Install System (Amazon Linux 2023 Or Rocky 8/9)

Overview

This article walks the installer through the process of bringing the environment up to test LILT in a secure platform, whether in the cloud (public or private), or on bare-metal hardware. The intent is to limit the amount of manual console interaction in favor of scripted installation that is fully documented. In working through the installation, references will be made as appropriate to provide context.

System Parameters/Versions:

LILT: 5.0-5.1
Base OS:
- Amazon Linux 2023: ami-05576a079321f21f8
- Rocky 8.10/9.5
K8S: 1.32.4
Pause: 3.10 (can be configured in containerd/config.toml)
Containerd:
- 2.0.5 (v2.0.0+ requires K8 >= 1.30 and Rocky 9 with kernel 5.x)
- 1.6.x/1.7.x (Rocky 8 with kernel 4.x)
Runc: 1.2.6
CNI plugin: 1.7.1
Flannel, 0.26.0
NVIDIA Driver: 565.57.01, dkms
Cuda: 12.7
Cuda-toolkit: 12.6
Cudnn: 9.6.0
GPU: A10G

Tools you will need

SSH to connect to systems.

On the Mac or Linux, you can open a terminal window and use ssh.
On Windows, you can use the PuTTY ssh client.

Web browser for post installation and verification.

Installer privileges

Many sections of this article include text formatted to indicate console input and output. The commands are prefixed with a $ or # depending on if the administrator should run them as root or not. Additionally, to assist the user to know which machine to run commands on, the prefix node, master, and gpu will be given when the machine should be switched. All following commands after a switch should use those.

Prerequisites

Before running installation scripts we need to prepare the system with all the dependencies.

Data Location

For the rest of the installation, please ensure that: Containerd, images and RPM packages are mounted and available to all of the nodes (master, worker, and GPU). In the typical installation, the system administrator will receive this data from LILT prior to installation, and the system administrator will mount all requirements.

Kubernetes Cluster Installation

Installation Overview

Kubernetes

The LILT system is a collection of containers that interact with each other. LILT requires the use of Kubernetes to handle orchestration. Further, there must be persistent volume storage mounted to nodes in Kubernetes.

LILT Component Overview

Note: See the section “LILT System Architecture Diagram” for a visual depiction of the following information. The LILT application consists of three major logical component groups:

“Front”, which services API calls as well as the interface and editor logic
“Neural”, which performs neural machine translation
“Core”, which performs linguistic pre-processing and post-processing, as well as document import and export

RabbitMQ is used for message passing between the services; MySQL is used for the application database, ElasticSearch for search purposes, Minio as s3 storage, OpenEBS for storage allocation, nginx-ingress as ingress controller and Redis for a memory cache. For each of these 7 services, LILT will provide a containerd image that is installed as part of the installation process. Optionally, customers can use a self-hosted version of any of those five services that can be pointed to by the LILT system (for example, a separately running ElasticSearch instance, AWS RDS in place of mysql, or s3 instead of minio). Finally, there must be a persistent location for stored user documents that is mountable as a persistent volume to a Kubernetes node. This volume will be used by Minio, Elasticsearch, MySql, Redis and RabbitMQ.

LILT System Architecture Diagram

Refer to: https://self-managed-docs.lilt.com/kb/lilt-system-architecture

System Maintenance and Update Frequency

LILT recommends a system update every quarter. The update can proceed in one of two ways:

(Recommended) A customer systems engineer installs the system from scratch given an installation document and assets package delivered via cloud or flash drive.
A LILT systems engineer is given SSH access to the customer system, and performs the update remotely.

To schedule a system update, contact your Account Manager.

Recommended System Requirements

LILT recommends a minimum of three separate servers as nodes for the kubernetes cluster.

k8s-master server

This server controls cluster scheduling, networking and health. In comparison to the node server(s), it is resource-light. Instance type: m5.xlarge (4 vCPUs, 16 GB RAM) Disk space: 500 GB

k8s-node server(s)

These server(s) are the main application workhorse and listen to the master server, host containerd containers for the main application, and mount storage. Usually, one node suffices, but for increased system performance, multiple nodes can be setup. In that case, the hardware requirements should be correspondingly replicated for each node, with the exception of the disk mounts. Disk mounts need to be shared across all nodes. A few notes here:

The total system requirements for the node server can either be fulfilled on a single machine, or split among multiple nodes that in sum are equal or greater than the recommended system requirements. However, if splitting the node server into separate physical nodes, please note that individual nodes have minimum requirements; see the details below.
On the nodes with GPUs installed, NVIDIA drivers will have to be installed (these will be specified in the installation document).

Worker Node

Instance type: r5n.24xlarge (96 vCPUs, 768 GB RAM) Total disk space: 2 TB

If create multiple drives
- Boot disk space: 400 GB
- Common space: 1.6 TB (increase based on total documents ingested).3
  - containerd and all PVCs

If unable to create multiple mounts/drives and all data is stored in root, one drive of 2TB is sufficient.

GPU Node

V3 models require a minimum of 24 GB GPU memory; either two combined T4s, one L4, one A10, or one A100. Batch processing requires a minimum of one GPU; T4, L4, A10, or A100. The following are tested recommendations but any configure can be utilized as long as the minimum requirements are met. Minimum instances, T4 GPUs (node with 2 GPUs required for translate v3, node with 1 GPU required for batch):

1 x g4ad.8xlarge (32 vCPUs, 128 GB RAM, 2 GPUs)
1 x g4dn.4xlarge (16 vCPUs, 64 GB RAM, 1 GPU)

Sufficient instance type, T4 GPUs: g4dn.12xlarge (48 vCPUs, 192 GB RAM, 4 GPUs) Preferred instance type, A10 GPUs (better performance than T4 GPUs): g5.12xlarge (48 vCPUs, 192 GB RAM, 4 GPUs) Optimal instance type, A10 GPUs: g5.48xlarge (192 vCPUs, 192 GB RAM, 8 GPUs) Total disk space: 2 TB. These store pods’ runtime assets, including large neural assets.

Install Kubernetes Cluster

All Nodes (master, worker, gpu)

Login to the each node and complete the following steps. Commands have to be performed as root (since it involves installation of Kubernetes).

sudo su -

Step 1: Update base OS, Install required packages

Each base OS has different package requirements.

Amazon Linux 2023:

# update OS
dnf check-release-update
sudo dnf update -y

# install bash
sudo dnf install bash -y

# install git
sudo dnf install -y git

# install ip route, used by k8s
sudo dnf install -y iproute-tc

Rocky 8/9:

# update system
dnf update -y

# utilities
sudo dnf install -y curl wget vim bash-completion gnupg2 lvm2

# required to unzip packages
dnf install zip unzip -y

# (optional) install aws
yum remove awscli
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
./aws/install

# install bash
sudo dnf install bash -y

# install jq
sudo dnf install jq -y

# install git
sudo dnf install -y git

# reboot so that the below packages can use the updated kernel
reboot
Step 2: Install helm

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

Step 3: Install k9s (MASTER NODE only. Optional, but highly recommended)

export K9S_V="0.50.4"
curl -LO "https://github.com/derailed/k9s/releases/download/v$K9S_V/k9s_Linux_amd64.tar.gz"
tar -xzf k9s_Linux_amd64.tar.gz
sudo mv k9s /usr/local/bin/

Step 4: Set kernel parameters as required by Istio

# avoid ztunnel container restarts due to load
# append to end of file
cat <<EOF >> /etc/security/limits.conf
* soft nofile 131072
* hard nofile 131072
EOF

cat <<EOF >> /etc/systemd/system.conf
DefaultLimitNOFILE=131072
EOF

Step 5: Set kernel parameters as required by Kubernetes

bash -c 'cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
nf_nat
xt_REDIRECT
xt_owner
iptable_nat
iptable_mangle
iptable_filter
EOF'

modprobe overlay
modprobe br_netfilter
modprobe nf_nat
modprobe xt_REDIRECT
modprobe xt_owner
modprobe iptable_nat
modprobe iptable_mangle
modprobe iptable_filter

bash -c 'cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF'

sysctl --system

Step 6: Turn off swap, disable SELinux, update local firewall changes, and modify max_map_count

NOTE: if using version of K8S older than v1.29, it is possible that insecure port (10255) is still being utilized. The following setup does NOT include the old insecure port. It is highly recommended that customers upgrade to K8S v1.30 or higher which includes the new secure port (10250) by default.

# disable swap
swapoff -a
sed -e '/swap/s/^/#/g' -i /etc/fstab

# disable SELinux
sudo setenforce 0
sudo sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config

# setup firewall
dnf install -y firewalld
systemctl unmask firewalld && systemctl enable firewalld && systemctl restart firewalld
firewall-cmd --permanent --zone=public --set-target=ACCEPT
firewall-cmd --permanent --add-port={22,80,443,2379,2380,5000,6443,10250,10251,10252}/tcp
# api
firewall-cmd --permanent --add-port={5005,8011,8080}/tcp
# istio
firewall-cmd --permanent --add-port={15000,15001,15006,15008,15009,15010,15012,15014,15017,15020,15021,15090,15443,20001}/tcp
# flannel
firewall-cmd --permanent --add-port=8472/udp
# clickhouse
firewall-cmd --permanent --add-port=8123/tcp
# WSO2
firewall-cmd --permanent --add-port={4000,9443,9763}/tcp
# reload firewall
firewall-cmd --reload

# Add vm.max_map_count=262144 to /etc/sysctl.conf
echo "vm.max_map_count=262144" | sudo tee -a /etc/sysctl.conf
# Apply the sysctl settings
sudo sysctl -p
# Verify the change
sysctl vm.max_map_count

Step 7: Install containerd and add runc to the runtime

# containerd
export CONTAINERD_V="2.0.5"
export RUNC_V="1.2.6"

wget https://github.com/containerd/containerd/releases/download/v$CONTAINERD_V/containerd-$CONTAINERD_V-linux-amd64.tar.gz
tar Cxzvf /usr/local containerd-$CONTAINERD_V-linux-amd64.tar.gz

# if going to use systemd (yes in this install), need to add the following file
mkdir -p /usr/local/lib/systemd/system/
curl "https://raw.githubusercontent.com/containerd/containerd/main/containerd.service" -o "/usr/local/lib/systemd/system/containerd.service"

# restart
systemctl daemon-reload
systemctl enable --now containerd

# runc
wget https://github.com/opencontainers/runc/releases/download/v$RUNC_V/runc.amd64
install -m 755 runc.amd64 /usr/local/sbin/runc

Step 8: Install cni plugin

# cni plugin
export CNI_V="1.7.1"

wget https://github.com/containernetworking/plugins/releases/download/v$CNI_V/cni-plugins-linux-amd64-v$CNI_V.tgz
mkdir -p /opt/cni/bin
tar Cxzvf /opt/cni/bin cni-plugins-linux-amd64-v$CNI_V.tgz

Step 9: Setup containerd

Create containerd directory:

mkdir -p /etc/containerd/

NOTE: If need to specify pause sand_box image, add the following to each respective config.toml:

[plugins]
    [plugins."io.containerd.grpc.v1.cri"]
        sandbox_image = "k8s.gcr.io/pause:3.10"

Master and Worker Nodes:

cat <<EOF > /etc/containerd/config.toml
version = 2
[plugins]
    [plugins."io.containerd.grpc.v1.cri"]
        [plugins."io.containerd.grpc.v1.cri".containerd]
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
                [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
                    runtime_type = "io.containerd.runc.v2"
                    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
                        SystemdCgroup = true
EOF
sudo systemctl restart containerd

GPU Nodes:

cat <<EOF > /etc/containerd/config.toml
version = 2
[plugins]
    [plugins."io.containerd.grpc.v1.cri"]
        [plugins."io.containerd.grpc.v1.cri".containerd]
            default_runtime_name = "nvidia"
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
                [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
                    runtime_type = "io.containerd.runc.v2"
                    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
                        SystemdCgroup = true
                [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
                    privileged_without_host_devices = false
                    runtime_type = "io.containerd.runc.v2"
                    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
                        BinaryName = "/usr/bin/nvidia-container-runtime"
                        SystemdCgroup = true
EOF
sudo systemctl restart containerd

Step 10: Install & Setup - Kubernetes

# set repository and repo
export K8S_V="1.32"

cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v$K8S_V/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v$K8S_V/rpm/repodata/repomd.xml.key
exclude=kubelet kubeadm kubectl cri-tools kubernetes-cni
EOF

# install kubelet, kubeadm, kubectl, create symlink
sudo dnf install -y kubelet kubeadm kubectl --disableexcludes=kubernetes

# enable kubelet
sudo systemctl enable --now kubelet

Verify the installation by checking the packages version

kubectl version --client && kubeadm version

Step 11: Reboot node/server

reboot

Step 12 (GPU NODE ONLY): Install NVIDIA Drivers

Each base OS has different package requirements.

Amazon Linux 2023:

# update dependnecies and kernel
dnf check-release-update
sudo dnf update -y
sudo dnf install -y dkms 
sudo systemctl enable --now dkms
if (uname -r | grep -q ^6.12.); then
  sudo dnf install -y kernel-devel-$(uname -r)  kernel6.12-modules-extra
else
  sudo dnf install -y kernel-devel-$(uname -r)  kernel-modules-extra
fi

# add nvidia repo
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/x86_64/cuda-amzn2023.repo

# clean cache
sudo dnf clean expire-cache

# install driver
sudo dnf module install -y nvidia-driver:565-dkms

# install cuda toolkit
sudo dnf install -y cuda-toolkit-12-6

# install cudnn
wget https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-9.6.0.74_cuda12-archive.tar.xz

# unpack
tar -xf cudnn-linux-x86_64-9.6.0.74_cuda12-archive.tar.xz

# copy to dirs
sudo cp -P cudnn-linux-x86_64-9.6.0.74_cuda12-archive/include/* /usr/local/cuda/include/
sudo cp -P cudnn-linux-x86_64-9.6.0.74_cuda12-archive/lib/* /usr/local/cuda/lib64/

# set permissions
sudo chmod a+r /usr/local/cuda/include/cudnn*.h
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

# optional, remove downloads to free up space
rm -rf cudnn-linux-x86_64-9.6.0.74_cuda12-archive*

# update libarary cache
sudo ldconfig

# install container toolkit
# add repo
sudo dnf config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
# install
# Pin to compatible version 1.17.6-1 and install
NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.6-1

sudo dnf install -y  nvidia-container-toolkit-${NVIDIA_CONTAINER_TOOLKIT_VERSION}  nvidia-container-toolkit-base-${NVIDIA_CONTAINER_TOOLKIT_VERSION}  libnvidia-container-tools-${NVIDIA_CONTAINER_TOOLKIT_VERSION}  libnvidia-container1-${NVIDIA_CONTAINER_TOOLKIT_VERSION}

# NOTE: If you want to use the latest nvidia-container-runtime:
#   1. Install the latest NVIDIA container toolkit:
#        sudo dnf install -y nvidia-container-toolkit
#   2. Enable legacy mode for compatibility (tested with nvidia-container-runtime v1.18.0):
#        sudo nvidia-ctk config --in-place --set nvidia-container-runtime.mode=legacy
#        sudo systemctl restart containerd
#   This ensures proper integration with containerd.


# reboot to complete install
reboot

Rocky 8:

# extra required packages
sudo dnf -y install epel-release

# get rhel/rocky OS current version
export cur_ver="rhel$(rpm -E %rhel)"

# install kernel headers
sudo dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y

# install cuda-toolkit
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/$cur_ver/x86_64/cuda-$cur_ver.repo
sudo dnf clean all
sudo dnf -y install cuda-toolkit-12-6

# update nvidia path
export PATH=/usr/local/cuda-12.6/bin${PATH:+:${PATH}}

# install nvidia driver
sudo dnf -y module install nvidia-driver:565-dkms

# zlib is required for cudnn
sudo dnf -y install zlib

# install cudnn
sudo dnf -y install cudnn-cuda-12

# install container-toolkit
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
# Pin to compatible version 1.17.6-1 and install
NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.6-1

sudo dnf install -y  nvidia-container-toolkit-${NVIDIA_CONTAINER_TOOLKIT_VERSION}  nvidia-container-toolkit-base-${NVIDIA_CONTAINER_TOOLKIT_VERSION}  libnvidia-container-tools-${NVIDIA_CONTAINER_TOOLKIT_VERSION}  libnvidia-container1-${NVIDIA_CONTAINER_TOOLKIT_VERSION}

# NOTE: If you want to use the latest nvidia-container-runtime:
#   1. Install the latest NVIDIA container toolkit:
#        sudo dnf install -y nvidia-container-toolkit
#   2. Enable legacy mode for compatibility (tested with nvidia-container-runtime v1.18.0):
#        sudo nvidia-ctk config --in-place --set nvidia-container-runtime.mode=legacy
#        sudo systemctl restart containerd
#   This ensures proper integration with containerd.

# reboot to complete install
reboot

Rocky 9:

sudo dnf install -y epel-release dnf-utils gcc make kernel-devel-matched kernel-headers
sudo dnf install -y dkms
sudo dnf config-manager --set-enabled crb

sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
sudo dnf clean all

# cuda toolkit required for cudnn
sudo dnf -y install cuda-toolkit-12-6

# nividia driver
sudo dnf -y module install nvidia-driver:565-dkms

# Run nvidia-smi and capture the output and error
nvidia_output=$(nvidia-smi 2>&1)

# Check if nvidia-smi failed with a specific error
if echo "$nvidia_output" | grep -q "NVIDIA-SMI has failed"; then
  echo "nvidia-smi failed with error: NVIDIA-SMI has failed. Checking DKMS status..."
  
  # Get the DKMS status
  dkms_status=$(dkms status)
  echo "DKMS Status:"
  echo "$dkms_status"
  
  # Extract the NVIDIA module version and remove the trailing colon
  nvidia_version=$(echo "$dkms_status" | grep nvidia | awk '{print $1}' | awk -F/ '{print $2}' | sed 's/:$//')
  
  if [ -n "$nvidia_version" ]; then
    echo "Attempting to install NVIDIA DKMS module version: $nvidia_version"
    sudo dkms install nvidia/"$nvidia_version"
  else
    echo "No NVIDIA DKMS module found in DKMS status. Please check manually."
  fi
else
  echo "nvidia-smi is working correctly."
fi

# zlib is required for cudnn
sudo dnf -y install zlib

# install cudnn
sudo dnf -y install --allowerasing cudnn9-cuda-12

# install container-toolkit
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
# Pin to compatible version 1.17.6-1 and install
NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.6-1

sudo dnf install -y  nvidia-container-toolkit-${NVIDIA_CONTAINER_TOOLKIT_VERSION}  nvidia-container-toolkit-base-${NVIDIA_CONTAINER_TOOLKIT_VERSION}  libnvidia-container-tools-${NVIDIA_CONTAINER_TOOLKIT_VERSION}  libnvidia-container1-${NVIDIA_CONTAINER_TOOLKIT_VERSION}

# NOTE: If you want to use the latest nvidia-container-runtime:
#   1. Install the latest NVIDIA container toolkit:
#        sudo dnf install -y nvidia-container-toolkit
#   2. Enable legacy mode for compatibility (tested with nvidia-container-runtime v1.18.0):
#        sudo nvidia-ctk config --in-place --set nvidia-container-runtime.mode=legacy
#        sudo systemctl restart containerd
#   This ensures proper integration with containerd.

Verify that all NVIDIA drivers/packages were correctly installed

nvidia-smi

Output should be similar to the following (depending on the number of server GPUs). If not, reinstall NVIDIA drivers from the previous section:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    Off |   00000000:00:16.0 Off |                    0 |
|  0%   20C    P8              9W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A10G                    Off |   00000000:00:17.0 Off |                    0 |
|  0%   19C    P8             10W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A10G                    Off |   00000000:00:18.0 Off |                    0 |
|  0%   21C    P8              9W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A10G                    Off |   00000000:00:19.0 Off |                    0 |
|  0%   21C    P8              9W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A10G                    Off |   00000000:00:1A.0 Off |                    0 |
|  0%   20C    P8              9W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A10G                    Off |   00000000:00:1B.0 Off |                    0 |
|  0%   20C    P8              9W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A10G                    Off |   00000000:00:1C.0 Off |                    0 |
|  0%   20C    P8              9W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A10G                    Off |   00000000:00:1D.0 Off |                    0 |
|  0%   20C    P8              9W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Verify cuda toolkit installed:

/usr/local/cuda/bin/nvcc -V

Output should be similar to the following:

nvcc: NVIDIA (R) Cuda compiler driveright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

Very container-toolkit installed:

nvidia-container-cli -V

Output should be similar to the following:

cli-version: 1.17.3
lib-version: 1.17.3
build date: 2024-12-04T09:47+0000
build revision: 16f37fcafcbdaf67525135104d60d98d36688ba9
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

Initialize Cluster

Master Node

Login to the k8s-master node and follow the below steps. Note that the following steps have to be performed as root (sudo su -) (since it involves installation of Kubernetes). Run below command to initialize and setup Kubernetes master.

NOTE: Ensure that CNI and Service CIDRs do NOT conflict with local server/node IP addresses. CNI_CIDR must match Flannel IP value set below.

CIDR, 192.168.100.0/19, will allow for up to 32 nodes (flannel reserves one subnet for each node). If more nodes are required, need to increase netmask/subnet size. CNI and service CIDRs can be modified to match respective network requirements:

CNI_CIDR="192.168.100.0/19"
SERVICE_CIDR="192.168.200.0/19"
NODE_NAME="master"
kubeadm init \
    --pod-network-cidr $CNI_CIDR \
    --service-cidr $SERVICE_CIDR \
    --node-name $NODE_NAME

Save the section of the output that resembles the following, as you will need it to join the worker nodes to the master:

# example, real output will differ
kubeadm join 10.10.3.8:6443 --token 72dx7i.i1q8hymni6lw3f3f \
        --discovery-token-ca-cert-hash sha256:9befd70bffbdf6009b96b4e2a3419458a1f2c5661c062246bef4ae38ca8fd7f7 

To use kubectl commands, need to update kubeconfig (regular or root user):

# update kubeconfig
export KUBECONFIG=/etc/kubernetes/admin.conf
# or if not root user
mkdir -p $HOME/.kube
sudo cp -f /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

systemctl enable kubelet.service
systemctl start kubelet.service

Verify that node has been installed correctly:

kubectl get nodes

Expected output:

NOTE: STATUS will be NotReady until Flannel is installed.

NAME     STATUS     ROLES           AGE   VERSION
master   NotReady   control-plane   66s   v1.32.4

If a period of time has passed between initializing cluster and joining nodes and the join token is expired, run the following command to create a new join token:

kubeadm token create --print-join-command

Worker Node

Login to the k8s-worker node and follow the below steps. Note that the following steps have to be performed as root (sudo su -) (since it involves installation of Kubernetes). Run following command to join the Kubernetes cluster created in the previous section:

# example, real output will differ
NODE_NAME="worker"

kubeadm join 10.10.3.8:6443 --node-name $NODE_NAME \
        --token 72dx7i.i1q8hymni6lw3f3f \
        --discovery-token-ca-cert-hash sha256:9befd70bffbdf6009b96b4e2a3419458a1f2c5661c062246bef4ae38ca8fd7f7 

From the master node, verify that the worker node has joined correctly:

kubectl get nodes

Expected output:

NOTE: STATUS will be NotReady until Flannel is installed.

NAME     STATUS     ROLES           AGE    VERSION
master   NotReady   control-plane   149m   v1.32.4
worker   NotReady   <none>          8s     v1.32.4

GPU Node

Login to the k8s-gpu node and follow the below steps. Note that the following steps have to be performed as root (sudo su -) (since it involves installation of Kubernetes). Run following command to join the Kubernetes cluster created in the previous section:

# example, real output will differ
NODE_NAME="gpu"
kubeadm join 10.10.3.8:6443 --node-name $NODE_NAME \
        --token 72dx7i.i1q8hymni6lw3f3f \
        --discovery-token-ca-cert-hash sha256:9befd70bffbdf6009b96b4e2a3419458a1f2c5661c062246bef4ae38ca8fd7f7 

From the master node, verify that the worker node has joined correctly:

kubectl get nodes

Expected output:

NOTE: STATUSwill be NotReady until Flannel is installed.

NAME     STATUS     ROLES           AGE    VERSION
gpu      NotReady   <none>          9s     v1.32.4
master   NotReady   control-plane   177m   v1.32.4
worker   NotReady   <none>          28m    v1.32.4

Prepare the LILT Platform

Import Dependencies

All images and scripts are packaged based on system requirements. Lilt is designed to work as a distributed system, utilizing specific nodes for each workload. It is not necessary to download the entire install package on each node. The following examples utilize a remote s3 bucket for retrieving data.

Highly recommended to use a central repository for all images. The below examples utilize local image storage on each node. If using a central repository, containerd image import sections can be ignored. If use a central repository, ensure to modify all helm charts to replace the default value lilt-registry.local.io:80

Step 1: Set Release Tag (all nodes)

This can be ignored if downloading the entire package separately. Login to each node (master, worker, gpu) and set release tag var:

export RELEASE_TAG="lilt-enterprise-x.x.x"

Step 2: Import install packages and images

Master Node Login to the k8s-master node and follow the below steps. Note that the following steps have to be performed as root (sudo su -). Move to the root directory:

cd /

Download install package and documentation from remote s3 bucket:

aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/install_packages/ /$RELEASE_TAG/install_packages --recursive --exclude "*" --include "on-prem-installer*"
aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/documentation/ /$RELEASE_TAG/documentation --recursive

Unpack installer:

mkdir -p /install_dir
tar -xzf /$RELEASE_TAG/install_packages/on-prem-installer* -C /install_dir

If not using a central image repository, complete the following:

Download required images:
- istio pilot/proxyv2/install-cni/ztunnel/kiali
- flannel

aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/docker_images/master/packages/ /$RELEASE_TAG/docker_images/master/packages --recursive --exclude "*" --include "pilot*" --include "proxyv2*" --include "install-cni*" --include "ztunnel*" --include "kiali*" --include "flannel*"

Load images to containerd (update <release-tag>):

for file in $(find /$RELEASE_TAG/docker_images -type f) ; do ctr -n=k8s.io image import $file --digests=true ; done

Worker Node Login to the k8s-worker node and follow the below steps. Note that the following steps have to be performed as root (sudo su -). Move to the root directory:

cd /

If not using a central image repository, complete the following:

Download required images (all of them):

aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/docker_images/ /$RELEASE_TAG/docker_images --recursive

Load images to containerd (update <release-tag>):

for file in $(find /$RELEASE_TAG/docker_images -type f) ; do ctr -n=k8s.io image import $file --digests=true ; done

GPU Node Login to the k8s-gpu node and follow the below steps. Note that the following steps have to be performed as root (sudo su -). Move to the root directory:

cd /

If not using a central image repository, complete the following:

Download required images:
- istio_pilot/proxyv2/install-cni/ztunnel/kiali/k8s-device-plugin/metrics-server/flannel
- all neural/llm/batch

aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/docker_images/master/packages/ /$RELEASE_TAG/docker_images/master/packages --recursive --exclude "*" --include "pilot*" --include "proxyv2*" --include "install-cni*" --include "ztunnel*" --include "kiali*" --include "flannel*" --include "k8s-device-plugin*" --include "metrics-server*"
aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/docker_images/master/apps/ /$RELEASE_TAG/docker_images/master/apps --recursive --exclude "*" --include "neural*" --include "llm*" --include "batch*"
aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/docker_images/node/ /$RELEASE_TAG/docker_images/node --recursive --exclude "*" --include "neural*"

Load images to containerd (update <release-tag>):

for file in $(find /$RELEASE_TAG/docker_images -type f) ; do ctr -n=k8s.io image import $file --digests=true ; done

Step 3: Label Nodes

Lilt cluster utilizes nodeSelector for scheduling pods on specific nodes. This is a simple way to control where pods are scheduled by adding a key-value pair to chart/manifest specifications. Master Node Login to the k8s-master node and follow the below steps. Note that the following steps have to be performed as root (sudo su -). Worker node is used for running the bulk of application workloads, typically separate from the master and GPU nodes. Label this node as worker (node name set in the above section) by executing the following command:

kubectl label nodes worker node-type=worker

GPU node is generally used for running GPU intensive pods/applications aside from the worker node; however, this node can also be used for additional workloads if the instance has sufficient resources to handle GPU and application tasks. This decision is up to the cluster admin to determine if GPU nodes can also handle application workloads. If only using this node for GPU specific tasks, label as gpu (node name set in the above section) by executing the following command:

kubectl label nodes gpu capability=gpu

Optional: if the GPU node has sufficient resources to run additional workloads, also add worker label:

kubectl label nodes gpu capability=gpu node-type=worker

Verify output:

# Show labels
kubectl get nodes --show-labels

If necessary to remove a previous node label, use the following command. Is this example, removing node-type label from the gpu node

kubectl label node gpu node-type-

Step 4: Create Namespaces

The cluster uses various namespace to separate pods and services. Master Node Login to the k8s-master node and follow the below steps. Note that the following steps have to be performed as root (sudo su -).

kubectl create ns lilt
kubectl create ns istio-system
kubectl create ns kube-flannel

Confirm that namespaces were created:

kubectl get ns

Output should be similar to the following:

NAME              STATUS   AGE
default           Active   4h
istio-system      Active   4h
kube-flannel      Active   4h
kube-node-lease   Active   4h
kube-public       Active   4h
kube-system       Active   4h
lilt              Active   4h

Step 5: Modify Flannel Helm Chart Values

Flannel helm chart CIDR must match k8s CIDR range. Master Node Login to the k8s-master node and follow the below steps. Note that the following steps have to be performed as root (sudo su -). Modify flannel on-prem-values.yaml:

vi /install_dir/flannel/on-prem-values.yaml

Edit podCidr to match k8s CNI_CIDRset above:

podCidr: "192.168.100.0/19"

Save file and exit.

Step 6: Image Pull Secrets (optional)

If using a private central repository for hosting images that requires authentication, create image pull secrets for the cluster to reference. The following example utilizes username and password, for service accounts with a json key use this document Create imagePullSecrets Master Node Login to the k8s-master node and follow the below steps. Note that the following steps have to be performed as root (sudo su -). Create cluster image pull secret:

kubectl -n <namespace> create secret docker-registry <secret-name> \
  --docker-username=<username> \
  --docker-password=<password> \
  --docker-server=<registry-url> \
  --docker-email=<email>

NOTE: Image pull secrets are namespace specific, must create a new secret for each additional namespace.

Verify secret was created:

kubectl get secrets -A

LILT helm charts utilize a custom values file, usually on-prem-values.yaml. Update these files with the new imagePullSecret. As an example, this is from redis custom values:

global:
  imagePullSecrets:
    - <secret-name>

Install LILT

LILT app

Assuming that all images have been loaded into the nodes as explained in the previous steps, we can proceed to install the apps

Step 1: Installation

Run the main install script, this includes all third-party and lilt dependencies. This can take up to two hours to fully complete. Change to install directory:

cd /install_dir

Run install script:

sh install-lilt.sh

If installed k9s, can monitor install progress:

k9s

If did not install k9s, use kubectl to watch pods:

kubectl get pods --all-namespaces --watch

Can also watch k8s events:

kubectl events --all-namespaces --watch

When ready, list all pods:

kubectl get pods --all-namespaces

Output should be similar to the following:

NAMESPACE      NAME                                                            READY   STATUS      RESTARTS         AGE
istio-system   istio-cni-node-7fmp4                                            1/1     Running     0                5d1h
istio-system   istio-cni-node-7fvcq                                            1/1     Running     0                5d1h
istio-system   istio-cni-node-ljg7h                                            1/1     Running     0                5d1h
istio-system   istiod-68fcbb5f87-vlg8m                                         1/1     Running     0                5d1h
istio-system   kiali-c5748bfb9-7ns2g                                           1/1     Running     0                6h26m
istio-system   prometheus-server-99f4cd586-5xbnt                               1/1     Running     0                5d1h
istio-system   ztunnel-b57qw                                                   1/1     Running     0                5d1h
istio-system   ztunnel-kwdz9                                                   1/1     Running     0                5d1h
istio-system   ztunnel-p8fhc                                                   1/1     Running     0                5d1h
kube-flannel   kube-flannel-ds-lvqr7                                           1/1     Running     0                5d2h
kube-flannel   kube-flannel-ds-m7rbh                                           1/1     Running     0                5d2h
kube-flannel   kube-flannel-ds-xqkfz                                           1/1     Running     0                5d2h
kube-system    coredns-76f75df574-7p66z                                        1/1     Running     0                5d2h
kube-system    coredns-76f75df574-svzk9                                        1/1     Running     0                5d2h
kube-system    etcd-master-shared-vpc                                          1/1     Running     1 (5d2h ago)     5d2h
kube-system    kube-apiserver-master-shared-vpc                                1/1     Running     1 (5d2h ago)     5d2h
kube-system    kube-controller-manager-master-shared-vpc                       1/1     Running     1 (5d2h ago)     5d2h
kube-system    kube-proxy-86855                                                1/1     Running     0                5d2h
kube-system    kube-proxy-jggvv                                                1/1     Running     0                5d2h
kube-system    kube-proxy-k5t6k                                                1/1     Running     0                5d2h
kube-system    kube-scheduler-master-shared-vpc                                1/1     Running     1 (5d2h ago)     5d2h
kube-system    metrics-server-865f9f4b55-b9t4r                                 1/1     Running     0                5d1h
lilt           assignment-core-976f6bc6c-xzsr8                                 1/1     Running     7 (5d1h ago)     5d1h
lilt           auditlog-core-67b45f45b9-h25fs                                  1/1     Running     4 (5d1h ago)     5d1h
lilt           auth-856c98d8f6-5f262                                           2/2     Running     0                4d22h
lilt           batch-neural-dd58d87b6-zzr6d                                    1/1     Running     0                5d1h
lilt           batch-tb-core-7dc4fd97c7-sfmks                                  1/1     Running     7 (5d1h ago)     5d1h
lilt           batch-worker-cpu-neural-74db44c979-pkl4t                        1/1     Running     0                6h24m
lilt           batch-worker-gpu-neural-866dcb59c4-hbhhd                        1/1     Running     0                6h24m
lilt           batch-worker-gpuv2-neural-756f78779f-4zbdf                      1/1     Running     0                6h24m
lilt           batch-worker-gpuv3-neural-6f596bf9f6-dbzt4                      1/1     Running     0                6h24m
lilt           batchv2-neural-8679969f97-st8tc                                 1/1     Running     0                5d1h
lilt           batchv3-neural-77f96455dc-tgqth                                 1/1     Running     0                4d22h
lilt           cache-redis-master-0                                            1/1     Running     0                5d1h
lilt           clickhouse-shard0-0                                             1/1     Running     0                5d1h
lilt           connectors-ingressgateway-6d85dd9b5-58n2f                       1/1     Running     0                5d1h
lilt           converter-core-64d87b55d-fbffd                                  1/1     Running     4 (5d1h ago)     5d1h
lilt           elasticsearch-master-0                                          1/1     Running     0                5d1h
lilt           elasticsearch-master-1                                          1/1     Running     0                5d1h
lilt           file-job-core-67598b8bb4-bv5r6                                  1/1     Running     0                5d1h
lilt           file-translation-core-76fcfd7979-qj4dq                          1/1     Running     0                5d1h
lilt           front-app-7f75c8745c-w2bbs                                      2/2     Running     0                4d22h
lilt           indexer-core-6bd8f76c45-lnxqs                                   1/1     Running     9 (5d1h ago)     5d1h
lilt           istio-ingressgateway-6dc78b487c-nht2v                           1/1     Running     0                5d1h
lilt           job-core-5c5494cbcd-dxqr2                                       1/1     Running     0                5d1h
lilt           langid-neural-57898fb4c5-7xcdv                                  1/1     Running     0                4d22h
lilt           langid-neural-57898fb4c5-nfcqr                                  1/1     Running     0                4d22h
lilt           lexicon-core-79557f46f7-z7sbz                                   1/1     Running     0                5d1h
lilt           lilt-beehive-5bf9d6ff6b-n87dh                                   1/1     Running     0                5d1h
lilt           lilt-configuration-api-5c49db4797-8jnxq                         2/2     Running     0                5d1h
lilt           lilt-connectors-builder-678448f4f6-4g4pq                        2/2     Running     0                7h50m
lilt           lilt-connectors-create-admin-user-kjv2k                         0/1     Completed   0                6h25m
lilt           lilt-connectors-exporter-cronjob-28856445-zqnkx                 0/1     Completed   0                53s
lilt           lilt-connectors-scheduler-cronjob-28856445-gjhd7                0/1     Completed   0                53s
lilt           lilt-core-api-67bcb48c6b-84gtt                                  2/2     Running     0                5d1h
lilt           lilt-dataflow-clickhouse-migration-job-5tqms                    0/1     Completed   0                6h25m
lilt           lilt-dataflow-generate-memory-snapshot-cronjob-28855920-l5cnf   0/1     Completed   0                8h
lilt           lilt-dataflow-ingest-comments-cronjob-28856400-988db            0/1     Completed   0                45m
lilt           lilt-dataflow-ingest-connectorjobs-cronjob-28856400-9kzxp       0/1     Completed   0                45m
lilt           lilt-dataflow-ingest-revision-reports-cronjob-28855020-mhwd4    0/1     Completed   0                23h
lilt           lilt-dataflow-ingest-wpa-minio-cronjob-28856280-6p5jz           0/1     Completed   0                165m
lilt           lilt-dataflow-segment-quality-cronjob-28856400-rr9rs            0/1     Completed   0                45m
lilt           lilt-manager-ui-f54749467-sc6cf                                 2/2     Running     9 (5d1h ago)     5d1h
lilt           linguist-core-7ddd8fcfbb-6rv8b                                  1/1     Running     0                5d1h
lilt           llm-inference-neural-6bb49c48b6-mtfff                           1/1     Running     0                5d1h
lilt           localpv-provisioner-5cfff7dcb5-g2zc5                            1/1     Running     0                5d1h
lilt           memory-core-f697888d9-ccxxh                                     1/1     Running     0                5d1h
lilt           minio-85c7cb7f85-kspj5                                          1/1     Running     0                5d1h
lilt           mongodb-688f5c8c5b-zmh77                                        1/1     Running     0                5d1h
lilt           mq-rabbitmq-0                                                   1/1     Running     1 (5d1h ago)     5d1h
lilt           mysql-0                                                         1/1     Running     0                5d1h
lilt           nginx-ingress-65v2k                                             1/1     Running     0                5d1h
lilt           nvidia-device-plugin-42ntc                                      1/1     Running     0                5d1h
lilt           nvidia-device-plugin-njpdp                                      1/1     Running     0                5d1h
lilt           nvidia-device-plugin-rqqp9                                      1/1     Running     0                5d1h
lilt           qa-core-5d474c46b4-sncqn                                        1/1     Running     3 (5d1h ago)     5d1h
lilt           routing-neural-79df5944c5-m962l                                 1/1     Running     0                4d22h
lilt           search-core-5f9ddbb697-b46q4                                    1/1     Running     0                5d1h
lilt           segment-core-8545b9dbcf-v5jzc                                   1/1     Running     4 (5d1h ago)     5d1h
lilt           tag-core-544c77757d-9qxm6                                       1/1     Running     4 (5d1h ago)     5d1h
lilt           tb-core-795dbf757d-9nkl7                                        1/1     Running     0                5d1h
lilt           tm-core-55c74c49cf-ntmqz                                        1/1     Running     0                5d1h
lilt           translate-neural-6c46b4755d-s84n2                               1/1     Running     0                5d1h
lilt           translatev2-neural-67fd8b7584-wmc5w                             1/1     Running     0                5d1h
lilt           translatev3-neural-f7f8669c8-28pxn                              1/1     Running     0                5d1h
lilt           update-manager-neural-77c89bf757-zs5d5                          1/1     Running     12 (4d12h ago)   5d1h
lilt           update-managerv2-neural-68cbfcf6d4-wh62m                        1/1     Running     11 (4d23h ago)   5d1h
lilt           update-managerv3-neural-6d4598547c-dk2k4                        1/1     Running     5 (4d10h ago)    5d1h
lilt           update-neural-6474b69644-kmxdh                                  1/1     Running     0                5d1h
lilt           updatev2-neural-7f4746d4d9-wklzr                                1/1     Running     0                5d1h
lilt           updatev3-neural-8bd9f747d-8q7dj                                 1/1     Running     0                5d1h
lilt           valid-words-replace-neural-7f9fc59958-hk8vw                     1/1     Running     0                4d22h
lilt           valid-words-update-neural-6956b97db9-gf5d4                      1/1     Running     0                4d22h
lilt           watchdog-core-67895dbc9d-pmg8n                                  1/1     Running     0                5d1h
lilt           workflow-core-7bff648cf5-7b77g                                  1/1     Running     4 (5d1h ago)     5d1h

Step 2: Access Lilt Main Page

Once all pods are running/ready, connect to the LILT main page. Default domain name is bare.lilt.com, and is reachable via nginx-ingress running on the worker node. If accessing from a workstation inside the cluster network, get the local ip address of the worker node:

kubectl get node worker -o wide | awk -v OFS='\t\t' '{print $1, $6}'

If accessing from an external source, check the cloud provider public IPv4 address assigned to the worker node Once the correct ip address is determined, modify the hosts file on your local workstation:

vi /etc/hosts

Add a line for the worker node ip address and bare.lilt.com:

10.10.3.24   bare.lilt.com

Navigate to LILT using a browser, bare.lilt.com:

After the admin password is set, LILT can be accessed via bare.lilt.com/signin.

Debugging

Depending on network speed(to download/upload containerd images) some of the pods can take more time than others, here are some known debugging techniques:

If you see

Error: UPGRADE FAILED: timed out waiting for the condition, please continue as this can happen due to time taken by the pods to startup, apps deployment happens as expected.

If you see pods stuck in

ContainerCreating for a longer time(>15 mins), it’s safe to restart the pods by using kubectl delete pod -n lilt <podname>

If the apps aren’t healthy even after all containerd images have been loaded to the node, it’s safe to revert to previous version and redo the install, use below commands for the same:

master $ cd $install_dir
## Rollback 
$ helm rollback lilt 1 
## Clean-up jobs
$ kubectl delete jobs -n lilt --all 
## Remove statefulset PVCs 
$ kubectl get pvc -n lilt | grep elasticsearch $ kubectl delete pvc -n lilt <elasticsearch pvc as per above command>
## Perform Install
$ ./install-lilt.sh

Flannel Error

Sometimes flannel does not load correctly on all nodes. Verify which node that the flannel init container does not complete and restart containerd:

sudo systemctl restart containerd

Gateway Error 502

Sometimes after a server reboot, nginx-ingress needs to be restarted to include new values for front. Try restarting the nginx-ingress daemonset:

# restart nginx-ingress
kubectl rollout restart daemonset/nginx-ingress -n lilt

CORE Pods

When a cluster is restarted, core pods may be stuck in CrashLoopBackOff. A potential fix is to delete the elasticsearch helm deployment and PVCs and then re-deploy elasticsearch:

# delete deployment
helm delete -n lilt elasticsearch

# get list of elastic search pvcs
kubectl get pvc -n lilt | grep elasticsearch

# delete pvc volumes base on the results, there will be two
kubectl delete pvc -n lilt pvc-1d81120d-5453-4fb0-ab8a-1dde8f85628f
kubectl delete pvc -n lilt pvc-34c86f17-84fc-4c21-8e05-4b07995f9379

# re-deploy
sh install_scripts/install-elasticsearch.sh

If some core pods are still not running, try to restart lilt-beehive deployment:

kubectl rollout restart deployment/lilt-beehive -n lilt

GPUs

If GPUs are not working, verify that the expected number is allocated to the cluster:

kubectl get nodes -o custom-columns=":metadata.name" | while read node; do kubectl describe node "$node" | grep -i "nvidia.com/gpu"; done

Output should similar to the following (GPU node shows 8, main and worker nodes show 0):

nvidia.com/gpu:   8
nvidia.com/gpu:   0
nvidia.com/gpu    0 

If the output shows zero for all nodes, reinstall GPU drivers from the above section and reapply nvidia-device-plugin sh install_scripts/install-nvidia-device-plugin.sh

Translate v2/3

Sometimes after a server reboot, translate pods for v2/3 will not initialize. This can be attributed to service endpoints not refreshing from the previous deployment. Possible solutions are restarting the rabbitmq statefulset and minio deployment:

# restart rabbitmq
kubectl rollout restart statefulset/mq-rabbitmq -n lilt

# restart minio
kubectl rollout restart deployment/minio -n lilt

Restart all Deployments/Statefulsets

If have various pod failures, can also try and restart all Lilt deployments and statefulsets:

kubectl rollout restart deployment -n lilt
kubectl rollout restart statefulset -n lilt

Reset Cluster

If there are any issues that can’t be resolved, as last resort reset each node in the cluster.

NOTE: reseting the cluster will not remove all configuration artifacts. Follow the onscreen output instructions from the below reset command.

All Nodes (master, worker, gpu)

Login to the each node and complete the following steps. Commands have to be performed as root (since it involves installation of Kubernetes).

sudo su -

Reset the node:

kubeadm reset

Reinstall the cluster starting from the beginning of this document.

​Overview

​System Parameters/Versions:

​Tools you will need

​Installer privileges

​Prerequisites

​Data Location

​Kubernetes Cluster Installation

​Installation Overview

Kubernetes

LILT Component Overview

​LILT System Architecture Diagram

​System Maintenance and Update Frequency

​Recommended System Requirements

​k8s-master server

​k8s-node server(s)

Worker Node

GPU Node

​Install Kubernetes Cluster

​All Nodes (master, worker, gpu)

​Step 1: Update base OS, Install required packages

​Step 3: Install k9s (MASTER NODE only. Optional, but highly recommended)

​Step 4: Set kernel parameters as required by Istio

​Step 5: Set kernel parameters as required by Kubernetes

​Step 6: Turn off swap, disable SELinux, update local firewall changes, and modify max_map_count

​Step 7: Install containerd and add runc to the runtime

​Step 8: Install cni plugin

​Step 9: Setup containerd

Create containerd directory:

Master and Worker Nodes:

GPU Nodes:

​Step 10: Install & Setup - Kubernetes

​Step 11: Reboot node/server

​Step 12 (GPU NODE ONLY): Install NVIDIA Drivers

​Initialize Cluster

​Master Node

​Worker Node

​GPU Node

​Prepare the LILT Platform

​Import Dependencies

​Step 1: Set Release Tag (all nodes)

​Step 2: Import install packages and images

​Step 3: Label Nodes

​Step 4: Create Namespaces

​Step 5: Modify Flannel Helm Chart Values

​Step 6: Image Pull Secrets (optional)

​Install LILT

​LILT app

​Step 1: Installation

​Step 2: Access Lilt Main Page

​Debugging

​Flannel Error

​Gateway Error 502

​CORE Pods

​GPUs

​Translate v2/3

​Restart all Deployments/Statefulsets

​Reset Cluster

​All Nodes (master, worker, gpu)

Overview

System Parameters/Versions:

Tools you will need

Installer privileges

Prerequisites

Data Location

Kubernetes Cluster Installation

Installation Overview

LILT System Architecture Diagram

System Maintenance and Update Frequency

Recommended System Requirements

k8s-master server

k8s-node server(s)

Install Kubernetes Cluster

All Nodes (master, worker, gpu)

Step 1: Update base OS, Install required packages

Step 3: Install k9s (MASTER NODE only. Optional, but highly recommended)

Step 4: Set kernel parameters as required by Istio

Step 5: Set kernel parameters as required by Kubernetes

Step 6: Turn off swap, disable SELinux, update local firewall changes, and modify max_map_count

Step 7: Install containerd and add runc to the runtime

Step 8: Install cni plugin

Step 9: Setup containerd

Step 10: Install & Setup - Kubernetes

Step 11: Reboot node/server

Step 12 (GPU NODE ONLY): Install NVIDIA Drivers

Initialize Cluster

Master Node

Worker Node

GPU Node

Prepare the LILT Platform

Import Dependencies

Step 1: Set Release Tag (all nodes)

Step 2: Import install packages and images

Step 3: Label Nodes

Step 4: Create Namespaces

Step 5: Modify Flannel Helm Chart Values

Step 6: Image Pull Secrets (optional)

Install LILT

LILT app

Step 1: Installation

Step 2: Access Lilt Main Page

Debugging

Flannel Error

Gateway Error 502

CORE Pods

GPUs

Translate v2/3

Restart all Deployments/Statefulsets

Reset Cluster

All Nodes (master, worker, gpu)