Overview
This article walks the installer through the process of bringing the environment up to test LILT in a secure platform, whether in the cloud (public or private), or on bare-metal hardware. The intent is to limit the amount of manual console interaction in favor of scripted installation that is fully documented.
In working through the installation, references will be made as appropriate to provide context.
System Parameters/Versions:
-
LILT: 5.0-5.1
-
Base OS:
-
Amazon Linux 2023:
ami-05576a079321f21f8
-
Rocky 8.10/9.5
-
K8S: 1.32.4
-
Pause: 3.10 (can be configured in
containerd/config.toml
)
-
Containerd:
-
2.0.5 (v2.0.0+ requires K8 >= 1.30 and Rocky 9 with kernel 5.x)
-
1.6.x/1.7.x (Rocky 8 with kernel 4.x)
-
Runc: 1.2.6
-
CNI plugin: 1.7.1
-
Flannel, 0.26.0
-
NVIDIA Driver: 565.57.01, dkms
-
Cuda: 12.7
-
Cuda-toolkit: 12.6
-
Cudnn: 9.6.0
-
GPU: A10G
SSH to connect to systems.
-
On the Mac or Linux, you can open a terminal window and use ssh.
-
On Windows, you can use the PuTTY ssh client.
Web browser for post installation and verification.
Installer privileges
Many sections of this article include text formatted to indicate console input and output. The commands are prefixed with a $ or # depending on if the administrator should run them as root or not. Additionally, to assist the user to know which machine to run commands on, the prefix node, master, and gpu will be given when the machine should be switched. All following commands after a switch should use those.
Prerequisites
Before running installation scripts we need to prepare the system with all the dependencies.
Data Location
For the rest of the installation, please ensure that:
Containerd, images and RPM packages are mounted and available to all of the nodes (master, worker, and GPU).
In the typical installation, the system administrator will receive this data from LILT prior to installation, and the system administrator will mount all requirements.
Kubernetes Cluster Installation
Installation Overview
Kubernetes
The LILT system is a collection of containers that interact with each other. LILT requires the use of Kubernetes to handle orchestration. Further, there must be persistent volume storage mounted to nodes in Kubernetes.
LILT Component Overview
Note: See the section “LILT System Architecture Diagram” for a visual depiction of the following information.
The LILT application consists of three major logical component groups:
-
“Front”, which services API calls as well as the interface and editor logic
-
“Neural”, which performs neural machine translation
-
“Core”, which performs linguistic pre-processing and post-processing, as well as document import and export
RabbitMQ is used for message passing between the services; MySQL is used for the application database, ElasticSearch for search purposes, Minio as s3 storage, OpenEBS for storage allocation, nginx-ingress as ingress controller and Redis for a memory cache. For each of these 7 services, LILT will provide a containerd image that is installed as part of the installation process. Optionally, customers can use a self-hosted version of any of those five services that can be pointed to by the LILT system (for example, a separately running ElasticSearch instance, AWS RDS in place of mysql, or s3 instead of minio).
Finally, there must be a persistent location for stored user documents that is mountable as a persistent volume to a Kubernetes node. This volume will be used by Minio, Elasticsearch, MySql, Redis and RabbitMQ.
LILT System Architecture Diagram
Refer to: https://self-managed-docs.lilt.com/kb/lilt-system-architecture
System Maintenance and Update Frequency
LILT recommends a system update every quarter. The update can proceed in one of two ways:
-
(Recommended) A customer systems engineer installs the system from scratch given an installation document and assets package delivered via cloud or flash drive.
-
A LILT systems engineer is given SSH access to the customer system, and performs the update remotely.
To schedule a system update, contact your Account Manager.
Recommended System Requirements
LILT recommends a minimum of three separate servers as nodes for the kubernetes cluster.
k8s-master server
This server controls cluster scheduling, networking and health. In comparison to the node server(s), it is resource-light.
Instance type: m5.xlarge (4 vCPUs, 16 GB RAM)
Disk space: 500 GB
k8s-node server(s)
These server(s) are the main application workhorse and listen to the master server, host containerd containers for the main application, and mount storage. Usually, one node suffices, but for increased system performance, multiple nodes can be setup. In that case, the hardware requirements should be correspondingly replicated for each node, with the exception of the disk mounts. Disk mounts need to be shared across all nodes. A few notes here:
-
The total system requirements for the node server can either be fulfilled on a single machine, or split among multiple nodes that in sum are equal or greater than the recommended system requirements. However, if splitting the node server into separate physical nodes, please note that individual nodes have minimum requirements; see the details below.
-
On the nodes with GPUs installed, NVIDIA drivers will have to be installed (these will be specified in the installation document).
Worker Node
Instance type: r5n.24xlarge (96 vCPUs, 768 GB RAM)
Total disk space: 2 TB
-
If create multiple drives
-
Boot disk space: 400 GB
-
Common space: 1.6 TB (increase based on total documents ingested).3
If unable to create multiple mounts/drives and all data is stored in root, one drive of 2TB is sufficient.
GPU Node
V3 models require a minimum of 24 GB GPU memory; either two combined T4s, one L4, one A10, or one A100. Batch processing requires a minimum of one GPU; T4, L4, A10, or A100.
The following are tested recommendations but any configure can be utilized as long as the minimum requirements are met.
Minimum instances, T4 GPUs (node with 2 GPUs required for translate v3, node with 1 GPU required for batch):
-
1 x g4ad.8xlarge (32 vCPUs, 128 GB RAM, 2 GPUs)
-
1 x g4dn.4xlarge (16 vCPUs, 64 GB RAM, 1 GPU)
Sufficient instance type, T4 GPUs:
g4dn.12xlarge (48 vCPUs, 192 GB RAM, 4 GPUs)
Preferred instance type, A10 GPUs (better performance than T4 GPUs):
g5.12xlarge (48 vCPUs, 192 GB RAM, 4 GPUs)
Optimal instance type, A10 GPUs:
g5.48xlarge (192 vCPUs, 192 GB RAM, 8 GPUs)
Total disk space: 2 TB. These store pods’ runtime assets, including large neural assets.
Install Kubernetes Cluster
All Nodes (master, worker, gpu)
Login to the each node and complete the following steps.
Commands have to be performed as root (since it involves installation of Kubernetes).
Step 1: Update base OS, Install required packages
Each base OS has different package requirements.
# update OS
dnf check-release-update
sudo dnf update -y
# install bash
sudo dnf install bash -y
# install git
sudo dnf install -y git
# install ip route, used by k8s
sudo dnf install -y iproute-tc
# update system
dnf update -y
# utilities
sudo dnf install -y curl wget vim bash-completion gnupg2 lvm2
# required to unzip packages
dnf install zip unzip -y
# (optional) install aws
yum remove awscli
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
./aws/install
# install bash
sudo dnf install bash -y
# install jq
sudo dnf install jq -y
# install git
sudo dnf install -y git
# reboot so that the below packages can use the updated kernel
reboot
Step 2: Install helm
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
Step 3: Install k9s (MASTER NODE only. Optional, but highly recommended)
export K9S_V="0.50.4"
curl -LO "https://github.com/derailed/k9s/releases/download/v$K9S_V/k9s_Linux_amd64.tar.gz"
tar -xzf k9s_Linux_amd64.tar.gz
sudo mv k9s /usr/local/bin/
Step 4: Set kernel parameters as required by Istio
# avoid ztunnel container restarts due to load
# append to end of file
cat <<EOF >> /etc/security/limits.conf
* soft nofile 131072
* hard nofile 131072
EOF
cat <<EOF >> /etc/systemd/system.conf
DefaultLimitNOFILE=131072
EOF
Step 5: Set kernel parameters as required by Kubernetes
bash -c 'cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
nf_nat
xt_REDIRECT
xt_owner
iptable_nat
iptable_mangle
iptable_filter
EOF'
modprobe overlay
modprobe br_netfilter
modprobe nf_nat
modprobe xt_REDIRECT
modprobe xt_owner
modprobe iptable_nat
modprobe iptable_mangle
modprobe iptable_filter
bash -c 'cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF'
sysctl --system
Step 6: Turn off swap, disable SELinux, update local firewall changes, and modify max_map_count
NOTE: if using version of K8S older than v1.29, it is possible that insecure port (10255
) is still being utilized. The following setup does NOT include the old insecure port. It is highly recommended that customers upgrade to K8S v1.30 or higher which includes the new secure port (10250
) by default.
# disable swap
swapoff -a
sed -e '/swap/s/^/#/g' -i /etc/fstab
# disable SELinux
sudo setenforce 0
sudo sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config
# setup firewall
dnf install -y firewalld
systemctl unmask firewalld && systemctl enable firewalld && systemctl restart firewalld
firewall-cmd --permanent --zone=public --set-target=ACCEPT
firewall-cmd --permanent --add-port={22,80,443,2379,2380,5000,6443,10250,10251,10252}/tcp
# api
firewall-cmd --permanent --add-port={5005,8011,8080}/tcp
# istio
firewall-cmd --permanent --add-port={15000,15001,15006,15008,15009,15010,15012,15014,15017,15020,15021,15090,15443,20001}/tcp
# flannel
firewall-cmd --permanent --add-port=8472/udp
# clickhouse
firewall-cmd --permanent --add-port=8123/tcp
# WSO2
firewall-cmd --permanent --add-port={4000,9443,9763}/tcp
# reload firewall
firewall-cmd --reload
# Add vm.max_map_count=262144 to /etc/sysctl.conf
echo "vm.max_map_count=262144" | sudo tee -a /etc/sysctl.conf
# Apply the sysctl settings
sudo sysctl -p
# Verify the change
sysctl vm.max_map_count
Step 7: Install containerd and add runc to the runtime
# containerd
export CONTAINERD_V="2.0.5"
export RUNC_V="1.2.6"
wget https://github.com/containerd/containerd/releases/download/v$CONTAINERD_V/containerd-$CONTAINERD_V-linux-amd64.tar.gz
tar Cxzvf /usr/local containerd-$CONTAINERD_V-linux-amd64.tar.gz
# if going to use systemd (yes in this install), need to add the following file
mkdir -p /usr/local/lib/systemd/system/
curl "https://raw.githubusercontent.com/containerd/containerd/main/containerd.service" -o "/usr/local/lib/systemd/system/containerd.service"
# restart
systemctl daemon-reload
systemctl enable --now containerd
# runc
wget https://github.com/opencontainers/runc/releases/download/v$RUNC_V/runc.amd64
install -m 755 runc.amd64 /usr/local/sbin/runc
Step 8: Install cni plugin
# cni plugin
export CNI_V="1.7.1"
wget https://github.com/containernetworking/plugins/releases/download/v$CNI_V/cni-plugins-linux-amd64-v$CNI_V.tgz
mkdir -p /opt/cni/bin
tar Cxzvf /opt/cni/bin cni-plugins-linux-amd64-v$CNI_V.tgz
Step 9: Setup containerd
Create containerd directory:
mkdir -p /etc/containerd/
NOTE: If need to specify pause sand_box image, add the following to each respective config.toml
:
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "k8s.gcr.io/pause:3.10"
Master and Worker Nodes:
cat <<EOF > /etc/containerd/config.toml
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
EOF
sudo systemctl restart containerd
GPU Nodes:
cat <<EOF > /etc/containerd/config.toml
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true
EOF
sudo systemctl restart containerd
Step 10: Install & Setup - Kubernetes
# set repository and repo
export K8S_V="1.32"
cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v$K8S_V/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v$K8S_V/rpm/repodata/repomd.xml.key
exclude=kubelet kubeadm kubectl cri-tools kubernetes-cni
EOF
# install kubelet, kubeadm, kubectl, create symlink
sudo dnf install -y kubelet kubeadm kubectl --disableexcludes=kubernetes
# enable kubelet
sudo systemctl enable --now kubelet
Verify the installation by checking the packages version
kubectl version --client && kubeadm version
Step 11: Reboot node/server
Step 12 (GPU NODE ONLY): Install NVIDIA Drivers
Each base OS has different package requirements.
# update dependnecies and kernel
dnf check-release-update
sudo dnf update -y
sudo dnf install -y dkms
sudo systemctl enable --now dkms
if (uname -r | grep -q ^6.12.); then
sudo dnf install -y kernel-devel-$(uname -r) kernel6.12-modules-extra
else
sudo dnf install -y kernel-devel-$(uname -r) kernel-modules-extra
fi
# add nvidia repo
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/x86_64/cuda-amzn2023.repo
# clean cache
sudo dnf clean expire-cache
# install driver
sudo dnf module install -y nvidia-driver:565-dkms
# install cuda toolkit
sudo dnf install -y cuda-toolkit-12-6
# install cudnn
wget https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-9.6.0.74_cuda12-archive.tar.xz
# unpack
tar -xf cudnn-linux-x86_64-9.6.0.74_cuda12-archive.tar.xz
# copy to dirs
sudo cp -P cudnn-linux-x86_64-9.6.0.74_cuda12-archive/include/* /usr/local/cuda/include/
sudo cp -P cudnn-linux-x86_64-9.6.0.74_cuda12-archive/lib/* /usr/local/cuda/lib64/
# set permissions
sudo chmod a+r /usr/local/cuda/include/cudnn*.h
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*
# optional, remove downloads to free up space
rm -rf cudnn-linux-x86_64-9.6.0.74_cuda12-archive*
# update libarary cache
sudo ldconfig
# install container toolkit
# add repo
sudo dnf config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
# install
sudo dnf install -y nvidia-container-toolkit
# reboot to complete install
reboot
# extra required packages
sudo dnf -y install epel-release
# get rhel/rocky OS current version
export cur_ver="rhel$(rpm -E %rhel)"
# install kernel headers
sudo dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y
# install cuda-toolkit
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/$cur_ver/x86_64/cuda-$cur_ver.repo
sudo dnf clean all
sudo dnf -y install cuda-toolkit-12-6
# update nvidia path
export PATH=/usr/local/cuda-12.6/bin${PATH:+:${PATH}}
# install nvidia driver
sudo dnf -y module install nvidia-driver:565-dkms
# zlib is required for cudnn
sudo dnf -y install zlib
# install cudnn
sudo dnf -y install cudnn-cuda-12
# install container-toolkit
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install -y nvidia-container-toolkit
# reboot to complete install
reboot
sudo dnf install -y epel-release dnf-utils gcc make kernel-devel-matched kernel-headers
sudo dnf install -y dkms
sudo dnf config-manager --set-enabled crb
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
sudo dnf clean all
# cuda toolkit required for cudnn
sudo dnf -y install cuda-toolkit-12-6
# nividia driver
sudo dnf -y module install nvidia-driver:565-dkms
# Run nvidia-smi and capture the output and error
nvidia_output=$(nvidia-smi 2>&1)
# Check if nvidia-smi failed with a specific error
if echo "$nvidia_output" | grep -q "NVIDIA-SMI has failed"; then
echo "nvidia-smi failed with error: NVIDIA-SMI has failed. Checking DKMS status..."
# Get the DKMS status
dkms_status=$(dkms status)
echo "DKMS Status:"
echo "$dkms_status"
# Extract the NVIDIA module version and remove the trailing colon
nvidia_version=$(echo "$dkms_status" | grep nvidia | awk '{print $1}' | awk -F/ '{print $2}' | sed 's/:$//')
if [ -n "$nvidia_version" ]; then
echo "Attempting to install NVIDIA DKMS module version: $nvidia_version"
sudo dkms install nvidia/"$nvidia_version"
else
echo "No NVIDIA DKMS module found in DKMS status. Please check manually."
fi
else
echo "nvidia-smi is working correctly."
fi
# zlib is required for cudnn
sudo dnf -y install zlib
# install cudnn
sudo dnf -y install --allowerasing cudnn9-cuda-12
# install container-toolkit
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install -y nvidia-container-toolkit
Verify that all NVIDIA drivers/packages were correctly installed
Output should be similar to the following (depending on the number of server GPUs). If not, reinstall NVIDIA drivers from the previous section:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10G Off | 00000000:00:16.0 Off | 0 |
| 0% 20C P8 9W / 300W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A10G Off | 00000000:00:17.0 Off | 0 |
| 0% 19C P8 10W / 300W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A10G Off | 00000000:00:18.0 Off | 0 |
| 0% 21C P8 9W / 300W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A10G Off | 00000000:00:19.0 Off | 0 |
| 0% 21C P8 9W / 300W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA A10G Off | 00000000:00:1A.0 Off | 0 |
| 0% 20C P8 9W / 300W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA A10G Off | 00000000:00:1B.0 Off | 0 |
| 0% 20C P8 9W / 300W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA A10G Off | 00000000:00:1C.0 Off | 0 |
| 0% 20C P8 9W / 300W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA A10G Off | 00000000:00:1D.0 Off | 0 |
| 0% 20C P8 9W / 300W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Verify cuda toolkit installed:
/usr/local/cuda/bin/nvcc -V
Output should be similar to the following:
nvcc: NVIDIA (R) Cuda compiler driveright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0
Very container-toolkit installed:
Output should be similar to the following:
cli-version: 1.17.3
lib-version: 1.17.3
build date: 2024-12-04T09:47+0000
build revision: 16f37fcafcbdaf67525135104d60d98d36688ba9
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
Initialize Cluster
Master Node
Login to the k8s-master node and follow the below steps. Note that the following steps have to be performed as root (sudo su -
) (since it involves installation of Kubernetes).
Run below command to initialize and setup Kubernetes master.
NOTE: Ensure that CNI and Service CIDRs do NOT conflict with local server/node IP addresses. CNI_CIDR must match Flannel IP value set below.
CIDR, 192.168.100.0/19
, will allow for up to 32 nodes (flannel reserves one subnet for each node). If more nodes are required, need to increase netmask/subnet size.
CNI and service CIDRs can be modified to match respective network requirements:
CNI_CIDR="192.168.100.0/19"
SERVICE_CIDR="192.168.200.0/19"
NODE_NAME="master"
kubeadm init \
--pod-network-cidr $CNI_CIDR \
--service-cidr $SERVICE_CIDR \
--node-name $NODE_NAME
Save the section of the output that resembles the following, as you will need it to join the worker nodes to the master:
# example, real output will differ
kubeadm join 10.10.3.8:6443 --token 72dx7i.i1q8hymni6lw3f3f \
--discovery-token-ca-cert-hash sha256:9befd70bffbdf6009b96b4e2a3419458a1f2c5661c062246bef4ae38ca8fd7f7
To use kubectl
commands, need to update kubeconfig (regular or root user):
# update kubeconfig
export KUBECONFIG=/etc/kubernetes/admin.conf
# or if not root user
mkdir -p $HOME/.kube
sudo cp -f /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
systemctl enable kubelet.service
systemctl start kubelet.service
Verify that node has been installed correctly:
Expected output:
NOTE: STATUS will be NotReady
until Flannel is installed.
NAME STATUS ROLES AGE VERSION
master NotReady control-plane 66s v1.32.4
If a period of time has passed between initializing cluster and joining nodes and the join token is expired, run the following command to create a new join token:
kubeadm token create --print-join-command
Worker Node
Login to the k8s-worker node and follow the below steps. Note that the following steps have to be performed as root (sudo su -
) (since it involves installation of Kubernetes).
Run following command to join the Kubernetes cluster created in the previous section:
# example, real output will differ
NODE_NAME="worker"
kubeadm join 10.10.3.8:6443 --node-name $NODE_NAME \
--token 72dx7i.i1q8hymni6lw3f3f \
--discovery-token-ca-cert-hash sha256:9befd70bffbdf6009b96b4e2a3419458a1f2c5661c062246bef4ae38ca8fd7f7
From the master node, verify that the worker node has joined correctly:
Expected output:
NOTE: STATUS will be NotReady
until Flannel is installed.
NAME STATUS ROLES AGE VERSION
master NotReady control-plane 149m v1.32.4
worker NotReady <none> 8s v1.32.4
GPU Node
Login to the k8s-gpu node and follow the below steps. Note that the following steps have to be performed as root (sudo su -
) (since it involves installation of Kubernetes).
Run following command to join the Kubernetes cluster created in the previous section:
# example, real output will differ
NODE_NAME="gpu"
kubeadm join 10.10.3.8:6443 --node-name $NODE_NAME \
--token 72dx7i.i1q8hymni6lw3f3f \
--discovery-token-ca-cert-hash sha256:9befd70bffbdf6009b96b4e2a3419458a1f2c5661c062246bef4ae38ca8fd7f7
From the master node, verify that the worker node has joined correctly:
Expected output:
NOTE: STATUS
will be NotReady
until Flannel is installed.
NAME STATUS ROLES AGE VERSION
gpu NotReady <none> 9s v1.32.4
master NotReady control-plane 177m v1.32.4
worker NotReady <none> 28m v1.32.4
Import Dependencies
All images and scripts are packaged based on system requirements. Lilt is designed to work as a distributed system, utilizing specific nodes for each workload.
It is not necessary to download the entire install package on each node. The following examples utilize a remote s3 bucket for retrieving data.
Highly recommended to use a central repository for all images. The below examples utilize local image storage on each node. If using a central repository, containerd image import sections can be ignored. If use a central repository, ensure to modify all helm charts to replace the default value lilt-registry.local.io:80
Step 1: Set Release Tag (all nodes)
This can be ignored if downloading the entire package separately.
Login to each node (master, worker, gpu) and set release tag var:
export RELEASE_TAG="lilt-enterprise-x.x.x"
Step 2: Import install packages and images
Master Node
Login to the k8s-master node and follow the below steps. Note that the following steps have to be performed as root (sudo su -
).
Move to the root directory:
Download install package and documentation from remote s3 bucket:
aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/install_packages/ /$RELEASE_TAG/install_packages --recursive --exclude "*" --include "on-prem-installer*"
aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/documentation/ /$RELEASE_TAG/documentation --recursive
Unpack installer:
mkdir -p /install_dir
tar -xzf /$RELEASE_TAG/install_packages/on-prem-installer* -C /install_dir
If not using a central image repository, complete the following:
-
Download required images:
-
istio pilot/proxyv2/install-cni/ztunnel/kiali
-
flannel
aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/docker_images/master/packages/ /$RELEASE_TAG/docker_images/master/packages --recursive --exclude "*" --include "pilot*" --include "proxyv2*" --include "install-cni*" --include "ztunnel*" --include "kiali*" --include "flannel*"
- Load images to containerd (update
<release-tag>
):
for file in $(find /$RELEASE_TAG/docker_images -type f) ; do ctr -n=k8s.io image import $file --digests=true ; done
Worker Node
Login to the k8s-worker node and follow the below steps. Note that the following steps have to be performed as root (sudo su -
).
Move to the root directory:
If not using a central image repository, complete the following:
- Download required images (all of them):
aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/docker_images/ /$RELEASE_TAG/docker_images --recursive
Load images to containerd (update <release-tag>
):
for file in $(find /$RELEASE_TAG/docker_images -type f) ; do ctr -n=k8s.io image import $file --digests=true ; done
GPU Node
Login to the k8s-gpu node and follow the below steps. Note that the following steps have to be performed as root (sudo su -
).
Move to the root directory:
If not using a central image repository, complete the following:
-
Download required images:
-
istio_pilot/proxyv2/install-cni/ztunnel/kiali/k8s-device-plugin/metrics-server/flannel
-
all neural/llm/batch
aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/docker_images/master/packages/ /$RELEASE_TAG/docker_images/master/packages --recursive --exclude "*" --include "pilot*" --include "proxyv2*" --include "install-cni*" --include "ztunnel*" --include "kiali*" --include "flannel*" --include "k8s-device-plugin*" --include "metrics-server*"
aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/docker_images/master/apps/ /$RELEASE_TAG/docker_images/master/apps --recursive --exclude "*" --include "neural*" --include "llm*" --include "batch*"
aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/docker_images/node/ /$RELEASE_TAG/docker_images/node --recursive --exclude "*" --include "neural*"
- Load images to containerd (update
<release-tag>
):
for file in $(find /$RELEASE_TAG/docker_images -type f) ; do ctr -n=k8s.io image import $file --digests=true ; done
Step 3: Label Nodes
Lilt cluster utilizes nodeSelector
for scheduling pods on specific nodes. This is a simple way to control where pods are scheduled by adding a key-value pair to chart/manifest specifications.
Master Node
Login to the k8s-master node and follow the below steps. Note that the following steps have to be performed as root (sudo su -
).
Worker node is used for running the bulk of application workloads, typically separate from the master and GPU nodes. Label this node as worker
(node name set in the above section) by executing the following command:
kubectl label nodes worker node-type=worker
GPU node is generally used for running GPU intensive pods/applications aside from the worker node; however, this node can also be used for additional workloads if the instance has sufficient resources to handle GPU and application tasks. This decision is up to the cluster admin to determine if GPU nodes can also handle application workloads.
If only using this node for GPU specific tasks, label as gpu
(node name set in the above section) by executing the following command:
kubectl label nodes gpu capability=gpu
Optional: if the GPU node has sufficient resources to run additional workloads, also add worker label:
kubectl label nodes gpu capability=gpu node-type=worker
Verify output:
# Show labels
kubectl get nodes --show-labels
If necessary to remove a previous node label, use the following command. Is this example, removing node-type
label from the gpu
node
kubectl label node gpu node-type-
Step 4: Create Namespaces
The cluster uses various namespace to separate pods and services.
Master Node
Login to the k8s-master node and follow the below steps. Note that the following steps have to be performed as root (sudo su -
).
kubectl create ns lilt
kubectl create ns istio-system
kubectl create ns kube-flannel
Confirm that namespaces were created:
Output should be similar to the following:
NAME STATUS AGE
default Active 4h
istio-system Active 4h
kube-flannel Active 4h
kube-node-lease Active 4h
kube-public Active 4h
kube-system Active 4h
lilt Active 4h
Step 5: Modify Flannel Helm Chart Values
Flannel helm chart CIDR must match k8s CIDR range.
Master Node
Login to the k8s-master node and follow the below steps. Note that the following steps have to be performed as root (sudo su -
).
Modify flannel on-prem-values.yaml
:
vi /install_dir/flannel/on-prem-values.yaml
Edit podCidr
to match k8s CNI_CIDR
set above:
podCidr: "192.168.100.0/19"
Save file and exit.
Step 6: Image Pull Secrets (optional)
If using a private central repository for hosting images that requires authentication, create image pull secrets for the cluster to reference. The following example utilizes username and password, for service accounts with a json
key use this document Create imagePullSecrets
Master Node
Login to the k8s-master node and follow the below steps. Note that the following steps have to be performed as root (sudo su -
).
Create cluster image pull secret:
kubectl -n <namespace> create secret docker-registry <secret-name> \
--docker-username=<username> \
--docker-password=<password> \
--docker-server=<registry-url> \
--docker-email=<email>
NOTE: Image pull secrets are namespace specific, must create a new secret for each additional namespace.
Verify secret was created:
LILT helm charts utilize a custom values file, usually on-prem-values.yaml
. Update these files with the new imagePullSecret
.
As an example, this is from redis
custom values:
global:
imagePullSecrets:
- <secret-name>
Install LILT
LILT app
Assuming that all images have been loaded into the nodes as explained in the previous steps, we can proceed to install the apps
Step 1: Installation
Run the main install script, this includes all third-party and lilt dependencies. This can take up to two hours to fully complete.
Change to install directory:
Run install script:
If installed k9s, can monitor install progress:
If did not install k9s, use kubectl to watch pods:
kubectl get pods --all-namespaces --watch
Can also watch k8s events:
kubectl events --all-namespaces --watch
When ready, list all pods:
kubectl get pods --all-namespaces
Output should be similar to the following:
NAMESPACE NAME READY STATUS RESTARTS AGE
istio-system istio-cni-node-7fmp4 1/1 Running 0 5d1h
istio-system istio-cni-node-7fvcq 1/1 Running 0 5d1h
istio-system istio-cni-node-ljg7h 1/1 Running 0 5d1h
istio-system istiod-68fcbb5f87-vlg8m 1/1 Running 0 5d1h
istio-system kiali-c5748bfb9-7ns2g 1/1 Running 0 6h26m
istio-system prometheus-server-99f4cd586-5xbnt 1/1 Running 0 5d1h
istio-system ztunnel-b57qw 1/1 Running 0 5d1h
istio-system ztunnel-kwdz9 1/1 Running 0 5d1h
istio-system ztunnel-p8fhc 1/1 Running 0 5d1h
kube-flannel kube-flannel-ds-lvqr7 1/1 Running 0 5d2h
kube-flannel kube-flannel-ds-m7rbh 1/1 Running 0 5d2h
kube-flannel kube-flannel-ds-xqkfz 1/1 Running 0 5d2h
kube-system coredns-76f75df574-7p66z 1/1 Running 0 5d2h
kube-system coredns-76f75df574-svzk9 1/1 Running 0 5d2h
kube-system etcd-master-shared-vpc 1/1 Running 1 (5d2h ago) 5d2h
kube-system kube-apiserver-master-shared-vpc 1/1 Running 1 (5d2h ago) 5d2h
kube-system kube-controller-manager-master-shared-vpc 1/1 Running 1 (5d2h ago) 5d2h
kube-system kube-proxy-86855 1/1 Running 0 5d2h
kube-system kube-proxy-jggvv 1/1 Running 0 5d2h
kube-system kube-proxy-k5t6k 1/1 Running 0 5d2h
kube-system kube-scheduler-master-shared-vpc 1/1 Running 1 (5d2h ago) 5d2h
kube-system metrics-server-865f9f4b55-b9t4r 1/1 Running 0 5d1h
lilt assignment-core-976f6bc6c-xzsr8 1/1 Running 7 (5d1h ago) 5d1h
lilt auditlog-core-67b45f45b9-h25fs 1/1 Running 4 (5d1h ago) 5d1h
lilt auth-856c98d8f6-5f262 2/2 Running 0 4d22h
lilt batch-neural-dd58d87b6-zzr6d 1/1 Running 0 5d1h
lilt batch-tb-core-7dc4fd97c7-sfmks 1/1 Running 7 (5d1h ago) 5d1h
lilt batch-worker-cpu-neural-74db44c979-pkl4t 1/1 Running 0 6h24m
lilt batch-worker-gpu-neural-866dcb59c4-hbhhd 1/1 Running 0 6h24m
lilt batch-worker-gpuv2-neural-756f78779f-4zbdf 1/1 Running 0 6h24m
lilt batch-worker-gpuv3-neural-6f596bf9f6-dbzt4 1/1 Running 0 6h24m
lilt batchv2-neural-8679969f97-st8tc 1/1 Running 0 5d1h
lilt batchv3-neural-77f96455dc-tgqth 1/1 Running 0 4d22h
lilt cache-redis-master-0 1/1 Running 0 5d1h
lilt clickhouse-shard0-0 1/1 Running 0 5d1h
lilt connectors-ingressgateway-6d85dd9b5-58n2f 1/1 Running 0 5d1h
lilt converter-core-64d87b55d-fbffd 1/1 Running 4 (5d1h ago) 5d1h
lilt elasticsearch-master-0 1/1 Running 0 5d1h
lilt elasticsearch-master-1 1/1 Running 0 5d1h
lilt file-job-core-67598b8bb4-bv5r6 1/1 Running 0 5d1h
lilt file-translation-core-76fcfd7979-qj4dq 1/1 Running 0 5d1h
lilt front-app-7f75c8745c-w2bbs 2/2 Running 0 4d22h
lilt indexer-core-6bd8f76c45-lnxqs 1/1 Running 9 (5d1h ago) 5d1h
lilt istio-ingressgateway-6dc78b487c-nht2v 1/1 Running 0 5d1h
lilt job-core-5c5494cbcd-dxqr2 1/1 Running 0 5d1h
lilt langid-neural-57898fb4c5-7xcdv 1/1 Running 0 4d22h
lilt langid-neural-57898fb4c5-nfcqr 1/1 Running 0 4d22h
lilt lexicon-core-79557f46f7-z7sbz 1/1 Running 0 5d1h
lilt lilt-beehive-5bf9d6ff6b-n87dh 1/1 Running 0 5d1h
lilt lilt-configuration-api-5c49db4797-8jnxq 2/2 Running 0 5d1h
lilt lilt-connectors-builder-678448f4f6-4g4pq 2/2 Running 0 7h50m
lilt lilt-connectors-create-admin-user-kjv2k 0/1 Completed 0 6h25m
lilt lilt-connectors-exporter-cronjob-28856445-zqnkx 0/1 Completed 0 53s
lilt lilt-connectors-scheduler-cronjob-28856445-gjhd7 0/1 Completed 0 53s
lilt lilt-core-api-67bcb48c6b-84gtt 2/2 Running 0 5d1h
lilt lilt-dataflow-clickhouse-migration-job-5tqms 0/1 Completed 0 6h25m
lilt lilt-dataflow-generate-memory-snapshot-cronjob-28855920-l5cnf 0/1 Completed 0 8h
lilt lilt-dataflow-ingest-comments-cronjob-28856400-988db 0/1 Completed 0 45m
lilt lilt-dataflow-ingest-connectorjobs-cronjob-28856400-9kzxp 0/1 Completed 0 45m
lilt lilt-dataflow-ingest-revision-reports-cronjob-28855020-mhwd4 0/1 Completed 0 23h
lilt lilt-dataflow-ingest-wpa-minio-cronjob-28856280-6p5jz 0/1 Completed 0 165m
lilt lilt-dataflow-segment-quality-cronjob-28856400-rr9rs 0/1 Completed 0 45m
lilt lilt-manager-ui-f54749467-sc6cf 2/2 Running 9 (5d1h ago) 5d1h
lilt linguist-core-7ddd8fcfbb-6rv8b 1/1 Running 0 5d1h
lilt llm-inference-neural-6bb49c48b6-mtfff 1/1 Running 0 5d1h
lilt localpv-provisioner-5cfff7dcb5-g2zc5 1/1 Running 0 5d1h
lilt memory-core-f697888d9-ccxxh 1/1 Running 0 5d1h
lilt minio-85c7cb7f85-kspj5 1/1 Running 0 5d1h
lilt mongodb-688f5c8c5b-zmh77 1/1 Running 0 5d1h
lilt mq-rabbitmq-0 1/1 Running 1 (5d1h ago) 5d1h
lilt mysql-0 1/1 Running 0 5d1h
lilt nginx-ingress-65v2k 1/1 Running 0 5d1h
lilt nvidia-device-plugin-42ntc 1/1 Running 0 5d1h
lilt nvidia-device-plugin-njpdp 1/1 Running 0 5d1h
lilt nvidia-device-plugin-rqqp9 1/1 Running 0 5d1h
lilt qa-core-5d474c46b4-sncqn 1/1 Running 3 (5d1h ago) 5d1h
lilt routing-neural-79df5944c5-m962l 1/1 Running 0 4d22h
lilt search-core-5f9ddbb697-b46q4 1/1 Running 0 5d1h
lilt segment-core-8545b9dbcf-v5jzc 1/1 Running 4 (5d1h ago) 5d1h
lilt tag-core-544c77757d-9qxm6 1/1 Running 4 (5d1h ago) 5d1h
lilt tb-core-795dbf757d-9nkl7 1/1 Running 0 5d1h
lilt tm-core-55c74c49cf-ntmqz 1/1 Running 0 5d1h
lilt translate-neural-6c46b4755d-s84n2 1/1 Running 0 5d1h
lilt translatev2-neural-67fd8b7584-wmc5w 1/1 Running 0 5d1h
lilt translatev3-neural-f7f8669c8-28pxn 1/1 Running 0 5d1h
lilt update-manager-neural-77c89bf757-zs5d5 1/1 Running 12 (4d12h ago) 5d1h
lilt update-managerv2-neural-68cbfcf6d4-wh62m 1/1 Running 11 (4d23h ago) 5d1h
lilt update-managerv3-neural-6d4598547c-dk2k4 1/1 Running 5 (4d10h ago) 5d1h
lilt update-neural-6474b69644-kmxdh 1/1 Running 0 5d1h
lilt updatev2-neural-7f4746d4d9-wklzr 1/1 Running 0 5d1h
lilt updatev3-neural-8bd9f747d-8q7dj 1/1 Running 0 5d1h
lilt valid-words-replace-neural-7f9fc59958-hk8vw 1/1 Running 0 4d22h
lilt valid-words-update-neural-6956b97db9-gf5d4 1/1 Running 0 4d22h
lilt watchdog-core-67895dbc9d-pmg8n 1/1 Running 0 5d1h
lilt workflow-core-7bff648cf5-7b77g 1/1 Running 4 (5d1h ago) 5d1h
Step 2: Access Lilt Main Page
Once all pods are running/ready, connect to the LILT main page. Default domain name is bare.lilt.com, and is reachable via nginx-ingress
running on the worker node.
If accessing from a workstation inside the cluster network, get the local ip address of the worker
node:
kubectl get node worker -o wide | awk -v OFS='\t\t' '{print $1, $6}'
If accessing from an external source, check the cloud provider public IPv4 address assigned to the worker
node
Once the correct ip address is determined, modify the hosts
file on your local workstation:
Add a line for the worker
node ip address and bare.lilt.com:
Navigate to LILT using a browser, bare.lilt.com:
After the admin password is set, LILT can be accessed via bare.lilt.com/signin.
Debugging
Depending on network speed(to download/upload containerd images) some of the pods can take more time than others, here are some known debugging techniques:
- If you see
Error: UPGRADE FAILED: timed out waiting for the condition, please continue as this can happen due to time taken by the pods to startup, apps deployment happens as expected.
- If you see pods stuck in
ContainerCreating for a longer time(>15 mins), it’s safe to restart the pods by using kubectl delete pod -n lilt <podname>
- If the apps aren’t healthy even after all containerd images have been loaded to the node, it’s safe to revert to previous version and redo the install, use below commands for the same:
master $ cd $install_dir
## Rollback
$ helm rollback lilt 1
## Clean-up jobs
$ kubectl delete jobs -n lilt --all
## Remove statefulset PVCs
$ kubectl get pvc -n lilt | grep elasticsearch $ kubectl delete pvc -n lilt <elasticsearch pvc as per above command>
## Perform Install
$ ./install-lilt.sh
Flannel Error
Sometimes flannel
does not load correctly on all nodes. Verify which node that the flannel
init container does not complete and restart containerd
:
sudo systemctl restart containerd
Gateway Error 502
Sometimes after a server reboot, nginx-ingress
needs to be restarted to include new values for front
. Try restarting the nginx-ingress
daemonset:
# restart nginx-ingress
kubectl rollout restart daemonset/nginx-ingress -n lilt
CORE Pods
When a cluster is restarted, core
pods may be stuck in CrashLoopBackOff
. A potential fix is to delete the elasticsearch
helm deployment and PVCs and then re-deploy elasticsearch
:
# delete deployment
helm delete -n lilt elasticsearch
# get list of elastic search pvcs
kubectl get pvc -n lilt | grep elasticsearch
# delete pvc volumes base on the results, there will be two
kubectl delete pvc -n lilt pvc-1d81120d-5453-4fb0-ab8a-1dde8f85628f
kubectl delete pvc -n lilt pvc-34c86f17-84fc-4c21-8e05-4b07995f9379
# re-deploy
sh install_scripts/install-elasticsearch.sh
If some core pods are still not running, try to restart lilt-beehive
deployment:
kubectl rollout restart deployment/lilt-beehive -n lilt
GPUs
If GPUs are not working, verify that the expected number is allocated to the cluster:
kubectl get nodes -o custom-columns=":metadata.name" | while read node; do kubectl describe node "$node" | grep -i "nvidia.com/gpu"; done
Output should similar to the following (GPU node shows 8, main and worker nodes show 0):
nvidia.com/gpu: 8
nvidia.com/gpu: 0
nvidia.com/gpu 0
If the output shows zero for all nodes, reinstall GPU drivers from the above section and reapply nvidia-device-plugin sh install_scripts/install-nvidia-device-plugin.sh
Translate v2/3
Sometimes after a server reboot, translate pods for v2/3 will not initialize. This can be attributed to service endpoints not refreshing from the previous deployment. Possible solutions are restarting the rabbitmq
statefulset and minio
deployment:
# restart rabbitmq
kubectl rollout restart statefulset/mq-rabbitmq -n lilt
# restart minio
kubectl rollout restart deployment/minio -n lilt
Restart all Deployments/Statefulsets
If have various pod failures, can also try and restart all Lilt deployments and statefulsets:
kubectl rollout restart deployment -n lilt
kubectl rollout restart statefulset -n lilt
Reset Cluster
If there are any issues that can’t be resolved, as last resort reset each node in the cluster.
NOTE: reseting the cluster will not remove all configuration artifacts. Follow the onscreen output instructions from the below reset command.
All Nodes (master, worker, gpu)
Login to the each node and complete the following steps.
Commands have to be performed as root (since it involves installation of Kubernetes).
Reset the node:
Reinstall the cluster starting from the beginning of this document.