> ## Documentation Index
> Fetch the complete documentation index at: https://support.lilt.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Install System (Amazon Linux 2023 Or Rocky 8/9)

## Overview

This article walks the installer through the process of bringing the environment up to test LILT in a secure platform, whether in the cloud (public or private), or on bare-metal hardware. The intent is to limit the amount of manual console interaction in favor of scripted installation that is fully documented.

In working through the installation, references will be made as appropriate to provide context.

#### System Parameters/Versions:

* LILT: 5.0-5.1

* Base OS:

  * Amazon Linux 2023: `ami-05576a079321f21f8`

  * Rocky 8.10/9.5

* K8S: 1.32.4

* Pause: 3.10 (can be configured in `containerd/config.toml`)

* Containerd:

  * 2.0.5 (v2.0.0+ requires K8 >= 1.30 and Rocky 9 with kernel 5.x)

  * 1.6.x/1.7.x (Rocky 8 with kernel 4.x)

* Runc: 1.2.6

* CNI plugin: 1.7.1

* Flannel, 0.26.0

* NVIDIA Driver: 565.57.01, dkms

* Cuda: 12.7

* Cuda-toolkit: 12.6

* Cudnn: 9.6.0

* GPU: L4 (current standard; A10G also validated)

## Tools you will need

SSH to connect to systems.

* On the Mac or Linux, you can open a terminal window and use ssh.

* On Windows, you can use the PuTTY ssh client.

Web browser for post installation and verification.

### Installer privileges

Many sections of this article include text formatted to indicate console input and output. The commands are prefixed with a \$ or # depending on if the administrator should run them as root or not. Additionally, to assist the user to know which machine to run commands on, the prefix node, master, and gpu will be given when the machine should be switched. All following commands after a switch should use those.

## Prerequisites

Before running installation scripts we need to prepare the system with all the dependencies.

### Data Location

For the rest of the installation, please ensure that:

Containerd, images and RPM packages are mounted and available to all of the nodes (master, worker, and GPU).

In the typical installation, the system administrator will receive this data from LILT prior to installation, and the system administrator will mount all requirements.

### Kubernetes Cluster Installation

#### Installation Overview

##### Kubernetes

The LILT system is a collection of containers that interact with each other. LILT requires the use of Kubernetes to handle orchestration. Further, there must be persistent volume storage mounted to nodes in Kubernetes.

##### LILT Component Overview

Note: See the section “LILT System Architecture Diagram” for a visual depiction of the following information.

The LILT application consists of three major logical component groups:

* “Front”, which services API calls as well as the interface and editor logic

* “Neural”, which performs neural machine translation

* “Core”, which performs linguistic pre-processing and post-processing, as well as document import and export

RabbitMQ is used for message passing between the services; MySQL is used for the application database, ElasticSearch for search purposes, Minio as s3 storage, OpenEBS for storage allocation, nginx-ingress as ingress controller and Redis for a memory cache. For each of these 7 services, LILT will provide a containerd image that is installed as part of the installation process. Optionally, customers can use a self-hosted version of any of those five services that can be pointed to by the LILT system (for example, a separately running ElasticSearch instance, AWS RDS in place of mysql, or s3 instead of minio).

Finally, there must be a persistent location for stored user documents that is mountable as a persistent volume to a Kubernetes node. This volume will be used by Minio, Elasticsearch, MySql, Redis and RabbitMQ.

#### LILT System Architecture Diagram

Refer to: [https://self-managed-docs.lilt.com/kb/lilt-system-architecture](https://self-managed-docs.lilt.com/kb/lilt-system-architecture)

#### System Maintenance and Update Frequency

LILT recommends a system update every quarter. The update can proceed in one of two ways:

1. (Recommended) A customer systems engineer installs the system from scratch given an installation document and assets package delivered via cloud or flash drive.

2. A LILT systems engineer is given SSH access to the customer system, and performs the update remotely.

To schedule a system update, contact your Account Manager.

### Recommended System Requirements

LILT recommends a minimum of three separate servers as nodes for the kubernetes cluster.

#### k8s-master server

This server controls cluster scheduling, networking and health. In comparison to the node server(s), it is resource-light.

Instance type: m5.xlarge (4 vCPUs, 16 GB RAM)

Disk space: 500 GB

#### k8s-node server(s)

These server(s) are the main application workhorse and listen to the master server, host containerd containers for the main application, and mount storage. Usually, one node suffices, but for increased system performance, multiple nodes can be setup. In that case, the hardware requirements should be correspondingly replicated for each node, with the exception of the disk mounts. Disk mounts need to be shared across all nodes. A few notes here:

1. The total system requirements for the node server can either be fulfilled on a single machine, or split among multiple nodes that in sum are equal or greater than the recommended system requirements. However, if splitting the node server into separate physical nodes, please note that individual nodes have *minimum* requirements; see the details below.

2. On the nodes with GPUs installed, NVIDIA drivers will have to be installed (these will be specified in the installation document).

##### Worker Node

Instance type: r5n.24xlarge (96 vCPUs, 768 GB RAM)

Total disk space: 2 TB

* If create multiple drives

  * Boot disk space: 400 GB

  * Common space: 1.6 TB (increase based on total documents ingested).3

    * containerd and all PVCs

If unable to create multiple mounts/drives and all data is stored in root, one drive of 2TB is sufficient.

##### GPU Node

V4 models require a minimum of 24 GB GPU memory; either two combined T4s, one L4, one A10, or one A100. Batch processing requires a minimum of one GPU; T4, L4, A10, or A100.

The following are tested recommendations but any configure can be utilized as long as the minimum requirements are met.

Minimum instances, T4 GPUs (node with 2 GPUs required for translate v4, node with 1 GPU required for batch):

* 1 x g4ad.8xlarge (32 vCPUs, 128 GB RAM, 2 GPUs)

* 1 x g4dn.4xlarge (16 vCPUs, 64 GB RAM, 1 GPU)

Sufficient instance type, T4 GPUs:

g4dn.12xlarge (48 vCPUs, 192 GB RAM, 4 GPUs)

Preferred instance type, L4 GPUs (current standard; see [Self-Managed Hardware Requirements](/kb/self-managed-hardware-requirements)):

g6.12xlarge (48 vCPUs, 192 GB RAM, 4 GPUs)

Optimal instance type, L4 GPUs:

g6.48xlarge (192 vCPUs, 768 GB RAM, 8 GPUs)

Alternative instance type, A10 GPUs (still supported):

g5.12xlarge (48 vCPUs, 192 GB RAM, 4 GPUs) or g5.48xlarge (192 vCPUs, 768 GB RAM, 8 GPUs)

Total disk space: 2 TB. These store pods' runtime assets, including large neural assets.

## Install Kubernetes Cluster

### All Nodes (master, worker, gpu)

Login to the each node and complete the following steps.

Commands have to be performed as root (since it involves installation of Kubernetes).

```bash theme={null}
sudo su -
```

#### Step 1: Update base OS, Install required packages

Each base OS has different package requirements.

* *Amazon Linux 2023*:

```bash theme={null}
# update OS
dnf check-release-update
sudo dnf update -y

# install bash
sudo dnf install bash -y

# install git
sudo dnf install -y git

# install ip route, used by k8s
sudo dnf install -y iproute-tc
```

* *Rocky 8/9*:

```bash theme={null}
# update system
dnf update -y

# utilities
sudo dnf install -y curl wget vim bash-completion gnupg2 lvm2

# required to unzip packages
dnf install zip unzip -y

# (optional) install aws
yum remove awscli
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
./aws/install

# install bash
sudo dnf install bash -y

# install jq
sudo dnf install jq -y

# install git
sudo dnf install -y git

# reboot so that the below packages can use the updated kernel
reboot
Step 2: Install helm
```

```bash theme={null}
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
```

#### Step 3: Install k9s (MASTER NODE only. Optional, but highly recommended)

```bash theme={null}
export K9S_V="0.50.4"
curl -LO "https://github.com/derailed/k9s/releases/download/v$K9S_V/k9s_Linux_amd64.tar.gz"
tar -xzf k9s_Linux_amd64.tar.gz
sudo mv k9s /usr/local/bin/
```

#### Step 4: Set kernel parameters as required by Istio

```bash theme={null}
# avoid ztunnel container restarts due to load
# append to end of file
cat <<EOF >> /etc/security/limits.conf
* soft nofile 131072
* hard nofile 131072
EOF

cat <<EOF >> /etc/systemd/system.conf
DefaultLimitNOFILE=131072
EOF
```

#### Step 5: Set kernel parameters as required by Kubernetes

```bash theme={null}
bash -c 'cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
nf_nat
xt_REDIRECT
xt_owner
iptable_nat
iptable_mangle
iptable_filter
EOF'

modprobe overlay
modprobe br_netfilter
modprobe nf_nat
modprobe xt_REDIRECT
modprobe xt_owner
modprobe iptable_nat
modprobe iptable_mangle
modprobe iptable_filter

bash -c 'cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF'

sysctl --system
```

#### Step 6: Turn off swap, disable SELinux, update local firewall changes, and modify max\_map\_count

<Frame caption="">
  <img src="https://mintcdn.com/lilt-db26f913/zqbAWzfhk3tSI8_P/images/f64c07f9-info-macro-icon--39985156a8a940b9a79d.svg?fit=max&auto=format&n=zqbAWzfhk3tSI8_P&q=85&s=ac469da7929a9d307d906f16a13f7ac3" width="18" height="18" data-path="images/f64c07f9-info-macro-icon--39985156a8a940b9a79d.svg" />
</Frame>

NOTE: if using version of K8S older than v1.29, it is possible that insecure port (`10255`) is still being utilized. The following setup does NOT include the old insecure port. It is highly recommended that customers upgrade to K8S v1.30 or higher which includes the new secure port (`10250`) by default.

```bash theme={null}
# disable swap
swapoff -a
sed -e '/swap/s/^/#/g' -i /etc/fstab

# disable SELinux
sudo setenforce 0
sudo sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config

# setup firewall
dnf install -y firewalld
systemctl unmask firewalld && systemctl enable firewalld && systemctl restart firewalld
firewall-cmd --permanent --zone=public --set-target=ACCEPT
firewall-cmd --permanent --add-port={22,80,443,2379,2380,5000,6443,10250,10251,10252}/tcp
# api
firewall-cmd --permanent --add-port={5005,8011,8080}/tcp
# istio
firewall-cmd --permanent --add-port={15000,15001,15006,15008,15009,15010,15012,15014,15017,15020,15021,15090,15443,20001}/tcp
# flannel
firewall-cmd --permanent --add-port=8472/udp
# clickhouse
firewall-cmd --permanent --add-port=8123/tcp
# WSO2
firewall-cmd --permanent --add-port={4000,9443,9763}/tcp
# reload firewall
firewall-cmd --reload

# Add vm.max_map_count=262144 to /etc/sysctl.conf
echo "vm.max_map_count=262144" | sudo tee -a /etc/sysctl.conf
# Apply the sysctl settings
sudo sysctl -p
# Verify the change
sysctl vm.max_map_count
```

#### Step 7: Install containerd and add runc to the runtime

```bash theme={null}
# containerd
export CONTAINERD_V="2.0.5"
export RUNC_V="1.2.6"

wget https://github.com/containerd/containerd/releases/download/v$CONTAINERD_V/containerd-$CONTAINERD_V-linux-amd64.tar.gz
tar Cxzvf /usr/local containerd-$CONTAINERD_V-linux-amd64.tar.gz

# if going to use systemd (yes in this install), need to add the following file
mkdir -p /usr/local/lib/systemd/system/
curl "https://raw.githubusercontent.com/containerd/containerd/main/containerd.service" -o "/usr/local/lib/systemd/system/containerd.service"

# restart
systemctl daemon-reload
systemctl enable --now containerd

# runc
wget https://github.com/opencontainers/runc/releases/download/v$RUNC_V/runc.amd64
install -m 755 runc.amd64 /usr/local/sbin/runc
```

#### Step 8: Install cni plugin

```bash theme={null}
# cni plugin
export CNI_V="1.7.1"

wget https://github.com/containernetworking/plugins/releases/download/v$CNI_V/cni-plugins-linux-amd64-v$CNI_V.tgz
mkdir -p /opt/cni/bin
tar Cxzvf /opt/cni/bin cni-plugins-linux-amd64-v$CNI_V.tgz
```

#### Step 9: Setup containerd

##### Create containerd directory:

```bash theme={null}
mkdir -p /etc/containerd/
```

<Frame caption="">
  <img src="https://mintcdn.com/lilt-db26f913/zqbAWzfhk3tSI8_P/images/f64c07f9-info-macro-icon--39985156a8a940b9a79d.svg?fit=max&auto=format&n=zqbAWzfhk3tSI8_P&q=85&s=ac469da7929a9d307d906f16a13f7ac3" width="18" height="18" data-path="images/f64c07f9-info-macro-icon--39985156a8a940b9a79d.svg" />
</Frame>

NOTE: If need to specify pause sand\_box image, add the following to each respective `config.toml`:

```bash theme={null}
[plugins]
    [plugins."io.containerd.grpc.v1.cri"]
        sandbox_image = "k8s.gcr.io/pause:3.10"
```

##### Master and Worker Nodes:

```bash theme={null}
cat <<EOF > /etc/containerd/config.toml
version = 2
[plugins]
    [plugins."io.containerd.grpc.v1.cri"]
        [plugins."io.containerd.grpc.v1.cri".containerd]
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
                [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
                    runtime_type = "io.containerd.runc.v2"
                    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
                        SystemdCgroup = true
EOF
sudo systemctl restart containerd
```

##### GPU Nodes:

```bash theme={null}
cat <<EOF > /etc/containerd/config.toml
version = 2
[plugins]
    [plugins."io.containerd.grpc.v1.cri"]
        [plugins."io.containerd.grpc.v1.cri".containerd]
            default_runtime_name = "nvidia"
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
                [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
                    runtime_type = "io.containerd.runc.v2"
                    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
                        SystemdCgroup = true
                [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
                    privileged_without_host_devices = false
                    runtime_type = "io.containerd.runc.v2"
                    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
                        BinaryName = "/usr/bin/nvidia-container-runtime"
                        SystemdCgroup = true
EOF
sudo systemctl restart containerd
```

#### Step 10: Install & Setup - Kubernetes

```bash theme={null}
# set repository and repo
export K8S_V="1.32"

cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v$K8S_V/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v$K8S_V/rpm/repodata/repomd.xml.key
exclude=kubelet kubeadm kubectl cri-tools kubernetes-cni
EOF

# install kubelet, kubeadm, kubectl, create symlink
sudo dnf install -y kubelet kubeadm kubectl --disableexcludes=kubernetes

# enable kubelet
sudo systemctl enable --now kubelet
```

Verify the installation by checking the packages version

```
kubectl version --client && kubeadm version
```

#### Step 11: Reboot node/server

```
reboot
```

#### Step 12 (GPU NODE ONLY): Install NVIDIA Drivers

Each base OS has different package requirements.

* *Amazon Linux 2023*:

```bash theme={null}
# update dependnecies and kernel
dnf check-release-update
sudo dnf update -y
sudo dnf install -y dkms 
sudo systemctl enable --now dkms
if (uname -r | grep -q ^6.12.); then
  sudo dnf install -y kernel-devel-$(uname -r)  kernel6.12-modules-extra
else
  sudo dnf install -y kernel-devel-$(uname -r)  kernel-modules-extra
fi

# add nvidia repo
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/x86_64/cuda-amzn2023.repo

# clean cache
sudo dnf clean expire-cache

# install driver
sudo dnf module install -y nvidia-driver:565-dkms

# install cuda toolkit
sudo dnf install -y cuda-toolkit-12-6

# install cudnn
wget https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-9.6.0.74_cuda12-archive.tar.xz

# unpack
tar -xf cudnn-linux-x86_64-9.6.0.74_cuda12-archive.tar.xz

# copy to dirs
sudo cp -P cudnn-linux-x86_64-9.6.0.74_cuda12-archive/include/* /usr/local/cuda/include/
sudo cp -P cudnn-linux-x86_64-9.6.0.74_cuda12-archive/lib/* /usr/local/cuda/lib64/

# set permissions
sudo chmod a+r /usr/local/cuda/include/cudnn*.h
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

# optional, remove downloads to free up space
rm -rf cudnn-linux-x86_64-9.6.0.74_cuda12-archive*

# update libarary cache
sudo ldconfig

# install container toolkit
# add repo
sudo dnf config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
# install
# Pin to compatible version 1.17.6-1 and install
NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.6-1

sudo dnf install -y  nvidia-container-toolkit-${NVIDIA_CONTAINER_TOOLKIT_VERSION}  nvidia-container-toolkit-base-${NVIDIA_CONTAINER_TOOLKIT_VERSION}  libnvidia-container-tools-${NVIDIA_CONTAINER_TOOLKIT_VERSION}  libnvidia-container1-${NVIDIA_CONTAINER_TOOLKIT_VERSION}

# NOTE: If you want to use the latest nvidia-container-runtime:
#   1. Install the latest NVIDIA container toolkit:
#        sudo dnf install -y nvidia-container-toolkit
#   2. Enable legacy mode for compatibility (tested with nvidia-container-runtime v1.18.0):
#        sudo nvidia-ctk config --in-place --set nvidia-container-runtime.mode=legacy
#        sudo systemctl restart containerd
#   This ensures proper integration with containerd.


# reboot to complete install
reboot
```

* *Rocky 8*:

```bash theme={null}
# extra required packages
sudo dnf -y install epel-release

# get rhel/rocky OS current version
export cur_ver="rhel$(rpm -E %rhel)"

# install kernel headers
sudo dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y

# install cuda-toolkit
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/$cur_ver/x86_64/cuda-$cur_ver.repo
sudo dnf clean all
sudo dnf -y install cuda-toolkit-12-6

# update nvidia path
export PATH=/usr/local/cuda-12.6/bin${PATH:+:${PATH}}

# install nvidia driver
sudo dnf -y module install nvidia-driver:565-dkms

# zlib is required for cudnn
sudo dnf -y install zlib

# install cudnn
sudo dnf -y install cudnn-cuda-12

# install container-toolkit
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
# Pin to compatible version 1.17.6-1 and install
NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.6-1

sudo dnf install -y  nvidia-container-toolkit-${NVIDIA_CONTAINER_TOOLKIT_VERSION}  nvidia-container-toolkit-base-${NVIDIA_CONTAINER_TOOLKIT_VERSION}  libnvidia-container-tools-${NVIDIA_CONTAINER_TOOLKIT_VERSION}  libnvidia-container1-${NVIDIA_CONTAINER_TOOLKIT_VERSION}

# NOTE: If you want to use the latest nvidia-container-runtime:
#   1. Install the latest NVIDIA container toolkit:
#        sudo dnf install -y nvidia-container-toolkit
#   2. Enable legacy mode for compatibility (tested with nvidia-container-runtime v1.18.0):
#        sudo nvidia-ctk config --in-place --set nvidia-container-runtime.mode=legacy
#        sudo systemctl restart containerd
#   This ensures proper integration with containerd.

# reboot to complete install
reboot
```

* *Rocky 9*:

```bash theme={null}
sudo dnf install -y epel-release dnf-utils gcc make kernel-devel-matched kernel-headers
sudo dnf install -y dkms
sudo dnf config-manager --set-enabled crb

sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
sudo dnf clean all

# cuda toolkit required for cudnn
sudo dnf -y install cuda-toolkit-12-6

# nividia driver
sudo dnf -y module install nvidia-driver:565-dkms

# Run nvidia-smi and capture the output and error
nvidia_output=$(nvidia-smi 2>&1)

# Check if nvidia-smi failed with a specific error
if echo "$nvidia_output" | grep -q "NVIDIA-SMI has failed"; then
  echo "nvidia-smi failed with error: NVIDIA-SMI has failed. Checking DKMS status..."
  
  # Get the DKMS status
  dkms_status=$(dkms status)
  echo "DKMS Status:"
  echo "$dkms_status"
  
  # Extract the NVIDIA module version and remove the trailing colon
  nvidia_version=$(echo "$dkms_status" | grep nvidia | awk '{print $1}' | awk -F/ '{print $2}' | sed 's/:$//')
  
  if [ -n "$nvidia_version" ]; then
    echo "Attempting to install NVIDIA DKMS module version: $nvidia_version"
    sudo dkms install nvidia/"$nvidia_version"
  else
    echo "No NVIDIA DKMS module found in DKMS status. Please check manually."
  fi
else
  echo "nvidia-smi is working correctly."
fi

# zlib is required for cudnn
sudo dnf -y install zlib

# install cudnn
sudo dnf -y install --allowerasing cudnn9-cuda-12

# install container-toolkit
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
# Pin to compatible version 1.17.6-1 and install
NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.6-1

sudo dnf install -y  nvidia-container-toolkit-${NVIDIA_CONTAINER_TOOLKIT_VERSION}  nvidia-container-toolkit-base-${NVIDIA_CONTAINER_TOOLKIT_VERSION}  libnvidia-container-tools-${NVIDIA_CONTAINER_TOOLKIT_VERSION}  libnvidia-container1-${NVIDIA_CONTAINER_TOOLKIT_VERSION}

# NOTE: If you want to use the latest nvidia-container-runtime:
#   1. Install the latest NVIDIA container toolkit:
#        sudo dnf install -y nvidia-container-toolkit
#   2. Enable legacy mode for compatibility (tested with nvidia-container-runtime v1.18.0):
#        sudo nvidia-ctk config --in-place --set nvidia-container-runtime.mode=legacy
#        sudo systemctl restart containerd
#   This ensures proper integration with containerd.

```

Verify that all NVIDIA drivers/packages were correctly installed

```bash theme={null}
nvidia-smi
```

Output should be similar to the following (depending on the number of server GPUs). If not, reinstall NVIDIA drivers from the previous section:

```bash theme={null}
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    Off |   00000000:00:16.0 Off |                    0 |
|  0%   20C    P8              9W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A10G                    Off |   00000000:00:17.0 Off |                    0 |
|  0%   19C    P8             10W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A10G                    Off |   00000000:00:18.0 Off |                    0 |
|  0%   21C    P8              9W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A10G                    Off |   00000000:00:19.0 Off |                    0 |
|  0%   21C    P8              9W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A10G                    Off |   00000000:00:1A.0 Off |                    0 |
|  0%   20C    P8              9W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A10G                    Off |   00000000:00:1B.0 Off |                    0 |
|  0%   20C    P8              9W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A10G                    Off |   00000000:00:1C.0 Off |                    0 |
|  0%   20C    P8              9W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A10G                    Off |   00000000:00:1D.0 Off |                    0 |
|  0%   20C    P8              9W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
```

Verify cuda toolkit installed:

```bash theme={null}
/usr/local/cuda/bin/nvcc -V
```

Output should be similar to the following:

```bash theme={null}
nvcc: NVIDIA (R) Cuda compiler driveright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0
```

Very container-toolkit installed:

```bash theme={null}
nvidia-container-cli -V
```

Output should be similar to the following:

```bash theme={null}
cli-version: 1.17.3
lib-version: 1.17.3
build date: 2024-12-04T09:47+0000
build revision: 16f37fcafcbdaf67525135104d60d98d36688ba9
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
```

## Initialize Cluster

#### Master Node

Login to the k8s-master node and follow the below steps. Note that the following steps have to be performed as root (`sudo su -`) (since it involves installation of Kubernetes).

Run below command to initialize and setup Kubernetes master.

<Note>
  NOTE: Ensure that CNI and Service CIDRs do NOT conflict with local server/node IP addresses. CNI\_CIDR must match Flannel IP value set below.
</Note>

CIDR, `192.168.100.0/19`, will allow for up to 32 nodes (flannel reserves one subnet for each node). If more nodes are required, need to increase netmask/subnet size.

CNI and service CIDRs can be modified to match respective network requirements:

```bash theme={null}
CNI_CIDR="192.168.100.0/19"
SERVICE_CIDR="192.168.200.0/19"
NODE_NAME="master"
kubeadm init \
    --pod-network-cidr $CNI_CIDR \
    --service-cidr $SERVICE_CIDR \
    --node-name $NODE_NAME
```

Save the section of the output that resembles the following, as you will need it to join the worker nodes to the master:

```bash theme={null}
# example, real output will differ
kubeadm join 10.10.3.8:6443 --token 72dx7i.i1q8hymni6lw3f3f \
        --discovery-token-ca-cert-hash sha256:9befd70bffbdf6009b96b4e2a3419458a1f2c5661c062246bef4ae38ca8fd7f7 
```

To use `kubectl` commands, need to update kubeconfig (regular or root user):

```bash theme={null}
# update kubeconfig
export KUBECONFIG=/etc/kubernetes/admin.conf
# or if not root user
mkdir -p $HOME/.kube
sudo cp -f /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

systemctl enable kubelet.service
systemctl start kubelet.service
```

Verify that node has been installed correctly:

```bash theme={null}
kubectl get nodes
```

Expected output:

<Note>
  NOTE: STATUS will be `NotReady` until Flannel is installed.
</Note>

```
NAME     STATUS     ROLES           AGE   VERSION
master   NotReady   control-plane   66s   v1.32.4
```

If a period of time has passed between initializing cluster and joining nodes and the join token is expired, run the following command to create a new join token:

```bash theme={null}
kubeadm token create --print-join-command
```

#### Worker Node

Login to the k8s-worker node and follow the below steps. Note that the following steps have to be performed as root (`sudo su -`) (since it involves installation of Kubernetes).

Run following command to join the Kubernetes cluster created in the previous section:

```bash theme={null}
# example, real output will differ
NODE_NAME="worker"

kubeadm join 10.10.3.8:6443 --node-name $NODE_NAME \
        --token 72dx7i.i1q8hymni6lw3f3f \
        --discovery-token-ca-cert-hash sha256:9befd70bffbdf6009b96b4e2a3419458a1f2c5661c062246bef4ae38ca8fd7f7 
```

From the master node, verify that the worker node has joined correctly:

```bash theme={null}
kubectl get nodes
```

Expected output:

<Frame caption="">
  <img src="https://mintcdn.com/lilt-db26f913/zqbAWzfhk3tSI8_P/images/f64c07f9-info-macro-icon--39985156a8a940b9a79d.svg?fit=max&auto=format&n=zqbAWzfhk3tSI8_P&q=85&s=ac469da7929a9d307d906f16a13f7ac3" width="18" height="18" data-path="images/f64c07f9-info-macro-icon--39985156a8a940b9a79d.svg" />
</Frame>

NOTE: STATUS will be `NotReady` until Flannel is installed.

```
NAME     STATUS     ROLES           AGE    VERSION
master   NotReady   control-plane   149m   v1.32.4
worker   NotReady   <none>          8s     v1.32.4
```

#### GPU Node

Login to the k8s-gpu node and follow the below steps. Note that the following steps have to be performed as root (`sudo su -`) (since it involves installation of Kubernetes).

Run following command to join the Kubernetes cluster created in the previous section:

```bash theme={null}
# example, real output will differ
NODE_NAME="gpu"
kubeadm join 10.10.3.8:6443 --node-name $NODE_NAME \
        --token 72dx7i.i1q8hymni6lw3f3f \
        --discovery-token-ca-cert-hash sha256:9befd70bffbdf6009b96b4e2a3419458a1f2c5661c062246bef4ae38ca8fd7f7 
```

From the master node, verify that the worker node has joined correctly:

```bash theme={null}
kubectl get nodes
```

Expected output:

<Note>
  NOTE: `STATUS`will be `NotReady` until Flannel is installed.
</Note>

```
NAME     STATUS     ROLES           AGE    VERSION
gpu      NotReady   <none>          9s     v1.32.4
master   NotReady   control-plane   177m   v1.32.4
worker   NotReady   <none>          28m    v1.32.4
```

## Prepare the LILT Platform

### Import Dependencies

All images and scripts are packaged based on system requirements. Lilt is designed to work as a distributed system, utilizing specific nodes for each workload.

It is not necessary to download the entire install package on each node. The following examples utilize a remote s3 bucket for retrieving data.

<Frame caption="">
  <img src="https://mintcdn.com/lilt-db26f913/zqbAWzfhk3tSI8_P/images/f64c07f9-info-macro-icon--39985156a8a940b9a79d.svg?fit=max&auto=format&n=zqbAWzfhk3tSI8_P&q=85&s=ac469da7929a9d307d906f16a13f7ac3" width="18" height="18" data-path="images/f64c07f9-info-macro-icon--39985156a8a940b9a79d.svg" />
</Frame>

Highly recommended to use a central repository for all images. The below examples utilize local image storage on each node. If using a central repository, containerd image import sections can be ignored. If use a central repository, ensure to modify all helm charts to replace the default value `lilt-registry.local.io:80`

#### Step 1: Set Release Tag (all nodes)

This can be ignored if downloading the entire package separately.

Login to each node (master, worker, gpu) and set release tag var:

```bash theme={null}
export RELEASE_TAG="lilt-enterprise-x.x.x"
```

#### Step 2: Import install packages and images

**Master Node**

Login to the k8s-master node and follow the below steps. Note that the following steps have to be performed as root (`sudo su -`).

Move to the root directory:

```bash theme={null}
cd /
```

Download install package and documentation from remote s3 bucket:

```bash theme={null}
aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/install_packages/ /$RELEASE_TAG/install_packages --recursive --exclude "*" --include "on-prem-installer*"
aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/documentation/ /$RELEASE_TAG/documentation --recursive
```

Unpack installer:

```bash theme={null}
mkdir -p /install_dir
tar -xzf /$RELEASE_TAG/install_packages/on-prem-installer* -C /install_dir
```

If not using a central image repository, complete the following:

* Download required images:

  * istio pilot/proxyv2/install-cni/ztunnel/kiali

  * flannel

```bash theme={null}
aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/docker_images/master/packages/ /$RELEASE_TAG/docker_images/master/packages --recursive --exclude "*" --include "pilot*" --include "proxyv2*" --include "install-cni*" --include "ztunnel*" --include "kiali*" --include "flannel*"
```

* Load images to containerd (update `<release-tag>`):

```bash theme={null}
for file in $(find /$RELEASE_TAG/docker_images -type f) ; do ctr -n=k8s.io image import $file --digests=true ; done
```

**Worker Node**

Login to the k8s-worker node and follow the below steps. Note that the following steps have to be performed as root (`sudo su -`).

Move to the root directory:

```bash theme={null}
cd /
```

If not using a central image repository, complete the following:

* Download required images (all of them):

```bash theme={null}
aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/docker_images/ /$RELEASE_TAG/docker_images --recursive
```

Load images to containerd (update `<release-tag>`):

```bash theme={null}
for file in $(find /$RELEASE_TAG/docker_images -type f) ; do ctr -n=k8s.io image import $file --digests=true ; done
```

**GPU Node**

Login to the k8s-gpu node and follow the below steps. Note that the following steps have to be performed as root (`sudo su -`).

Move to the root directory:

```bash theme={null}
cd /
```

If not using a central image repository, complete the following:

* Download required images:

  * istio\_pilot/proxyv2/install-cni/ztunnel/kiali/k8s-device-plugin/metrics-server/flannel

  * all neural/llm/batch

```bash theme={null}
aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/docker_images/master/packages/ /$RELEASE_TAG/docker_images/master/packages --recursive --exclude "*" --include "pilot*" --include "proxyv2*" --include "install-cni*" --include "ztunnel*" --include "kiali*" --include "flannel*" --include "k8s-device-plugin*" --include "metrics-server*"
aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/docker_images/master/apps/ /$RELEASE_TAG/docker_images/master/apps --recursive --exclude "*" --include "neural*" --include "llm*" --include "batch*"
aws s3 cp s3://lilt-enterprise-releases/$RELEASE_TAG/docker_images/node/ /$RELEASE_TAG/docker_images/node --recursive --exclude "*" --include "neural*"
```

* Load images to containerd (update `<release-tag>`):

```bash theme={null}
for file in $(find /$RELEASE_TAG/docker_images -type f) ; do ctr -n=k8s.io image import $file --digests=true ; done
```

#### Step 3: Label Nodes

Lilt cluster utilizes `nodeSelector` for scheduling pods on specific nodes. This is a simple way to control where pods are scheduled by adding a key-value pair to chart/manifest specifications.

**Master Node**

Login to the k8s-master node and follow the below steps. Note that the following steps have to be performed as root (`sudo su -`).

Worker node is used for running the bulk of application workloads, typically separate from the master and GPU nodes. Label this node as `worker` (node name set in the above section) by executing the following command:

```bash theme={null}
kubectl label nodes worker node-type=worker
```

GPU node is generally used for running GPU intensive pods/applications aside from the worker node; however, this node can also be used for additional workloads if the instance has sufficient resources to handle GPU and application tasks. This decision is up to the cluster admin to determine if GPU nodes can also handle application workloads.

If using this node for standard GPU services (translate, batch, vmf) and LLM services (llama, gemma and whisper) label as `gpu` (node name set in the above section) by executing the following command:

```bash theme={null}
kubectl label nodes gpu capability=gpu
```

Optional: if the GPU node has sufficient resources to run additional workloads, also add worker label:

```bash theme={null}
kubectl label nodes gpu capability=gpu node-type=worker
```

Verify output:

```bash theme={null}
# Show labels
kubectl get nodes --show-labels
```

If necessary to remove a previous node label, use the following command. Is this example, removing `node-type` label from the `gpu` node

```bash theme={null}
kubectl label node gpu node-type-
```

#### Step 4: Create Namespaces

The cluster uses various namespace to separate pods and services.

**Master Node**

Login to the k8s-master node and follow the below steps. Note that the following steps have to be performed as root (`sudo su -`).

```bash theme={null}
kubectl create ns lilt
kubectl create ns istio-system
kubectl create ns kube-flannel
```

Confirm that namespaces were created:

```bash theme={null}
kubectl get ns
```

Output should be similar to the following:

```
NAME              STATUS   AGE
default           Active   4h
istio-system      Active   4h
kube-flannel      Active   4h
kube-node-lease   Active   4h
kube-public       Active   4h
kube-system       Active   4h
lilt              Active   4h
```

#### Step 5: Modify Flannel Helm Chart Values

Flannel helm chart CIDR must match k8s CIDR range.

**Master Node**

Login to the k8s-master node and follow the below steps. Note that the following steps have to be performed as root (`sudo su -`).

Modify flannel `on-prem-values.yaml`:

```bash theme={null}
vi /install_dir/flannel/on-prem-values.yaml
```

Edit `podCidr` to match k8s `CNI_CIDR`set above:

```bash theme={null}
podCidr: "192.168.100.0/19"
```

Save file and exit.

#### Step 6: Image Pull Secrets (optional)

If using a private central repository for hosting images that requires authentication, create image pull secrets for the cluster to reference. The following example utilizes username and password, for service accounts with a `json` key use this document [Create imagePullSecrets](/kb/create-imagepullsecrets)

**Master Node**

Login to the k8s-master node and follow the below steps. Note that the following steps have to be performed as root (`sudo su -`).

Create cluster image pull secret:

```bash theme={null}
kubectl -n <namespace> create secret docker-registry <secret-name> \
  --docker-username=<username> \
  --docker-password=<password> \
  --docker-server=<registry-url> \
  --docker-email=<email>
```

<Note>
  NOTE: Image pull secrets are namespace specific, must create a new secret for each additional namespace.
</Note>

Verify secret was created:

```bash theme={null}
kubectl get secrets -A
```

LILT helm charts utilize a custom values file, usually `on-prem-values.yaml`. Update these files with the new `imagePullSecret`.

As an example, this is from `redis` custom values:

```bash theme={null}
global:
  imagePullSecrets:
    - <secret-name>
```

## Install LILT

### LILT app

Assuming that all images have been loaded into the nodes as explained in the previous steps, we can proceed to install the apps

#### Step 1: Installation

Run the main install script, this includes all third-party and lilt dependencies. This can take up to two hours to fully complete.

Change to install directory:

```bash theme={null}
cd /install_dir
```

Run install script:

```powershell theme={null}
sh install-lilt.sh
```

If installed k9s, can monitor install progress:

```bash theme={null}
k9s
```

If did not install k9s, use kubectl to watch pods:

```bash theme={null}
kubectl get pods --all-namespaces --watch
```

Can also watch k8s events:

```bash theme={null}
kubectl events --all-namespaces --watch
```

When ready, list all pods:

```bash theme={null}
kubectl get pods --all-namespaces
```

Output should be similar to the following:

```
NAMESPACE      NAME                                                            READY   STATUS      RESTARTS         AGE
istio-system   istio-cni-node-7fmp4                                            1/1     Running     0                5d1h
istio-system   istio-cni-node-7fvcq                                            1/1     Running     0                5d1h
istio-system   istio-cni-node-ljg7h                                            1/1     Running     0                5d1h
istio-system   istiod-68fcbb5f87-vlg8m                                         1/1     Running     0                5d1h
istio-system   kiali-c5748bfb9-7ns2g                                           1/1     Running     0                6h26m
istio-system   prometheus-server-99f4cd586-5xbnt                               1/1     Running     0                5d1h
istio-system   ztunnel-b57qw                                                   1/1     Running     0                5d1h
istio-system   ztunnel-kwdz9                                                   1/1     Running     0                5d1h
istio-system   ztunnel-p8fhc                                                   1/1     Running     0                5d1h
kube-flannel   kube-flannel-ds-lvqr7                                           1/1     Running     0                5d2h
kube-flannel   kube-flannel-ds-m7rbh                                           1/1     Running     0                5d2h
kube-flannel   kube-flannel-ds-xqkfz                                           1/1     Running     0                5d2h
kube-system    coredns-76f75df574-7p66z                                        1/1     Running     0                5d2h
kube-system    coredns-76f75df574-svzk9                                        1/1     Running     0                5d2h
kube-system    etcd-master-shared-vpc                                          1/1     Running     1 (5d2h ago)     5d2h
kube-system    kube-apiserver-master-shared-vpc                                1/1     Running     1 (5d2h ago)     5d2h
kube-system    kube-controller-manager-master-shared-vpc                       1/1     Running     1 (5d2h ago)     5d2h
kube-system    kube-proxy-86855                                                1/1     Running     0                5d2h
kube-system    kube-proxy-jggvv                                                1/1     Running     0                5d2h
kube-system    kube-proxy-k5t6k                                                1/1     Running     0                5d2h
kube-system    kube-scheduler-master-shared-vpc                                1/1     Running     1 (5d2h ago)     5d2h
kube-system    metrics-server-865f9f4b55-b9t4r                                 1/1     Running     0                5d1h
lilt           assignment-core-976f6bc6c-xzsr8                                 1/1     Running     7 (5d1h ago)     5d1h
lilt           auditlog-core-67b45f45b9-h25fs                                  1/1     Running     4 (5d1h ago)     5d1h
lilt           auth-856c98d8f6-5f262                                           2/2     Running     0                4d22h
lilt           batch-tb-core-7dc4fd97c7-sfmks                                  1/1     Running     7 (5d1h ago)     5d1h
lilt           batch-worker-cpu-neural-74db44c979-pkl4t                        1/1     Running     0                6h24m
lilt           batch-worker-gpuv4-neural-866dcb59c4-hbhhd                      1/1     Running     0                6h24m
lilt           batchv4-neural-dd58d87b6-zzr6d                                  1/1     Running     0                5d1h
lilt           cache-redis-master-0                                            1/1     Running     0                5d1h
lilt           clickhouse-shard0-0                                             1/1     Running     0                5d1h
lilt           connectors-ingressgateway-6d85dd9b5-58n2f                       1/1     Running     0                5d1h
lilt           converter-core-64d87b55d-fbffd                                  1/1     Running     4 (5d1h ago)     5d1h
lilt           elasticsearch-master-0                                          1/1     Running     0                5d1h
lilt           elasticsearch-master-1                                          1/1     Running     0                5d1h
lilt           file-job-core-67598b8bb4-bv5r6                                  1/1     Running     0                5d1h
lilt           file-translation-core-76fcfd7979-qj4dq                          1/1     Running     0                5d1h
lilt           front-app-7f75c8745c-w2bbs                                      2/2     Running     0                4d22h
lilt           indexer-core-6bd8f76c45-lnxqs                                   1/1     Running     9 (5d1h ago)     5d1h
lilt           istio-ingressgateway-6dc78b487c-nht2v                           1/1     Running     0                5d1h
lilt           job-core-5c5494cbcd-dxqr2                                       1/1     Running     0                5d1h
lilt           langid-neural-57898fb4c5-7xcdv                                  1/1     Running     0                4d22h
lilt           langid-neural-57898fb4c5-nfcqr                                  1/1     Running     0                4d22h
lilt           lexicon-core-79557f46f7-z7sbz                                   1/1     Running     0                5d1h
lilt           lilt-beehive-5bf9d6ff6b-n87dh                                   1/1     Running     0                5d1h
lilt           lilt-configuration-api-5c49db4797-8jnxq                         2/2     Running     0                5d1h
lilt           lilt-connectors-builder-678448f4f6-4g4pq                        2/2     Running     0                7h50m
lilt           lilt-connectors-create-admin-user-kjv2k                         0/1     Completed   0                6h25m
lilt           lilt-connectors-exporter-cronjob-28856445-zqnkx                 0/1     Completed   0                53s
lilt           lilt-connectors-scheduler-cronjob-28856445-gjhd7                0/1     Completed   0                53s
lilt           lilt-core-api-67bcb48c6b-84gtt                                  2/2     Running     0                5d1h
lilt           lilt-dataflow-clickhouse-migration-job-5tqms                    0/1     Completed   0                6h25m
lilt           lilt-dataflow-generate-memory-snapshot-cronjob-28855920-l5cnf   0/1     Completed   0                8h
lilt           lilt-dataflow-ingest-comments-cronjob-28856400-988db            0/1     Completed   0                45m
lilt           lilt-dataflow-ingest-connectorjobs-cronjob-28856400-9kzxp       0/1     Completed   0                45m
lilt           lilt-dataflow-ingest-revision-reports-cronjob-28855020-mhwd4    0/1     Completed   0                23h
lilt           lilt-dataflow-ingest-wpa-minio-cronjob-28856280-6p5jz           0/1     Completed   0                165m
lilt           lilt-dataflow-segment-quality-cronjob-28856400-rr9rs            0/1     Completed   0                45m
lilt           lilt-manager-ui-f54749467-sc6cf                                 2/2     Running     9 (5d1h ago)     5d1h
lilt           linguist-core-7ddd8fcfbb-6rv8b                                  1/1     Running     0                5d1h
lilt           llm-inference-neural-6bb49c48b6-mtfff                           1/1     Running     0                5d1h
lilt           localpv-provisioner-5cfff7dcb5-g2zc5                            1/1     Running     0                5d1h
lilt           memory-core-f697888d9-ccxxh                                     1/1     Running     0                5d1h
lilt           minio-85c7cb7f85-kspj5                                          1/1     Running     0                5d1h
lilt           mongodb-688f5c8c5b-zmh77                                        1/1     Running     0                5d1h
lilt           mq-rabbitmq-0                                                   1/1     Running     1 (5d1h ago)     5d1h
lilt           mysql-0                                                         1/1     Running     0                5d1h
lilt           nginx-ingress-65v2k                                             1/1     Running     0                5d1h
lilt           nvidia-device-plugin-42ntc                                      1/1     Running     0                5d1h
lilt           nvidia-device-plugin-njpdp                                      1/1     Running     0                5d1h
lilt           nvidia-device-plugin-rqqp9                                      1/1     Running     0                5d1h
lilt           qa-core-5d474c46b4-sncqn                                        1/1     Running     3 (5d1h ago)     5d1h
lilt           routing-neural-79df5944c5-m962l                                 1/1     Running     0                4d22h
lilt           search-core-5f9ddbb697-b46q4                                    1/1     Running     0                5d1h
lilt           segment-core-8545b9dbcf-v5jzc                                   1/1     Running     4 (5d1h ago)     5d1h
lilt           tag-core-544c77757d-9qxm6                                       1/1     Running     4 (5d1h ago)     5d1h
lilt           tb-core-795dbf757d-9nkl7                                        1/1     Running     0                5d1h
lilt           tm-core-55c74c49cf-ntmqz                                        1/1     Running     0                5d1h
lilt           translatev4-neural-6c46b4755d-s84n2                              1/1     Running     0                5d1h
lilt           update-managerv4-neural-77c89bf757-zs5d5                        1/1     Running     0                5d1h
lilt           updatev4-neural-6474b69644-kmxdh                                1/1     Running     0                5d1h
lilt           valid-words-replace-neural-7f9fc59958-hk8vw                     1/1     Running     0                4d22h
lilt           valid-words-update-neural-6956b97db9-gf5d4                      1/1     Running     0                4d22h
lilt           watchdog-core-67895dbc9d-pmg8n                                  1/1     Running     0                5d1h
lilt           workflow-core-7bff648cf5-7b77g                                  1/1     Running     4 (5d1h ago)     5d1h
```

#### Step 2: Access Lilt Main Page

Once all pods are running/ready, connect to the LILT main page. Default domain name is [bare.lilt.com](http://bare.lilt.com), and is reachable via `nginx-ingress` running on the worker node.

If accessing from a workstation inside the cluster network, get the local ip address of the `worker` node:

```bash theme={null}
kubectl get node worker -o wide | awk -v OFS='\t\t' '{print $1, $6}'
```

If accessing from an external source, check the cloud provider public IPv4 address assigned to the `worker` node

Once the correct ip address is determined, modify the `hosts` file on your local workstation:

```bash theme={null}
vi /etc/hosts
```

Add a line for the `worker` node ip address and bare.lilt.com:

```bash theme={null}
10.10.3.24   bare.lilt.com
```

Navigate to LILT using a browser, [bare.lilt.com](http://bare.lilt.com/signin):

<Frame caption="">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/lilt-db26f913/images/c8894060-att_5_for_27820036.png" />
</Frame>

After the admin password is set, LILT can be accessed via [bare.lilt.com/signin](http://bare.lilt.com/signin).

<Frame caption="">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/lilt-db26f913/images/8886aa8d-att_1_for_27820036.png" />
</Frame>

#### Debugging

Depending on network speed(to download/upload containerd images) some of the pods can take more time than others, here are some known debugging techniques:

1. If you see

*Error: UPGRADE FAILED: timed out waiting for the condition*, please continue as this can happen due to time taken by the pods to startup, apps deployment happens as expected.

2. If you see pods stuck in

*ContainerCreating* for a longer time(>15 mins), it's safe to restart the pods by using `kubectl delete pod -n lilt <podname>`

1. If the apps aren't healthy even after all containerd images have been loaded to the node, it's safe to revert to previous version and redo the install, use below commands for the same:

```powershell theme={null}
master $ cd $install_dir
## Rollback 
$ helm rollback lilt 1 
## Clean-up jobs
$ kubectl delete jobs -n lilt --all 
## Remove statefulset PVCs 
$ kubectl get pvc -n lilt | grep elasticsearch $ kubectl delete pvc -n lilt <elasticsearch pvc as per above command>
## Perform Install
$ ./install-lilt.sh
```

#### Flannel Error

Sometimes `flannel` does not load correctly on all nodes. Verify which node that the `flannel` init container does not complete and restart `containerd`:

```bash theme={null}
sudo systemctl restart containerd
```

#### Gateway Error 502

Sometimes after a server reboot, `nginx-ingress` needs to be restarted to include new values for `front`. Try restarting the `nginx-ingress` daemonset:

```bash theme={null}
# restart nginx-ingress
kubectl rollout restart daemonset/nginx-ingress -n lilt
```

#### CORE Pods

When a cluster is restarted, `core` pods may be stuck in `CrashLoopBackOff`. A potential fix is to delete the `elasticsearch` helm deployment and PVCs and then re-deploy `elasticsearch`:

```bash theme={null}
# delete deployment
helm delete -n lilt elasticsearch

# get list of elastic search pvcs
kubectl get pvc -n lilt | grep elasticsearch

# delete pvc volumes base on the results, there will be two
kubectl delete pvc -n lilt pvc-1d81120d-5453-4fb0-ab8a-1dde8f85628f
kubectl delete pvc -n lilt pvc-34c86f17-84fc-4c21-8e05-4b07995f9379

# re-deploy
sh install_scripts/install-elasticsearch.sh
```

If some core pods are still not running, try to restart `lilt-beehive` deployment:

```bash theme={null}
kubectl rollout restart deployment/lilt-beehive -n lilt
```

#### GPUs

If GPUs are not working, verify that the expected number is allocated to the cluster:

```bash theme={null}
kubectl get nodes -o custom-columns=":metadata.name" | while read node; do kubectl describe node "$node" | grep -i "nvidia.com/gpu"; done
```

Output should similar to the following (GPU node shows 8, main and worker nodes show 0):

```bash theme={null}
nvidia.com/gpu:   8
nvidia.com/gpu:   0
nvidia.com/gpu    0 
```

If the output shows zero for all nodes, reinstall GPU drivers from the above section and reapply nvidia-device-plugin `sh install_scripts/install-nvidia-device-plugin.sh`

#### Translate v4

Sometimes after a server reboot, translate pods will not initialize. This can be attributed to service endpoints not refreshing from the previous deployment. Possible solutions are restarting the `rabbitmq` statefulset and `minio` deployment:

```bash theme={null}
# restart rabbitmq
kubectl rollout restart statefulset/mq-rabbitmq -n lilt

# restart minio
kubectl rollout restart deployment/minio -n lilt
```

#### Restart all Deployments/Statefulsets

If have various pod failures, can also try and restart all Lilt deployments and statefulsets:

```
kubectl rollout restart deployment -n lilt
kubectl rollout restart statefulset -n lilt
```

## Reset Cluster

If there are any issues that can’t be resolved, as last resort reset each node in the cluster.

<Note>
  NOTE: reseting the cluster will not remove all configuration artifacts. Follow the onscreen output instructions from the below reset command.
</Note>

#### All Nodes (master, worker, gpu)

Login to the each node and complete the following steps.

Commands have to be performed as root (since it involves installation of Kubernetes).

```bash theme={null}
sudo su -
```

Reset the node:

```bash theme={null}
kubeadm reset
```

Reinstall the cluster starting from the beginning of this document.