Overview

This article walks the installer through the process of creating a test environment for LILT on a secure platform in a fully managed cloud Elastic Kubernetes Service (EKS) (public or private access). Purposes of using a fully managed cluster are to limit root access and provide automated horizontal scaling for nodes and pods across multiple availability zones (AZ).

Tools you will need

  • A web browser to browse the AWS console during/post installation for verification.
  • AWS user with required permissions to create/modify EKS, IAM, ECR, EBS and EC2.
  • Access to s3 lilt-enterprise-releases installation bucket.
  • Required utilities for local machine:
    • AWS CLI
    • Terraform (version ~> 1.9.8)
    • kubectl
    • Helm (version ~> 2.0)
    • Docker/containerd

Structure of this Article

The first several sections of this article are related to manually preparing/creating Kubernetes cluster/nodes and other supported packages. Once the Kubernetes cluster is ready, a number of sections will guide you through a partially-automated install process to deploy and configure the software inside your environment.

Installer privileges

All commands were run as a user with appropriate permissions as stated in “Tools” section above. Additionally, to assist the user to know which machines/nodes are being referenced, the prefix nodes of main (control plane), worker and gpu will be used throughout this article.

EKS Cluster

Base Image

Installation was tested with Amazon Linux 2023 (AL2023) but customers may choose any bare OS supported by EC2 and EKS. Additionally, customers can create a custom Amazon Machine Images (AMI) based on operational requirements (requires correct NVIDIA drivers to be bootstrapped/installed).

EKS control-plane

This controls cluster scheduling, networking and health. This is managed by AWS.

Worker-node instance(s)

This instance is the main application workhorse and interacts with the control-plane, hosts containers for the main application and mounts storage. Usually, one node is sufficient but can be horizontally scaled for increased system performance and multiple AZ redundancy. Regarding multiple worker nodes, hardware requirements should be correspondingly replicated; however, disk mounts need to be shared across all nodes requiring a distributed storage solution like EBS CSI Controller. The total system requirements can either be fulfilled on a single machine, or split among multiple nodes that in sum are equal or greater than the recommended system requirements (for multiple AZ redundancy, each node must be able to fully support all systems’ requirements).
  • Instance type: r5n.24xlarge (96 vCPUs, 768 GB RAM)
  • Boot disk space:
    • local storage, 1000 GB (Ensure /var partition used by containerd has at least 250G)
GPU Node
As with worker nodes, can use a single or multiple GPU nodes to meet application demand. If using custom AMIs, NVIDIA drivers will have to be installed. Default AL22023 EKS images include required NVIDIA drivers by default.
  • Instance type: g5.12xlarge (48 vCPUs, 192 GB RAM)
  • GPU: 4 x NVIDIA A10
  • Boot disk space:
    • local storage, 1000 GB (Ensure /var partition used by containerd has at least 250G)

Prerequisites

Since every customer and environment is different, user should already have an EKS cluster installed and running. This can be accomplished by either terraform, eksctl, or AWS Cloud Formation. LILT has example scripts for both terraform and eksctl and can be provided as requested.

Nodes

At least one workerand one gpunode, AL2023 or Rocky 8 base OS

Addons

  • AWS Load Balancer, either network or Load Balancer Controller
  • EBS CSI Controller (necessary storage for node scaling)

Image Repository

ECR, all images must be in a central private repository (external repositories demonstrated inconsistent results due to large LLM images)

Ports

Up to the user if want to manage access via AWS security groups or firewalld on each node. Regardless, the following ports must be accessible:
# general cluster
22,80,443,2379,2380,5000,6443,10250,10251,10252
# api
5005,8011,8080
# istio
15000,15001,15006,15008,15009,15010,15012,15014,15017,15020,15021,15090,15443,20001
# clickhouse
8123
# WSO2
4000,9443,9763

Kernel

If utilizing AWS EKS AL2023 base OS, most of these setting are already included by default:
# set kernel parameters as required by Istio
# avoid ztunnel container restarts due to load
# append to end of file
cat <<EOF >> /etc/security/limits.conf
soft nofile 131072
hard nofile 131072
EOF
cat <<EOF >> /etc/systemd/system.conf
DefaultLimitNOFILE=131072
EOF

# kernel params for k8s
bash -c 'cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
nf_nat
xt_REDIRECT
xt_owner
iptable_nat
iptable_mangle
iptable_filter
EOF'

modprobe overlay
modprobe br_netfilter
modprobe nf_nat
modprobe xt_REDIRECT
modprobe xt_owner
modprobe iptable_nat
modprobe iptable_mangle
modprobe iptable_filter

bash -c 'cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF'

sysctl --system
echo "Kernel parameters configured."

# Disable swap
echo "Disabling swap..."
swapoff -a
sed -e '/swap/s/^/#/g' -i /etc/fstab
echo "Swap disabled."

# Disable SELinux
echo "Disabling SELinux..."
sudo setenforce 0
sudo sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config
echo "SELinux disabled."

# Configure vm.max_map_count for Elasticsearch
echo "Configuring vm.max_map_count for Elasticsearch..."
echo "vm.max_map_count=262144" | sudo tee -a /etc/sysctl.conf
echo "vm.max_map_count configured."

Containerd

Each node should have the respective config.toml. Please ensure that GPU nodes have modified config for NVIDIA.

Install Package

LILT provides a complete install package available for download from s3. Package needs to be loaded on a local machine that can utlize helm and kubectl for installation. The following steps cover manual installation procedures but users can created automated scripts based on their environment.

Set Version

Set the install package version:
export RELEASE_TAG="lilt-enterprise-2025.03.11"

Download Package

Download the entire install package and create an install directory:
aws s3 sync s3://lilt-enterprise-releases/$RELEASE_TAG/ ./$RELEASE_TAG/
mkdir -p ./$RELEASE_TAG/install_dir
tar -xzf ./$RELEASE_TAG/install_packages/on-prem-installer* -C ./$RELEASE_TAG/install_dir

Load images to ECR

Images included in the installer package come with the LILT default image repository. Images need to be retagged to match the new user environment. The following example depicts tagging and pushing images from a local machine to ECR. Set the user environment variables:
export AWS_ACCOUNT="1234567890"
export AWS_REGION="us-east-1"
Ensure that the user has ECR access and is logged-in:
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com
Load all install package images to local machine docker, tag them and push to ECR:
# Loop over all Docker image tar files
for image in $(find "./$RELEASE_TAG/docker_images" -type f); do
  echo "Processing image file: $image"

  # Load the Docker image
  loaded_image=$(docker load < "$image" | awk '/Loaded image:/ {print $3}')
  if [ -z "$loaded_image" ]; then
    echo "Failed to load image from $image, skipping..."
    continue
  fi
  echo "Loaded image: $loaded_image"
  
  # Construct the new ECR image name, replace default LILT registry
  new_image="${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${loaded_image#lilt-registry.local.io:80/gcr.io/lilt-service-48916b30/}"
  
  # Retag the image
  docker tag "$loaded_image" "$new_image"
  echo "Tagged $loaded_image as $new_image"

  # Push the new tagged image to ECR
  docker push "$new_image"
  echo "Pushed $new_image to ECR"
done

Create EBS CSI Controller Storage Class

Local PVC node storage does not allow for pod/node horizontal scaling. The best option is to utilize EBS CSI Controller, one of the prerequisites mentioned above. EFS is also a viable option but this is up to the discretion of the customer environment.
# create yaml
vi storageclass.yaml

# paste the following code
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-csi-gp3
parameters:
  fsType: ext4
  type: gp3
provisioner: ebs.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

# apply yaml
kubectl apply -f sc.yaml

Update Image Repository

By default, helm charts and on-prem-values.yaml utilize a local docker registry and need to be updated with the correct ECR values.
find /install_dir -type f -exec sed -i 's|lilt-registry.local.io:80|$AWS_ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com|g' {} +
Update Storage Class By default, helm chart values and scripts are set to localpv storage class. Update all with EBS-CSI storage class set in the previous steps.
find /install_dir -type f -exec sed -i 's|openebs-hostpath|ebs-csi-gp3|g' {} +

Node Labels

Pods are scheduled base on helm chart nodeSelectorattribute. All respective nodes must be labeled with with work or gpu annotations. Worker nodes:
kubectl label nodes worker node-type=worker
GPU nodes:
kubectl label nodes gpu capability=gpu

Install LILT

Once all cluster and install package prerequisites are met, ready to install LILT app. Since EKS cluster utilizes an internal CNI and the EBS CSI Controller storageclass will hancle PVC scheduling, can comment out flannel and localpv charts from the main install script: Open main install script and comment out the following sections:
# open script
vi install_dir/install-lilt.sh

# comment out
# kubectl label --overwrite ns kube-flannel pod-security.kubernetes.io/enforce=privileged
# sh install_scripts/install-flannel.sh
# wait until pod ready
# kubectl wait --namespace kube-flannel --for=condition=ready pod -l app=flannel --timeout=180s

# storage class
# sh install_scripts/install-localpv-provisioner.sh
# wait until pod ready
# kubectl wait --namespace lilt --for=condition=ready pod -l app=localpv-provisioner --timeout=180s
Run the main install scriptInstall script is located in the main install_dir:
sh install_dir/install-lilt.sh
This will create/install required namespaces, secrets, certs, third-party apps and LILT charts. Total install takes around 60-90 minutes depending on system performance. Verify that all apps are running:
kubectl get pods -n lilt

Debugging

Depending on network speed to download images from ECR), some of the pods can take more time than others, here are some known debugging techniques: Error: UPGRADE FAILED: timed out waiting for the condition: please continue as this can happen due to time taken by the pods to startup. Apps deployment happens as expected. pods stuck in ContainerCreating for >15 minutes, it’s safe to restart the pods by using kubectl delete pod -n lilt <podname> If the apps aren’t healthy even after all images have been loaded to the node, it’s safe to revert to previous version and redo the install, use below commands for the same:
# Rollback to docker-registry version
helm rollback lilt 1 
# Clean-up jobs
kubectl delete jobs -n lilt --all 
# Remove statefulset PVCs, elasticsearch as an example
kubectl get pvc -n lilt | grep elasticsearch
kubectl delete pvc -n lilt <elasticsearch pvc as per previous command>