Overview
This article walks the installer through the process of creating a test environment for LILT on a secure platform in a fully managed cloud Elastic Kubernetes Service (EKS) (public or private access). Purposes of using a fully managed cluster are to limit root access and provide automated horizontal scaling for nodes and pods across multiple availability zones (AZ).Tools you will need
- A web browser to browse the AWS console during/post installation for verification.
- AWS user with required permissions to create/modify EKS, IAM, ECR, EBS and EC2.
- Access to s3 lilt-enterprise-releases installation bucket.
-
Required utilities for local machine:
- AWS CLI
-
Terraform (version
~> 1.9.8
) -
kubectl
-
Helm (version
~> 2.0
) - Docker/containerd
Structure of this Article
The first several sections of this article are related to manually preparing/creating Kubernetes cluster/nodes and other supported packages. Once the Kubernetes cluster is ready, a number of sections will guide you through a partially-automated install process to deploy and configure the software inside your environment.Installer privileges
All commands were run as a user with appropriate permissions as stated in “Tools” section above. Additionally, to assist the user to know which machines/nodes are being referenced, the prefix nodes of main (control plane), worker and gpu will be used throughout this article.EKS Cluster
Recommended System Requirements
Base Image
Installation was tested with Amazon Linux 2023 (AL2023) but customers may choose any bare OS supported by EC2 and EKS. Additionally, customers can create a custom Amazon Machine Images (AMI) based on operational requirements (requires correct NVIDIA drivers to be bootstrapped/installed).EKS control-plane
This controls cluster scheduling, networking and health. This is managed by AWS.Worker-node instance(s)
This instance is the main application workhorse and interacts with the control-plane, hosts containers for the main application and mounts storage. Usually, one node is sufficient but can be horizontally scaled for increased system performance and multiple AZ redundancy. Regarding multiple worker nodes, hardware requirements should be correspondingly replicated; however, disk mounts need to be shared across all nodes requiring a distributed storage solution like EBS CSI Controller. The total system requirements can either be fulfilled on a single machine, or split among multiple nodes that in sum are equal or greater than the recommended system requirements (for multiple AZ redundancy, each node must be able to fully support all systems’ requirements).- Instance type: r5n.24xlarge (96 vCPUs, 768 GB RAM)
-
Boot disk space:
- local storage, 1000 GB (Ensure /var partition used by containerd has at least 250G)
GPU Node
As with worker nodes, can use a single or multiple GPU nodes to meet application demand. If using custom AMIs, NVIDIA drivers will have to be installed. Default AL22023 EKS images include required NVIDIA drivers by default.- Instance type: g5.12xlarge (48 vCPUs, 192 GB RAM)
- GPU: 4 x NVIDIA A10
-
Boot disk space:
- local storage, 1000 GB (Ensure /var partition used by containerd has at least 250G)
Prerequisites
Since every customer and environment is different, user should already have an EKS cluster installed and running. This can be accomplished by eitherterraform
, eksctl
, or AWS Cloud Formation. LILT has example scripts for both terraform
and eksctl
and can be provided as requested.
Nodes
At least oneworker
and one gpu
node, AL2023
or Rocky 8
base OS
Addons
- AWS Load Balancer, either network or Load Balancer Controller
- EBS CSI Controller (necessary storage for node scaling)
Image Repository
ECR, all images must be in a central private repository (external repositories demonstrated inconsistent results due to large LLM images)Ports
Up to the user if want to manage access via AWS security groups or firewalld on each node. Regardless, the following ports must be accessible:Kernel
If utilizing AWS EKS AL2023 base OS, most of these setting are already included by default:Containerd
Each node should have the respectiveconfig.toml
. Please ensure that GPU nodes have modified config for NVIDIA.
Install Package
LILT provides a complete install package available for download from s3. Package needs to be loaded on a local machine that can utlizehelm
and kubectl
for installation. The following steps cover manual installation procedures but users can created automated scripts based on their environment.
Set Version
Set the install package version:Download Package
Download the entire install package and create an install directory:Load images to ECR
Images included in the installer package come with the LILT default image repository. Images need to be retagged to match the new user environment. The following example depicts tagging and pushing images from a local machine to ECR. Set the user environment variables:Create EBS CSI Controller Storage Class
Local PVC node storage does not allow for pod/node horizontal scaling. The best option is to utilize EBS CSI Controller, one of the prerequisites mentioned above. EFS is also a viable option but this is up to the discretion of the customer environment.Update Image Repository
By default, helm charts andon-prem-values.yaml
utilize a local docker registry and need to be updated with the correct ECR values.
localpv
storage class. Update all with EBS-CSI storage class set in the previous steps.
Node Labels
Pods are scheduled base on helm chartnodeSelector
attribute. All respective nodes must be labeled with with work or gpu annotations.
Worker nodes:
Install LILT
Once all cluster and install package prerequisites are met, ready to install LILT app. Since EKS cluster utilizes an internal CNI and the EBS CSI Controller storageclass will hancle PVC scheduling, can comment outflannel
and localpv
charts from the main install script:
Open main install script and comment out the following sections:
install_dir
:
Debugging
Depending on network speed to download images from ECR), some of the pods can take more time than others, here are some known debugging techniques: Error: UPGRADE FAILED: timed out waiting for the condition: please continue as this can happen due to time taken by the pods to startup. Apps deployment happens as expected. pods stuck in ContainerCreating for >15 minutes, it’s safe to restart the pods by usingkubectl delete pod -n lilt <podname>
If the apps aren’t healthy even after all images have been loaded to the node, it’s safe to revert to previous version and redo the install, use below commands for the same: