This article details the infrastructure changes required when installing this version of LILT.

2025

Patch Updates

The configuration of GPU requirements has changed. Starting with this release, 24GB of VRAM is necessary to run the translate pod successfully. For customers running on T4 GPUs, this means that the GPU node must have at least two (2) T4s attached to it. Put another way, it is not sufficient to have two (2) nodes, each with one (1) T4 GPU attached.

2024 Q4

Deprecation Notice

The analytics-api application, part of the old analytics implementation, has been officially deprecated and removed.

Node Labeling

In order to support better utilization of clusters, we have adjusted the way we recommend labeling clusters. Node Labels describes the expected labeling of nodes.

Troubleshooting

Updated the Troubleshooting guide to include the newly added CLI command to reset the AI models.

Default Values

Updated resource defaults for services to optimize performance. These changes are documented within Resource Metrics.

Flannel CNI

Flannel is now a helm chart and included in the overall install-lilt.sh script and no longer a separate deployment. If upgrading LILT from a previous version and flannel is already installed, comment out the flannel section of the install script:
# install-lilt.sh

# Install flannel, on-prem cutomers only
kubectl label --overwrite ns kube-flannel pod-security.kubernetes.io/enforce=privileged
sh install_scripts/install-flannel.sh
# wait until pod ready
kubectl wait --namespace kube-flannel --for=condition=ready pod -l app=flannel --timeout=180s
If installing flannel for the first time via the helm chart, ensure that the podCidr is consistent with K8S cluster settings:
# flannel/on-prem-values.yaml

flannel:
  podCidr: "192.168.100.0/19"

Redis

Memory limits have been implemented to prevent pod restarts/crashes. This ensures that consumed memory does not exceed pod resource limits. Settings for maxmemory must be slightly below pod mem limits. Memory can be increased if required:
# redis/on-prem-values.yaml

global:
  redis:
    # maxmem needs to be just below resource limit
    maxmemory: "5.8gb"
    maxmemoryPolicy: "allkeys-lru"  # evict least recently used keys
#
master:
  resources:
    limits:
      memory: 6G
Persistent storage is now disabled by default. This increases performance and reduces storage requirements.
# redis/on-prem-values.yaml

master:
  persistence:
    enabled: false
If persistence is required for cache security logging/audit, reenable and ensure that the storage size is sufficient for estimated usage:
# redis/on-prem-values.yaml

master:
  persistence:
    enabled: true
    size: 20Gi   # up to 100Gi for heavy usage

Istio

Additional kernelparameters are required to prevent ztunnel pod restarts:
# avoid ztunnel container restarts due to load
# append to end of file
cat <<EOF >> /etc/security/limits.conf
soft nofile 131072
hard nofile 131072
EOF

cat <<EOF >> /etc/systemd/system.conf
DefaultLimitNOFILE=131072
EOF

Firewall Ports

Additional ports are required for Istio, api, Clickhouse, and Flannel. Please ensure that the following are enabled:
firewall-cmd --permanent --add-port={22,80,443,2379,2380,5000,6443,10250,10251,10252,10255}/tcp
# api
firewall-cmd --permanent --add-port={5005,8011,8080}/tcp
# istio
firewall-cmd --permanent --add-port={15000,15001,15006,15008,15009,15010,15012,15014,15017,15020,15021,15090,15443,20001}/tcp
# flannel
firewall-cmd --permanent --add-port=8472/udp
# clickhouse
firewall-cmd --permanent --add-port=8123/tcp

Containerd

Additional workloads now run on the GPU node in parallel with the Worker node. If NOT using a centralized repository for all images, ensure that the following are loaded via containerd on the GPU node:
pilot*
proxyv2*
install-cni*
ztunnel*
kiali*
flannel*     # (only if new install)
k8s-device-plugin*
metrics-server*
neural*   # (all neural from docker_images master/node)
llm*
batch*

2024 Q2

Hardware Requirements

Due to the inclusion of V2 and V3 models by default, the hardware requirements have changed. Resource Metrics reflects the additional deployments that need to be considered, and the following recommendations have been updated:
  • Master node disk requirement has increased from 200 GB to 500 GB to accommodate for additional container images, configuration, and logging.
  • GPU Node instance type updated from g4dn.2xlarge (8 vCPUs, 32 GB RAM) to g4dn.8xlarge (32 vCPUs, 128 GB RAM), in order to be able to run the V2 and V3 services.
  • Due to the added models, the hard disk space requirements have been increased. See Installation Requirements for more information.

V2 and V3 Language Model Updates

As we introduce newer, more accurate language models into LILT, we’ve continually updated our hardware requirements. See Language Models for the latest in V2 and V3 model information. More information can be found in the Knowledge Base around Resource Requirements.

Operating System Requirements

CentOS 7 → Rocky Linux 8

New software features of the LILT platform are incompatible with CentOS 7[1]. All installations still on CentOS 7 should migrate before adopting this release. Some previous LILT installations were done using CentOS 7, which reached End of Life (EOL) support as of June 30, 2024. The recommended base operating system, and the one being tested in our QA environment, is using Rocky Linux 8. Rocky Linux 8 provides a secure environment similar to CentOS 7, with an End of Life (EOL) support date of May 2029.

Istio Module Support

Modules
All installations have updated modules needed to support Istio. See the section regarding kernel modules which now includes the following modules to install:
overlay
br_netfilter
nf_nat
xt_REDIRECT
xt_owner
iptable_nat
iptable_mangle
iptable_filter
These will need to be installed on all existing nodes.
Ports
All installations have updated firewall port changes needed to support Istio. See the section regarding Firewall Settings which now includes opening up ports 15000,15001,15006,15008,15009,15010,15012,15014,15017,15020,15021,15090,15443,20001}/tcp for Istio.

Configuration Updates

Custom Domains

New configurations should be done, as described in Set custom Domain and Certificates and Set Connectors Domain.

Upgrading Process

The upgrading process, which involves Helm values files, has been updated for this release to make upgrading simpler in the future. See Q2 2024 Updates for more details.

MinIO Resize

In previous releases, the PersistentVolume for MinIO was set to 200GB. With the release of V2 and V3 models, this is no longer large enough to support them. The default size has been updated from 200GB to 400GB, however, this will not automatically resize existing installations. If your backing MinIO PersistentVolume is resizable, please resize to 400GB. If it is not able to be resized, the recommended procedure is as follows:
  • Back up the MinIO data (if necessary)
  • Delete the MinIO PersistentVolumeClaim and and PersistentVolume
  • Restart the new MinIO deployment
  • Restore MinIO data (if necessary)

WPA Metrics

WPA metrics, as described in Generate Evaluation Metrics (WPA, BLEU) , are now enabled by default.

SMTP Notifications

See the new page around configuring SMTP Notifications here: SMTP Email Notifications

Guide on how to handle GPU worker counts

As GPU processing has become increasingly critical in LILT’s models, we’ve added Configuring GPU Worker Counts in LILT to assist system administrators in configuring the LILT application for GPU use.

Vulnerability (CVE) Scan Results

LILT has conducted thorough scans of all services and components to confirm there are no components rated as High or Critical CVEs[2]. Self-Hosted customers can find further details in the CVE Scan PDF provided with the release.

Known Issues

MongoDB Upgrade

The latest MongoDB version has a known issue that may cause it to fall into a CrashLoop upon upgrading. If this occurs, the recommended fix is as follows:
  • Back up the MongoDB data (if necessary)
  • Delete the MongoDB PersistentVolumeClaim and and PersistentVolume
  • Restart the new MongoDB deployment
  • Restore MongoDB data (if necessary)

Known CVE issues

Vulnerability ReferenceApplicationMitigation / Notes
CVE-2024-31580neuralTo be fixed in next release, requires upgrade of package that is used by internal dependencies.
CVE-2024-31583neuralTo be fixed in next release, requires upgrade of package that is used by internal dependencies.
CVE-2023-6378core-apiTo be fixed in next release, requires upgrade of package that is used by internal dependencies.
CVE-2023-6481core-apiTo be fixed in next release, requires upgrade of package that is used by internal dependencies.
CVE-2024-22257core-apiTo be fixed in next release, requires upgrade of package that is used by internal dependencies.
CVE-2016-1000027core-apiTo be fixed in next release, requires upgrade of package that is used by internal dependencies.
CVE-2024-22243core-apiTo be fixed in next release, requires upgrade of package that is used by internal dependencies.
CVE-2024-22259core-apiTo be fixed in next release, requires upgrade of package that is used by internal dependencies.
CVE-2024-22262core-apiTo be fixed in next release, requires upgrade of package that is used by internal dependencies.
CVE-2023-32697core-apiTo be fixed in next release, requires upgrade of package that is used by internal dependencies.
CVE-2022-1471core-apiTo be fixed in next release, requires upgrade of package that is used by internal dependencies.
CVE-2022-25857core-apiTo be fixed in next release, requires upgrade of package that is used by internal dependencies.
CVE-2024-21634core-apiTo be fixed in next release, requires upgrade of package that is used by internal dependencies.
GHSA-m425-mq94-257glocalpv-provisionerVulnerability exists in latest version of this application. Waiting on a newer release to fix.
CVE-2024-24790localpv-provisionerVulnerability exists in latest version of this application. Waiting on a newer release to fix.
CVE-2023-39325localpv-provisionerVulnerability exists in latest version of this application. Waiting on a newer release to fix.
CVE-2023-45283localpv-provisionerVulnerability exists in latest version of this application. Waiting on a newer release to fix.
CVE-2023-45287localpv-provisionerVulnerability exists in latest version of this application. Waiting on a newer release to fix.
CVE-2023-45288localpv-provisionerVulnerability exists in latest version of this application. Waiting on a newer release to fix.
CVE-2023-1370elasticsearchVulnerability exists in latest version of this application. Waiting on a newer release to fix.
CVE-2021-40690elasticsearchVulnerability exists in latest version of this application. Waiting on a newer release to fix.
CVE-2022-1471elasticsearchVulnerability exists in latest version of this application. Waiting on a newer release to fix.
CVE-2024-41110istiodVulnerability exists in latest version of this application. Waiting on a newer release to fix.

Appendix

[1] The Analytics dashboard relies on Istio, which uses some underlying code libraries in the operating system kernel that are not present in CentOS 7. [2] Fixed High and Critical CVEs refer to vulnerabilities for which fixes are available and do not break the integration of the application and its dependencies. It should be noted that vulnerabilities are continuously discovered, and new CVEs may have been identified but not yet addressed.