Skip to main content

Overview

Dataflow is the analytics ETL (Extract, Transform, Load) pipeline that powers LILT’s reporting and usage metrics. It continuously moves data from the primary MySQL database into ClickHouse, an analytics-optimized column store, where it is aggregated into materialized views for dashboards and operational reporting. Dataflow runs as a collection of scheduled jobs orchestrated by Argo Workflows inside your Kubernetes cluster. Each job handles a specific data domain (file translations, documents, users, connector jobs, etc.) and runs on its own schedule. Two special operational modes exist beyond the regular scheduled ETL:
  • Initial Load: Automatically runs all enabled ETL jobs in sequence when the system is first installed or upgraded. This ensures ClickHouse is populated with historical data.
  • Backfill: A manually-triggered operation that rebuilds ClickHouse materialized views. Used after schema changes or when historical data needs to be reprocessed.

Architecture

Data Flow

MySQL (primary database)


Dataflow ETL Jobs (Python, running as Argo Workflows)


ClickHouse (analytics database, lilt_analytics schema)


Materialized Views (pre-aggregated metrics for dashboards)

Key Components

ComponentDescription
Argo WorkflowsOrchestrates all ETL jobs as CronWorkflows (scheduled) or Workflows (manual/one-shot)
CronWorkflowsKubernetes resources that run ETL jobs on defined schedules (e.g., every 15 minutes, every 2 hours)
WorkflowTemplatesReusable job definitions shared across all ETL jobs
Cloud SQL Proxy (GCP only)Sidecar container providing secure MySQL connectivity
ClickHouseAnalytics database using lilt_analytics schema with ReplacingMergeTree tables

How ETL Jobs Work

Each ETL job follows the same pattern:
  1. Connect to MySQL and ClickHouse
  2. Read the high-water mark — the timestamp of the last record processed in ClickHouse
  3. Fetch a batch of records from MySQL that are newer than the high-water mark
  4. Transform the records into the ClickHouse schema format
  5. Insert the batch into ClickHouse
  6. Repeat until no more new records are found
  7. Sleep between batches to avoid overloading the source database
ClickHouse tables use the ReplacingMergeTree engine, which provides upsert semantics — if a record with the same primary key is inserted again (e.g., during a backfill), the latest version automatically replaces the old one during background merges.

ETL Job Reference

The following jobs are available. Each can be independently enabled or disabled in your configuration:
Job NameDefault ScheduleDescription
ingest-file-translationEvery 15 minutesFile translation records and metadata
ingest-documentsEvery hourDocument-level translation data
ingest-usersEvery 2 hoursUser account information
ingest-connectorjobsEvery 2 hoursConnector integration job records
ingest-commentsEvery 2 hoursTranslation comments and annotations
segment-qualityEvery 2 hoursSegment-level quality metrics
ingest-otdDaily (3am, 2pm)On-time delivery data
ingest-revision-reportsDaily (5am)Revision report data
ingest-financialdataDaily (5am)Financial and billing data (GCP only)
linguist-activity-cdcEvery 2 hoursLinguist activity change data capture (GCP only)
ingest-wpa-minioDaily (2am)Word-pair analysis data from object storage
ingest-wpa-bigqueryEvery 2 hoursWord-pair analysis data from BigQuery (GCP only)
generate-memory-snapshotDaily (8pm)Translation memory snapshots
refresh-reviewed-wordsEvery 2 hoursVerified/reviewed word counts
backfillManualRebuild materialized views
debugManualDebug pod (runs indefinitely for troubleshooting)
Jobs marked (GCP only) depend on Google Cloud services and are not applicable to self-managed deployments. They should remain disabled in your configuration.

Initial Load

What It Does

The Initial Load is a one-time workflow that runs automatically after a Helm install or upgrade (when enabled). It ensures that ClickHouse is fully populated with data by executing all enabled ETL jobs in series, followed by the backfill job.

Execution Order

  1. All enabled ETL jobs run one at a time, in sequence (excluding backfill and debug)
  2. Each job is allowed to fail without blocking subsequent jobs
  3. The backfill job runs last, rebuilding materialized views with the freshly ingested data

When It Runs

The Initial Load is triggered as a Helm post-install/post-upgrade hook. It runs automatically whenever the dataflow chart is installed or upgraded, provided it is enabled in your configuration.
The Initial Load workflow may take significant time depending on the volume of historical data. Monitor progress via kubectl (see Troubleshooting).

Configuration

dataflow:
  onpremValues:
    jobs:
      initialLoad:
        enabled: true  # Set to true to run on install/upgrade

Manually Triggering an Initial Load

If you need to re-run the Initial Load outside of a Helm install/upgrade cycle, create a new Workflow from the existing WorkflowTemplate:
kubectl create -n <namespace> -f - <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: <release-name>-initial-load-manual-
spec:
  workflowTemplateRef:
    name: <release-name>-initial-load
EOF
Replace <release-name> with your Helm release name (typically lilt) and <namespace> with your deployment namespace.

Backfill Operations

What It Does

The backfill job rebuilds ClickHouse materialized views — pre-aggregated summary tables that power dashboards and reports. This is necessary when:
  • New columns have been added to base tables (via ClickHouse migrations)
  • Historical data needs to be reprocessed to populate new fields
  • Materialized views have become inconsistent or corrupted

How It Works

  1. Drops existing materialized views
  2. Recreates them with the current schema definition
  3. Backfills historical data in time-based batches:
    • Batch 1: Last 12 months (most recent data first)
    • Batch 2: 12–24 months ago
    • Batch 3: Older than 24 months
This batched approach ensures that the most valuable recent data is available first and prevents timeouts during large rebuilds.

Key Materialized Views

ViewPurpose
mv_mt_translation_throughputAggregated instant/MT translation metrics by user, date, language pair, and domain
mv_user_daily_verified_wordsDaily verified translation word counts per user
mv_verified_translation_throughputProject-level verified translation throughput
mv_online_offline_totalOnline vs offline translation totals
mv_jobs_costs_summaryJob cost summaries by domain

Configuration

The backfill job is defined as a manually-triggered CronWorkflow:
dataflow:
  onpremValues:
    jobs:
      scripts:
        backfill:
          enabled: true   # Create the CronWorkflow resource
          # trigger is always "manual" — no automatic schedule
When enabled: true, a suspended CronWorkflow is created. It will not run on a schedule — it must be triggered manually.

Running a Backfill

Trigger the backfill by creating a new Workflow from the CronWorkflow’s WorkflowTemplate:
kubectl create -n <namespace> -f - <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: <release-name>-backfill-manual-
spec:
  workflowTemplateRef:
    name: <release-name>-job
  arguments:
    parameters:
      - name: command
        value: "python -m bin.rebuild_materialized_views"
EOF

Monitoring Progress

# List running workflows
kubectl get workflows -n <namespace> | grep dataflow

# Watch workflow status
kubectl get workflow -n <namespace> <workflow-name> -w

# Stream pod logs for a running workflow
kubectl logs -n <namespace> -l workflows.argoproj.io/workflow=<workflow-name> -f

Configuration Reference

Enabling/Disabling Jobs

Each ETL job can be independently toggled. For self-managed deployments, the recommended configuration is:
dataflow:
  onpremValues:
    jobs:
      initialLoad:
        enabled: true

      scripts:
        # Enable for self-managed (uses object storage)
        ingest-wpa-minio:
          enabled: true

        # Disable — requires Google BigQuery (not available in self-managed)
        ingest-wpa-bigquery:
          enabled: false
        ingest-financialdata:
          enabled: false
        linguist-activity-cdc:
          enabled: false

        # Enable for backfill operations
        backfill:
          enabled: true

        # Optional debug pod (enable only when troubleshooting)
        debug:
          enabled: false
Jobs that depend on Google BigQuery (ingest-wpa-bigquery, ingest-financialdata, linguist-activity-cdc) must remain disabled in self-managed deployments as the BigQuery service is not available.

ClickHouse Migrations

Dataflow includes a migration job that runs automatically during Helm sync operations. This job applies any pending ClickHouse schema changes (new tables, columns, or materialized views) before ETL jobs run. The migration job:
  • Runs as a Kubernetes Job with an ArgoCD sync hook
  • Applies migrations sequentially and idempotently
  • Is automatically cleaned up after completion (default TTL: 10 minutes)
After a migration adds new columns to existing tables, a backfill may be required to populate those columns for historical records. Check the release notes for your version to determine if a backfill is recommended after upgrade.

Troubleshooting

Viewing Job Status

List all dataflow CronWorkflows:
kubectl get cronworkflows -n <namespace> | grep dataflow
List recent workflow runs:
kubectl get workflows -n <namespace> --sort-by=.metadata.creationTimestamp

Common Issues

Jobs are not running

  1. Check that Argo Workflows is installed and healthy:
    kubectl get pods -n <namespace> | grep argo
    
  2. Verify CronWorkflows exist and are not suspended (scheduled jobs should show suspend: false):
    kubectl get cronworkflows -n <namespace> -o yaml | grep -A2 "name:\|suspend:"
    
  3. Check that the dataflow image can be pulled:
    kubectl get events -n <namespace> --field-selector reason=Failed | grep dataflow
    

ClickHouse migration job hangs or fails

  1. Check the migration job logs:
    kubectl logs -n <namespace> -l job-name=<release-name>-clickhouse-migration-job-<revision>
    
  2. Verify the lilt_analytics database exists in ClickHouse:
    kubectl exec -it -n <namespace> <clickhouse-pod> -- \
      clickhouse-client --query "SHOW DATABASES"
    

Initial Load did not run after upgrade

The Initial Load only runs as a Helm hook. If it did not trigger:
  1. Verify initialLoad.enabled: true in your values
  2. Check Helm hook execution:
    kubectl get workflows -n <namespace> | grep initial-load
    
  3. Manually trigger if needed (see Manually Triggering an Initial Load)

Enabling the Debug Pod

For deeper troubleshooting, enable the debug pod to get an interactive shell with the dataflow environment:
dataflow:
  onpremValues:
    jobs:
      scripts:
        debug:
          enabled: true
After deploying, trigger the debug workflow:
kubectl create -n <namespace> -f - <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: <release-name>-debug-manual-
spec:
  workflowTemplateRef:
    name: <release-name>-job
  arguments:
    parameters:
      - name: command
        value: "sleep infinity"
EOF
Then exec into the running pod:
kubectl exec -it -n <namespace> <debug-pod-name> -- /bin/bash
This gives you access to the full dataflow environment for running individual ETL commands, testing database connectivity, or inspecting the runtime configuration.

Error Retry Behavior

ETL jobs include built-in retry logic for transient errors:
Error TypeBehavior
HTTP 500/502/503/504Retries up to 3 times with 5-second delay
Connection reset/refusedRetries up to 3 times with 5-second delay
Timeout errorsRetries up to 3 times with 5-second delay
ClickHouse memory limit exceededFails immediately (increase resource limits)
Other database errorsFails immediately (check logs for details)