Dataflow Analytics Pipeline

Overview

Dataflow is the analytics ETL (Extract, Transform, Load) pipeline that powers LILT’s reporting and usage metrics. It continuously moves data from the primary MySQL database into ClickHouse, an analytics-optimized column store, where it is aggregated into materialized views for dashboards and operational reporting. Dataflow runs as a collection of scheduled jobs orchestrated by Argo Workflows inside your Kubernetes cluster. Each job handles a specific data domain (file translations, documents, users, connector jobs, etc.) and runs on its own schedule. Two special operational modes exist beyond the regular scheduled ETL:

Initial Load: Automatically runs all enabled ETL jobs in sequence when the system is first installed or upgraded. This ensures ClickHouse is populated with historical data.
Backfill: A manually-triggered operation that rebuilds ClickHouse materialized views. Used after schema changes or when historical data needs to be reprocessed.

Architecture

Data Flow

MySQL (primary database)
    │
    ▼
Dataflow ETL Jobs (Python, running as Argo Workflows)
    │
    ▼
ClickHouse (analytics database, lilt_analytics schema)
    │
    ▼
Materialized Views (pre-aggregated metrics for dashboards)

Key Components

Component	Description
Argo Workflows	Orchestrates all ETL jobs as CronWorkflows (scheduled) or Workflows (manual/one-shot)
CronWorkflows	Kubernetes resources that run ETL jobs on defined schedules (e.g., every 15 minutes, every 2 hours)
WorkflowTemplates	Reusable job definitions shared across all ETL jobs
Cloud SQL Proxy (GCP only)	Sidecar container providing secure MySQL connectivity
ClickHouse	Analytics database using `lilt_analytics` schema with ReplacingMergeTree tables

How ETL Jobs Work

Each ETL job follows the same pattern:

Connect to MySQL and ClickHouse
Read the high-water mark — the timestamp of the last record processed in ClickHouse
Fetch a batch of records from MySQL that are newer than the high-water mark
Transform the records into the ClickHouse schema format
Insert the batch into ClickHouse
Repeat until no more new records are found
Sleep between batches to avoid overloading the source database

ClickHouse tables use the ReplacingMergeTree engine, which provides upsert semantics — if a record with the same primary key is inserted again (e.g., during a backfill), the latest version automatically replaces the old one during background merges.

ETL Job Reference

The following jobs are available. Each can be independently enabled or disabled in your configuration:

Job Name	Default Schedule	Description
`ingest-file-translation`	Every 15 minutes	File translation records and metadata
`ingest-documents`	Every hour	Document-level translation data
`ingest-users`	Every 2 hours	User account information
`ingest-connectorjobs`	Every 2 hours	Connector integration job records
`ingest-comments`	Every 2 hours	Translation comments and annotations
`segment-quality`	Every 2 hours	Segment-level quality metrics
`ingest-otd`	Daily (3am, 2pm)	On-time delivery data
`ingest-revision-reports`	Daily (5am)	Revision report data
`ingest-financialdata`	Daily (5am)	Financial and billing data (GCP only)
`linguist-activity-cdc`	Every 2 hours	Linguist activity change data capture (GCP only)
`ingest-wpa-minio`	Daily (2am)	Word-pair analysis data from object storage
`ingest-wpa-bigquery`	Every 2 hours	Word-pair analysis data from BigQuery (GCP only)
`generate-memory-snapshot`	Daily (8pm)	Translation memory snapshots
`refresh-reviewed-words`	Every 2 hours	Verified/reviewed word counts
`backfill`	Manual	Rebuild materialized views
`debug`	Manual	Debug pod (runs indefinitely for troubleshooting)

Jobs marked (GCP only) depend on Google Cloud services and are not applicable to self-managed deployments. They should remain disabled in your configuration.

Initial Load

What It Does

The Initial Load is a one-time workflow that runs automatically after a Helm install or upgrade (when enabled). It ensures that ClickHouse is fully populated with data by executing all enabled ETL jobs in series, followed by the backfill job.

Execution Order

All enabled ETL jobs run one at a time, in sequence (excluding backfill and debug)
Each job is allowed to fail without blocking subsequent jobs
The backfill job runs last, rebuilding materialized views with the freshly ingested data

When It Runs

The Initial Load is triggered as a Helm post-install/post-upgrade hook. It runs automatically whenever the dataflow chart is installed or upgraded, provided it is enabled in your configuration.

The Initial Load workflow may take significant time depending on the volume of historical data. Monitor progress via kubectl (see Troubleshooting).

Configuration

dataflow:
  onpremValues:
    jobs:
      initialLoad:
        enabled: true  # Set to true to run on install/upgrade

Manually Triggering an Initial Load

If you need to re-run the Initial Load outside of a Helm install/upgrade cycle, create a new Workflow from the existing WorkflowTemplate:

kubectl create -n <namespace> -f - <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: <release-name>-initial-load-manual-
spec:
  workflowTemplateRef:
    name: <release-name>-initial-load
EOF

Replace <release-name> with your Helm release name (typically lilt) and <namespace> with your deployment namespace.

Backfill Operations

What It Does

The backfill job rebuilds ClickHouse materialized views — pre-aggregated summary tables that power dashboards and reports. This is necessary when:

New columns have been added to base tables (via ClickHouse migrations)
Historical data needs to be reprocessed to populate new fields
Materialized views have become inconsistent or corrupted

How It Works

Drops existing materialized views
Recreates them with the current schema definition
Backfills historical data in time-based batches:
- Batch 1: Last 12 months (most recent data first)
- Batch 2: 12–24 months ago
- Batch 3: Older than 24 months

This batched approach ensures that the most valuable recent data is available first and prevents timeouts during large rebuilds.

Key Materialized Views

View	Purpose
`mv_mt_translation_throughput`	Aggregated instant/MT translation metrics by user, date, language pair, and domain
`mv_user_daily_verified_words`	Daily verified translation word counts per user
`mv_verified_translation_throughput`	Project-level verified translation throughput
`mv_online_offline_total`	Online vs offline translation totals
`mv_jobs_costs_summary`	Job cost summaries by domain

Configuration

The backfill job is defined as a manually-triggered CronWorkflow:

dataflow:
  onpremValues:
    jobs:
      scripts:
        backfill:
          enabled: true   # Create the CronWorkflow resource
          # trigger is always "manual" — no automatic schedule

When enabled: true, a suspended CronWorkflow is created. It will not run on a schedule — it must be triggered manually.

Running a Backfill

Trigger the backfill by creating a new Workflow from the CronWorkflow’s WorkflowTemplate:

kubectl create -n <namespace> -f - <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: <release-name>-backfill-manual-
spec:
  workflowTemplateRef:
    name: <release-name>-job
  arguments:
    parameters:
      - name: command
        value: "python -m bin.rebuild_materialized_views"
EOF

Monitoring Progress

# List running workflows
kubectl get workflows -n <namespace> | grep dataflow

# Watch workflow status
kubectl get workflow -n <namespace> <workflow-name> -w

# Stream pod logs for a running workflow
kubectl logs -n <namespace> -l workflows.argoproj.io/workflow=<workflow-name> -f

Configuration Reference

Enabling/Disabling Jobs

Each ETL job can be independently toggled. For self-managed deployments, the recommended configuration is:

dataflow:
  onpremValues:
    jobs:
      initialLoad:
        enabled: true

      scripts:
        # Enable for self-managed (uses object storage)
        ingest-wpa-minio:
          enabled: true

        # Disable — requires Google BigQuery (not available in self-managed)
        ingest-wpa-bigquery:
          enabled: false
        ingest-financialdata:
          enabled: false
        linguist-activity-cdc:
          enabled: false

        # Enable for backfill operations
        backfill:
          enabled: true

        # Optional debug pod (enable only when troubleshooting)
        debug:
          enabled: false

Jobs that depend on Google BigQuery (ingest-wpa-bigquery, ingest-financialdata, linguist-activity-cdc) must remain disabled in self-managed deployments as the BigQuery service is not available.

ClickHouse Migrations

Dataflow includes a migration job that runs automatically during Helm sync operations. This job applies any pending ClickHouse schema changes (new tables, columns, or materialized views) before ETL jobs run. The migration job:

Runs as a Kubernetes Job with an ArgoCD sync hook
Applies migrations sequentially and idempotently
Is automatically cleaned up after completion (default TTL: 10 minutes)

After a migration adds new columns to existing tables, a backfill may be required to populate those columns for historical records. Check the release notes for your version to determine if a backfill is recommended after upgrade.

Troubleshooting

Viewing Job Status

List all dataflow CronWorkflows:

kubectl get cronworkflows -n <namespace> | grep dataflow

List recent workflow runs:

kubectl get workflows -n <namespace> --sort-by=.metadata.creationTimestamp

Common Issues

Jobs are not running

Check that Argo Workflows is installed and healthy:
```
kubectl get pods -n <namespace> | grep argo
```
Verify CronWorkflows exist and are not suspended (scheduled jobs should show suspend: false):
```
kubectl get cronworkflows -n <namespace> -o yaml | grep -A2 "name:\|suspend:"
```

Check that the dataflow image can be pulled:

kubectl get events -n <namespace> --field-selector reason=Failed | grep dataflow

ClickHouse migration job hangs or fails

Check the migration job logs:

kubectl logs -n <namespace> -l job-name=<release-name>-clickhouse-migration-job-<revision>

Verify the lilt_analytics database exists in ClickHouse:

kubectl exec -it -n <namespace> <clickhouse-pod> -- \
  clickhouse-client --query "SHOW DATABASES"

Initial Load did not run after upgrade

The Initial Load only runs as a Helm hook. If it did not trigger:

Verify initialLoad.enabled: true in your values

Check Helm hook execution:

kubectl get workflows -n <namespace> | grep initial-load

Manually trigger if needed (see Manually Triggering an Initial Load)

Enabling the Debug Pod

For deeper troubleshooting, enable the debug pod to get an interactive shell with the dataflow environment:

dataflow:
  onpremValues:
    jobs:
      scripts:
        debug:
          enabled: true

After deploying, trigger the debug workflow:

kubectl create -n <namespace> -f - <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: <release-name>-debug-manual-
spec:
  workflowTemplateRef:
    name: <release-name>-job
  arguments:
    parameters:
      - name: command
        value: "sleep infinity"
EOF

Then exec into the running pod:

kubectl exec -it -n <namespace> <debug-pod-name> -- /bin/bash

This gives you access to the full dataflow environment for running individual ETL commands, testing database connectivity, or inspecting the runtime configuration.

Error Retry Behavior

ETL jobs include built-in retry logic for transient errors:

Error Type	Behavior
HTTP 500/502/503/504	Retries up to 3 times with 5-second delay
Connection reset/refused	Retries up to 3 times with 5-second delay
Timeout errors	Retries up to 3 times with 5-second delay
ClickHouse memory limit exceeded	Fails immediately (increase resource limits)
Other database errors	Fails immediately (check logs for details)

​Overview

​Architecture

​Data Flow

​Key Components

​How ETL Jobs Work

​ETL Job Reference

​Initial Load

​What It Does

​Execution Order

​When It Runs

​Configuration

​Manually Triggering an Initial Load

​Backfill Operations

​What It Does

​How It Works

​Key Materialized Views

​Configuration

​Running a Backfill

​Monitoring Progress

​Configuration Reference

​Enabling/Disabling Jobs

​ClickHouse Migrations

​Troubleshooting

​Viewing Job Status

​Common Issues

​Jobs are not running

​ClickHouse migration job hangs or fails

​Initial Load did not run after upgrade

​Enabling the Debug Pod

​Error Retry Behavior

Overview

Architecture

Data Flow

Key Components

How ETL Jobs Work

ETL Job Reference

Initial Load

What It Does

Execution Order

When It Runs

Configuration

Manually Triggering an Initial Load

Backfill Operations

What It Does

How It Works

Key Materialized Views

Configuration

Running a Backfill

Monitoring Progress

Configuration Reference

Enabling/Disabling Jobs

ClickHouse Migrations

Troubleshooting

Viewing Job Status

Common Issues

Jobs are not running

ClickHouse migration job hangs or fails

Initial Load did not run after upgrade

Enabling the Debug Pod

Error Retry Behavior