Overview
Dataflow is the analytics ETL (Extract, Transform, Load) pipeline that powers LILT’s reporting and usage metrics. It continuously moves data from the primary MySQL database into ClickHouse, an analytics-optimized column store, where it is aggregated into materialized views for dashboards and operational reporting. Dataflow runs as a collection of scheduled jobs orchestrated by Argo Workflows inside your Kubernetes cluster. Each job handles a specific data domain (file translations, documents, users, connector jobs, etc.) and runs on its own schedule. Two special operational modes exist beyond the regular scheduled ETL:- Initial Load: Automatically runs all enabled ETL jobs in sequence when the system is first installed or upgraded. This ensures ClickHouse is populated with historical data.
- Backfill: A manually-triggered operation that rebuilds ClickHouse materialized views. Used after schema changes or when historical data needs to be reprocessed.
Architecture
Data Flow
Key Components
| Component | Description |
|---|---|
| Argo Workflows | Orchestrates all ETL jobs as CronWorkflows (scheduled) or Workflows (manual/one-shot) |
| CronWorkflows | Kubernetes resources that run ETL jobs on defined schedules (e.g., every 15 minutes, every 2 hours) |
| WorkflowTemplates | Reusable job definitions shared across all ETL jobs |
| Cloud SQL Proxy (GCP only) | Sidecar container providing secure MySQL connectivity |
| ClickHouse | Analytics database using lilt_analytics schema with ReplacingMergeTree tables |
How ETL Jobs Work
Each ETL job follows the same pattern:- Connect to MySQL and ClickHouse
- Read the high-water mark — the timestamp of the last record processed in ClickHouse
- Fetch a batch of records from MySQL that are newer than the high-water mark
- Transform the records into the ClickHouse schema format
- Insert the batch into ClickHouse
- Repeat until no more new records are found
- Sleep between batches to avoid overloading the source database
ETL Job Reference
The following jobs are available. Each can be independently enabled or disabled in your configuration:| Job Name | Default Schedule | Description |
|---|---|---|
ingest-file-translation | Every 15 minutes | File translation records and metadata |
ingest-documents | Every hour | Document-level translation data |
ingest-users | Every 2 hours | User account information |
ingest-connectorjobs | Every 2 hours | Connector integration job records |
ingest-comments | Every 2 hours | Translation comments and annotations |
segment-quality | Every 2 hours | Segment-level quality metrics |
ingest-otd | Daily (3am, 2pm) | On-time delivery data |
ingest-revision-reports | Daily (5am) | Revision report data |
ingest-financialdata | Daily (5am) | Financial and billing data (GCP only) |
linguist-activity-cdc | Every 2 hours | Linguist activity change data capture (GCP only) |
ingest-wpa-minio | Daily (2am) | Word-pair analysis data from object storage |
ingest-wpa-bigquery | Every 2 hours | Word-pair analysis data from BigQuery (GCP only) |
generate-memory-snapshot | Daily (8pm) | Translation memory snapshots |
refresh-reviewed-words | Every 2 hours | Verified/reviewed word counts |
backfill | Manual | Rebuild materialized views |
debug | Manual | Debug pod (runs indefinitely for troubleshooting) |
Jobs marked (GCP only) depend on Google Cloud services and are not applicable to self-managed deployments. They should remain disabled in your configuration.
Initial Load
What It Does
The Initial Load is a one-time workflow that runs automatically after a Helm install or upgrade (when enabled). It ensures that ClickHouse is fully populated with data by executing all enabled ETL jobs in series, followed by the backfill job.Execution Order
- All enabled ETL jobs run one at a time, in sequence (excluding
backfillanddebug) - Each job is allowed to fail without blocking subsequent jobs
- The
backfilljob runs last, rebuilding materialized views with the freshly ingested data
When It Runs
The Initial Load is triggered as a Helm post-install/post-upgrade hook. It runs automatically whenever the dataflow chart is installed or upgraded, provided it is enabled in your configuration.The Initial Load workflow may take significant time depending on the volume of historical data. Monitor progress via
kubectl (see Troubleshooting).Configuration
Manually Triggering an Initial Load
If you need to re-run the Initial Load outside of a Helm install/upgrade cycle, create a new Workflow from the existing WorkflowTemplate:<release-name> with your Helm release name (typically lilt) and <namespace> with your deployment namespace.
Backfill Operations
What It Does
The backfill job rebuilds ClickHouse materialized views — pre-aggregated summary tables that power dashboards and reports. This is necessary when:- New columns have been added to base tables (via ClickHouse migrations)
- Historical data needs to be reprocessed to populate new fields
- Materialized views have become inconsistent or corrupted
How It Works
- Drops existing materialized views
- Recreates them with the current schema definition
- Backfills historical data in time-based batches:
- Batch 1: Last 12 months (most recent data first)
- Batch 2: 12–24 months ago
- Batch 3: Older than 24 months
Key Materialized Views
| View | Purpose |
|---|---|
mv_mt_translation_throughput | Aggregated instant/MT translation metrics by user, date, language pair, and domain |
mv_user_daily_verified_words | Daily verified translation word counts per user |
mv_verified_translation_throughput | Project-level verified translation throughput |
mv_online_offline_total | Online vs offline translation totals |
mv_jobs_costs_summary | Job cost summaries by domain |
Configuration
The backfill job is defined as a manually-triggered CronWorkflow:enabled: true, a suspended CronWorkflow is created. It will not run on a schedule — it must be triggered manually.
Running a Backfill
Trigger the backfill by creating a new Workflow from the CronWorkflow’s WorkflowTemplate:Monitoring Progress
Configuration Reference
Enabling/Disabling Jobs
Each ETL job can be independently toggled. For self-managed deployments, the recommended configuration is:Jobs that depend on Google BigQuery (
ingest-wpa-bigquery, ingest-financialdata, linguist-activity-cdc) must remain disabled in self-managed deployments as the BigQuery service is not available.ClickHouse Migrations
Dataflow includes a migration job that runs automatically during Helm sync operations. This job applies any pending ClickHouse schema changes (new tables, columns, or materialized views) before ETL jobs run. The migration job:- Runs as a Kubernetes Job with an ArgoCD sync hook
- Applies migrations sequentially and idempotently
- Is automatically cleaned up after completion (default TTL: 10 minutes)
After a migration adds new columns to existing tables, a backfill may be required to populate those columns for historical records. Check the release notes for your version to determine if a backfill is recommended after upgrade.
Troubleshooting
Viewing Job Status
List all dataflow CronWorkflows:Common Issues
Jobs are not running
-
Check that Argo Workflows is installed and healthy:
-
Verify CronWorkflows exist and are not suspended (scheduled jobs should show
suspend: false): -
Check that the dataflow image can be pulled:
ClickHouse migration job hangs or fails
-
Check the migration job logs:
-
Verify the
lilt_analyticsdatabase exists in ClickHouse:
Initial Load did not run after upgrade
The Initial Load only runs as a Helm hook. If it did not trigger:- Verify
initialLoad.enabled: truein your values - Check Helm hook execution:
- Manually trigger if needed (see Manually Triggering an Initial Load)
Enabling the Debug Pod
For deeper troubleshooting, enable the debug pod to get an interactive shell with the dataflow environment:Error Retry Behavior
ETL jobs include built-in retry logic for transient errors:| Error Type | Behavior |
|---|---|
| HTTP 500/502/503/504 | Retries up to 3 times with 5-second delay |
| Connection reset/refused | Retries up to 3 times with 5-second delay |
| Timeout errors | Retries up to 3 times with 5-second delay |
| ClickHouse memory limit exceeded | Fails immediately (increase resource limits) |
| Other database errors | Fails immediately (check logs for details) |

