> ## Documentation Index
> Fetch the complete documentation index at: https://support.lilt.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Dataflow Analytics Pipeline

> Understand and configure the Dataflow analytics service, including ETL jobs, backfill operations, and the initial data load process

## Overview

Dataflow is the analytics ETL (Extract, Transform, Load) pipeline that powers LILT's reporting and usage metrics. It continuously moves data from the primary MySQL database into ClickHouse, an analytics-optimized column store, where it is aggregated into materialized views for dashboards and operational reporting.

Dataflow runs as a collection of scheduled jobs orchestrated by Argo Workflows inside your Kubernetes cluster. Each job handles a specific data domain (file translations, documents, users, connector jobs, etc.) and runs on its own schedule.

Two special operational modes exist beyond the regular scheduled ETL:

* **Initial Load**: Automatically runs all enabled ETL jobs in sequence when the system is first installed or upgraded. This ensures ClickHouse is populated with historical data.
* **Backfill**: A manually-triggered operation that rebuilds ClickHouse materialized views. Used after schema changes or when historical data needs to be reprocessed.

## Architecture

### Data Flow

```
MySQL (primary database)
    │
    ▼
Dataflow ETL Jobs (Python, running as Argo Workflows)
    │
    ▼
ClickHouse (analytics database, lilt_analytics schema)
    │
    ▼
Materialized Views (pre-aggregated metrics for dashboards)
```

### Key Components

| Component                      | Description                                                                                         |
| ------------------------------ | --------------------------------------------------------------------------------------------------- |
| **Argo Workflows**             | Orchestrates all ETL jobs as CronWorkflows (scheduled) or Workflows (manual/one-shot)               |
| **CronWorkflows**              | Kubernetes resources that run ETL jobs on defined schedules (e.g., every 15 minutes, every 2 hours) |
| **WorkflowTemplates**          | Reusable job definitions shared across all ETL jobs                                                 |
| **Cloud SQL Proxy** (GCP only) | Sidecar container providing secure MySQL connectivity                                               |
| **ClickHouse**                 | Analytics database using `lilt_analytics` schema with ReplacingMergeTree tables                     |

### How ETL Jobs Work

Each ETL job follows the same pattern:

1. **Connect** to MySQL and ClickHouse
2. **Read the high-water mark** — the timestamp of the last record processed in ClickHouse
3. **Fetch a batch** of records from MySQL that are newer than the high-water mark
4. **Transform** the records into the ClickHouse schema format
5. **Insert** the batch into ClickHouse
6. **Repeat** until no more new records are found
7. **Sleep** between batches to avoid overloading the source database

ClickHouse tables use the **ReplacingMergeTree** engine, which provides upsert semantics — if a record with the same primary key is inserted again (e.g., during a backfill), the latest version automatically replaces the old one during background merges.

## ETL Job Reference

The following jobs are available. Each can be independently enabled or disabled in your configuration:

| Job Name                   | Default Schedule | Description                                       |
| -------------------------- | ---------------- | ------------------------------------------------- |
| `ingest-file-translation`  | Every 15 minutes | File translation records and metadata             |
| `ingest-documents`         | Every hour       | Document-level translation data                   |
| `ingest-users`             | Every 2 hours    | User account information                          |
| `ingest-connectorjobs`     | Every 2 hours    | Connector integration job records                 |
| `ingest-comments`          | Every 2 hours    | Translation comments and annotations              |
| `segment-quality`          | Every 2 hours    | Segment-level quality metrics                     |
| `ingest-otd`               | Daily (3am, 2pm) | On-time delivery data                             |
| `ingest-revision-reports`  | Daily (5am)      | Revision report data                              |
| `ingest-financialdata`     | Daily (5am)      | Financial and billing data (GCP only)             |
| `linguist-activity-cdc`    | Every 2 hours    | Linguist activity change data capture (GCP only)  |
| `ingest-wpa-minio`         | Daily (2am)      | Word-pair analysis data from object storage       |
| `ingest-wpa-bigquery`      | Every 2 hours    | Word-pair analysis data from BigQuery (GCP only)  |
| `generate-memory-snapshot` | Daily (8pm)      | Translation memory snapshots                      |
| `refresh-reviewed-words`   | Every 2 hours    | Verified/reviewed word counts                     |
| `backfill`                 | Manual           | Rebuild materialized views                        |
| `debug`                    | Manual           | Debug pod (runs indefinitely for troubleshooting) |

<Note>
  Jobs marked **(GCP only)** depend on Google Cloud services and are not applicable to self-managed deployments. They should remain disabled in your configuration.
</Note>

## Initial Load

### What It Does

The Initial Load is a one-time workflow that runs automatically after a Helm install or upgrade (when enabled). It ensures that ClickHouse is fully populated with data by executing all enabled ETL jobs in series, followed by the backfill job.

### Execution Order

1. All enabled ETL jobs run **one at a time, in sequence** (excluding `backfill` and `debug`)
2. Each job is allowed to fail without blocking subsequent jobs
3. The `backfill` job runs **last**, rebuilding materialized views with the freshly ingested data

### When It Runs

The Initial Load is triggered as a **Helm post-install/post-upgrade hook**. It runs automatically whenever the dataflow chart is installed or upgraded, provided it is enabled in your configuration.

<Note>
  The Initial Load workflow may take significant time depending on the volume of historical data. Monitor progress via `kubectl` (see [Troubleshooting](#troubleshooting)).
</Note>

### Configuration

```yaml theme={null}
dataflow:
  onpremValues:
    jobs:
      initialLoad:
        enabled: true  # Set to true to run on install/upgrade
```

### Manually Triggering an Initial Load

If you need to re-run the Initial Load outside of a Helm install/upgrade cycle, create a new Workflow from the existing WorkflowTemplate:

```bash theme={null}
kubectl create -n <namespace> -f - <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: <release-name>-initial-load-manual-
spec:
  workflowTemplateRef:
    name: <release-name>-initial-load
EOF
```

Replace `<release-name>` with your Helm release name (typically `lilt`) and `<namespace>` with your deployment namespace.

## Backfill Operations

### What It Does

The backfill job rebuilds ClickHouse **materialized views** — pre-aggregated summary tables that power dashboards and reports. This is necessary when:

* New columns have been added to base tables (via ClickHouse migrations)
* Historical data needs to be reprocessed to populate new fields
* Materialized views have become inconsistent or corrupted

### How It Works

1. **Drops** existing materialized views
2. **Recreates** them with the current schema definition
3. **Backfills** historical data in time-based batches:
   * Batch 1: Last 12 months (most recent data first)
   * Batch 2: 12–24 months ago
   * Batch 3: Older than 24 months

This batched approach ensures that the most valuable recent data is available first and prevents timeouts during large rebuilds.

### Key Materialized Views

| View                                 | Purpose                                                                            |
| ------------------------------------ | ---------------------------------------------------------------------------------- |
| `mv_mt_translation_throughput`       | Aggregated instant/MT translation metrics by user, date, language pair, and domain |
| `mv_user_daily_verified_words`       | Daily verified translation word counts per user                                    |
| `mv_verified_translation_throughput` | Project-level verified translation throughput                                      |
| `mv_online_offline_total`            | Online vs offline translation totals                                               |
| `mv_jobs_costs_summary`              | Job cost summaries by domain                                                       |

### Configuration

The backfill job is defined as a manually-triggered CronWorkflow:

```yaml theme={null}
dataflow:
  onpremValues:
    jobs:
      scripts:
        backfill:
          enabled: true   # Create the CronWorkflow resource
          # trigger is always "manual" — no automatic schedule
```

When `enabled: true`, a **suspended** CronWorkflow is created. It will not run on a schedule — it must be triggered manually.

### Running a Backfill

Trigger the backfill by creating a new Workflow from the CronWorkflow's WorkflowTemplate:

```bash theme={null}
kubectl create -n <namespace> -f - <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: <release-name>-backfill-manual-
spec:
  workflowTemplateRef:
    name: <release-name>-job
  arguments:
    parameters:
      - name: command
        value: "python -m bin.rebuild_materialized_views"
EOF
```

### Monitoring Progress

```bash theme={null}
# List running workflows
kubectl get workflows -n <namespace> | grep dataflow

# Watch workflow status
kubectl get workflow -n <namespace> <workflow-name> -w

# Stream pod logs for a running workflow
kubectl logs -n <namespace> -l workflows.argoproj.io/workflow=<workflow-name> -f
```

## Configuration Reference

### Enabling/Disabling Jobs

Each ETL job can be independently toggled. For self-managed deployments, the recommended configuration is:

```yaml theme={null}
dataflow:
  onpremValues:
    jobs:
      initialLoad:
        enabled: true

      scripts:
        # Enable for self-managed (uses object storage)
        ingest-wpa-minio:
          enabled: true

        # Disable — requires Google BigQuery (not available in self-managed)
        ingest-wpa-bigquery:
          enabled: false
        ingest-financialdata:
          enabled: false
        linguist-activity-cdc:
          enabled: false

        # Enable for backfill operations
        backfill:
          enabled: true

        # Optional debug pod (enable only when troubleshooting)
        debug:
          enabled: false
```

<Note>
  Jobs that depend on Google BigQuery (`ingest-wpa-bigquery`, `ingest-financialdata`, `linguist-activity-cdc`) must remain **disabled** in self-managed deployments as the BigQuery service is not available.
</Note>

## ClickHouse Migrations

Dataflow includes a **migration job** that runs automatically during Helm sync operations. This job applies any pending ClickHouse schema changes (new tables, columns, or materialized views) before ETL jobs run.

The migration job:

* Runs as a Kubernetes Job with an ArgoCD sync hook
* Applies migrations sequentially and idempotently
* Is automatically cleaned up after completion (default TTL: 10 minutes)

<Note>
  After a migration adds new columns to existing tables, a **backfill** may be required to populate those columns for historical records. Check the release notes for your version to determine if a backfill is recommended after upgrade.
</Note>

## Troubleshooting

### Viewing Job Status

List all dataflow CronWorkflows:

```bash theme={null}
kubectl get cronworkflows -n <namespace> | grep dataflow
```

List recent workflow runs:

```bash theme={null}
kubectl get workflows -n <namespace> --sort-by=.metadata.creationTimestamp
```

### Common Issues

#### Jobs are not running

1. **Check that Argo Workflows is installed and healthy:**
   ```bash theme={null}
   kubectl get pods -n <namespace> | grep argo
   ```

2. **Verify CronWorkflows exist and are not suspended** (scheduled jobs should show `suspend: false`):
   ```bash theme={null}
   kubectl get cronworkflows -n <namespace> -o yaml | grep -A2 "name:\|suspend:"
   ```

3. **Check that the dataflow image can be pulled:**
   ```bash theme={null}
   kubectl get events -n <namespace> --field-selector reason=Failed | grep dataflow
   ```

#### ClickHouse migration job hangs or fails

1. **Check the migration job logs:**
   ```bash theme={null}
   kubectl logs -n <namespace> -l job-name=<release-name>-clickhouse-migration-job-<revision>
   ```

2. **Verify the `lilt_analytics` database exists** in ClickHouse:
   ```bash theme={null}
   kubectl exec -it -n <namespace> <clickhouse-pod> -- \
     clickhouse-client --query "SHOW DATABASES"
   ```

#### Initial Load did not run after upgrade

The Initial Load only runs as a Helm hook. If it did not trigger:

1. Verify `initialLoad.enabled: true` in your values
2. Check Helm hook execution:
   ```bash theme={null}
   kubectl get workflows -n <namespace> | grep initial-load
   ```
3. Manually trigger if needed (see [Manually Triggering an Initial Load](#manually-triggering-an-initial-load))

### Enabling the Debug Pod

For deeper troubleshooting, enable the debug pod to get an interactive shell with the dataflow environment:

```yaml theme={null}
dataflow:
  onpremValues:
    jobs:
      scripts:
        debug:
          enabled: true
```

After deploying, trigger the debug workflow:

```bash theme={null}
kubectl create -n <namespace> -f - <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: <release-name>-debug-manual-
spec:
  workflowTemplateRef:
    name: <release-name>-job
  arguments:
    parameters:
      - name: command
        value: "sleep infinity"
EOF
```

Then exec into the running pod:

```bash theme={null}
kubectl exec -it -n <namespace> <debug-pod-name> -- /bin/bash
```

This gives you access to the full dataflow environment for running individual ETL commands, testing database connectivity, or inspecting the runtime configuration.

### Error Retry Behavior

ETL jobs include built-in retry logic for transient errors:

| Error Type                       | Behavior                                     |
| -------------------------------- | -------------------------------------------- |
| HTTP 500/502/503/504             | Retries up to 3 times with 5-second delay    |
| Connection reset/refused         | Retries up to 3 times with 5-second delay    |
| Timeout errors                   | Retries up to 3 times with 5-second delay    |
| ClickHouse memory limit exceeded | Fails immediately (increase resource limits) |
| Other database errors            | Fails immediately (check logs for details)   |
