Skip to content

Build & run Clinker

The fastest way to start understanding a system is to run it once, end to end, and watch it do something real. Before we open a single source file, let’s get Clinker compiling and move some actual data through it.

You’ll be able to: build the engine, run a real pipeline, and read its result — and you’ll have met the shape of every Clinker job before we explain any of it.

What Clinker is (the one-paragraph version)

Section titled “What Clinker is (the one-paragraph version)”

Clinker is a bounded-memory, single-process batch executor for finite ETL jobs. You describe a job as a pipeline in YAML — a set of nodes (a source, some transforms, an output) wired into a graph. Per-record logic is written in a small expression language called CXL. You hand the pipeline to the clinker command; it reads records from the source, pushes them through the graph, writes the output, and exits. Finite and batch: the sources end, the job drains, the process stops.

That’s the whole mental model for now. We’ll earn every word of it over the coming phases.

Clinker ships runnable example pipelines. We’ll use the canonical one:

clinker ·customer_etl.yaml example @47d2e12

It’s a customer ETL job — a CSV of customers in, a CSV of flagged customers out, with two transforms in between:

nodes:
- type: source # read customers.csv
name: customers
config: { type: csv, path: ./data/customers.csv, ... }
- type: transform # add an is_active flag
name: active_only
input: customers
config:
cxl: |
emit is_active = status == "active"
- type: transform # classify into a gold/standard tier
name: final_flag
input: active_only
config:
cxl: |
emit tier = if lifetime_value.to_int() > $vars.gold_threshold then "gold" else "standard"
- type: output # write the result
name: results
input: final_flag
config: { type: csv, path: ./output/customers.csv }

Four nodes — source → transform → transform → output — a tiny pipeline that is nonetheless a complete Clinker job. The input is small and human-readable:

customer_id,first_name,last_name,email,status,lifetime_value,zip_code
1001,Alice,Chen,alice.chen@acme.com,active,15200,94103
1002,Bob,Martinez,bob.m@globex.com,active,8400,10001
1003,Carol,Johnson,carol.j@example.com,inactive,3200,60601

Alice is active with a lifetime value above the gold_threshold (default 10000), so she’ll be flagged gold; Bob is active but below it, so standard; Carol is inactive.

Clinker pins its toolchain (a rust-toolchain.toml selects the exact Rust version), so rustup installs the right compiler automatically the first time. From your clinker checkout:

Terminal window
cargo build -p clinker

The first build compiles the whole workspace and takes a few minutes; after that, builds are incremental and fast. (Phase 0.2 is all about that fast inner loop.)

Two ways to run a pipeline — start with the one that doesn’t touch any data.

1. See the plan, without executing — --explain. Run the example from the examples/pipelines/ directory (so the pipeline’s ./data/... paths resolve):

Terminal window
cd examples/pipelines
cargo run -p clinker -- run customer_etl.yaml --explain

Clinker compiles the pipeline into an execution plan and prints it — but runs nothing:

=== Execution Plan ===
Mode: Streaming
Transforms: 2
Output projections: 1
DAG nodes: 4
arbitration: BackPressurePreferred -> Priority
Source DAG:
Tier 0: customers

Four DAG nodes, two transforms, “Streaming” mode. You’re looking at the plan — the proof that the job is well-formed — before any record moves. We’ll come back to this view in lesson 0.4, and to why planning is separate from running much later.

2. Actually move data — --dry-run. A dry run processes records and writes the result to your terminal instead of to the output file:

Terminal window
cargo run -p clinker -- run customer_etl.yaml --dry-run -n 5
INFO clinker: Pipeline complete: 5 total, 5 ok, 5 written, 0 dlq

Five records in, five processed, five written, zero rejected — and the process exits 0. That summary line (total / ok / written / dlq) is Clinker telling you the finite job ran clean. (dlq is the dead-letter queue — rejected records; we’ll meet it soon.)

You just ran a four-node DAG end to end. Each of those nodes — the source, the two CXL transforms, the output — is a door we’ll open in later lessons. Next: the fast edit-and-check loop you’ll live in while working on the engine.