Testing strategy

Welcome to Phase 5 — Extending & Contributing. Phases 0–4 taught you to read the engine; this phase turns you toward changing it. And the first thing a contributor needs is not a clever feature — it’s a way to know they didn’t break anything. This lesson is the engine’s answer to “how do I prove a change is correct?”

You’ll be able to: name the two tiers where clinker’s tests live, explain what “testing to the boundary” buys you, read a snapshot test and a golden-baseline regression test, and read the one property test that checks a fast band-join against a brute-force oracle.

Two tiers: unit tests next to the code, integration tests at the seam

Clinker’s tests live in two clearly separated places, and the split is the first thing to internalize:

Inline unit tests — a #[cfg(test)] mod tests block at the bottom of a source file, testing that file’s internals. They can reach private functions.
Integration tests — files under crates/<crate>/tests/, compiled as separate crates that can only call the crate’s public API.

A small inline unit test, right beside the coercion logic it covers:

clinker-record ·coercion.rs ·test_coerce_string_to_int_valid test @47d2e12

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_coerce_string_to_int_valid() {
        // exercises coerce_to_int directly — a private-ish unit of behavior
    }
}

The naming is deliberate and worth copying: tests read as module::tests::scenario::behavior, so a failure name tells you what broke before you open the file. The testing-commands doc records the convention and the canonical run commands:

clinker ·50_TESTING_AND_COMMANDS.md doc @47d2e12

# Fast signal after an edit — does it still compile?
cargo check --workspace --locked --offline

# Run ONE test exactly (the period-separated path is the full test name):
cargo test -p clinker-exec --lib --offline \
  executor::tests::spill_dir_unavailable_midrun::unarmed_seam... -- --exact

# The whole suite. The ulimit prefix is load-bearing: the default 1024-fd
# limit makes clinker-exec's spill tests fail with "Too many open files".
ulimit -n 4096 && cargo test --workspace --locked --offline

Testing to the boundary

An integration test under tests/ can’t see internals — so it’s forced to drive the engine the way a user does: feed YAML and input bytes to the public executor, then assert on the output bytes and the run report. That’s “testing to the boundary,” and it’s the engine’s most valuable kind of test, because it survives any internal refactor that keeps the public behavior the same.

clinker-exec ·aggregate_integration.rs ·test_e2e_group_by_sum_count test @47d2e12

// End-to-end: CSV in → aggregate(group_by:[dept], sum + count) → CSV out.
let csv = "dept,salary\neng,100\neng,200\nsales,50\n";
let (report, output) = run_single(yaml, csv);

assert_eq!(report.dlq_entries.len(), 0);
assert_eq!(report.counters.ok_count, 2, "two output groups");
assert_eq!(
    sorted_body_lines(&output),
    vec!["eng,300,2".to_string(), "sales,50,1".to_string()],
);

Nothing here names a private type. If someone rewrites the aggregation operator’s internals tomorrow, this test still passes as long as eng,100 plus eng,200 still sums to eng,300,2. That’s the whole point: the assertion is pinned to the contract, not the implementation.

Snapshot tests: assert on a big blob without hand-writing it

Some outputs are too large to hand-author an assert_eq! for — like the full text of an execution plan from --explain. Clinker uses the insta crate: you write the test, run it once, and insta records the output as a committed .snap file. Later runs compare against that file; an intentional change is reviewed and re-accepted.

clinker-exec ·cull_explain_snapshot.rs ·explain_renders_cull_two_output_ports test @47d2e12

#[test]
fn explain_renders_cull_two_output_ports() {
    let text = render_explain(yaml);
    // a few structural asserts first (these document intent) ...
    assert!(text.contains("FORK [cull] 'drop_bad'"));
    // ... then snapshot the whole rendered plan under a stable name:
    insta::assert_snapshot!("explain_cull_two_output_ports", text);
}

The committed snapshot it locks against starts with an insta header and then the captured value:

---
source: crates/clinker-exec/tests/cull_explain_snapshot.rs
expression: text
---
=== Execution Plan ===

Mode: Streaming
...

When you intentionally change --explain output, the snapshot test fails, you eyeball the diff, and accept it with cargo insta review (or INSTA_UPDATE=always). The discipline: a snapshot diff in a PR is a visible, reviewable record of an output-format change — it can’t sneak through silently.

Golden-baseline regression seeds

The strongest refactor net in the codebase is a corpus of golden baselines: real pipelines whose exact output bytes are committed under tests/fixtures/baselines/ (e.g. csv_transform_sink.expected.csv). A driver runs each pipeline and compares the fresh output to the committed file, byte for byte:

clinker-exec ·pre_lift_baselines.rs ·compare_or_write fn @47d2e12

fn compare_or_write(baseline_name: &str, actual: &str) {
    let p = baseline_root().join(baseline_name);
    if update_mode() || !p.exists() {
        // First run (or UPDATE_BASELINES=1): capture the golden.
        std::fs::write(&p, actual.as_bytes()).unwrap();
        return;
    }
    let expected = std::fs::read_to_string(&p).unwrap();
    assert_eq!(actual, expected, "byte-mismatch against baseline {}", p.display());
}

Read the control flow carefully — it’s the whole regression-seed pattern in ten lines. The first time a fixture runs (or whenever you deliberately set UPDATE_BASELINES=1), the current output is written as the new golden. Every run after that compares. So the seed is captured once, then frozen; any future change that alters a single byte of any baseline pipeline’s output trips a named failure. (Note: the corpus is keyed by fixture name, not by issue number — clinker doesn’t tag regression tests with bug IDs.)

One property test: fast algorithm vs. slow oracle

Most tests check fixed examples. A property test instead generates hundreds of random inputs and checks an invariant on every one. Clinker uses proptest for exactly one high-value case: its fast band-join (iejoin_numeric) must agree with a dead-simple, obviously-correct nested-loop join on every random input.

clinker-exec ·iejoin.rs ·proptest_iejoin_matches_nested_loop test @47d2e12

proptest! {
    #![proptest_config(ProptestConfig::with_cases(256))]
    #[test]
    fn proptest_iejoin_matches_nested_loop((left, right, op1, op2) in arb_inputs()) {
        let actual: HashSet<(usize, usize)> =
            iejoin_numeric(&left, &right, op1.to_range(), op2.to_range())
                .into_iter().collect();
        let expected = nested_loop(&left, &right, op1, op2);   // the slow oracle
        prop_assert_eq!(actual, expected);
    }
}

This is the oracle pattern: you have a fast implementation you’re unsure about and a slow implementation you trust, and you assert they always agree. It’s worth the machinery precisely because the fast path (coarse-filter striding, permutation indexing) is the kind of code that’s easy to get subtly wrong. For straightforward behavior, a handful of example tests is cheaper and clearer — don’t reach for proptest by default. (The repo also has a combine_iejoin_prop.rs scaffold; the live property test is the inline one shown here.)

Match the test to what you’re changing

// quick check

You refactor the internals of the aggregation operator but intend zero change to its output. Which test most directly protects you, and why?

Investigate the suite

✓ Checkpoint — testing strategy

💡 Hint 1

Run the aggregate integration test first — it’s the boundary test from the tour. Then list the baselines directory: each .expected.csv is one frozen golden output. Open compare_or_write and trace the update_mode() || !p.exists() branch.

💡 Hint 2

To see the regression net work: pick a baseline pipeline, change its expected .csv by one byte, and re-run — the byte-mismatch panic names the fixture. Restore it with git restore, or regenerate with UPDATE_BASELINES=1.

What the tour establishes

Tests live in two tiers: inline #[cfg(test)] mod tests (private internals) and crates/*/tests/ integration files (public API only). The most durable tests drive the public executor and assert on output bytes — testing to the boundary. Large outputs use insta snapshots (reviewable diffs); whole-pipeline outputs use committed golden .expected.csv baselines compared by compare_or_write (capture-once via UPDATE_BASELINES=1, compare-forever). One proptest checks the fast band-join against a nested-loop oracle — the right tool when a fast algorithm needs a trusted slow twin.

// quick check

What does setting UPDATE_BASELINES=1 do for a golden-baseline fixture that already has a committed .expected.csv?

You should be able to:

You can state the difference between an inline #[cfg(test)] unit test and a tests/ integration test (what each can reach)
You can explain what a golden baseline is and what UPDATE_BASELINES=1 does on first run vs later runs
You can explain when a property test (oracle pattern) earns its keep versus a few example tests

Verify in the checkout:

ulimit -n 4096 && cargo test -p clinker-exec --offline aggregate_integration
grep -rn 'insta::assert_snapshot' crates/clinker-exec/tests/ | head
ls crates/clinker-exec/tests/fixtures/baselines/
grep -n 'proptest_iejoin_matches_nested_loop' crates/clinker-exec/src/pipeline/iejoin.rs

You can now prove a change is correct. Next: make your first real change — add a builtin to CXL, the engine’s expression language.