Error handling
A pipeline chewing through ten million records will hit failures of very different weight. One malformed row in a CSV is data — annoying, expected, and no reason to throw away the other 9,999,999 records. A disk filling mid-write is infrastructure. And an invariant the compiler proved at plan time being violated at run time is a bug in clinker itself — which must stop everything, loudly. A good error type doesn’t just say that something failed; it encodes which kind, because the kind decides the fate of the run.
You’ll be able to: read the PipelineError enum, explain how From impls let ?
propagate subsystem errors, and describe why an Internal error always aborts while a per-record
Eval error can be quarantined.
One enum that aggregates every failure
Section titled “One enum that aggregates every failure”PipelineError is the engine’s top-level runtime error. It’s a sum type with one variant per
subsystem failure, plus a set of specific “this went wrong” variants:
clinker-plan ·error.rs ·PipelineError type @47d2e12
#[derive(Debug)]pub enum PipelineError { Config(crate::config::ConfigError), Format(clinker_format::FormatError), Eval(cxl::eval::EvalError), Io(std::io::Error), /// Plan-time invariant violated at runtime — Clinker bug, not a data /// error. ALWAYS aborts the run regardless of ErrorStrategy::Continue. Internal { op: &'static str, node: String, detail: String }, MemoryBudgetExceeded { node: String, used: u64, limit: u64, /* … */ }, // ... ~25 variants; many documented "Always aborts the run"}Notice this enum is hand-written — no thiserror derive. error.rs writes its own
Display, its own impl std::error::Error, and its own From conversions. That’s a
deliberate choice for a type this central: the Display strings are diagnostics shown to
users, worth hand-tuning. (You’ll meet thiserror elsewhere in the codebase; it’s a fine tool,
just not used for this type.)
From impls are what make ? work
Section titled “From impls are what make ? work”The magic word in idiomatic Rust error handling is ?. Writing let r = next_record()?; means
“if this is Err, return it from the enclosing function — converting it to my error type on
the way out.” That conversion is powered by the From trait. Each subsystem error gets a
hand-written From into PipelineError:
impl From<std::io::Error> for PipelineError { fn from(e: std::io::Error) -> Self { Self::Io(e) }}impl From<cxl::eval::EvalError> for PipelineError { fn from(e: cxl::eval::EvalError) -> Self { Self::Eval(e) }}// ... ConfigError, FormatError, SchemaError, SpillErrorWith those in place, a function returning Result<_, PipelineError> can call into the format
layer, the IO layer, and the CXL evaluator and just write ? after each — every foreign error
is auto-lifted into the right PipelineError variant. No match, no manual map_err. The
six From impls are the seams that let one error type absorb six others.
The taxonomy: abort versus quarantine
Section titled “The taxonomy: abort versus quarantine”Here’s where the type earns its keep. The variants split into two fates:
- Recoverable (per-record data errors). A
Valuethat won’t cast, acxl::eval::EvalErroron one row. Under the right policy these are sent to a dead-letter queue (DLQ) — the bad record is set aside and the run continues. - Fatal (always abort).
Internal,MemoryBudgetExceeded, schema mismatches, sort-order violations. These stop the run regardless of policy. The variant docs say so in capital letters: “ALWAYS aborts the run regardless ofErrorStrategy::Continue.”
Which policy governs the recoverable ones is a config setting:
clinker-plan ·pipeline.rs ·ErrorStrategy type @47d2e12
pub enum ErrorStrategy { FailFast, // any error stops the run (the default) Continue, // recoverable errors go to the DLQ; keep going BestEffort,}And the actual routing — the place a per-record eval error meets the policy — is one function in the executor:
clinker-exec ·dispatch.rs ·dispatch_transform_eval_error fn @47d2e12
fn dispatch_transform_eval_error(/* … */) -> Result<…, PipelineError> { if ctx.strategy == ErrorStrategy::FailFast { return Err(eval_err.into()); // propagate — the ? at the call site aborts } // otherwise: classify and route the bad record to the DLQ, run continues}The crucial asymmetry: only recoverable errors flow through this router. The fatal variants
are never offered to the DLQ at all — Internal and its kin are constructed and ?-propagated
directly, bypassing this function entirely. So ErrorStrategy::Continue can keep a job alive
through a million bad rows, but it cannot suppress a clinker bug. That’s by design: a
Continue policy is a statement about your data, never a license to limp on through a broken
engine.
Why Internal is its own fate
Section titled “Why Internal is its own fate”Pause on Internal { op, node, detail }. Its doc calls it “a Clinker bug, not a data error.”
It exists for the cases that the plan/runtime boundary (last lesson) was supposed to make
impossible — an invariant that compilation proved, tripped anyway at run time. Routing such a
thing to the DLQ would be exactly wrong: it would hide a bug behind a “bad record” label and
let a broken run produce plausible-looking output. Making Internal always-fatal means the
engine fails loudly at the first sign it has violated its own contract — the run dies, the
operator sees it, nobody trusts corrupt output. The error taxonomy is how the engine refuses to
paper over its own bugs.
Build the taxonomy in miniature
Section titled “Build the taxonomy in miniature”?, From, and the abort/quarantine split — all in std:
> output appears here — press Run
Run it. Under Continue, "oops" is quarantined and "37" still processes; under FailFast,
the first bad record stops everything. Now change process to return Internal for "oops"
and watch it abort under both policies — that’s the fatal class refusing to be quarantined.
// quick check
Why does an Internal error abort the run even under ErrorStrategy::Continue, while an Eval error can be sent to the DLQ?
The taxonomy ties fate to kind. Continue is a statement about messy data, not permission to run a broken engine, so Internal (and other always-abort variants) bypass the DLQ router and propagate directly.
Read the error layer
Section titled “Read the error layer”You can now read the engine’s failure taxonomy and say why some errors quarantine a record while others kill the run. One subsystem has shown up in nearly every lesson — the expression language, CXL — but always from the outside. The final Phase 3 lesson goes inside it: how a formula becomes a typechecked, compile-once program that runs over every record.