Skip to content

Error handling

A pipeline chewing through ten million records will hit failures of very different weight. One malformed row in a CSV is data — annoying, expected, and no reason to throw away the other 9,999,999 records. A disk filling mid-write is infrastructure. And an invariant the compiler proved at plan time being violated at run time is a bug in clinker itself — which must stop everything, loudly. A good error type doesn’t just say that something failed; it encodes which kind, because the kind decides the fate of the run.

You’ll be able to: read the PipelineError enum, explain how From impls let ? propagate subsystem errors, and describe why an Internal error always aborts while a per-record Eval error can be quarantined.

PipelineError is the engine’s top-level runtime error. It’s a sum type with one variant per subsystem failure, plus a set of specific “this went wrong” variants:

clinker-plan ·error.rs ·PipelineError type @47d2e12
#[derive(Debug)]
pub enum PipelineError {
Config(crate::config::ConfigError),
Format(clinker_format::FormatError),
Eval(cxl::eval::EvalError),
Io(std::io::Error),
/// Plan-time invariant violated at runtime — Clinker bug, not a data
/// error. ALWAYS aborts the run regardless of ErrorStrategy::Continue.
Internal { op: &'static str, node: String, detail: String },
MemoryBudgetExceeded { node: String, used: u64, limit: u64, /* … */ },
// ... ~25 variants; many documented "Always aborts the run"
}

Notice this enum is hand-written — no thiserror derive. error.rs writes its own Display, its own impl std::error::Error, and its own From conversions. That’s a deliberate choice for a type this central: the Display strings are diagnostics shown to users, worth hand-tuning. (You’ll meet thiserror elsewhere in the codebase; it’s a fine tool, just not used for this type.)

The magic word in idiomatic Rust error handling is ?. Writing let r = next_record()?; means “if this is Err, return it from the enclosing function — converting it to my error type on the way out.” That conversion is powered by the From trait. Each subsystem error gets a hand-written From into PipelineError:

impl From<std::io::Error> for PipelineError {
fn from(e: std::io::Error) -> Self { Self::Io(e) }
}
impl From<cxl::eval::EvalError> for PipelineError {
fn from(e: cxl::eval::EvalError) -> Self { Self::Eval(e) }
}
// ... ConfigError, FormatError, SchemaError, SpillError

With those in place, a function returning Result<_, PipelineError> can call into the format layer, the IO layer, and the CXL evaluator and just write ? after each — every foreign error is auto-lifted into the right PipelineError variant. No match, no manual map_err. The six From impls are the seams that let one error type absorb six others.

Here’s where the type earns its keep. The variants split into two fates:

  • Recoverable (per-record data errors). A Value that won’t cast, a cxl::eval::EvalError on one row. Under the right policy these are sent to a dead-letter queue (DLQ) — the bad record is set aside and the run continues.
  • Fatal (always abort). Internal, MemoryBudgetExceeded, schema mismatches, sort-order violations. These stop the run regardless of policy. The variant docs say so in capital letters: “ALWAYS aborts the run regardless of ErrorStrategy::Continue.”

Which policy governs the recoverable ones is a config setting:

clinker-plan ·pipeline.rs ·ErrorStrategy type @47d2e12
pub enum ErrorStrategy {
FailFast, // any error stops the run (the default)
Continue, // recoverable errors go to the DLQ; keep going
BestEffort,
}

And the actual routing — the place a per-record eval error meets the policy — is one function in the executor:

clinker-exec ·dispatch.rs ·dispatch_transform_eval_error fn @47d2e12
crates/clinker-exec/src/executor/dispatch.rs
fn dispatch_transform_eval_error(/* … */) -> Result<, PipelineError> {
if ctx.strategy == ErrorStrategy::FailFast {
return Err(eval_err.into()); // propagate — the ? at the call site aborts
}
// otherwise: classify and route the bad record to the DLQ, run continues
}

The crucial asymmetry: only recoverable errors flow through this router. The fatal variants are never offered to the DLQ at all — Internal and its kin are constructed and ?-propagated directly, bypassing this function entirely. So ErrorStrategy::Continue can keep a job alive through a million bad rows, but it cannot suppress a clinker bug. That’s by design: a Continue policy is a statement about your data, never a license to limp on through a broken engine.

Pause on Internal { op, node, detail }. Its doc calls it “a Clinker bug, not a data error.” It exists for the cases that the plan/runtime boundary (last lesson) was supposed to make impossible — an invariant that compilation proved, tripped anyway at run time. Routing such a thing to the DLQ would be exactly wrong: it would hide a bug behind a “bad record” label and let a broken run produce plausible-looking output. Making Internal always-fatal means the engine fails loudly at the first sign it has violated its own contract — the run dies, the operator sees it, nobody trusts corrupt output. The error taxonomy is how the engine refuses to paper over its own bugs.

?, From, and the abort/quarantine split — all in std:

rust // editable

Run it. Under Continue, "oops" is quarantined and "37" still processes; under FailFast, the first bad record stops everything. Now change process to return Internal for "oops" and watch it abort under both policies — that’s the fatal class refusing to be quarantined.

// quick check

Why does an Internal error abort the run even under ErrorStrategy::Continue, while an Eval error can be sent to the DLQ?

You can now read the engine’s failure taxonomy and say why some errors quarantine a record while others kill the run. One subsystem has shown up in nearly every lesson — the expression language, CXL — but always from the outside. The final Phase 3 lesson goes inside it: how a formula becomes a typechecked, compile-once program that runs over every record.