Error handling

A pipeline chewing through ten million records will hit failures of very different weight. One malformed row in a CSV is data — annoying, expected, and no reason to throw away the other 9,999,999 records. A disk filling mid-write is infrastructure. And an invariant the compiler proved at plan time being violated at run time is a bug in clinker itself — which must stop everything, loudly. A good error type doesn’t just say that something failed; it encodes which kind, because the kind decides the fate of the run.

You’ll be able to: read the PipelineError enum, explain how From impls let ? propagate subsystem errors, and describe why an Internal error always aborts while a per-record Eval error can be quarantined.

One enum that aggregates every failure

PipelineError is the engine’s top-level runtime error. It’s a sum type with one variant per subsystem failure, plus a set of specific “this went wrong” variants:

clinker-plan ·error.rs ·PipelineError type @47d2e12

#[derive(Debug)]
pub enum PipelineError {
    Config(crate::config::ConfigError),
    Format(clinker_format::FormatError),
    Eval(cxl::eval::EvalError),
    Io(std::io::Error),
    /// Plan-time invariant violated at runtime — Clinker bug, not a data
    /// error. ALWAYS aborts the run regardless of ErrorStrategy::Continue.
    Internal { op: &'static str, node: String, detail: String },
    MemoryBudgetExceeded { node: String, used: u64, limit: u64, /* … */ },
    // ... ~25 variants; many documented "Always aborts the run"
}

Notice this enum is hand-written — no thiserror derive. error.rs writes its own Display, its own impl std::error::Error, and its own From conversions. That’s a deliberate choice for a type this central: the Display strings are diagnostics shown to users, worth hand-tuning. (You’ll meet thiserror elsewhere in the codebase; it’s a fine tool, just not used for this type.)

`From` impls are what make `?` work

The magic word in idiomatic Rust error handling is ?. Writing let r = next_record()?; means “if this is Err, return it from the enclosing function — converting it to my error type on the way out.” That conversion is powered by the From trait. Each subsystem error gets a hand-written From into PipelineError:

impl From<std::io::Error> for PipelineError {
    fn from(e: std::io::Error) -> Self { Self::Io(e) }
}
impl From<cxl::eval::EvalError> for PipelineError {
    fn from(e: cxl::eval::EvalError) -> Self { Self::Eval(e) }
}
// ... ConfigError, FormatError, SchemaError, SpillError

With those in place, a function returning Result<_, PipelineError> can call into the format layer, the IO layer, and the CXL evaluator and just write ? after each — every foreign error is auto-lifted into the right PipelineError variant. No match, no manual map_err. The six From impls are the seams that let one error type absorb six others.

The taxonomy: abort versus quarantine

Here’s where the type earns its keep. The variants split into two fates:

Recoverable (per-record data errors). A Value that won’t cast, a cxl::eval::EvalError on one row. Under the right policy these are sent to a dead-letter queue (DLQ) — the bad record is set aside and the run continues.
Fatal (always abort). Internal, MemoryBudgetExceeded, schema mismatches, sort-order violations. These stop the run regardless of policy. The variant docs say so in capital letters: “ALWAYS aborts the run regardless of ErrorStrategy::Continue.”

Which policy governs the recoverable ones is a config setting:

clinker-plan ·pipeline.rs ·ErrorStrategy type @47d2e12

pub enum ErrorStrategy {
    FailFast,    // any error stops the run (the default)
    Continue,    // recoverable errors go to the DLQ; keep going
    BestEffort,
}

And the actual routing — the place a per-record eval error meets the policy — is one function in the executor:

clinker-exec ·dispatch.rs ·dispatch_transform_eval_error fn @47d2e12

fn dispatch_transform_eval_error(/* … */) -> Result<…, PipelineError> {
    if ctx.strategy == ErrorStrategy::FailFast {
        return Err(eval_err.into());     // propagate — the ? at the call site aborts
    }
    // otherwise: classify and route the bad record to the DLQ, run continues
}

The crucial asymmetry: only recoverable errors flow through this router. The fatal variants are never offered to the DLQ at all — Internal and its kin are constructed and ?-propagated directly, bypassing this function entirely. So ErrorStrategy::Continue can keep a job alive through a million bad rows, but it cannot suppress a clinker bug. That’s by design: a Continue policy is a statement about your data, never a license to limp on through a broken engine.

Why `Internal` is its own fate

Pause on Internal { op, node, detail }. Its doc calls it “a Clinker bug, not a data error.” It exists for the cases that the plan/runtime boundary (last lesson) was supposed to make impossible — an invariant that compilation proved, tripped anyway at run time. Routing such a thing to the DLQ would be exactly wrong: it would hide a bug behind a “bad record” label and let a broken run produce plausible-looking output. Making Internal always-fatal means the engine fails loudly at the first sign it has violated its own contract — the run dies, the operator sees it, nobody trusts corrupt output. The error taxonomy is how the engine refuses to paper over its own bugs.

Build the taxonomy in miniature

?, From, and the abort/quarantine split — all in std:

rust // editable

#[derive(Debug)]
enum DataError { BadField(String) }        // recoverable: one bad record

#[derive(Debug)]
enum PipelineError {
  Data(DataError),                       // recoverable — may go to the DLQ
  Internal(&'static str),                // a bug — always fatal
}

// From lets `?` convert DataError into PipelineError automatically.
impl From<DataError> for PipelineError {
  fn from(e: DataError) -> Self { PipelineError::Data(e) }
}

fn parse_age(s: &str) -> Result<u32, DataError> {
  s.parse().map_err(|_| DataError::BadField(format!("not a number: {s}")))
}

fn process(s: &str) -> Result<u32, PipelineError> {
  let age = parse_age(s)?;        // From<DataError> fires here — no manual map_err
  if age > 200 {
      // 'impossible' — treat as a bug, not as bad data
      return Err(PipelineError::Internal("age over 200 slipped past validation"));
  }
  Ok(age)
}

// The taxonomy decides each error's fate.
fn run(records: &[&str], fail_fast: bool) {
  for r in records {
      match process(r) {
          Ok(age) => println!("ok: {age}"),
          Err(PipelineError::Internal(msg)) => {
              println!("ABORT (bug): {msg}");
              return;             // internal errors always stop the run
          }
          Err(PipelineError::Data(e)) => {
              if fail_fast {
                  println!("ABORT (fail-fast): {e:?}");
                  return;
              }
              println!("quarantine to DLQ, keep going: {e:?}");
          }
      }
  }
}

fn main() {
  let records = ["42", "oops", "37"];
  println!("-- Continue: recoverable errors quarantined --");
  run(&records, false);
  println!("-- FailFast: first error aborts --");
  run(&records, true);
}

#[derive(Debug)]
enum DataError { BadField(String) }        // recoverable: one bad record

#[derive(Debug)]
enum PipelineError {
  Data(DataError),                       // recoverable — may go to the DLQ
  Internal(&'static str),                // a bug — always fatal
}

// From lets `?` convert DataError into PipelineError automatically.
impl From<DataError> for PipelineError {
  fn from(e: DataError) -> Self { PipelineError::Data(e) }
}

fn parse_age(s: &str) -> Result<u32, DataError> {
  s.parse().map_err(|_| DataError::BadField(format!("not a number: {s}")))
}

fn process(s: &str) -> Result<u32, PipelineError> {
  let age = parse_age(s)?;        // From<DataError> fires here — no manual map_err
  if age > 200 {
      // 'impossible' — treat as a bug, not as bad data
      return Err(PipelineError::Internal("age over 200 slipped past validation"));
  }
  Ok(age)
}

// The taxonomy decides each error's fate.
fn run(records: &[&str], fail_fast: bool) {
  for r in records {
      match process(r) {
          Ok(age) => println!("ok: {age}"),
          Err(PipelineError::Internal(msg)) => {
              println!("ABORT (bug): {msg}");
              return;             // internal errors always stop the run
          }
          Err(PipelineError::Data(e)) => {
              if fail_fast {
                  println!("ABORT (fail-fast): {e:?}");
                  return;
              }
              println!("quarantine to DLQ, keep going: {e:?}");
          }
      }
  }
}

fn main() {
  let records = ["42", "oops", "37"];
  println!("-- Continue: recoverable errors quarantined --");
  run(&records, false);
  println!("-- FailFast: first error aborts --");
  run(&records, true);
}

> output appears here — press Run

Run it. Under Continue, "oops" is quarantined and "37" still processes; under FailFast, the first bad record stops everything. Now change process to return Internal for "oops" and watch it abort under both policies — that’s the fatal class refusing to be quarantined.

// quick check

Why does an Internal error abort the run even under ErrorStrategy::Continue, while an Eval error can be sent to the DLQ?

The taxonomy ties fate to kind. Continue is a statement about messy data, not permission to run a broken engine, so Internal (and other always-abort variants) bypass the DLQ router and propagate directly.

Read the error layer

✓ Checkpoint — error handling

💡 Hint 1

Count the impl From< lines — each one is a subsystem ? can absorb. Then read a couple of the variant doc comments that contain “Always aborts the run”; those are the fatal class.

What the tour establishes

PipelineError is hand-written with six From impls, so ? lifts ConfigError, FormatError, EvalError, SchemaError, io::Error, and SpillError into it. The “Always aborts the run” grep lists the fatal variants (Internal, MemoryBudgetExceeded, schema/sort-order violations, …). dispatch_transform_eval_error is the only router that consults ErrorStrategy, and it only ever sees recoverable per-record eval errors — the fatal variants never reach it. Fate follows kind.

// quick check

What does writing `let r = some_subsystem_call()?;` inside a fn returning Result<_, PipelineError> rely on?

? converts via From. Each subsystem error has a hand-written From into PipelineError, so the foreign error is auto-lifted into the right variant on the way out — no derive needed.

You should be able to:

You can explain how a From<E> impl lets ? convert a subsystem error into PipelineError
You can name a recoverable variant and a fatal variant and say what happens to each under Continue
You can explain why Internal is deliberately not DLQ-eligible

Verify in the checkout:

grep -n 'pub enum PipelineError' crates/clinker-plan/src/error.rs
grep -n 'impl From<' crates/clinker-plan/src/error.rs
grep -n 'Always aborts the run' crates/clinker-plan/src/error.rs | head
grep -n 'fn dispatch_transform_eval_error' crates/clinker-exec/src/executor/dispatch.rs

You can now read the engine’s failure taxonomy and say why some errors quarantine a record while others kill the run. One subsystem has shown up in nearly every lesson — the expression language, CXL — but always from the outside. The final Phase 3 lesson goes inside it: how a formula becomes a typechecked, compile-once program that runs over every record.