PipelineError — recoverable vs fatal, encoded in the type

The Rust question for this lesson: each of the last three lessons built one error type for one layer — CoercionError for coercion, FormatError for the format readers, SpillError for disk spill, ChannelError for channel files. But a single pipeline run touches all of them. When a FormatError raised deep in a reader and a SpillError raised in the executor both have to travel to the same place — the top of the run — what type do they become? And once there, how does the runtime decide whether a failure means “skip this bad row and keep going” or “stop the entire run now”? This lesson answers both with PipelineError: the top-level vocabulary that aggregates every subsystem error, and whose variants encode Clinker’s recoverable-vs-fatal model. It completes Phase 4.

One error type to rule the run

A top-level error type does two jobs. First, aggregation: it has a variant for each leaf error, with a From impl so ? can lift any of them. Second, classification: its variants carry enough information for the runtime to decide how to react. Here’s the shape, in miniature — two leaf errors from two “subsystems,” aggregated and classified:

rust // editable

#[derive(Debug)]
struct ParseError(String); // a "bad row" from the read layer
#[derive(Debug)]
struct DiskError(String);  // the spill volume filled, from the exec layer

// The top-level error AGGREGATES the leaf errors, plus its own variants.
#[derive(Debug)]
enum JobError {
  Parse(ParseError),   // a bad row — recoverable
  Disk(DiskError),     // disk full — fatal
  Internal(String),    // OUR bug — always fatal
}

// From impls thread the leaf errors up: ? lifts a ParseError into a JobError.
impl From<ParseError> for JobError {
  fn from(e: ParseError) -> Self { JobError::Parse(e) }
}
impl From<DiskError> for JobError {
  fn from(e: DiskError) -> Self { JobError::Disk(e) }
}

impl JobError {
  // The variant decides routing: a bad row can be skipped; a full disk or a
  // bug cannot. This is the recoverable-vs-fatal model in one method.
  fn is_fatal(&self) -> bool {
      match self {
          JobError::Parse(_) => false,  // recoverable: skip the row, keep going
          JobError::Disk(_) => true,    // fatal: stop the run
          JobError::Internal(_) => true,
      }
  }
}

// A worker whose failing step ? -lifts a ParseError into a JobError:
fn process_row(raw: &str) -> Result<i64, JobError> {
  let n: i64 = raw.parse().map_err(|_| ParseError(format!("bad int: {raw}")))?;
  Ok(n * 2)
}

fn main() {
  let errors = [
      JobError::Parse(ParseError("bad row 7".into())),
      JobError::Disk(DiskError("no space left".into())),
      JobError::Internal("unreachable arm".into()),
  ];
  for e in &errors {
      let route = if e.is_fatal() { "ABORT run" } else { "skip row -> DLQ" };
      println!("{e:?} -> {route}");
  }
  println!("{:?}", process_row("21")); // Ok(42)
  println!("{:?}", process_row("xx")); // Err(Parse(ParseError("bad int: xx")))
}

#[derive(Debug)]
struct ParseError(String); // a "bad row" from the read layer
#[derive(Debug)]
struct DiskError(String);  // the spill volume filled, from the exec layer

// The top-level error AGGREGATES the leaf errors, plus its own variants.
#[derive(Debug)]
enum JobError {
  Parse(ParseError),   // a bad row — recoverable
  Disk(DiskError),     // disk full — fatal
  Internal(String),    // OUR bug — always fatal
}

// From impls thread the leaf errors up: ? lifts a ParseError into a JobError.
impl From<ParseError> for JobError {
  fn from(e: ParseError) -> Self { JobError::Parse(e) }
}
impl From<DiskError> for JobError {
  fn from(e: DiskError) -> Self { JobError::Disk(e) }
}

impl JobError {
  // The variant decides routing: a bad row can be skipped; a full disk or a
  // bug cannot. This is the recoverable-vs-fatal model in one method.
  fn is_fatal(&self) -> bool {
      match self {
          JobError::Parse(_) => false,  // recoverable: skip the row, keep going
          JobError::Disk(_) => true,    // fatal: stop the run
          JobError::Internal(_) => true,
      }
  }
}

// A worker whose failing step ? -lifts a ParseError into a JobError:
fn process_row(raw: &str) -> Result<i64, JobError> {
  let n: i64 = raw.parse().map_err(|_| ParseError(format!("bad int: {raw}")))?;
  Ok(n * 2)
}

fn main() {
  let errors = [
      JobError::Parse(ParseError("bad row 7".into())),
      JobError::Disk(DiskError("no space left".into())),
      JobError::Internal("unreachable arm".into()),
  ];
  for e in &errors {
      let route = if e.is_fatal() { "ABORT run" } else { "skip row -> DLQ" };
      println!("{e:?} -> {route}");
  }
  println!("{:?}", process_row("21")); // Ok(42)
  println!("{:?}", process_row("xx")); // Err(Parse(ParseError("bad int: xx")))
}

> output appears here — press Run

Two ideas to hold onto. The From impls are the same ? on-ramps from lesson 8 — only now they lift leaf errors into the aggregate, so a function deep in the read layer can ? and have its ParseError arrive at the top as a JobError. And is_fatal shows where the recoverable-vs-fatal decision lives: in which variant the failure is. Not a boolean threaded through every call — the type carries it.

// quick check

In an aggregating error type like JobError, what is each From impl for?

Clinker’s PipelineError aggregates every subsystem

PipelineError is exactly this type at engine scale: a variant for each subsystem error, a From per leaf type, and a hand-written Display. Here is the shape (about twenty variants in full):

clinker-plan ·error.rs ·PipelineError type @47d2e12

#[derive(Debug)]
pub enum PipelineError {
    Config(crate::config::ConfigError),       // wraps a leaf error
    Schema(crate::schema::SchemaError),
    Format(clinker_format::FormatError),      // wraps lesson 9/10's FormatError
    Spill(crate::runtime_error::SpillError),  // wraps lesson 8's SpillError
    Io(std::io::Error),
    /// Plan-time invariant violated at runtime — Clinker bug, not a data
    /// error. ALWAYS aborts the run regardless of `ErrorStrategy::Continue`.
    Internal { op: &'static str, node: String, detail: String },
    /// Finalize-time accumulator failure. Routed to the DLQ under `Continue`,
    /// propagated under `FailFast`.
    Accumulator { transform: String, binding: String, source: AccumulatorError },
    /// ALWAYS aborts the run regardless of `ErrorStrategy::Continue`; this is
    /// a halt directive, not a per-record error.
    DlqRateExceeded { observed_rate: f64, max_rate: f64, /* … */ },
    // … ~20 variants total
}

The From impls thread every leaf error you’ve already met up into this one type:

clinker-format ·error.rs ·FormatError type @47d2e12

impl From<clinker_format::FormatError> for PipelineError {
    fn from(e: clinker_format::FormatError) -> Self { Self::Format(e) }
}
impl From<crate::runtime_error::SpillError> for PipelineError {
    fn from(e: crate::runtime_error::SpillError) -> Self { Self::Spill(e) }
}
// … plus From for ConfigError, SchemaError, EvalError, io::Error

So when a format reader deep in the executor returns a FormatError (lessons 9–10), the executor boundary writes reader.next_record()? and the From<FormatError> lifts it into a PipelineError::Format automatically — the lesson-8 ?/From mechanism, now operating one level up. SpillError (lesson 8) flows up the same way. Every leaf vocabulary you built drains into this single top-level type.

One thing to notice: PipelineError is hand-rolled — #[derive(Debug)], a hand-written Display, and hand-written From impls — not thiserror, even though lesson 10 just showed the derive. Its Display is elaborate (the Multiple variant joins a whole list of child errors; the Compilation variant prints multi-line CXL messages; several variants render E-coded diagnostics), and a few variants carry Arc<Schema> or a nested PipelineError. The team kept full control of that rendering by writing it out. The lesson of 9 and 10 stands: choose hand-rolled or derived per type — Clinker derives its leaf errors and hand-rolls this aggregate.

The recoverable-vs-fatal model

Here is the payoff, and the reason the variants are documented so carefully. Read the doc comments again — they aren’t description, they’re routing:

Internal { … } — “ALWAYS aborts the run regardless of ErrorStrategy::Continue.” A Clinker bug, not a data error; continuing would process corrupt state.
Accumulator { … } — “Routed to the DLQ under Continue, propagated under FailFast.” A real data failure, so it obeys the user’s strategy.
DlqRateExceeded { … } — “ALWAYS aborts… this is a halt directive.” Too many rows already went to the dead-letter queue; a configured ceiling was crossed.

That is the model. The runtime carries an ErrorStrategy — Continue (route a bad record to the dead-letter queue and keep processing) or FailFast (stop on the first failure). A recoverable error (a bad row: Format, Accumulator) obeys that strategy. A fatal error overrides it: Internal and the invariant-violation variants always abort because they signal a bug; the policy halts (DlqRateExceeded, SpillCapExceeded) always abort because a limit the user configured was reached. This is lesson 8’s strict-vs-lenient choice — fail loudly, or absorb and continue — lifted to the whole pipeline and turned into a typed vocabulary: the decision is encoded in which variant the failure is, readable straight off the type.

The recoverable-vs-fatal classification is documented on each variant, not inferred by readers: the Internal doc says “ALWAYS aborts the run regardless of ErrorStrategy::Continue”; the Accumulator doc says “Routed to the DLQ under Continue, propagated under FailFast”; DlqRateExceeded says “ALWAYS aborts… this is a halt directive.” So the type isn’t just a list of ways the run can fail — it’s a list annotated with how the runtime must treat each one. That is the deepest reason to model failure with a precise enum rather than a string or an opaque box: the variant is the routing decision, and the compiler guarantees every failure is one of them. (Documented: each variant’s doc comment states its routing under ErrorStrategy.)

Your turn

The outline’s exercise — categorize variants as recoverable vs fatal — plus a toy extension.

(a) For each of these five real PipelineError variants, decide whether it is recoverable (routed to the dead-letter queue under ErrorStrategy::Continue) or fatal (aborts the run regardless of strategy), and name the one phrase in its doc comment that tells you: Format(FormatError), Internal { … }, Accumulator { … }, DlqRateExceeded { … }, SortOrderViolation { … }.

(b) In the toy above, add a Network(NetError) variant (define a struct NetError(String)) with its From impl, and decide its is_fatal arm. Is a dropped connection mid-read recoverable or fatal? Justify it the way Clinker’s doc comments do — one sentence on what continuing would mean.

💡 Hint 1

(a) Look for the words “ALWAYS aborts” / “always aborts” (fatal) versus “Routed to the DLQ under Continue” or “per-record” (recoverable). A bug or a crossed policy ceiling is fatal; a single bad row is recoverable. (b) There’s no single right answer — but a transient network drop is often retried/skipped (recoverable), whereas an auth failure would be fatal. State which you mean and why continuing is safe or unsafe.

Show solution

(a)

Format(FormatError) — recoverable. A bad input record; the doc on FormatError itself says “the executor decides whether to abort or skip based on the error strategy,” so under Continue it goes to the DLQ.
Internal { … } — fatal. “ALWAYS aborts the run regardless of ErrorStrategy::Continue” — it’s a Clinker bug, and continuing would process corrupt state.
Accumulator { … } — recoverable. “Routed to the DLQ under Continue, propagated under FailFast” — a real data failure that obeys the strategy.
DlqRateExceeded { … } — fatal. “ALWAYS aborts… this is a halt directive” — a configured ceiling on DLQ volume was crossed.
SortOrderViolation { … } — fatal. “ALWAYS hard-aborts regardless of error strategy” — the user’s declared sort order was wrong, so the result would be incorrect.

(b) A reasonable modeling, treating a transient drop as recoverable:

#[derive(Debug)]
struct NetError(String);

impl From<NetError> for JobError {
    fn from(e: NetError) -> Self { JobError::Network(e) }
}

// in is_fatal():
JobError::Network(_) => false,  // recoverable: a transient drop — skip/retry the row,
                                // continuing is safe because no committed state is corrupt

If instead the variant meant an authentication or configuration failure, you’d return true: every subsequent read would fail the same way, so continuing is pointless. The discipline is the same as Clinker’s — say, at the definition site, what continuing would mean.

Common misconceptions

“Aggregating errors throws away the original.” No — PipelineError wraps each leaf error in a variant (Format(FormatError)) and From moves it in whole; the Display even delegates back to it (write!(f, "format error: {e}")). The detail survives; only the static type is unified, so one Result<_, PipelineError> can carry any subsystem’s failure.
“ErrorStrategy::Continue means nothing ever aborts.” No — Continue routes recoverable per-record errors to the dead-letter queue, but fatal variants (Internal, the invariant violations, policy halts like DlqRateExceeded) abort regardless. The variant overrides the strategy.
“thiserror is always the right choice for an error type.” Not always — Clinker derives its leaf errors (ChannelError, lesson 10) but hand-rolls PipelineError, whose elaborate multi-line Display and Arc/nested-error variants the team chose to render by hand. Match the tool to the type.

✓ Checkpoint

// quick check

Where does the recoverable-vs-fatal decision live in Clinker's design?

You should be able to:

Explain the two jobs of a top-level error type: aggregation (From per leaf) and classification
Trace how a FormatError from a reader becomes a PipelineError at the executor boundary
State the recoverable-vs-fatal model and where each variant's routing is documented
Give one reason PipelineError is hand-rolled while ChannelError is derived

Verify in the checkout:

cargo test -p clinker-plan spill_cap_exceeded_renders_e320_distinct_from_oom
cargo test -p clinker-plan

Verify it for real

A real clinker-plan test pins the hand-written Display for one of the fatal policy-halt variants — it builds a SpillCapExceeded and checks the rendered message is unmistakable:

cargo test -p clinker-plan spill_cap_exceeded_renders_e320_distinct_from_oom

It constructs PipelineError::spill_cap_exceeded(…) and asserts to_string() contains the E320 code, the configured cap and current byte counts, and the phrase “not an out-of-memory condition” — so an operator who hits a disk-cap halt is never misled into chasing a memory leak. That’s the hand-rolled Display from this lesson, proven to render a precise, structured diagnostic. Run the whole cargo test -p clinker-plan suite to exercise the From conversions and the other variants’ rendering.

Where this leads

That completes Phase 4. You can now read and design Rust’s whole failure model: Result and ? (lesson 8), an error type built by hand (9), the same generated by thiserror (10), and a top-level vocabulary that aggregates every subsystem error and encodes a recoverable-vs-fatal model (this one). Every fallible signature in Clinker is now legible to you.

Phase 5 turns from modeling failure to modeling behavior. The leaf-to-aggregate From impls you just saw are one kind of shared interface; next comes the general tool for them — traits: how RecordStorage, FormatReader, and friends let different concrete types present one common API, and how generics dispatch over that API without a runtime cost.