PipelineError — recoverable vs fatal, encoded in the type
The Rust question for this lesson: each of the last three lessons built one error type for
one layer — CoercionError for coercion, FormatError for the format readers, SpillError
for disk spill, ChannelError for channel files. But a single pipeline run touches all of them.
When a FormatError raised deep in a reader and a SpillError raised in the executor both have
to travel to the same place — the top of the run — what type do they become? And once there,
how does the runtime decide whether a failure means “skip this bad row and keep going” or
“stop the entire run now”? This lesson answers both with PipelineError: the top-level
vocabulary that aggregates every subsystem error, and whose variants encode Clinker’s
recoverable-vs-fatal model. It completes Phase 4.
One error type to rule the run
Section titled “One error type to rule the run”A top-level error type does two jobs. First, aggregation: it has a variant for each leaf
error, with a From impl so ? can lift any of them. Second, classification: its variants
carry enough information for the runtime to decide how to react. Here’s the shape, in miniature —
two leaf errors from two “subsystems,” aggregated and classified:
> output appears here — press Run
Two ideas to hold onto. The From impls are the same ? on-ramps from lesson 8 — only now they
lift leaf errors into the aggregate, so a function deep in the read layer can ? and have its
ParseError arrive at the top as a JobError. And is_fatal shows where the recoverable-vs-fatal
decision lives: in which variant the failure is. Not a boolean threaded through every call — the
type carries it.
// quick check
In an aggregating error type like JobError, what is each From impl for?
Each From
Clinker’s PipelineError aggregates every subsystem
Section titled “Clinker’s PipelineError aggregates every subsystem”PipelineError is exactly this type at engine scale: a variant for each subsystem error, a From
per leaf type, and a hand-written Display. Here is the shape (about twenty variants in full):
clinker-plan ·error.rs ·PipelineError type @47d2e12
#[derive(Debug)]pub enum PipelineError { Config(crate::config::ConfigError), // wraps a leaf error Schema(crate::schema::SchemaError), Format(clinker_format::FormatError), // wraps lesson 9/10's FormatError Spill(crate::runtime_error::SpillError), // wraps lesson 8's SpillError Io(std::io::Error), /// Plan-time invariant violated at runtime — Clinker bug, not a data /// error. ALWAYS aborts the run regardless of `ErrorStrategy::Continue`. Internal { op: &'static str, node: String, detail: String }, /// Finalize-time accumulator failure. Routed to the DLQ under `Continue`, /// propagated under `FailFast`. Accumulator { transform: String, binding: String, source: AccumulatorError }, /// ALWAYS aborts the run regardless of `ErrorStrategy::Continue`; this is /// a halt directive, not a per-record error. DlqRateExceeded { observed_rate: f64, max_rate: f64, /* … */ }, // … ~20 variants total}The From impls thread every leaf error you’ve already met up into this one type:
clinker-format ·error.rs ·FormatError type @47d2e12
impl From<clinker_format::FormatError> for PipelineError { fn from(e: clinker_format::FormatError) -> Self { Self::Format(e) }}impl From<crate::runtime_error::SpillError> for PipelineError { fn from(e: crate::runtime_error::SpillError) -> Self { Self::Spill(e) }}// … plus From for ConfigError, SchemaError, EvalError, io::ErrorSo when a format reader deep in the executor returns a FormatError (lessons 9–10), the executor
boundary writes reader.next_record()? and the From<FormatError> lifts it into a
PipelineError::Format automatically — the lesson-8 ?/From mechanism, now operating one level
up. SpillError (lesson 8) flows up the same way. Every leaf vocabulary you built drains into
this single top-level type.
One thing to notice: PipelineError is hand-rolled — #[derive(Debug)], a hand-written
Display, and hand-written From impls — not thiserror, even though lesson 10 just showed the
derive. Its Display is elaborate (the Multiple variant joins a whole list of child errors; the
Compilation variant prints multi-line CXL messages; several variants render E-coded diagnostics),
and a few variants carry Arc<Schema> or a nested PipelineError. The team kept full control of
that rendering by writing it out. The lesson of 9 and 10 stands: choose hand-rolled or derived per
type — Clinker derives its leaf errors and hand-rolls this aggregate.
The recoverable-vs-fatal model
Section titled “The recoverable-vs-fatal model”Here is the payoff, and the reason the variants are documented so carefully. Read the doc comments again — they aren’t description, they’re routing:
Internal { … }— “ALWAYS aborts the run regardless ofErrorStrategy::Continue.” A Clinker bug, not a data error; continuing would process corrupt state.Accumulator { … }— “Routed to the DLQ underContinue, propagated underFailFast.” A real data failure, so it obeys the user’s strategy.DlqRateExceeded { … }— “ALWAYS aborts… this is a halt directive.” Too many rows already went to the dead-letter queue; a configured ceiling was crossed.
That is the model. The runtime carries an ErrorStrategy — Continue (route a bad record to the
dead-letter queue and keep processing) or FailFast (stop on the first failure). A
recoverable error (a bad row: Format, Accumulator) obeys that strategy. A fatal error
overrides it: Internal and the invariant-violation variants always abort because they signal a
bug; the policy halts (DlqRateExceeded, SpillCapExceeded) always abort because a limit the
user configured was reached. This is lesson 8’s strict-vs-lenient choice — fail loudly, or absorb
and continue — lifted to the whole pipeline and turned into a typed vocabulary: the decision is
encoded in which variant the failure is, readable straight off the type.
Your turn
Section titled “Your turn”The outline’s exercise — categorize variants as recoverable vs fatal — plus a toy extension.
(a) For each of these five real
PipelineErrorvariants, decide whether it is recoverable (routed to the dead-letter queue underErrorStrategy::Continue) or fatal (aborts the run regardless of strategy), and name the one phrase in its doc comment that tells you:Format(FormatError),Internal { … },Accumulator { … },DlqRateExceeded { … },SortOrderViolation { … }.(b) In the toy above, add a
Network(NetError)variant (define astruct NetError(String)) with itsFromimpl, and decide itsis_fatalarm. Is a dropped connection mid-read recoverable or fatal? Justify it the way Clinker’s doc comments do — one sentence on what continuing would mean.
💡 Hint 1
(a) Look for the words “ALWAYS aborts” / “always aborts” (fatal) versus “Routed to the DLQ
under Continue” or “per-record” (recoverable). A bug or a crossed policy ceiling is fatal; a
single bad row is recoverable.
(b) There’s no single right answer — but a transient network drop is often retried/skipped
(recoverable), whereas an auth failure would be fatal. State which you mean and why continuing is
safe or unsafe.
Show solution
(a)
Format(FormatError)— recoverable. A bad input record; the doc onFormatErroritself says “the executor decides whether to abort or skip based on the error strategy,” so underContinueit goes to the DLQ.Internal { … }— fatal. “ALWAYS aborts the run regardless ofErrorStrategy::Continue” — it’s a Clinker bug, and continuing would process corrupt state.Accumulator { … }— recoverable. “Routed to the DLQ underContinue, propagated underFailFast” — a real data failure that obeys the strategy.DlqRateExceeded { … }— fatal. “ALWAYS aborts… this is a halt directive” — a configured ceiling on DLQ volume was crossed.SortOrderViolation { … }— fatal. “ALWAYS hard-aborts regardless of error strategy” — the user’s declared sort order was wrong, so the result would be incorrect.
(b) A reasonable modeling, treating a transient drop as recoverable:
#[derive(Debug)]struct NetError(String);
impl From<NetError> for JobError { fn from(e: NetError) -> Self { JobError::Network(e) }}
// in is_fatal():JobError::Network(_) => false, // recoverable: a transient drop — skip/retry the row, // continuing is safe because no committed state is corruptIf instead the variant meant an authentication or configuration failure, you’d return true: every
subsequent read would fail the same way, so continuing is pointless. The discipline is the same as
Clinker’s — say, at the definition site, what continuing would mean.
Common misconceptions
Section titled “Common misconceptions”- “Aggregating errors throws away the original.” No —
PipelineErrorwraps each leaf error in a variant (Format(FormatError)) andFrommoves it in whole; theDisplayeven delegates back to it (write!(f, "format error: {e}")). The detail survives; only the static type is unified, so oneResult<_, PipelineError>can carry any subsystem’s failure. - “
ErrorStrategy::Continuemeans nothing ever aborts.” No —Continueroutes recoverable per-record errors to the dead-letter queue, but fatal variants (Internal, the invariant violations, policy halts likeDlqRateExceeded) abort regardless. The variant overrides the strategy. - “
thiserroris always the right choice for an error type.” Not always — Clinker derives its leaf errors (ChannelError, lesson 10) but hand-rollsPipelineError, whose elaborate multi-lineDisplayandArc/nested-error variants the team chose to render by hand. Match the tool to the type.
Verify it for real
Section titled “Verify it for real”A real clinker-plan test pins the hand-written Display for one of the fatal policy-halt
variants — it builds a SpillCapExceeded and checks the rendered message is unmistakable:
cargo test -p clinker-plan spill_cap_exceeded_renders_e320_distinct_from_oomIt constructs PipelineError::spill_cap_exceeded(…) and asserts to_string() contains the E320
code, the configured cap and current byte counts, and the phrase “not an out-of-memory
condition” — so an operator who hits a disk-cap halt is never misled into chasing a memory leak.
That’s the hand-rolled Display from this lesson, proven to render a precise, structured
diagnostic. Run the whole cargo test -p clinker-plan suite to exercise the From conversions and
the other variants’ rendering.
Where this leads
Section titled “Where this leads”That completes Phase 4. You can now read and design Rust’s whole failure model: Result and
? (lesson 8), an error type built by hand (9), the same generated by thiserror (10), and a
top-level vocabulary that aggregates every subsystem error and encodes a recoverable-vs-fatal model
(this one). Every fallible signature in Clinker is now legible to you.
Phase 5 turns from modeling failure to modeling behavior. The leaf-to-aggregate From impls you
just saw are one kind of shared interface; next comes the general tool for them — traits: how
RecordStorage, FormatReader, and friends let different concrete types present one common API,
and how generics dispatch over that API without a runtime cost.