Skip to content

Span-preserving parse & strictness

You write colmn: when you meant column:. What should happen? A lenient parser shrugs, ignores the key it doesn’t recognise, and runs your pipeline with a silently-missing setting — a bug you discover hours later in the output. Clinker takes the opposite stance: that typo is an error, raised at plan time, pointing at the exact line. Two design choices make that possible, and both are worth stealing.

You’ll be able to: explain why all YAML flows through one from_str chokepoint, what deny_unknown_fields buys, and how Spanned<T> lets an error point back at the source line.

Choice one: every value remembers where it came from

Section titled “Choice one: every value remembers where it came from”

When serde turns YAML into a Rust struct, it normally throws away where each value sat in the file. That’s fine until you need to say “the problem is here.” So clinker parses into Spanned<T> — a wrapper that keeps the value and its source location:

clinker-plan ·yaml.rs ·Spanned type @47d2e12
// re-exported by clinker from the serde-saphyr crate
pub struct Spanned<T> {
pub value: T,
pub referenced: Location, // where this value is used in the source
pub defined: Location, // where it was defined (e.g. a YAML anchor)
}

A Spanned<PipelineNode> is a pipeline node that still knows its line and column. The engine holds the whole pipeline as Vec<Spanned<PipelineNode>> precisely so that any later complaint — a bad reference, a security rejection, a type error — can be rendered as a diagnostic that points at the source, the way a good compiler underlines the offending token rather than saying “error somewhere in your file.”

This is the same philosophy as last lesson’s ValidatedPath: carry the extra fact in the type rather than recomputing or guessing it later. There the fact was “screened”; here it’s “came from line N.”

Choice two: unknown fields are rejected, not ignored

Section titled “Choice two: unknown fields are rejected, not ignored”

The strictness lives on the config structs themselves, via a serde attribute:

clinker-plan ·source.rs ·deny_unknown_fields doc @47d2e12
#[derive(Deserialize)]
#[serde(deny_unknown_fields)]
pub struct WatermarkConfig {
pub column: String,
// ... a stray `colmn:` key here is a PARSE ERROR, not a silent no-op
}

#[serde(deny_unknown_fields)] flips serde from “ignore keys I don’t recognise” to “refuse any key I don’t recognise.” That single attribute is what converts your colmn: typo from a silent misconfiguration into a loud, located failure before the pipeline runs. The attribute appears on config structs across the crate; it’s a house rule, not a one-off.

Choice three: one chokepoint for all of it

Section titled “Choice three: one chokepoint for all of it”

Strictness and span-tracking only hold if nothing sneaks around them. So clinker funnels all YAML parsing through a single function. No other code is permitted to call the underlying parser directly:

clinker-plan ·yaml.rs ·from_str fn @47d2e12
pub fn from_str<'de, T>(yaml: &'de str) -> Result<T, YamlError>
where
T: Deserialize<'de>,
{
if yaml.len() > MAX_INPUT_BYTES {
// pre-parse rejection — cheap, before the parser sees a huge input
return Err(YamlError(make_oversize_error(yaml.len())));
}
serde_saphyr::from_str_with_options(yaml, budget_options()).map_err(YamlError)
}

The module doc states the rule plainly: “This module is the single entry point for YAML parsing in clinker. No other code path is permitted to call serde_saphyr::from_str*.” Why insist on one door? Two reasons, both architectural:

  • Defence in depth. budget_options() caps input size (32 MB), nesting depth (256), and node count (100 000), disables !include, and enforces an alias/anchor ratio against “billion laughs” expansion attacks. A single chokepoint means those limits apply to every parse — there’s no forgotten code path that parses unbounded input.
  • Bus-factor containment. serde-saphyr is a pre-1.0, single-maintainer dependency. Routing every call through one wrapper means that if it ever needs replacing, there’s exactly one file to change, not a hundred call sites scattered across the crate.

A chokepoint is the parser-side cousin of the proof token: instead of trusting every caller to remember the limits, you make the one gate enforce them for everyone.

No serde needed to feel it. Here’s a tiny config reader that is both strict (rejects unknown keys) and span-aware (knows the line), using only std:

rust // editable

The good config parses; the typo is rejected with line 2: unknown field \colmn`. Swap the if !allowed.contains(…)check for acontinue` and you’ve built the lenient parser — it accepts the typo and quietly drops the setting. That one branch is the whole difference between “fails loudly at plan time, here” and “fails mysteriously at run time, somewhere.”

// quick check

What does #[serde(deny_unknown_fields)] change about parsing a config?

Two lessons, two flavours of “make the type carry the guarantee”: a proof token for security, spans-and-strictness for config. Both feed the same destination — the validated plan the executor runs. Next we meet that plan itself, and the typed handle that says “this has been fully compiled” the same unforgeable way ValidatedPath said “this was screened.”