Span-preserving parse & strictness

You write colmn: when you meant column:. What should happen? A lenient parser shrugs, ignores the key it doesn’t recognise, and runs your pipeline with a silently-missing setting — a bug you discover hours later in the output. Clinker takes the opposite stance: that typo is an error, raised at plan time, pointing at the exact line. Two design choices make that possible, and both are worth stealing.

You’ll be able to: explain why all YAML flows through one from_str chokepoint, what deny_unknown_fields buys, and how Spanned<T> lets an error point back at the source line.

Choice one: every value remembers where it came from

When serde turns YAML into a Rust struct, it normally throws away where each value sat in the file. That’s fine until you need to say “the problem is here.” So clinker parses into Spanned<T> — a wrapper that keeps the value and its source location:

clinker-plan ·yaml.rs ·Spanned type @47d2e12

// re-exported by clinker from the serde-saphyr crate
pub struct Spanned<T> {
    pub value: T,
    pub referenced: Location,   // where this value is used in the source
    pub defined: Location,      // where it was defined (e.g. a YAML anchor)
}

A Spanned<PipelineNode> is a pipeline node that still knows its line and column. The engine holds the whole pipeline as Vec<Spanned<PipelineNode>> precisely so that any later complaint — a bad reference, a security rejection, a type error — can be rendered as a diagnostic that points at the source, the way a good compiler underlines the offending token rather than saying “error somewhere in your file.”

This is the same philosophy as last lesson’s ValidatedPath: carry the extra fact in the type rather than recomputing or guessing it later. There the fact was “screened”; here it’s “came from line N.”

Choice two: unknown fields are rejected, not ignored

The strictness lives on the config structs themselves, via a serde attribute:

clinker-plan ·source.rs ·deny_unknown_fields doc @47d2e12

#[derive(Deserialize)]
#[serde(deny_unknown_fields)]
pub struct WatermarkConfig {
    pub column: String,
    // ... a stray `colmn:` key here is a PARSE ERROR, not a silent no-op
}

#[serde(deny_unknown_fields)] flips serde from “ignore keys I don’t recognise” to “refuse any key I don’t recognise.” That single attribute is what converts your colmn: typo from a silent misconfiguration into a loud, located failure before the pipeline runs. The attribute appears on config structs across the crate; it’s a house rule, not a one-off.

Choice three: one chokepoint for all of it

Strictness and span-tracking only hold if nothing sneaks around them. So clinker funnels all YAML parsing through a single function. No other code is permitted to call the underlying parser directly:

clinker-plan ·yaml.rs ·from_str fn @47d2e12

pub fn from_str<'de, T>(yaml: &'de str) -> Result<T, YamlError>
where
    T: Deserialize<'de>,
{
    if yaml.len() > MAX_INPUT_BYTES {
        // pre-parse rejection — cheap, before the parser sees a huge input
        return Err(YamlError(make_oversize_error(yaml.len())));
    }
    serde_saphyr::from_str_with_options(yaml, budget_options()).map_err(YamlError)
}

The module doc states the rule plainly: “This module is the single entry point for YAML parsing in clinker. No other code path is permitted to call serde_saphyr::from_str*.” Why insist on one door? Two reasons, both architectural:

Defence in depth. budget_options() caps input size (32 MB), nesting depth (256), and node count (100 000), disables !include, and enforces an alias/anchor ratio against “billion laughs” expansion attacks. A single chokepoint means those limits apply to every parse — there’s no forgotten code path that parses unbounded input.
Bus-factor containment. serde-saphyr is a pre-1.0, single-maintainer dependency. Routing every call through one wrapper means that if it ever needs replacing, there’s exactly one file to change, not a hundred call sites scattered across the crate.

A chokepoint is the parser-side cousin of the proof token: instead of trusting every caller to remember the limits, you make the one gate enforce them for everyone.

Strict + located, in miniature

No serde needed to feel it. Here’s a tiny config reader that is both strict (rejects unknown keys) and span-aware (knows the line), using only std:

rust // editable

struct Spanned { value: String, line: usize }

fn parse_strict(input: &str) -> Result<Vec<(String, Spanned)>, String> {
  let allowed = ["source", "column", "format"];
  let mut out = Vec::new();
  for (i, raw) in input.lines().enumerate() {
      let line = i + 1; // 1-based, like an editor
      let (key, val) = raw
          .split_once(':')
          .ok_or_else(|| format!("line {line}: expected `key: value`"))?;
      let key = key.trim();
      if !allowed.contains(&key) {
          // strict: unknown key is an error — and we know exactly where
          return Err(format!("line {line}: unknown field `{key}`"));
      }
      out.push((key.to_string(), Spanned { value: val.trim().into(), line }));
  }
  Ok(out)
}

fn main() {
  let good = "source: customers.csv\ncolumn: id\nformat: csv";
  match parse_strict(good) {
      Ok(fields) => {
          for (key, span) in &fields {
              // read the value AND its remembered source line
              println!("line {}: {} = {}", span.line, key, span.value);
          }
      }
      Err(e) => println!("rejected: {e}"),
  }

  // Typo: `colmn` instead of `column`.
  let typo = "source: customers.csv\ncolmn: id";
  match parse_strict(typo) {
      Ok(_) => println!("accepted (a lenient parser would do this — silently wrong)"),
      Err(e) => println!("rejected: {e}"),
  }
}

struct Spanned { value: String, line: usize }

fn parse_strict(input: &str) -> Result<Vec<(String, Spanned)>, String> {
  let allowed = ["source", "column", "format"];
  let mut out = Vec::new();
  for (i, raw) in input.lines().enumerate() {
      let line = i + 1; // 1-based, like an editor
      let (key, val) = raw
          .split_once(':')
          .ok_or_else(|| format!("line {line}: expected `key: value`"))?;
      let key = key.trim();
      if !allowed.contains(&key) {
          // strict: unknown key is an error — and we know exactly where
          return Err(format!("line {line}: unknown field `{key}`"));
      }
      out.push((key.to_string(), Spanned { value: val.trim().into(), line }));
  }
  Ok(out)
}

fn main() {
  let good = "source: customers.csv\ncolumn: id\nformat: csv";
  match parse_strict(good) {
      Ok(fields) => {
          for (key, span) in &fields {
              // read the value AND its remembered source line
              println!("line {}: {} = {}", span.line, key, span.value);
          }
      }
      Err(e) => println!("rejected: {e}"),
  }

  // Typo: `colmn` instead of `column`.
  let typo = "source: customers.csv\ncolmn: id";
  match parse_strict(typo) {
      Ok(_) => println!("accepted (a lenient parser would do this — silently wrong)"),
      Err(e) => println!("rejected: {e}"),
  }
}

> output appears here — press Run

The good config parses; the typo is rejected with line 2: unknown field \colmn`. Swap the if !allowed.contains(…)check for acontinue` and you’ve built the lenient parser — it accepts the typo and quietly drops the setting. That one branch is the whole difference between “fails loudly at plan time, here” and “fails mysteriously at run time, somewhere.”

// quick check

What does #[serde(deny_unknown_fields)] change about parsing a config?

deny_unknown_fields is the strictness lever: an unrecognised key becomes an error at parse time. (Remembering the source line is the separate job of Spanned.)

Trace the chokepoint

✓ Checkpoint — strict, located parsing

💡 Hint 1

Read the yaml.rs module doc comment at the top first — it states the single-entry-point rule and the two reasons. Then count how many config files carry deny_unknown_fields.

What the tour shows

from_str is the lone public parse entry; its module doc forbids any other call into serde_saphyr, and it applies a shared resource budget plus a pre-parse size check. The deny_unknown_fields grep returns a long list — it’s a crate-wide convention, not one struct. And Spanned is imported and used pervasively so that diagnostics can point at the source line. Together: one strict, bounded, location-preserving door for all config.

// quick check

Why route every YAML parse through a single wrapper function?

You should be able to:

You can explain why all YAML goes through one from_str and name one benefit
You can say what deny_unknown_fields does and why a silent-ignore default is dangerous
You can explain what Spanned<T> carries beyond the value and why that matters for errors

Verify in the checkout:

grep -n 'pub fn from_str' crates/clinker-plan/src/yaml.rs
grep -rn 'deny_unknown_fields' crates/clinker-plan/src/config/ | head
grep -n 'Spanned' crates/clinker-plan/src/yaml.rs | head

Two lessons, two flavours of “make the type carry the guarantee”: a proof token for security, spans-and-strictness for config. Both feed the same destination — the validated plan the executor runs. Next we meet that plan itself, and the typed handle that says “this has been fully compiled” the same unforgeable way ValidatedPath said “this was screened.”