Traits: the IO seam

Welcome to Phase 3 — Planning & Expressions. Phase 2 took the data layer apart cell by cell. Now we move up a layer to the machinery that turns a YAML file into the validated plan you’ve been running — and the seams where new behaviour plugs in. This first lesson is about the most important seam in the engine: how it reads and writes file formats it was never specifically written for.

You’ll be able to: read the FormatReader trait, explain how every format reaches the rest of the engine through it, and say why the engine stores readers as trait objects (Box<dyn FormatReader>) rather than concrete types.

The motivating problem: one engine, eight formats

Clinker ships readers and writers for CSV, JSON, XML, fixed-width, EDIFACT, X12, HL7, and SWIFT. That’s eight wildly different ways to lay bytes on disk. Yet the executor — the part that pushes records through the DAG — contains zero format-specific code. It never asks “is this a CSV?” How can the same execution loop drive all eight?

The answer is a trait: a contract that says what a reader can do without fixing which reader it is. Every format implements the same two methods, and the engine talks only to the contract.

clinker-format ·traits.rs ·FormatReader trait @47d2e12

/// Streaming record reader. Yields records one at a time.
pub trait FormatReader: Send {
    fn schema(&mut self) -> Result<Arc<Schema>, FormatError>;
    fn next_record(&mut self) -> Result<Option<Record>, FormatError>;
    // ... plus default-bodied methods for multi-file / envelope handling
}

Two required methods carry the whole seam: ask for the schema, then pull records until next_record returns Ok(None). A FormatReader is exactly “a thing the engine can ask for a schema and then drain, one Record at a time.” The write side is its mirror image — consume records one at a time and flush:

clinker-format ·traits.rs ·FormatWriter trait @47d2e12

/// Streaming record writer. Consumes records one at a time.
pub trait FormatWriter: Send {
    fn write_record(&mut self, record: &Record) -> Result<(), FormatError>;
    fn flush(&mut self) -> Result<(), FormatError>;
    // ... plus default-bodied document-framing methods
}

Notice the supertrait bound : Send. The doc comment is explicit about why: a reader is moved onto the executor’s per-source ingest thread, so it must be Send. It is deliberately not Sync — a single reader is driven by one thread, streaming, never shared. That bound is a small architectural decision encoded in the type.

A minimal model: the trait is the seam

Strip the engine away and the pattern is small. A trait with one method, two implementors, and a function that works on any implementor:

rust // editable

trait FormatReader {
  fn next_record(&mut self) -> Option<String>;
}

struct CsvReader { rows: Vec<String> }
struct JsonReader { left: u32 }

impl FormatReader for CsvReader {
  fn next_record(&mut self) -> Option<String> {
      self.rows.pop()
  }
}
impl FormatReader for JsonReader {
  fn next_record(&mut self) -> Option<String> {
      if self.left == 0 { return None; }
      self.left -= 1;
      Some(format!("json row {}", self.left))
  }
}

// Drains ANY reader — it knows the trait, not the type.
fn drain(reader: &mut dyn FormatReader) -> usize {
  let mut n = 0;
  while reader.next_record().is_some() { n += 1; }
  n
}

fn main() {
  let mut csv = CsvReader { rows: vec!["a".into(), "b".into()] };
  let mut json = JsonReader { left: 3 };
  println!("csv rows:  {}", drain(&mut csv));
  println!("json rows: {}", drain(&mut json));
}

trait FormatReader {
  fn next_record(&mut self) -> Option<String>;
}

struct CsvReader { rows: Vec<String> }
struct JsonReader { left: u32 }

impl FormatReader for CsvReader {
  fn next_record(&mut self) -> Option<String> {
      self.rows.pop()
  }
}
impl FormatReader for JsonReader {
  fn next_record(&mut self) -> Option<String> {
      if self.left == 0 { return None; }
      self.left -= 1;
      Some(format!("json row {}", self.left))
  }
}

// Drains ANY reader — it knows the trait, not the type.
fn drain(reader: &mut dyn FormatReader) -> usize {
  let mut n = 0;
  while reader.next_record().is_some() { n += 1; }
  n
}

fn main() {
  let mut csv = CsvReader { rows: vec!["a".into(), "b".into()] };
  let mut json = JsonReader { left: 3 };
  println!("csv rows:  {}", drain(&mut csv));
  println!("json rows: {}", drain(&mut json));
}

> output appears here — press Run

drain is the executor in miniature: it takes &mut dyn FormatReader and never learns whether it’s draining CSV or JSON. Add a fourth format tomorrow and drain doesn’t change by a character. That is the seam doing its job.

Why a trait object, and what `dyn` costs

The format a job uses isn’t known when the engine is compiled — it’s chosen at run time, from the plan. So the executor needs a single variable that can hold any reader. That’s a trait object: Box<dyn FormatReader> — a heap-allocated value plus a hidden pointer to a vtable (the table of “which next_record does this actual reader use?”). Calling through it is dynamic dispatch: the concrete method is looked up at run time.

In real clinker, the concrete readers are themselves generic structs — CsvReader<R>, FixedWidthReader<R>, and so on, generic over the byte source R. They’re built type-specifically, then immediately boxed into a trait object at one factory function so that everything downstream is uniform:

clinker-exec ·mod.rs ·RecordSource trait @47d2e12

// crates/clinker-exec/src/executor/ingest.rs — the dispatch boundary
fn build_format_reader(/* … */) -> Box<dyn FormatReader> {
    match &input.format {
        InputFormat::Csv(opts)  => Box::new(CsvReader::new(/* … */)),
        InputFormat::Json(opts) => Box::new(JsonReader::new(/* … */)),
        // ... one arm per format, each boxed into the same trait-object type
    }
}

This is worth pausing on, because it’s the shape of every plug-in seam in clinker: a typed enum (InputFormat) is matched once at the boundary, each arm constructs the right concrete reader, and all of them collapse into one Box<dyn FormatReader>. There is no string-keyed “format registry” — dispatch is over the closed enum you met in Phase 2, so an unknown format is a plan-time error, not a runtime lookup miss.

What does dyn cost? One pointer-indirection per call. Here that’s invisible: a vtable call is dwarfed by the file IO and parsing that produce each record. Dynamic dispatch is the right trade exactly where the set of types is open-ended and chosen at run time, and the per-call cost is noise against the work being dispatched. The next lesson shows the opposite choice — generics — for the field-read hot path where that same pointer-indirection would not be noise.

One step further: the transport generalization

FormatReader assumes bytes — it decodes a stream. But a SQL SELECT cursor yields rows with no byte body at all. So one crate up, the executor defines a broader contract, RecordSource, and bridges the file case to it with a single blanket impl:

impl RecordSource for Box<dyn FormatReader> {
    // a file transport reaches the row-oriented seam by wrapping its byte reader
}

You don’t need the details yet — Phase 4 returns to it. The point is the layering: a narrow, byte-oriented trait (FormatReader) nested inside a broader, transport-agnostic one (RecordSource), each a clean seam. Traits compose into layers the same way types do.

// quick check

Why does the engine store readers as Box<dyn FormatReader> instead of a concrete reader type?

The format comes from the plan at run time. A Box is one type that can hold any implementor; the concrete method is resolved through a vtable. The per-call cost is negligible against file IO.

Read the real seam

✓ Checkpoint — the IO seam

💡 Hint 1

Start at the trait definition in traits.rs, then follow build_format_reader in crates/clinker-exec/src/executor/ingest.rs to see where concrete readers are boxed.

What the grep tour shows

The first two greps land on the trait declarations — two required methods each, both with a : Send supertrait. The third shows the factory: a match over the InputFormat enum whose arms each Box::new(...) a concrete reader into the shared Box<dyn FormatReader> return type. That single function is the entire dynamic-dispatch boundary for ingest.

// quick check

A new format, Avro, is added. Which existing code must change to teach the executor to read it?

You should be able to:

You can state the two required methods of FormatReader and what a full drain looks like
You can explain why readers are stored as trait objects and what dynamic dispatch costs here
You can explain why the : Send bound is on the trait but : Sync is not

Verify in the checkout:

grep -n 'pub trait FormatReader' crates/clinker-format/src/traits.rs
grep -n 'pub trait FormatWriter' crates/clinker-format/src/traits.rs
grep -rn 'Box<dyn FormatReader>' crates/clinker-exec/src/executor/ingest.rs

You’ve seen the engine’s open seam: a trait, implemented per format, boxed into a trait object at one boundary so the rest of the engine stays format-blind. Next: the other dispatch strategy — generics — and why the field-read hot path makes the opposite choice.