Skip to content

Traits: the IO seam

Welcome to Phase 3 — Planning & Expressions. Phase 2 took the data layer apart cell by cell. Now we move up a layer to the machinery that turns a YAML file into the validated plan you’ve been running — and the seams where new behaviour plugs in. This first lesson is about the most important seam in the engine: how it reads and writes file formats it was never specifically written for.

You’ll be able to: read the FormatReader trait, explain how every format reaches the rest of the engine through it, and say why the engine stores readers as trait objects (Box<dyn FormatReader>) rather than concrete types.

The motivating problem: one engine, eight formats

Section titled “The motivating problem: one engine, eight formats”

Clinker ships readers and writers for CSV, JSON, XML, fixed-width, EDIFACT, X12, HL7, and SWIFT. That’s eight wildly different ways to lay bytes on disk. Yet the executor — the part that pushes records through the DAG — contains zero format-specific code. It never asks “is this a CSV?” How can the same execution loop drive all eight?

The answer is a trait: a contract that says what a reader can do without fixing which reader it is. Every format implements the same two methods, and the engine talks only to the contract.

clinker-format ·traits.rs ·FormatReader trait @47d2e12
/// Streaming record reader. Yields records one at a time.
pub trait FormatReader: Send {
fn schema(&mut self) -> Result<Arc<Schema>, FormatError>;
fn next_record(&mut self) -> Result<Option<Record>, FormatError>;
// ... plus default-bodied methods for multi-file / envelope handling
}

Two required methods carry the whole seam: ask for the schema, then pull records until next_record returns Ok(None). A FormatReader is exactly “a thing the engine can ask for a schema and then drain, one Record at a time.” The write side is its mirror image — consume records one at a time and flush:

clinker-format ·traits.rs ·FormatWriter trait @47d2e12
/// Streaming record writer. Consumes records one at a time.
pub trait FormatWriter: Send {
fn write_record(&mut self, record: &Record) -> Result<(), FormatError>;
fn flush(&mut self) -> Result<(), FormatError>;
// ... plus default-bodied document-framing methods
}

Notice the supertrait bound : Send. The doc comment is explicit about why: a reader is moved onto the executor’s per-source ingest thread, so it must be Send. It is deliberately not Sync — a single reader is driven by one thread, streaming, never shared. That bound is a small architectural decision encoded in the type.

Strip the engine away and the pattern is small. A trait with one method, two implementors, and a function that works on any implementor:

rust // editable

drain is the executor in miniature: it takes &mut dyn FormatReader and never learns whether it’s draining CSV or JSON. Add a fourth format tomorrow and drain doesn’t change by a character. That is the seam doing its job.

The format a job uses isn’t known when the engine is compiled — it’s chosen at run time, from the plan. So the executor needs a single variable that can hold any reader. That’s a trait object: Box<dyn FormatReader> — a heap-allocated value plus a hidden pointer to a vtable (the table of “which next_record does this actual reader use?”). Calling through it is dynamic dispatch: the concrete method is looked up at run time.

In real clinker, the concrete readers are themselves generic structs — CsvReader<R>, FixedWidthReader<R>, and so on, generic over the byte source R. They’re built type-specifically, then immediately boxed into a trait object at one factory function so that everything downstream is uniform:

clinker-exec ·mod.rs ·RecordSource trait @47d2e12
// crates/clinker-exec/src/executor/ingest.rs — the dispatch boundary
fn build_format_reader(/* … */) -> Box<dyn FormatReader> {
match &input.format {
InputFormat::Csv(opts) => Box::new(CsvReader::new(/* … */)),
InputFormat::Json(opts) => Box::new(JsonReader::new(/* … */)),
// ... one arm per format, each boxed into the same trait-object type
}
}

This is worth pausing on, because it’s the shape of every plug-in seam in clinker: a typed enum (InputFormat) is matched once at the boundary, each arm constructs the right concrete reader, and all of them collapse into one Box<dyn FormatReader>. There is no string-keyed “format registry” — dispatch is over the closed enum you met in Phase 2, so an unknown format is a plan-time error, not a runtime lookup miss.

What does dyn cost? One pointer-indirection per call. Here that’s invisible: a vtable call is dwarfed by the file IO and parsing that produce each record. Dynamic dispatch is the right trade exactly where the set of types is open-ended and chosen at run time, and the per-call cost is noise against the work being dispatched. The next lesson shows the opposite choice — generics — for the field-read hot path where that same pointer-indirection would not be noise.

One step further: the transport generalization

Section titled “One step further: the transport generalization”

FormatReader assumes bytes — it decodes a stream. But a SQL SELECT cursor yields rows with no byte body at all. So one crate up, the executor defines a broader contract, RecordSource, and bridges the file case to it with a single blanket impl:

crates/clinker-exec/src/source/mod.rs
impl RecordSource for Box<dyn FormatReader> {
// a file transport reaches the row-oriented seam by wrapping its byte reader
}

You don’t need the details yet — Phase 4 returns to it. The point is the layering: a narrow, byte-oriented trait (FormatReader) nested inside a broader, transport-agnostic one (RecordSource), each a clean seam. Traits compose into layers the same way types do.

// quick check

Why does the engine store readers as Box<dyn FormatReader> instead of a concrete reader type?

You’ve seen the engine’s open seam: a trait, implemented per format, boxed into a trait object at one boundary so the rest of the engine stays format-blind. Next: the other dispatch strategy — generics — and why the field-read hot path makes the opposite choice.