What is a record?
In Phase 0 you ran customer_etl and Clinker reported “5 records.” Phase 1 follows
one of those records all the way through the engine. But first: what is a
record? It’s the unit of data that flows through every pipeline, and it’s built from
three vocabulary types you’ll see everywhere.
You’ll be able to: name the three types that make up a record (Value, Record,
Schema) and explain how a CSV row becomes one. (We meet them shallowly here; Phase 2
opens each one up.)
A cell is a Value
Section titled “A cell is a Value”Every single cell of data in Clinker is one Value — a closed set of nine shapes
(null, bool, integer, float, string, date, datetime, array, map):
clinker-record ·value.rs ·Value type @47d2e12
pub enum Value { Null, Bool(bool), Integer(i64), Float(f64), String(FieldStr), Date(NaiveDate), DateTime(NaiveDateTime), Array(Vec<Value>), Map(Box<IndexMap<Box<str>, Value>>),}When the CSV source reads Alice’s row, it doesn’t guess types — it reads every field
as a string (Value::String). Turning "15200" into a number is a later step (a
transform’s job), not the reader’s. So right after reading, Alice is a row of strings.
A row is a Record
Section titled “A row is a Record”A Record is one row: a list of Values lined up against a Schema that names the
columns.
clinker-record ·mod.rs ·Record type @47d2e12
The real definition, trimmed to the part that matters now:
pub struct Record { schema: Arc<Schema>, // the column names + order, shared by every row of a source values: Vec<Value>, // the cells, positional — values[i] belongs to column i // ... plus per-record scoped vars and document context, for later}Two ideas to take away: the cells are a plain Vec<Value> indexed by position, and
the schema is held behind an Arc — a shared handle, so a million rows from one
source all point at the same schema instead of each carrying their own copy. (Why
Arc, and what that costs, is a Phase 2 question.)
The columns are a Schema
Section titled “The columns are a Schema”The Schema is the list of column names and their order — it’s what lets
values[5] mean “lifetime_value”:
clinker-record ·schema.rs ·Schema type @47d2e12
pub struct Schema { columns: Vec<Box<str>>, // column names, in order field_metadata: Vec<Option<FieldMetadata>>, index: HashMap<Box<str>, usize>, // name -> position, for O(1) lookup}In customer_etl, the source declared the schema right in the YAML:
customer_id, first_name, last_name, email, status, lifetime_value, zip_code. That’s
the schema every customer record is bound to.
Build one yourself
Section titled “Build one yourself”Here’s a record modeled in miniature — column names plus a Vec of values, exactly
Alice’s row as the reader first sees it. Run it, then add the zip_code column and its
value:
> output appears here — press Run
// quick check
A Record holds its cells as a Vec<Value>. How does it know which cell is which column?
Cells are positional: values[i] belongs to the i-th column in the Arc. The schema is stored once and shared by every record of the source, not copied per row.
Checkpoint
Section titled “Checkpoint”You have the vocabulary of a record. Next: how the YAML pipeline you wrote becomes something the engine can actually run.