Collections & iterators
You now know a record is “a Vec<Value> behind a schema.” This lesson unpacks both
collections — the Vec of cells and the HashMap that turns a column name into a
position — and then the iterator pattern that lets the engine stream a billion-row file
without holding it in memory.
You’ll be able to: explain how a name lookup becomes a position lookup, and why the
engine reads records lazily with iterators instead of collecting them into a Vec.
A record’s two collections
Section titled “A record’s two collections” clinker-record ·mod.rs ·Record type @47d2e12
pub struct Record { schema: Arc<Schema>, // column names + order, shared across rows values: Vec<Value>, // the cells, indexed by position // ...}The cells are a Vec<Value> — a growable, contiguous array — indexed by position.
But your CXL says status, a name. The bridge is in the Schema, which keeps a
HashMap from column name to index:
clinker-record ·schema.rs ·Schema type @47d2e12
pub struct Schema { columns: Vec<Box<str>>, // the column names, in order field_metadata: Vec<Option<FieldMetadata>>, index: HashMap<Box<str>, usize>, // name -> position}So record.get("status") is two steps: the schema’s HashMap turns "status" into,
say, 4; then values[4] is the cell. Name → index → value. The schema is stored once
and shared (that Arc again), so this map isn’t duplicated per row.
A small but telling detail: the column names are stored as Box<str>, not String. A
String carries a length and spare capacity for growth; these names never grow, so
Box<str> drops the capacity field and stores just the bytes. The engine prefers the
leaner type on the hot path — a habit you’ll see repeatedly.
Iterators: one record at a time
Section titled “Iterators: one record at a time”A reader doesn’t load the file and hand you a Vec<Record>. It hands you records one
at a time, on demand — that’s what next_record is: pull the next row, or None at
the end. This is the iterator pattern, and it’s lazy: nothing is computed until asked,
and only one record is held at a time.
That laziness is a load-bearing property, not a style choice. A streaming pipeline keeps
roughly one record’s worth of data live, regardless of whether the file is a thousand
rows or a billion. Collecting the whole input into a Vec first would throw away the
engine’s entire bounded-memory promise. (How the engine holds the line when an operation
does need to accumulate — a sort, a join — is the subject of Phase 4.)
Map, filter, lazily
Section titled “Map, filter, lazily”> output appears here — press Run
// quick check
How does looking up the field named status find the right cell when cells are stored by position?
The schema keeps a name→index HashMap (shared across rows). Lookup is name → index → values[index] — O(1), and the names aren't duplicated per record.
Checkpoint
Section titled “Checkpoint”You know the collections inside a record and why they stream. Next: the smart pointers —
Box, Rc, Arc — that the engine uses to share data without copying it.