Skip to content

Collections & iterators

You now know a record is “a Vec<Value> behind a schema.” This lesson unpacks both collections — the Vec of cells and the HashMap that turns a column name into a position — and then the iterator pattern that lets the engine stream a billion-row file without holding it in memory.

You’ll be able to: explain how a name lookup becomes a position lookup, and why the engine reads records lazily with iterators instead of collecting them into a Vec.

clinker-record ·mod.rs ·Record type @47d2e12
pub struct Record {
schema: Arc<Schema>, // column names + order, shared across rows
values: Vec<Value>, // the cells, indexed by position
// ...
}

The cells are a Vec<Value> — a growable, contiguous array — indexed by position. But your CXL says status, a name. The bridge is in the Schema, which keeps a HashMap from column name to index:

clinker-record ·schema.rs ·Schema type @47d2e12
pub struct Schema {
columns: Vec<Box<str>>, // the column names, in order
field_metadata: Vec<Option<FieldMetadata>>,
index: HashMap<Box<str>, usize>, // name -> position
}

So record.get("status") is two steps: the schema’s HashMap turns "status" into, say, 4; then values[4] is the cell. Name → index → value. The schema is stored once and shared (that Arc again), so this map isn’t duplicated per row.

A small but telling detail: the column names are stored as Box<str>, not String. A String carries a length and spare capacity for growth; these names never grow, so Box<str> drops the capacity field and stores just the bytes. The engine prefers the leaner type on the hot path — a habit you’ll see repeatedly.

A reader doesn’t load the file and hand you a Vec<Record>. It hands you records one at a time, on demand — that’s what next_record is: pull the next row, or None at the end. This is the iterator pattern, and it’s lazy: nothing is computed until asked, and only one record is held at a time.

That laziness is a load-bearing property, not a style choice. A streaming pipeline keeps roughly one record’s worth of data live, regardless of whether the file is a thousand rows or a billion. Collecting the whole input into a Vec first would throw away the engine’s entire bounded-memory promise. (How the engine holds the line when an operation does need to accumulate — a sort, a join — is the subject of Phase 4.)

rust // editable

// quick check

How does looking up the field named status find the right cell when cells are stored by position?

You know the collections inside a record and why they stream. Next: the smart pointers — Box, Rc, Arc — that the engine uses to share data without copying it.