Collections & iterators

You now know a record is “a Vec<Value> behind a schema.” This lesson unpacks both collections — the Vec of cells and the HashMap that turns a column name into a position — and then the iterator pattern that lets the engine stream a billion-row file without holding it in memory.

You’ll be able to: explain how a name lookup becomes a position lookup, and why the engine reads records lazily with iterators instead of collecting them into a Vec.

A record’s two collections

clinker-record ·mod.rs ·Record type @47d2e12

pub struct Record {
    schema: Arc<Schema>,   // column names + order, shared across rows
    values: Vec<Value>,    // the cells, indexed by position
    // ...
}

The cells are a Vec<Value> — a growable, contiguous array — indexed by position. But your CXL says status, a name. The bridge is in the Schema, which keeps a HashMap from column name to index:

clinker-record ·schema.rs ·Schema type @47d2e12

pub struct Schema {
    columns: Vec<Box<str>>,           // the column names, in order
    field_metadata: Vec<Option<FieldMetadata>>,
    index: HashMap<Box<str>, usize>,  // name -> position
}

So record.get("status") is two steps: the schema’s HashMap turns "status" into, say, 4; then values[4] is the cell. Name → index → value. The schema is stored once and shared (that Arc again), so this map isn’t duplicated per row.

A small but telling detail: the column names are stored as Box<str>, not String. A String carries a length and spare capacity for growth; these names never grow, so Box<str> drops the capacity field and stores just the bytes. The engine prefers the leaner type on the hot path — a habit you’ll see repeatedly.

Iterators: one record at a time

A reader doesn’t load the file and hand you a Vec<Record>. It hands you records one at a time, on demand — that’s what next_record is: pull the next row, or None at the end. This is the iterator pattern, and it’s lazy: nothing is computed until asked, and only one record is held at a time.

That laziness is a load-bearing property, not a style choice. A streaming pipeline keeps roughly one record’s worth of data live, regardless of whether the file is a thousand rows or a billion. Collecting the whole input into a Vec first would throw away the engine’s entire bounded-memory promise. (How the engine holds the line when an operation does need to accumulate — a sort, a join — is the subject of Phase 4.)

Map, filter, lazily

rust // editable

fn main() {
  // Pretend these are records streaming from a source.
  let rows = ["active", "inactive", "active", "active"];

  // An iterator chain: lazy until `count` pulls it.
  let active = rows
      .iter()
      .filter(|status| **status == "active")
      .count();

  println!("active customers: {active}");

  // Nothing in the chain ran until count() asked for values —
  // and it never built an intermediate Vec of all rows.
}

fn main() {
  // Pretend these are records streaming from a source.
  let rows = ["active", "inactive", "active", "active"];

  // An iterator chain: lazy until `count` pulls it.
  let active = rows
      .iter()
      .filter(|status| **status == "active")
      .count();

  println!("active customers: {active}");

  // Nothing in the chain ran until count() asked for values —
  // and it never built an intermediate Vec of all rows.
}

> output appears here — press Run

// quick check

How does looking up the field named status find the right cell when cells are stored by position?

The schema keeps a name→index HashMap (shared across rows). Lookup is name → index → values[index] — O(1), and the names aren't duplicated per record.

Checkpoint

✓ Checkpoint

// quick check

Why does the engine pull records one at a time instead of reading the whole file into a Vec<Record>?

You should be able to:

You can describe the name→index→value lookup path
You can explain why readers stream records lazily instead of collecting them
You can say why column names are Box<str> rather than String

Verify in the checkout:

grep -nA4 'pub struct Record' crates/clinker-record/src/record/mod.rs
grep -n 'pub struct Schema' crates/clinker-record/src/schema.rs

You know the collections inside a record and why they stream. Next: the smart pointers — Box, Rc, Arc — that the engine uses to share data without copying it.