Benchmark & measure memory

All of Phase 4 has rested on a claim: this is cheaper. A 24-byte FieldStr saves an allocation; the arbitrator keeps memory bounded; spilling trades RAM for disk. None of that is worth anything unless you can measure it. This final lesson is about clinker’s measurement tools — and they come in two flavours that are easy to conflate: a fast estimate the runtime uses to budget memory, and an exact count the benchmarks use to verify it.

You’ll be able to: explain the heap_size cost model and why it’s an estimate, read a criterion benchmark, and explain how a custom GlobalAlloc counts allocated bytes.

The cost model: `heap_size`

Recall lessons 4.4–4.5: operators charge bytes against the memory budget. What is a record’s byte cost? The engine estimates it with heap_size — owned heap bytes, accounted per Value variant:

clinker-record ·value.rs ·heap_size fn @47d2e12

/// Estimated heap bytes owned by this value (excludes the enum itself).
pub fn heap_size(&self) -> usize {
    match self {
        Value::String(s) => s.heap_size(),                 // 0 if inline, else byte length
        Value::Array(arr) => {
            arr.capacity() * std::mem::size_of::<Value>()
                + arr.iter().map(Value::heap_size).sum::<usize>()
        }
        Value::Map(m) => m.iter().map(|(k, v)| k.len() + v.heap_size()).sum(),
        _ => 0,                                            // scalars live inline — no heap
    }
}

Two things to read here. Scalars (Integer, Bool, dates) return 0 — they live inline in the 32-byte Value, owning no heap. And a short inline FieldStr also returns 0, which is the 24-byte type from last lesson paying off in the cost model directly. Crucially, this is an estimate, computed by walking the value — not a real allocator measurement. It has to be: charging runs on the hot path, per record, per operator, and you cannot attach an allocator probe to every value without wrecking the throughput you’re trying to bound. A cheap, consistent estimate is the right tool for a runtime budget.

The benchmark suite: criterion

For the other job — proving a change actually made things faster or smaller — clinker uses criterion benchmarks. They live in each crate’s benches/ directory: record_ops (record create/get/set/clone, value_heap_size), arbitration_poll, and more.

clinker-exec ·arbitration_poll.rs ·bench_should_spill bench @47d2e12

// crates/clinker-record/benches/record_ops.rs — the criterion shape
fn bench_value_heap_size(c: &mut Criterion) {
    let string = Value::String("a medium length string".into());
    c.bench_function("value_heap_size/string", |b| {
        b.iter(|| black_box(string.heap_size()));   // black_box stops the optimizer
    });
}

black_box is the load-bearing detail: it hides its argument from the optimizer, so the compiler can’t “see through” the benchmark and delete the work you’re trying to time. You run these with cargo bench -p clinker-record --bench record_ops (or --bench arbitration_poll for the arbitrator-poll cost). The arbitration_poll suite, for instance, times should_spill across consumer-registry sizes to keep the arbitrator’s per-poll cost from regressing as pipelines deepen.

The exact count: a custom global allocator

An estimate budgets; a count verifies. To measure the real bytes a section allocates, clinker ships a counting global allocator behind the bench-alloc feature:

clinker-bench-support ·alloc.rs ·AccountingAlloc type @47d2e12

pub struct AccountingAlloc { allocs: AtomicUsize, bytes_alloc: AtomicUsize, /* ... */ }

// SAFETY: every call forwards to the System allocator after counting.
unsafe impl GlobalAlloc for AccountingAlloc {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        self.allocs.fetch_add(1, Ordering::Relaxed);
        self.bytes_alloc.fetch_add(layout.size(), Ordering::Relaxed);
        unsafe { System.alloc(layout) }     // delegate the real work
    }
    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
        /* count, then */ unsafe { System.dealloc(ptr, layout) }
    }
}

Install it with #[global_allocator] and every allocation in the process flows through it, bumping atomic counters (the interior-mutability pattern from 4.4, and the unsafe impl discipline from 4.6). A scoped Region snapshots the counters before and after a block to report exactly how many bytes it allocated — that’s how the executor’s per-stage heap_delta_bytes metric is captured under the feature. The cost is real (it adds contention per allocation), so it’s a measurement build, not the production path.

So the two tools divide the labour cleanly: heap_size estimates, cheaply, for the live budget; AccountingAlloc counts, exactly, for offline verification. Confusing the two — budgeting off real allocation, or benchmarking off the estimate — would get you the worst of each.

Count allocations yourself

The Rust playground lets you install a global allocator, so you can build a miniature AccountingAlloc right here — same unsafe impl GlobalAlloc, same forward-to-System pattern:

rust // editable

use std::alloc::{GlobalAlloc, Layout, System};
use std::sync::atomic::{AtomicUsize, Ordering};

struct Counter;
static ALLOCS: AtomicUsize = AtomicUsize::new(0);
static BYTES: AtomicUsize = AtomicUsize::new(0);

// SAFETY: forwards every call to System after bumping the counters.
unsafe impl GlobalAlloc for Counter {
  unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
      ALLOCS.fetch_add(1, Ordering::Relaxed);
      BYTES.fetch_add(layout.size(), Ordering::Relaxed);
      System.alloc(layout)
  }
  unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
      System.dealloc(ptr, layout)
  }
}

#[global_allocator]
static A: Counter = Counter;

fn main() {
  let before = BYTES.load(Ordering::Relaxed);
  let v: Vec<u8> = Vec::with_capacity(1024);   // exactly one heap allocation
  let after = BYTES.load(Ordering::Relaxed);
  println!("bytes the Vec allocated: {}", after - before);
  println!("total allocations so far: {}", ALLOCS.load(Ordering::Relaxed));
  std::hint::black_box(&v);
}

use std::alloc::{GlobalAlloc, Layout, System};
use std::sync::atomic::{AtomicUsize, Ordering};

struct Counter;
static ALLOCS: AtomicUsize = AtomicUsize::new(0);
static BYTES: AtomicUsize = AtomicUsize::new(0);

// SAFETY: forwards every call to System after bumping the counters.
unsafe impl GlobalAlloc for Counter {
  unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
      ALLOCS.fetch_add(1, Ordering::Relaxed);
      BYTES.fetch_add(layout.size(), Ordering::Relaxed);
      System.alloc(layout)
  }
  unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
      System.dealloc(ptr, layout)
  }
}

#[global_allocator]
static A: Counter = Counter;

fn main() {
  let before = BYTES.load(Ordering::Relaxed);
  let v: Vec<u8> = Vec::with_capacity(1024);   // exactly one heap allocation
  let after = BYTES.load(Ordering::Relaxed);
  println!("bytes the Vec allocated: {}", after - before);
  println!("total allocations so far: {}", ALLOCS.load(Ordering::Relaxed));
  std::hint::black_box(&v);
}

> output appears here — press Run

The Vec::with_capacity(1024) shows up as a 1024-byte allocation in the delta — an exact count, not an estimate. This is clinker’s AccountingAlloc in miniature: a GlobalAlloc impl that counts and forwards. Swap with_capacity(1024) for a String::from("...") or a vec![0u8; 100] and watch the bytes track precisely.

// quick check

Why does the runtime budget charge memory using heap_size (an estimate) instead of the exact AccountingAlloc count?

Measure it for yourself

✓ Checkpoint — benchmarking & measuring

💡 Hint 1

Compare the two: heap_size walks a value and sums an estimate; AccountingAlloc intercepts the real allocator. Which one runs on every record during a job, and which only under a bench build?

What the tour establishes

heap_size is a per-variant estimate of owned heap bytes (scalars and short inline strings cost 0) — cheap enough to charge against the budget per record. Criterion benches (record_ops, arbitration_poll) time real operations, with black_box defeating the optimizer; you run them with cargo bench. AccountingAlloc is an unsafe impl GlobalAlloc that counts every allocation and forwards to System, used under the bench-alloc feature for exact measurement. Estimate for the live budget, exact count for verification — different tools, different jobs.

// quick check

What is black_box used for in a criterion benchmark?

Without black_box, the compiler can prove the benchmarked result is unused and optimize the work away, timing nothing. black_box forces the value to be treated as observed, so the work actually runs.

You should be able to:

You can explain what heap_size estimates and why an estimate (not real allocation) is right for the live budget
You can read a criterion bench and say what black_box is for
You can explain how AccountingAlloc counts bytes and how it differs in purpose from heap_size

Verify in the checkout:

grep -n 'fn heap_size' crates/clinker-record/src/value.rs
grep -n 'struct AccountingAlloc\|impl GlobalAlloc' crates/clinker-bench-support/src/alloc.rs
grep -n 'fn bench_should_spill' crates/clinker-exec/benches/arbitration_poll.rs
cargo bench -p clinker-record --bench record_ops -- value_heap_size

That’s Phase 4 — Execution & Memory. You’ve gone from “the plan is validated” to the live machine that runs it: closed-enum dispatch, one thread per source feeding bounded channels, buffers that spill, an arbitrator that bounds memory through interior mutability, the unsafe core of the field string, and the tools that measure all of it. The thread tying Phases 3 and 4 together: push proof to the boundary, then run hard inside it — with the type system, the budget, and the benchmarks each enforcing a different guarantee. Phase 5 turns you from reader to contributor: adding an operator, a format, a CXL builtin, and passing the review gauntlet.