Benchmark & measure memory
All of Phase 4 has rested on a claim: this is cheaper. A 24-byte FieldStr saves an allocation;
the arbitrator keeps memory bounded; spilling trades RAM for disk. None of that is worth anything
unless you can measure it. This final lesson is about clinker’s measurement tools — and they
come in two flavours that are easy to conflate: a fast estimate the runtime uses to budget
memory, and an exact count the benchmarks use to verify it.
You’ll be able to: explain the heap_size cost model and why it’s an estimate, read a criterion
benchmark, and explain how a custom GlobalAlloc counts allocated bytes.
The cost model: heap_size
Section titled “The cost model: heap_size”Recall lessons 4.4–4.5: operators charge bytes against the memory budget. What is a record’s byte
cost? The engine estimates it with heap_size — owned heap bytes, accounted per Value variant:
clinker-record ·value.rs ·heap_size fn @47d2e12
/// Estimated heap bytes owned by this value (excludes the enum itself).pub fn heap_size(&self) -> usize { match self { Value::String(s) => s.heap_size(), // 0 if inline, else byte length Value::Array(arr) => { arr.capacity() * std::mem::size_of::<Value>() + arr.iter().map(Value::heap_size).sum::<usize>() } Value::Map(m) => m.iter().map(|(k, v)| k.len() + v.heap_size()).sum(), _ => 0, // scalars live inline — no heap }}Two things to read here. Scalars (Integer, Bool, dates) return 0 — they live inline in the
32-byte Value, owning no heap. And a short inline FieldStr also returns 0, which is the
24-byte type from last lesson paying off in the cost model directly. Crucially, this is an
estimate, computed by walking the value — not a real allocator measurement. It has to be:
charging runs on the hot path, per record, per operator, and you cannot attach an allocator probe to
every value without wrecking the throughput you’re trying to bound. A cheap, consistent estimate is
the right tool for a runtime budget.
The benchmark suite: criterion
Section titled “The benchmark suite: criterion”For the other job — proving a change actually made things faster or smaller — clinker uses
criterion benchmarks. They live in each crate’s benches/ directory:
record_ops (record create/get/set/clone, value_heap_size), arbitration_poll, and more.
clinker-exec ·arbitration_poll.rs ·bench_should_spill bench @47d2e12
// crates/clinker-record/benches/record_ops.rs — the criterion shapefn bench_value_heap_size(c: &mut Criterion) { let string = Value::String("a medium length string".into()); c.bench_function("value_heap_size/string", |b| { b.iter(|| black_box(string.heap_size())); // black_box stops the optimizer });}black_box is the load-bearing detail: it hides its argument from the optimizer, so the compiler
can’t “see through” the benchmark and delete the work you’re trying to time. You run these with
cargo bench -p clinker-record --bench record_ops (or --bench arbitration_poll for the
arbitrator-poll cost). The arbitration_poll suite, for instance, times should_spill across
consumer-registry sizes to keep the arbitrator’s per-poll cost from regressing as pipelines deepen.
The exact count: a custom global allocator
Section titled “The exact count: a custom global allocator”An estimate budgets; a count verifies. To measure the real bytes a section allocates, clinker
ships a counting global allocator behind the bench-alloc feature:
clinker-bench-support ·alloc.rs ·AccountingAlloc type @47d2e12
pub struct AccountingAlloc { allocs: AtomicUsize, bytes_alloc: AtomicUsize, /* ... */ }
// SAFETY: every call forwards to the System allocator after counting.unsafe impl GlobalAlloc for AccountingAlloc { unsafe fn alloc(&self, layout: Layout) -> *mut u8 { self.allocs.fetch_add(1, Ordering::Relaxed); self.bytes_alloc.fetch_add(layout.size(), Ordering::Relaxed); unsafe { System.alloc(layout) } // delegate the real work } unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) { /* count, then */ unsafe { System.dealloc(ptr, layout) } }}Install it with #[global_allocator] and every allocation in the process flows through it, bumping
atomic counters (the interior-mutability pattern from 4.4, and the unsafe impl discipline from
4.6). A scoped Region snapshots the counters before and after a block to report exactly how many
bytes it allocated — that’s how the executor’s per-stage heap_delta_bytes metric is captured under
the feature. The cost is real (it adds contention per allocation), so it’s a measurement build, not
the production path.
So the two tools divide the labour cleanly: heap_size estimates, cheaply, for the live budget;
AccountingAlloc counts, exactly, for offline verification. Confusing the two — budgeting off real
allocation, or benchmarking off the estimate — would get you the worst of each.
Count allocations yourself
Section titled “Count allocations yourself”The Rust playground lets you install a global allocator, so you can build a miniature
AccountingAlloc right here — same unsafe impl GlobalAlloc, same forward-to-System pattern:
> output appears here — press Run
The Vec::with_capacity(1024) shows up as a 1024-byte allocation in the delta — an exact count,
not an estimate. This is clinker’s AccountingAlloc in miniature: a GlobalAlloc impl that counts
and forwards. Swap with_capacity(1024) for a String::from("...") or a vec![0u8; 100] and watch
the bytes track precisely.
// quick check
Why does the runtime budget charge memory using heap_size (an estimate) instead of the exact AccountingAlloc count?
The two tools serve different jobs: a cheap consistent estimate for live budgeting on the hot path, and an exact count for offline verification. Using the expensive exact count for per-record budgeting would defeat the purpose.
Measure it for yourself
Section titled “Measure it for yourself”That’s Phase 4 — Execution & Memory. You’ve gone from “the plan is validated” to the live machine that runs it: closed-enum dispatch, one thread per source feeding bounded channels, buffers that spill, an arbitrator that bounds memory through interior mutability, the unsafe core of the field string, and the tools that measure all of it. The thread tying Phases 3 and 4 together: push proof to the boundary, then run hard inside it — with the type system, the budget, and the benchmarks each enforcing a different guarantee. Phase 5 turns you from reader to contributor: adding an operator, a format, a CXL builtin, and passing the review gauntlet.