unsafe & FieldStr
This is the lowest the data layer goes. Value::String is the dominant payload in an ETL stream,
so its width sets the per-Value cost — the very number that drives the RSS and spill thresholds
from the last four lessons. To keep Value at its 32-byte budget (Phase 2) and support three
different string-ownership strategies, clinker hand-packs them into a 24-byte type called
FieldStr, using unsafe. It’s the engine’s deepest use of unsafe Rust — and, more usefully, a
model of how to use it responsibly: a tiny unsafe core, fenced in by layout assertions and
documented invariants.
You’ll be able to: explain FieldStr’s three arms, read an unsafe block guarded by a SAFETY
comment, and explain how layout assertions make the unsafe sound.
Three strategies in one 24-byte type
Section titled “Three strategies in one 24-byte type”A field string is one of three things, and which one is a memory-cost decision:
- inline — short values (≤ 23 bytes: ids, codes, flags) live in the struct, zero heap allocation;
- shared — longer values that repeat or get cloned across stages are
Arc<str>-backed, so a clone is an O(1) refcount bump, not a copy; - unique — long values flagged unique by the schema are
Box<str>-backed, dropping the ~16-byteArcrefcount header that’s pure waste when a value never repeats (UUIDs, addresses, notes).
All three share one 24-byte footprint via a private union, discriminated by a single trailing tag
byte:
clinker-record ·field_str.rs ·FieldStr type @47d2e12
#[repr(C)]union Repr { inline: Inline, // [u8; 23] of UTF-8 + a tag byte heap: HeapHeader, // ptr + len + padding + the same tag byte}
pub struct FieldStr { repr: Repr, // inline bytes, an Arc<str> (shared), or a Box<str> (unique)} FieldStr — 24 bytes, two overlapping arms (a union), ONE shared tag byte:
inline: [ 23 bytes of UTF-8 ........................ ][ tag = len 0..=23 ] heap: [ ptr: *const u8 ][ len: usize ][ padding ...][ tag = 0xFF/0xFE ] ^ same offset (INLINE_CAP) in both armsThe trailing byte does double duty: for the inline arm it’s the length (always ≤ 23); for the
heap arms it’s a sentinel (0xFF shared, 0xFE unique) chosen above 23 so it can never be
mistaken for an inline length. That overlap is the trick that hits 24 bytes — and the reason
unsafe is unavoidable: reading a field of a union is unsafe, because the compiler can’t know
which arm is currently valid.
clinker-record ·field_str.rs ·INLINE_CAP doc @47d2e12
unsafe, fenced by a SAFETY contract
Section titled “unsafe, fenced by a SAFETY contract”Here’s the read path. as_str checks the tag, then reads the matching arm — each read in an
unsafe block prefaced by a // SAFETY: comment stating the invariant that makes it sound:
pub fn as_str(&self) -> &str { let tag = self.tag(); if tag <= INLINE_CAP as u8 { // SAFETY: inline arm — `tag` is the byte length (<= INLINE_CAP), and the // constructor only ever wrote valid UTF-8 into `data[..len]`. unsafe { let inline = &self.repr.inline; std::str::from_utf8_unchecked(&inline.data[..tag as usize]) } } else { // SAFETY: heap arm — ptr/len came from Arc::into_raw / Box::into_raw on a // `str`, so the bytes are live, owned by this FieldStr, and valid UTF-8. unsafe { let heap = &self.repr.heap; let bytes = std::slice::from_raw_parts(heap.ptr, heap.len); std::str::from_utf8_unchecked(bytes) } }}unsafe doesn’t mean “unchecked and hope.” It means the compiler can’t verify this, so I’m
asserting an invariant I’m responsible for. The SAFETY comment is that assertion written down. The
Drop impl is the same discipline in reverse: it reconstitutes the Arc/Box from the raw pointer
to release exactly one ownership token, with a SAFETY note that each heap arm owns exactly one such
token.
What makes it actually sound: layout assertions
Section titled “What makes it actually sound: layout assertions”A SAFETY comment is a promise; the layout assertions are how the promise is kept. The whole design rests on the tag byte sitting at the same offset in both union arms — so reading the tag through either arm reads the live discriminant. A compile-time block pins exactly that:
const _: () = { assert!(std::mem::size_of::<FieldStr>() == 24); // the Value-budget invariant assert!(std::mem::offset_of!(Inline, tag) == INLINE_CAP); // tag offset coincides... assert!(std::mem::offset_of!(HeapHeader, tag) == INLINE_CAP); // ...in both arms assert!(TAG_SHARED as usize > INLINE_CAP); // sentinels can't be inline lengths assert!(TAG_UNIQUE as usize > INLINE_CAP);};If anyone changes a field and breaks the layout, the build fails — the unsafe code can never run
against a layout it wasn’t written for. And because the auto-derived Send/Sync can’t see through
the raw pointer in the union, those are hand-written unsafe impls with their own SAFETY note (every
arm’s backing is Send + Sync; the raw pointer is only ever an owned Arc/Box in disguise). A
runtime test pins the size too:
clinker-record ·field_str.rs ·size_is_24_bytes test @47d2e12
This is the responsible-unsafe pattern, and it’s worth internalising: keep the unsafe core small, pin every layout invariant it relies on with a static assertion, document each block’s SAFETY contract, and back it with a drop/round-trip test. Unsafe earns its keep here — one allocation saved per short field, across billions of fields — precisely because it’s this tightly fenced.
The technique, safely
Section titled “The technique, safely”You can’t write a union in a quick playground, but you can practise the core move: skipping a
check you can prove is unnecessary, and writing the proof down. This is exactly what the inline arm
of as_str does:
> output appears here — press Run
The unsafe block is sound because the SAFETY comment’s claim is true — the bytes really are valid
UTF-8. Feed from_utf8_unchecked genuinely invalid bytes and you’d have undefined behaviour with no
error; the entire responsibility moves from the compiler to you, which is why the invariant must be
real and written down. That’s the deal unsafe offers, and FieldStr takes it only where the payoff
is large and the invariant is pinned by an assertion.
// quick check
What makes FieldStr's unsafe union reads sound rather than reckless?
Soundness comes from fencing: const layout assertions pin the size and tag offset the unsafe relies on (build fails otherwise), SAFETY comments document each block's invariant, and the constructors establish those invariants. The unsafe is small and every assumption is checked or asserted.
Read the unsafe core
Section titled “Read the unsafe core”You’ve reached the floor of the data layer. The final Phase 4 lesson turns the lens around: how do you measure all this — the per-value cost model, the benchmark suite, and a custom allocator that counts every byte?