Skip to content

unsafe & FieldStr

This is the lowest the data layer goes. Value::String is the dominant payload in an ETL stream, so its width sets the per-Value cost — the very number that drives the RSS and spill thresholds from the last four lessons. To keep Value at its 32-byte budget (Phase 2) and support three different string-ownership strategies, clinker hand-packs them into a 24-byte type called FieldStr, using unsafe. It’s the engine’s deepest use of unsafe Rust — and, more usefully, a model of how to use it responsibly: a tiny unsafe core, fenced in by layout assertions and documented invariants.

You’ll be able to: explain FieldStr’s three arms, read an unsafe block guarded by a SAFETY comment, and explain how layout assertions make the unsafe sound.

A field string is one of three things, and which one is a memory-cost decision:

  • inline — short values (≤ 23 bytes: ids, codes, flags) live in the struct, zero heap allocation;
  • shared — longer values that repeat or get cloned across stages are Arc<str>-backed, so a clone is an O(1) refcount bump, not a copy;
  • unique — long values flagged unique by the schema are Box<str>-backed, dropping the ~16-byte Arc refcount header that’s pure waste when a value never repeats (UUIDs, addresses, notes).

All three share one 24-byte footprint via a private union, discriminated by a single trailing tag byte:

clinker-record ·field_str.rs ·FieldStr type @47d2e12
#[repr(C)]
union Repr {
inline: Inline, // [u8; 23] of UTF-8 + a tag byte
heap: HeapHeader, // ptr + len + padding + the same tag byte
}
pub struct FieldStr {
repr: Repr, // inline bytes, an Arc<str> (shared), or a Box<str> (unique)
}
FieldStr — 24 bytes, two overlapping arms (a union), ONE shared tag byte:
inline: [ 23 bytes of UTF-8 ........................ ][ tag = len 0..=23 ]
heap: [ ptr: *const u8 ][ len: usize ][ padding ...][ tag = 0xFF/0xFE ]
^ same offset (INLINE_CAP) in both arms

The trailing byte does double duty: for the inline arm it’s the length (always ≤ 23); for the heap arms it’s a sentinel (0xFF shared, 0xFE unique) chosen above 23 so it can never be mistaken for an inline length. That overlap is the trick that hits 24 bytes — and the reason unsafe is unavoidable: reading a field of a union is unsafe, because the compiler can’t know which arm is currently valid.

clinker-record ·field_str.rs ·INLINE_CAP doc @47d2e12

Here’s the read path. as_str checks the tag, then reads the matching arm — each read in an unsafe block prefaced by a // SAFETY: comment stating the invariant that makes it sound:

pub fn as_str(&self) -> &str {
let tag = self.tag();
if tag <= INLINE_CAP as u8 {
// SAFETY: inline arm — `tag` is the byte length (<= INLINE_CAP), and the
// constructor only ever wrote valid UTF-8 into `data[..len]`.
unsafe {
let inline = &self.repr.inline;
std::str::from_utf8_unchecked(&inline.data[..tag as usize])
}
} else {
// SAFETY: heap arm — ptr/len came from Arc::into_raw / Box::into_raw on a
// `str`, so the bytes are live, owned by this FieldStr, and valid UTF-8.
unsafe {
let heap = &self.repr.heap;
let bytes = std::slice::from_raw_parts(heap.ptr, heap.len);
std::str::from_utf8_unchecked(bytes)
}
}
}

unsafe doesn’t mean “unchecked and hope.” It means the compiler can’t verify this, so I’m asserting an invariant I’m responsible for. The SAFETY comment is that assertion written down. The Drop impl is the same discipline in reverse: it reconstitutes the Arc/Box from the raw pointer to release exactly one ownership token, with a SAFETY note that each heap arm owns exactly one such token.

What makes it actually sound: layout assertions

Section titled “What makes it actually sound: layout assertions”

A SAFETY comment is a promise; the layout assertions are how the promise is kept. The whole design rests on the tag byte sitting at the same offset in both union arms — so reading the tag through either arm reads the live discriminant. A compile-time block pins exactly that:

const _: () = {
assert!(std::mem::size_of::<FieldStr>() == 24); // the Value-budget invariant
assert!(std::mem::offset_of!(Inline, tag) == INLINE_CAP); // tag offset coincides...
assert!(std::mem::offset_of!(HeapHeader, tag) == INLINE_CAP); // ...in both arms
assert!(TAG_SHARED as usize > INLINE_CAP); // sentinels can't be inline lengths
assert!(TAG_UNIQUE as usize > INLINE_CAP);
};

If anyone changes a field and breaks the layout, the build fails — the unsafe code can never run against a layout it wasn’t written for. And because the auto-derived Send/Sync can’t see through the raw pointer in the union, those are hand-written unsafe impls with their own SAFETY note (every arm’s backing is Send + Sync; the raw pointer is only ever an owned Arc/Box in disguise). A runtime test pins the size too:

clinker-record ·field_str.rs ·size_is_24_bytes test @47d2e12

This is the responsible-unsafe pattern, and it’s worth internalising: keep the unsafe core small, pin every layout invariant it relies on with a static assertion, document each block’s SAFETY contract, and back it with a drop/round-trip test. Unsafe earns its keep here — one allocation saved per short field, across billions of fields — precisely because it’s this tightly fenced.

You can’t write a union in a quick playground, but you can practise the core move: skipping a check you can prove is unnecessary, and writing the proof down. This is exactly what the inline arm of as_str does:

rust // editable

The unsafe block is sound because the SAFETY comment’s claim is true — the bytes really are valid UTF-8. Feed from_utf8_unchecked genuinely invalid bytes and you’d have undefined behaviour with no error; the entire responsibility moves from the compiler to you, which is why the invariant must be real and written down. That’s the deal unsafe offers, and FieldStr takes it only where the payoff is large and the invariant is pinned by an assertion.

// quick check

What makes FieldStr's unsafe union reads sound rather than reckless?

You’ve reached the floor of the data layer. The final Phase 4 lesson turns the lens around: how do you measure all this — the per-value cost model, the benchmark suite, and a custom allocator that counts every byte?