Skip to content

Arc — shared ownership across the engine

The Rust question for this lesson: Box (lesson 6) gave one owner a value on the heap. But ownership has been strictly single-owner this whole track — and real programs constantly need the same value owned by many places at once. Every record in a batch needs the same schema. Every row from a file shares that file’s name. None of those owners naturally outlives the others, and copying the value per owner would be wasteful. So how do you give a value several owners and free it exactly when the last one is done? With Arc — atomically reference-counted shared ownership.

Arc is the smart pointer Clinker reaches for everywhere a single value has to be shared widely and cheaply — one schema behind a million records.

Arc::new(value) puts value on the heap with a reference count attached. Every Arc::clone hands out another owning handle and bumps that count by one; every handle that drops lowers it by one. When the count hits zero — the last owner gone — the value is freed, exactly once. Crucially, Arc::clone does not copy the value: it copies a pointer and increments the counter. All handles point at the same heap allocation.

rust // editable

Arc::ptr_eq returning true is the proof: a is not a second Shape, it’s a second owner of the one Shape. Three handles, one value, one eventual free.

// quick check

What does Arc::clone actually duplicate?

A Record doesn’t own a private Schema — it holds an Arc<Schema>, the same schema shared by every record in the batch:

clinker-record ·mod.rs ·Record type @47d2e12
pub struct Record {
schema: Arc<Schema>, // shared — one Schema behind every record in the batch
values: Vec<Value>, // owned — this row's own cells
// …
}

The doc comment on Record spells out the payoff for the document context it carries the same way — records from one source file “share that file’s Arc (one allocation per document, refcount-bump per record).” That is the whole idea in one line: allocate the big shared thing once, then every record gets a cheap owning handle to it instead of a copy.

The same pattern carries a record’s provenance — where it came from:

clinker-record ·provenance.rs ·RecordProvenance type @47d2e12
/// Tracks where a record came from in the source data.
/// Arc<str> fields are shared across all records from the same file/run.
pub struct RecordProvenance {
pub source_file: Arc<str>, // the file name — one string, every row points to it
pub source_row: u64, // this row's own number
pub source_batch: Arc<str>,
pub ingestion_timestamp: NaiveDateTime,
}

Arc<str> is a shared, immutable string — the file name "data/input.csv" is allocated once and every row from that file holds a refcounted handle to it, not its own copy. The factory that stamps each row makes this explicit:

pub fn factory(source_file: Arc<str>, source_batch: Arc<str>, /* … */) -> impl FnMut(u64) -> Self {
move |source_row| Self {
source_file: Arc::clone(&source_file), // refcount bump, not a string copy
source_row,
source_batch: Arc::clone(&source_batch),
// …
}
}

Across a million-row file that’s one "data/input.csv" allocation and a million cheap refcount bumps — versus a million duplicated strings if these were plain Strings.

A batch is millions of Records drawn from the same source, all describing themselves with the same Schema and the same source-file string. Two non-Arc options both hurt: give each record its own Schema/String and you copy big shared data millions of times; thread a borrow (&Schema) through everything and you tie every record’s lifetime to one original owner that must outlive them all — unworkable when records flow through a pipeline and get buffered, reordered, and handed off. Arc is the third way: each record is a genuine, independent owner of the shared value, and the value lives precisely until the last record referencing it is gone.

A picture of what’s shared versus owned:

┌─ Record #0 ─┐ ┌─ Record #1 ─┐ ┌─ Record #2 ─┐
│ values: […] │ │ values: […] │ │ values: […] │ ← each owns its own cells
│ schema: ●───┼─┐ │ schema: ●───┼─┐ │ schema: ●───┼─┐
└─────────────┘ │ └─────────────┘ │ └─────────────┘ │
└─────────┬────────┴────────────────┘
heap: Arc<Schema> { refcount: 3, … } ← allocated ONCE, freed
when the last record drops

Two halves — use the playground.

(a) In the Shape playground above, before each println! of strong_count, write down what you expect the count to be, then run and check. Add a third Arc::clone and a drop and predict the count after each.

(b) Paste the thread version below and run it (it works). Then change every Arc to Rc (use std::rc::Rc;) and recompile. Read the error, and explain — in Send terms — why the engine’s worker-thread model forces Arc.

rust // editable
💡 Hint 1

thread::spawn requires its closure (and everything it captures) to be Send, so the data can legally move onto another thread. Arc<Schema> is Send; Rc<Schema> is not. The error reads roughly: Rc<Schema>` cannot be sent between threads safely.

Show solution

(a) Counts go 1 → 3 (after two clones) → back down as handles drop. Each Arc::clone is +1, each drop is −1; the Shape is freed only when the count reaches 0.

(b) With Rc, the program fails to compile at thread::spawn with an error like Rc<Schema>` cannot be sent between threads safely — because Rc is !Send. spawn demands Send so the captured value can move to the worker thread. Arc is Send + Sync (its refcount updates atomically), so it crosses the boundary safely. That’s the whole reason Clinker uses Arc: a Record holding an Arc<Schema> rides onto the executor’s worker thread, and Rc simply isn’t allowed there. Arc::clone copies no schema data either way — just a pointer and a refcount bump — so sharing stays cheap.

  • Arc::clone deep-copies the value.” No — it copies a pointer and atomically bumps a reference count; every handle aliases the one heap value (provable with Arc::ptr_eq). Cloning an Arc<Schema> does not duplicate the schema; that’s the point.
  • Arc and Rc are interchangeable.” Only when you never leave one thread. Rc uses a non-atomic count and is !Send, so it’s faster but single-threaded; Arc is Send + Sync and pays a small atomic cost to be shared across threads. Clinker moves records across a worker-thread boundary, so it must be Arc.
  • “An Arc lets every owner mutate the shared value.” No — Arc<T> gives shared read access; you can’t get an &mut through it while it’s shared. Mutating shared data needs interior mutability (a Mutex/RwLock, or atomics) — a topic still ahead.

The provenance test proves the sharing claim directly — clones of an Arc<str> point at one allocation:

Terminal window
cargo test -p clinker-record test_provenance_arc_sharing

It builds two RecordProvenance rows from the same Arc<str> file name and asserts Arc::ptr_eq(&p1.source_file, &p2.source_file) — same pointer, one string, two owners. For the counting side, drop a dbg!(Arc::strong_count(&x)) into a scratch crate (never edit clinker source) and watch it move as you clone and drop.

That completes the ownership arc of the track: a value has one owner (lesson 4), you can borrow it (5), put it on the heap (6), or share it among many owners with Arc (7). Next the track turns from who holds a value to what a value can be: modeling success or failure in the type system with Result and the ? operator — how Clinker distinguishes a bad row it can route to a dead-letter queue from a fatal error that must stop the run.