← All writing
12 October 2025·8 min

Why I Built a Lock-Free Ring Buffer Before I Needed One

Memory ordering, false sharing, and why a 64-byte alignment hint earned a 3.1× throughput win.

The first time I wrote a producer-consumer queue, I reached for a mutex. It worked. It was also wrong for the problem I would eventually face: streaming 1 kHz IMU samples into a processing pipeline with bounded jitter. Mutexes give you correctness, but they hand you priority inversion and unbounded blocking as a bonus.

The shape of the problem

Single producer, single consumer. Fixed-size circular buffer. Wait-free on both ends. The only synchronisation primitives I needed were two atomic indices and the right memory ordering on each one.

alignas(64) std::atomic<size_t> head_{0};
alignas(64) std::atomic<size_t> tail_{0};

False sharing is invisible until you measure it

Without the alignas(64), head and tail land on the same cache line. Every producer write invalidates the consumer's cached copy of tail, and vice versa. perf stat told the story: L1-dcache-load-misses dropped sharply once the indices lived on separate lines.

Acquire/release, not seq_cst

memory_order_seq_cst is the safe default and the wrong default for hot paths. The producer releases the new head; the consumer acquires it. That's the only edge that needs ordering. ThreadSanitizer agreed across 50 million operations: zero races.

What the numbers said

3.1× throughput vs. mutex-guarded baseline

Sub-350 ns write latency at p99

Deterministic jitter bounds for 1 kHz ingestion

Build the primitive before the system needs it. By the time the deadline shows up, you want the boring, measured choice already on the shelf.
C++17ConcurrencyReal-time

Next essay

Closed-Loop Actuator Control for Cochlear Surgery: Lessons from the Lab