Why I Built a Lock-Free Ring Buffer Before I Needed One
Memory ordering, false sharing, and why a 64-byte alignment hint earned a 3.1× throughput win.
The first time I wrote a producer-consumer queue, I reached for a mutex. It worked. It was also wrong for the problem I would eventually face: streaming 1 kHz IMU samples into a processing pipeline with bounded jitter. Mutexes give you correctness, but they hand you priority inversion and unbounded blocking as a bonus.
The shape of the problem
Single producer, single consumer. Fixed-size circular buffer. Wait-free on both ends. The only synchronisation primitives I needed were two atomic indices and the right memory ordering on each one.
alignas(64) std::atomic<size_t> head_{0};
alignas(64) std::atomic<size_t> tail_{0};False sharing is invisible until you measure it
Without the alignas(64), head and tail land on the same cache line. Every producer write invalidates the consumer's cached copy of tail, and vice versa. perf stat told the story: L1-dcache-load-misses dropped sharply once the indices lived on separate lines.
Acquire/release, not seq_cst
memory_order_seq_cst is the safe default and the wrong default for hot paths. The producer releases the new head; the consumer acquires it. That's the only edge that needs ordering. ThreadSanitizer agreed across 50 million operations: zero races.
What the numbers said
3.1× throughput vs. mutex-guarded baseline
Sub-350 ns write latency at p99
Deterministic jitter bounds for 1 kHz ingestion
Build the primitive before the system needs it. By the time the deadline shows up, you want the boring, measured choice already on the shelf.