15 May 2025·10 min

Why I Spent a Week Writing a 150-Line Header File (And What It Taught Me About Modern CPUs)

A wait-free SPSC ring buffer in 150 lines of C++17: 3.1× throughput over mutexes, sub-87 ns p99 latency, and the cache-line lesson that changed how I think about concurrency.

Macro photograph of a bare CPU die with metallic blue cache geometry — Modern silicon. The cache line is where lock-free code lives or dies.

There is a category of software problem that looks trivially simple until you sit down to solve it correctly. Passing data between two threads is one of them.

Any first-year student can write a producer-consumer queue: put a mutex around a std::queue, call push() and pop(), done. It compiles, it runs, it passes tests. And in a lot of applications it is perfectly fine.

But in real-time systems, embedded sensors, avionics data buses, radar signal processors, anything where the cost of a missed deadline is measured in physical consequences rather than slow page loads, the mutex approach has a problem. That problem has a name: priority inversion. And solving it properly sends you down a fascinating rabbit hole of CPU architecture, memory model theory, and C++ atomics.

This post is about the SPSC ring buffer I built in C++17: a wait-free, lock-free Single-Producer Single-Consumer circular buffer that benchmarks at 3.1× the throughput of a mutex-guarded queue and achieves ~87 ns p99 write latency, verified as data-race-free by ThreadSanitizer across 50,000,000 operations.

The problem with mutexes in real-time systems

Here is the scenario. You have a high-priority thread reading data from an IMU sensor at 1 kHz. You have a lower-priority thread processing that data. They communicate through a mutex-guarded queue.

Now the low-priority thread holds the mutex, it is in the middle of a pop(), and the operating system decides to preempt it. Your high-priority sensor thread runs, tries to push() a sample, and blocks on the mutex. The high-priority thread is now waiting for the low-priority thread to be rescheduled. This is priority inversion: a high-priority task is blocked by a low-priority one.

In a real-time embedded system, this can cause a sensor sample to be dropped, a control loop to miss its deadline, or, in safety-critical applications, a failure that has physical consequences.

The fix is to eliminate the mutex entirely.

Lock-free and wait-free — what is the difference?

These terms are often used interchangeably. They are not the same.

Lock-free means the system as a whole makes progress, even if individual threads might retry. Compare-and-swap (CAS) loops are lock-free: if thread A fails its CAS, it retries, but thread B succeeded, so the system progressed.

Wait-free is stronger: every thread makes progress in a bounded number of steps, regardless of what other threads are doing. No thread can be indefinitely delayed by another.

My SPSC ring buffer is wait-free. try_push() and try_pop() each execute in O(1) steps with no loops, no retries, no blocking. The worst case is that the buffer is full (push returns false) or empty (pop returns false), but the thread never waits for another thread to do anything.

This is possible because of the single-producer single-consumer constraint. With exactly one writer and one reader, you do not need CAS. You only need carefully ordered atomic loads and stores.

The data structure

template <typename T, std::size_t Capacity>
class SpscRingBuffer {
    static_assert((Capacity & (Capacity - 1)) == 0,
                  "Capacity must be a power of 2");

    alignas(64) std::atomic<std::size_t> head_{0};
    alignas(64) std::atomic<std::size_t> tail_{0};
    std::array<T, Capacity> buf_;

    static constexpr std::size_t MASK = Capacity - 1;

public:
    [[nodiscard]] bool try_push(const T& item) noexcept;
    [[nodiscard]] bool try_pop(T& item) noexcept;
    [[nodiscard]] std::size_t size_approx() const noexcept;
    [[nodiscard]] bool empty() const noexcept;
    static constexpr std::size_t capacity() noexcept { return Capacity; }
};

Three design decisions worth explaining immediately:

Power-of-2 capacity. Index wrapping with index & MASK is a single bitwise AND, faster than index % Capacity. The static_assert enforces this at compile time.

alignas(64) on both atomics. This is the most important optimisation in the whole file. I will explain it shortly.

[[nodiscard]] on try_push and try_pop. If you ignore the return value of try_pop, you have probably introduced a bug, you think you consumed an item but the buffer was empty. The compiler will warn you at the call site.

The push and pop operations

template <typename T, std::size_t Capacity>
bool SpscRingBuffer<T, Capacity>::try_push(const T& item) noexcept {
    const std::size_t h = head_.load(std::memory_order_relaxed);
    const std::size_t next_h = (h + 1) & MASK;

    if (next_h == tail_.load(std::memory_order_acquire))
        return false;  // buffer full

    buf_[h] = item;
    head_.store(next_h, std::memory_order_release);
    return true;
}

template <typename T, std::size_t Capacity>
bool SpscRingBuffer<T, Capacity>::try_pop(T& item) noexcept {
    const std::size_t t = tail_.load(std::memory_order_relaxed);

    if (t == head_.load(std::memory_order_acquire))
        return false;  // buffer empty

    item = buf_[t];
    tail_.store((t + 1) & MASK, std::memory_order_release);
    return true;
}

Twelve lines of code. Every memory ordering annotation is deliberate. Let me go through each one.

Memory ordering — the part most tutorials skip

The C++ memory model does not guarantee that memory writes made by one thread are immediately visible to another thread, or that they are seen in the order they were written. The compiler and the CPU are both free to reorder operations in ways that preserve single-thread semantics but can break multi-threaded code.

std::atomic operations with explicit memory ordering annotations tell the compiler and CPU what constraints apply. There are five orderings; I use three:

memory_order_relaxed, just do the atomic operation atomically. No ordering guarantees relative to other memory operations.

memory_order_acquire, no memory reads or writes in the current thread can be reordered to before this load. It synchronises with a corresponding release store.

memory_order_release, no memory reads or writes in the current thread can be reordered to after this store. Any thread that subsequently does an acquire load of this atomic will see all writes that preceded the release.

Here is why each annotation is what it is. In try_push, the head_.load(memory_order_relaxed) is relaxed because only the producer reads or writes head_. We do not need to observe any other thread's writes there. The tail_.load(memory_order_acquire) is acquire because we need to see the consumer's most recent tail_.store(release) so we know which slots have been freed. Without acquire, we might read a stale tail_ and wrongly think the buffer is full. The buf_[h] = item data write must happen before the head_.store(next_h, memory_order_release), and release guarantees exactly that. The consumer's head_.load(acquire) will see both the new head_ value and the data we just wrote.

The critical synchronisation edges are: producer head_.store(release) synchronises-with consumer head_.load(acquire), so the consumer is guaranteed to see buf_[h] = item. And consumer tail_.store(release) synchronises-with producer tail_.load(acquire), so the producer is guaranteed to see freed slots. If I had used memory_order_seq_cst throughout, the code would be correct but slower, seq_cst forces a full memory fence (MFENCE on x86) on every store, which costs roughly 20–40 ns per operation and defeats most of the performance advantage of going lock-free.

Cache-line padding — the invisible performance killer

Here is the thing that surprised me most when I first measured it. Even with correct memory ordering, a naive implementation that puts head_ and tail_ adjacent in memory is significantly slower than it should be. The reason is false sharing.

Modern x86-64 CPUs cache memory in 64-byte lines. When a core writes to a memory location, it takes exclusive ownership of the entire 64-byte cache line containing that location. Any other core that wants to read or write anywhere in that same line has to wait.

If head_ and tail_ share a cache line, which they will by default, since they are adjacent 8-byte values, then every time the producer writes head_, it invalidates the cache line that the consumer is reading tail_ from. And every time the consumer writes tail_, it invalidates the line the producer is reading from. They are constantly thrashing each other's L1 cache, even though they are accessing different variables.

alignas(64) std::atomic<std::size_t> head_{0};  // occupies its own cache line
alignas(64) std::atomic<std::size_t> tail_{0};  // occupies its own cache line

This forces each atomic to start at a 64-byte boundary, placing them on separate cache lines. The producer owns its line exclusively; the consumer owns its line exclusively. The false sharing disappears.

You can measure the effect directly on Linux:

perf stat -e cache-misses,L1-dcache-load-misses ./bench

In my measurements, alignas(64) reduced L1 cache miss rate by approximately 40–60% compared to the unpadded version, and this is visible in the benchmark results.

Benchmark results

Measured on Ubuntu 22.04, Intel Core i7-10750H, GCC 12.3, -O3 -march=native, two isolated cores:

Throughput: 10,000,000 items:

Implementation          Throughput
─────────────────────────────────────────
SpscRingBuffer          ~310,000,000 ops/sec
std::mutex + std::queue ~100,000,000 ops/sec  (baseline)
─────────────────────────────────────────
Speedup: 3.1×

Write latency: 1,000,000 samples:

Percentile   Latency
p50          ~22 ns
p95          ~45 ns
p99          ~87 ns
p999         ~210 ns
max          ~1,200 ns

The p99 at 87 ns is comfortably below the 350 ns target I set based on the 1 kHz IMU use case (where you have 1,000,000 ns per sample and can afford to spend a small fraction on the buffer operation). The occasional 1,200 ns max is caused by OS scheduler preemption, not by the buffer itself.

The 3.1× throughput advantage over mutex comes from three combined effects: no system call overhead (mutexes may call into the kernel on contention), no memory fence on every operation (just acquire/release pairs), and no false sharing between head_ and tail_.

ThreadSanitizer validation

Correctness in concurrent code is notoriously hard to test. Unit tests that pass deterministically on your machine can hide data races that only surface under different thread scheduling. ThreadSanitizer (TSan) instruments your binary to detect data races at runtime by tracking memory accesses across threads.

I ran a stress test of 50,000,000 operations under TSan:

TSan stress test: 50000000 operations (producer + consumer)...
TSan stress test PASSED: 50000000 ops, zero data races

TSan is not a proof of correctness, it can only find races it observes, but 50 million operations covers a very large fraction of the scheduling interleavings that could expose incorrect memory ordering. Getting a clean TSan result on a lock-free data structure is meaningful validation that the acquire/release annotations are placed correctly.

Note that TSan requires GCC or Clang on Linux. The MinGW Windows toolchain does not support it, another reason to prefer a Linux development environment for low-level concurrent code.

The IMU demo

To make the use case concrete, I built a demo that simulates a 1 kHz IMU sensor feeding an accelerometer processing pipeline:

╔══════════════════════════════════════════╗
║   SPSC IMU Demo  ·  1 kHz  ·  5 seconds  ║
╚══════════════════════════════════════════╝

[consumer] consumed=5000   mean_az=9.8104 m/s²

╔══════════════════════════════════════════╗
║               Results                    ║
╠══════════════════════════════════════════╣
║  Total attempted  :  5000 samples        ║
║  Pushed (success) :  5000 samples        ║
║  Dropped (full)   :     0 samples        ║
║  Drop rate        :  0.000 %             ║
╚══════════════════════════════════════════╝

The buffer is 256 slots, at 1 kHz, that is 255 ms of scheduling-jitter headroom. The OS can preempt the consumer for up to a quarter of a second before the producer starts dropping samples. In practice, Linux preemption latency on a non-RT kernel is well under 10 ms, so the buffer absorbs all jitter comfortably.

Zero drops across 5,000 samples. Exactly what a safety-critical pipeline needs.

The correctness test suite

Five Google Tests cover the key invariants:

PushPop, single-threaded round-trip: push N items, pop N items, values match

WrapAround, verify correct behaviour across the index wraparound boundary

FullAndEmpty, try_push on a full buffer returns false; try_pop on empty returns false

SpscConcurrent, producer and consumer run simultaneously; all values received in order

SizeApprox, size_approx() is non-negative and ≤ Capacity under concurrent access

The SpscConcurrent test is the most important. It runs the producer and consumer on separate threads, pushes 100,000 items with known values, and verifies on the consumer side that every value is received exactly once in order. It passes consistently.

Where this fits in a larger system

In my CFAR radar pipeline, the data flow looks like this:

ADC samples → [SPSC ring buffer] → FFT processor → [SPSC ring buffer] → CFAR detector → output

Each stage runs on its own thread. The SPSC buffers between stages provide deterministic, wait-free handoff with bounded latency. The total pipeline latency budget is well under 1 ms for a 1024-bin frame, the buffer operations contribute less than 1 µs of that.

This is the pattern that makes real-time multi-stage DSP pipelines possible without mutexes: one ring buffer between each adjacent pair of stages, strict SPSC discipline enforced by design, and alignas(64) throughout.

Running it yourself

git clone https://github.com/karam25pal/spsc-ring-buffer.git
cd spsc-ring-buffer
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

./imu_demo           # 1 kHz IMU demo, 5 seconds
./test_spsc          # correctness suite
./test_spsc_tsan     # TSan stress (Linux only)
./bench              # throughput + p99 latency

The entire implementation is a single header file: include/spsc_ring_buffer.hpp. No dependencies, no external libraries, pure C++17 standard library. MIT licence.

Full project at github.com/karam25pal/spsc-ring-buffer.

What I learned

The most valuable thing this project taught me is that correctness and performance in concurrent code are both about understanding what the hardware is actually doing, not just what the C++ standard says in the abstract.

alignas(64) is three words. Its effect, eliminating cache-line ping-pong between two cores, shaves tens of nanoseconds off every single buffer operation. Understanding why it works requires knowing that cache coherence protocols like MESI operate at cache-line granularity, not at variable granularity. That is not in most undergraduate curricula. It is in the hardware documentation.

Similarly, the difference between memory_order_acquire and memory_order_seq_cst is not just academic. On x86 it is the difference between a normal store and a store followed by MFENCE, a full memory barrier that serialises all in-flight memory operations. At 310 million ops/second, every unnecessary barrier matters.

Low-level C++ forces you to understand the machine. That is uncomfortable at first, and then it becomes one of the most satisfying parts of programming.

Build the primitive before the system needs it. By the time the deadline shows up, you want the boring, measured choice already on the shelf, and a benchmark report that lets a non-technical stakeholder sign off in one read.

C++17ConcurrencyLock-FreeReal-timeCache Architecture