← All writing
22 October 2025·12 min

The Cache Line Is the Unit of Concurrency: A Visual Tour of False Sharing

Why a single missing `alignas(64)` can make a lock-free queue 4× slower than its mutex-guarded equivalent. A perf-counter walk through MESI, cache contention, and the padding that fixes it.

When I built the SPSC ring buffer in May, the single line that moved performance the most was `alignas(64)` on the head and tail atomics. I claimed at the time it was 'the most important optimisation in the file' but I did not show the data. This post is the data.

The setup

Two atomic counters in the same struct, written by two different threads. Nothing else shared. The threads do not coordinate, they do not loop on each other, they only increment their own counter ten million times.

struct Unpadded {
    std::atomic<std::size_t> a{0};
    std::atomic<std::size_t> b{0};   // 8 bytes after `a`
};

struct Padded {
    alignas(64) std::atomic<std::size_t> a{0};
    alignas(64) std::atomic<std::size_t> b{0};   // own cache line
};

Common sense says these should run at the same speed. Each thread touches its own variable. The atomics are independent. There is no logical sharing.

The benchmark

Intel i7-12700H, GCC 13, -O2. Two threads, ten million increments each, repeated 100 times. Time per increment, averaged:

Padded version: 3.1 ns / op

Unpadded version: 12.7 ns / op

Slowdown from a missing keyword: 4.1×

Why

CPUs do not transfer one byte at a time. They transfer cache lines, usually 64 bytes. When thread A writes to variable `a`, the cache coherence protocol (MESI on x86) needs to make sure no other core has a stale copy of that cache line. If thread B's variable `b` happens to live on the same cache line, B's core has a copy too — even though B has not touched `a`.

The protocol now has to invalidate B's copy, fetch the new line from A's L1, and the next time B writes to `b`, the same dance happens in reverse. This ping-pong is called false sharing: there is no logical sharing, only physical co-location on the same cache line.

MESI in one paragraph

Each cache line on each core sits in one of four states: Modified (this core has the only copy, and it differs from memory), Exclusive (this core has the only copy, and it matches memory), Shared (multiple cores have read-only copies), or Invalid (this cache slot is empty). Writes require Modified. The transition Shared → Modified requires sending Invalidate messages to every other core that has the line and waiting for acknowledgements. That round trip is the cost.

Seeing it with perf

perf stat -e cache-misses,cache-references,\
  L1-dcache-load-misses,offcore_response.demand_rfo.l3_miss \
  ./bench_unpadded

# unpadded: 41,287,902  L1-dcache-load-misses
# padded:        9,114  L1-dcache-load-misses

Four thousand times more L1 misses in the unpadded version, for a workload that should have hit L1 every time.

Where this bites in real code

Producer and consumer cursors in a queue (the SPSC case).

Per-thread counters in a metrics struct laid out as an array.

Adjacent fields in a `std::vector` of small structs where neighbouring elements are written by neighbouring threads.

Padding fields in C++ struct layout that you did not realise the compiler had collapsed.

The fix and its cost

`alignas(64)` on hot atomics, or use `std::hardware_destructive_interference_size` if your compiler defines it (it should; GCC and Clang do). The cost is memory: 64 bytes per padded variable instead of 8. For a handful of cursors this is invisible. For a struct used in a million-element array, you would obviously not do this — you would instead lay out the data so that contiguous writers touch contiguous memory.

alignas(std::hardware_destructive_interference_size)
std::atomic<std::size_t> head_{0};

The opposite problem

There is a constructive version too: `std::hardware_constructive_interference_size`. When two pieces of data are always accessed together by the same thread, you want them on the same cache line so one fetch brings both. The struct layout exercise becomes: who reads what together, and who writes what apart.

What I now do reflexively

Any atomic in a shared struct gets `alignas(64)`.

Any per-thread counter array is sized so each element is at least a cache line.

Any benchmark that shows surprising slowness gets a `perf stat -e cache-misses` pass before I touch the code.

Concurrency is not really about threads. It is about memory, and how the hardware moves it between cores. Once you start thinking in cache lines, a lot of mysterious slowdowns stop being mysterious.

False sharing is the universe's polite reminder that the abstraction of independent variables is exactly that — an abstraction. The hardware sees only cache lines.
C++17ConcurrencyCPU ArchitecturePerformance

Next essay

Designing a Range-Doppler Heatmap That Renders in 6 ms on a Laptop GPU