The Paradox of Read Locks: Why Your "Optimized" Rust Code Is Slower

Here's something that sounds backwards: a write lock can be faster than a read lock.

I know — it violates everything we've learned. The conventional wisdom says RwLock is the answer to read-heavy workloads. Multiple readers should fly. Exclusive writes should be the bottleneck.

But on modern multi-core chips? The opposite can be true. And the reason is fascinating.

The Experiment That Broke Expectations

A developer building a high-performance tensor cache in Rust hit a wall with write-lock contention. Standard optimization move: switch to RwLock so multiple threads could read simultaneously.

The result? 5x slower than the original Mutex.

They expected this:

Read Lock → Multiple threads reading in parallel → Huge throughput increase

They got this:

Read Lock → Cache line ping-pong → Atomic contention → Actually worse

What's Actually Happening

Here's the counterintuitive part: even though you're calling .read(), you're triggering a write operation at the hardware level.

Every RwLock implementation uses an atomic counter to track how many readers currently hold the lock. When thread A calls .read(), it increments this counter. When thread B calls .read() a nanosecond later, it has to:

Invalidate Core A's cache line (to mark it as "read" too)
Fetch the cache line across the internal bus
Increment the counter
Invalidate Core A's copy again

This is cache line ping-pong — the same 64-byte chunk of memory bouncing back and forth between cores faster than your actual work takes.

A write lock is less noisy because it doesn't trigger this stampede. One thread does its thing and releases. Done.

When This Matters (And When It Doesn't)

This isn't a reason to never use RwLock. It matters when:

Your critical section is tiny — like a HashMap lookup that takes nanoseconds
You have extreme contention — many threads fighting for the same lock
You're on high-core-count hardware — more cores = more ping-pong

The fix? Profile first. If you see lots of time in atomic_add, you have cache contention.

Other strategies:

Sharding: Split one big cache into N smaller ones, each with its own lock
Lock striping: Similar idea, distribute across lock buckets
RCU (Read-Copy-Update): Linux kernel technique, available in Rust via rcu crate

What This Taught Me

Two things:

Hardware reality beats theory. What "should" be faster often isn't. The only way to know is to profile.
The borrow checker isn't the only thing to learn about. Concurrency in Rust is about more than ownership — it's about understanding what happens at the hardware level too.

The lesson isn't "don't optimize." It's: verify your optimizations actually help.

This connects to Chapter 9 of my Rust course — Concurrency. If you're learning Rust, that's where this stuff lives. The borrow checker gets all the attention, but the real performance secrets are in understanding what your code actually does at the hardware level.