If you have a shared value that many threads read but few threads modify, your intuition says: use a RwLock. Readers don't block each other. Only writers block. It's the obvious choice.

Your intuition is wrong.

Not always. But often enough that it's worth understanding why.

The Benchmark That Shouldn't Work

Here's a pattern that shows up in production systems: a hot cache that gets read thousands of times per second but only updates occasionally. You'd think parking_lot::RwLock would be perfect here. Multiple readers, zero contention.

Except the actual benchmark tells a different story. On a 16-core Apple M4, using RwLock for a trivial read operation (just returning a cached value) was seven times slower than using a Mutex.

That's not a typo. Seven times.

// The RwLock version
let reader = cache.rw_lock.read();
reader.get(&key)

// The Mutex version  
let guard = cache.mutex.lock();
guard.get(&key)

Same operation. Same data. One uses seven times more CPU.

What's Actually Happening

The problem isn't what you think it is. It's not that readers block each other—they don't in a proper RwLock implementation. The problem is the bookkeeping.

Every time a reader acquires the lock, Rust's RwLock increments an atomic counter. Every time it releases, it decrements. These aren't cheap operations. Each one involves cache line transfers across CPU cores.

A Mutex does one atomic operation to acquire and one to release. An RwLock does two per reader, and crucially, those operations happen even when there's no writer waiting.

When you have 16 threads all reading the same value, you're generating 32 atomic operations (16 acquisitions + 16 releases) per read cycle. Each atomic operation forces CPU cores to coordinate, invalidating cache lines, synchronization overhead that dominates the actual work.

The Real Lesson

The conventional wisdom—"use RwLock for read-heavy workloads"—assumes the cost of the lock is proportional to the critical section. But when the critical section is tiny (a hashmap lookup, a cache hit), the lock itself becomes the entire cost.

This is the paradox: the more "optimized" your read path, the worse RwLock performs. A zero-cost abstraction that costs you seven times more.

When RwLock Actually Wins

RwLock still wins when:

But for in-memory caches, hot paths, anything where you're just reading a value? A Mutex might be faster. Profile it. The benchmark above shows a 7x difference—your intuition isn't enough.

What This Means for Your Code

The lesson isn't "never use RwLock." It's that performance intuition is dangerous. The obvious choice is sometimes the slow choice. The cost model of your primitives matters more than the design pattern.

Your read lock isn't a read optimization. It's a write optimization that happens to allow concurrent reads. And that distinction matters when you're optimizing for throughput.