Someone opens their messaging app. The room list is blank. They wait. They wait more. Five minutes later, the list finally appears.
The room list wasn't frozen โ it was thinking. And thinking way too hard.
This is a story about how the Matrix Rust SDK fixed this problem, and what it teaches us about memory pressure, lock contention, and why Data-oriented Design matters in Rust.
The Problem
The Matrix Rust SDK has a Room List component โ the screen where you see all your chat rooms, sorted by recency. Users expect this to be instant. Instead, it was taking up to 5 minutes on some phones.
The symptoms pointed to the sorting logic. Something in how rooms were being sorted was causing massive slowdowns.
Investigation: Finding the Culprits
The team dug into the code and found two problems working together:
Problem 1: Memory Pressure
Each room sort required the latest_event field. But getting this field meant:
pub fn latest_event(&self) -> LatestEventValue {
self.info.read().latest_event.clone() // Clone the whole thing
}
The LatestEventValue type is 144 bytes. Every comparison between two rooms required:
- Acquiring a read lock
- Cloning 144 bytes
- Doing this twice per comparison
- For hundreds of rooms
The math: 322,042 allocations, 743 MiB of memory, 287 bytes per allocation. Just to sort a list.
Problem 2: Lock Contention
That self.info.read()? It's a RwLock. The sorters were acquiring this lock 322,042 times โ once for every comparison. With three sorters running (latest event, recency, name), that's a lot of lock contention.
When you combine memory pressure with lock contention, you get a perfect storm: the allocator is stressed, threads are fighting for locks, and the CPU is waiting.
The First Fix: Reduce Allocation Size
Instead of cloning the entire LatestEventValue, the team created specific accessors:
pub fn latest_event_is_local(&self) -> bool {
self.info.read().latest_event.is_local()
}
pub fn latest_event_timestamp(&self) -> Option<MilliSecondsSinceUnixEpoch> {
self.info.read().latest_event.timestamp()
}
Now instead of 144 bytes, we're returning a bool (1 byte) or an Option<u64> (8 bytes). The lock is still acquired, but we're allocating way less.
Result: 18% throughput improvement. Better, but not enough. The 5-minute problem was still there.
The Real Fix: Data-oriented Design
The second fix was more profound. Instead of fetching data during sorting, pre-fetch everything into a single struct:
pub struct RoomListItem {
cached_latest_event_timestamp: Option<MilliSecondsSinceUnixEpoch>,
cached_latest_event_is_local: bool,
cached_recency_stamp: Option<RoomRecencyStamp>,
cached_display_name: Option<String>,
cached_is_space: bool,
cached_state: RoomState,
}
This struct is 64 bytes โ small enough to fit in a CPU cache line.
Now the sorters look like this:
pub fn new_sorter() -> impl Sorter {
let latest_events = |left: &RoomListItem, right: &RoomListItem| {
(left.cached_latest_event_is_local, right.cached_latest_event_is_local)
};
move |left, right| -> Ordering { cmp(latest_events, left, right) }
}
No locks during sorting. The lock is only acquired once โ when refreshing the cache โ not 322,042 times.
The Results
| Metric | Before | After | Improvement | |--------|--------|-------|-------------| | Time | 53ms | 676ยตs | 98.7% faster | | Throughput | 18.8K elem/s | 1,477K elem/s | 7,718% |
The room list went from potentially 5 minutes to instant. The benchmark shows linear, predictable performance instead of erratic spikes.
Why This Matters
This story illustrates three key concepts:
1. Memory has hierarchy. L1 cache access: 1ns. L2 cache: 4ns. Main memory: 100ns. That's like the difference between 1 minute and 1 hour 40 minutes. Keeping data in cache isn't a micro-optimization โ it's fundamental.
2. Locks are expensive. Every read() on an RwLock has overhead. When you acquire a lock 322,000 times, that overhead adds up. Better to acquire once and work with the snapshot.
3. Data layout matters. A Vec<RoomListItem> (Array of Structures) is cache-friendly. But sometimes a Structure of Arrays is even better โ instead of Vec<Room { a, b, c }>, you have RoomList { a: Vec<A>, b: Vec<B>, c: Vec<C> }. Less padding, fewer cache misses.
The Lesson
The Rust borrow checker gives you safety. But performance? That's on you. The borrow checker won't tell you that cloning a 144-byte struct inside a sort comparator is killing your performance. It won't tell you that acquiring a lock 300,000 times is creating contention.
You have to think about:
- How data is laid out in memory
- How often locks are acquired
- Whether your hot path is cache-friendly
This is where Rust gives you control โ but with control comes responsibility. The Matrix team didn't reach for fancy profiling tools first. They understood the problem at the memory level.
And that's the real power of Rust: you can optimize at this level. You just have to know why you should.