Someone opens their messaging app. The room list is blank. They wait. They wait more. Five minutes later, the list finally appears.

The room list wasn't frozen โ€” it was thinking. And thinking way too hard.

This is a story about how the Matrix Rust SDK fixed this problem, and what it teaches us about memory pressure, lock contention, and why Data-oriented Design matters in Rust.

The Problem

The Matrix Rust SDK has a Room List component โ€” the screen where you see all your chat rooms, sorted by recency. Users expect this to be instant. Instead, it was taking up to 5 minutes on some phones.

The symptoms pointed to the sorting logic. Something in how rooms were being sorted was causing massive slowdowns.

Investigation: Finding the Culprits

The team dug into the code and found two problems working together:

Problem 1: Memory Pressure

Each room sort required the latest_event field. But getting this field meant:

pub fn latest_event(&self) -> LatestEventValue {
    self.info.read().latest_event.clone()  // Clone the whole thing
}

The LatestEventValue type is 144 bytes. Every comparison between two rooms required:

The math: 322,042 allocations, 743 MiB of memory, 287 bytes per allocation. Just to sort a list.

Problem 2: Lock Contention

That self.info.read()? It's a RwLock. The sorters were acquiring this lock 322,042 times โ€” once for every comparison. With three sorters running (latest event, recency, name), that's a lot of lock contention.

When you combine memory pressure with lock contention, you get a perfect storm: the allocator is stressed, threads are fighting for locks, and the CPU is waiting.

The First Fix: Reduce Allocation Size

Instead of cloning the entire LatestEventValue, the team created specific accessors:

pub fn latest_event_is_local(&self) -> bool {
    self.info.read().latest_event.is_local()
}

pub fn latest_event_timestamp(&self) -> Option<MilliSecondsSinceUnixEpoch> {
    self.info.read().latest_event.timestamp()
}

Now instead of 144 bytes, we're returning a bool (1 byte) or an Option<u64> (8 bytes). The lock is still acquired, but we're allocating way less.

Result: 18% throughput improvement. Better, but not enough. The 5-minute problem was still there.

The Real Fix: Data-oriented Design

The second fix was more profound. Instead of fetching data during sorting, pre-fetch everything into a single struct:

pub struct RoomListItem {
    cached_latest_event_timestamp: Option<MilliSecondsSinceUnixEpoch>,
    cached_latest_event_is_local: bool,
    cached_recency_stamp: Option<RoomRecencyStamp>,
    cached_display_name: Option<String>,
    cached_is_space: bool,
    cached_state: RoomState,
}

This struct is 64 bytes โ€” small enough to fit in a CPU cache line.

Now the sorters look like this:

pub fn new_sorter() -> impl Sorter {
    let latest_events = |left: &RoomListItem, right: &RoomListItem| {
        (left.cached_latest_event_is_local, right.cached_latest_event_is_local)
    };
    move |left, right| -> Ordering { cmp(latest_events, left, right) }
}

No locks during sorting. The lock is only acquired once โ€” when refreshing the cache โ€” not 322,042 times.

The Results

| Metric | Before | After | Improvement | |--------|--------|-------|-------------| | Time | 53ms | 676ยตs | 98.7% faster | | Throughput | 18.8K elem/s | 1,477K elem/s | 7,718% |

The room list went from potentially 5 minutes to instant. The benchmark shows linear, predictable performance instead of erratic spikes.

Why This Matters

This story illustrates three key concepts:

1. Memory has hierarchy. L1 cache access: 1ns. L2 cache: 4ns. Main memory: 100ns. That's like the difference between 1 minute and 1 hour 40 minutes. Keeping data in cache isn't a micro-optimization โ€” it's fundamental.

2. Locks are expensive. Every read() on an RwLock has overhead. When you acquire a lock 322,000 times, that overhead adds up. Better to acquire once and work with the snapshot.

3. Data layout matters. A Vec<RoomListItem> (Array of Structures) is cache-friendly. But sometimes a Structure of Arrays is even better โ€” instead of Vec<Room { a, b, c }>, you have RoomList { a: Vec<A>, b: Vec<B>, c: Vec<C> }. Less padding, fewer cache misses.

The Lesson

The Rust borrow checker gives you safety. But performance? That's on you. The borrow checker won't tell you that cloning a 144-byte struct inside a sort comparator is killing your performance. It won't tell you that acquiring a lock 300,000 times is creating contention.

You have to think about:

This is where Rust gives you control โ€” but with control comes responsibility. The Matrix team didn't reach for fancy profiling tools first. They understood the problem at the memory level.

And that's the real power of Rust: you can optimize at this level. You just have to know why you should.