Kernel Hell: SLUB Allocator Contention on NUMA Machines

So here’s a weird rabbit hole I went down recently: trying to figure out why Linux memory allocation slows to a crawl under pressure — especially on big multi-socket systems with a ton of cores. The culprit? Good ol’ SLUB. And no, I don’t mean a rude insult — I mean the SLUB allocator, one of the core memory allocators in the Linux kernel.

If you’ve ever profiled a high-core-count server under load and seen strange latency spikes in malloc-heavy workloads, there’s a good chance SLUB contention is part of it.

The Setup

Let’s say you’ve got a 96-core AMD EPYC box. It’s running a real-time app that’s creating and destroying small kernel objects like crazy — maybe TCP connections, inodes, structs for netlink, whatever.

Now, SLUB is supposed to be fast. It uses per-CPU caches so that you don’t have to lock stuff most of the time. Allocating memory should be a lockless, per-CPU bump pointer. Great, right?

Until it’s not.

The Problem: The Slow Path of Doom

When the per-CPU cache runs dry (e.g., under memory pressure or fragmentation), you fall into the slow path, and that’s where things get bad:

  • SLUB hits a global or per-node lock (slub_lock) to refill the cache.
  • If your NUMA node is short on memory, it might fallback to a remote node — so now you’ve got cross-node memory traffic.
  • Meanwhile, other cores are trying to do the same thing. Boom: contention.
  • Add slab merging and debug options like slub_debug into the mix, and now you’re in full kernel chaos mode.

If you’re really unlucky, your allocator calls will stall behind a memory compaction or even trigger the OOM killer if it can’t reclaim fast enough.

Why This Is So Hard

This isn’t just “optimize your code” kind of stuff — this is deep down in mm/slub.c, where you’re juggling:

  • Atomic operations in interrupt contexts
  • Per-CPU vs. global data structures
  • Memory locality vs. system-wide reclaim
  • The fact that one wrong lock sequence and you deadlock the kernel

There are tuning knobs (/proc/slabinfo, slub_debug, etc.), but they’re like trying to steer a cruise ship with a canoe paddle. You might see symptoms, but fixing the cause takes patching and testing on bare metal.

Things I’m Exploring

Just for fun (and pain), I’ve been poking around the idea of:

  • Introducing NUMA-aware slab refill batching, so we reduce cross-node fallout.
  • Using BPF to trace slab allocation bottlenecks live (if you haven’t tried this yet, it’s surprisingly helpful).
  • Adding a kind of per-node, per-type draining system where compaction and slab freeing can happen more asynchronously.

Not gonna lie — some of this stuff is hard. It’s race-condition-central, and the kind of thing where adding one optimization breaks five other things in edge cases you didn’t know existed.

SLUB is amazing when it works. But when it doesn’t — especially under weird multi-core, NUMA, low-memory edge cases — it can absolutely wreck your performance.

And like most things in the kernel, the answer isn’t always “fix it” — sometimes it’s “understand what it’s doing and work around it.” Until someone smarter than me upstreams a real solution.