Fine-grained Locking with Two-Bit Mutexes
by Malte Skarupke
Lets say you want to have a mutex for every item in a list with 10k elements. It feels a bit wasteful to use a std::mutex for each of those elements. In Linux std::mutex is 40 bytes, in Windows it’s 80 bytes. But mutexes don’t need to be that big. You can fit a mutex into two bits, and it’s going to be fast. So we could fit a mutex into the low bits of a pointer. If your container already stores pointers, you might be able to store a mutex for each element with zero space overhead, not even any extra operations during initialization. You’d pay no cost until you use a mutex for the first time.
Lets start with a mutex that uses one byte. It’s easy now that C++ has added futexes to the standard:
struct one_byte_mutex
{
void lock()
{
if (state.exchange(locked, std::memory_order_acquire) == unlocked)
return;
while (state.exchange(sleeper, std::memory_order_acquire) != unlocked)
state.wait(sleeper, std::memory_order_relaxed);
}
void unlock()
{
if (state.exchange(unlocked, std::memory_order_release) == sleeper)
state.notify_one();
}
private:
std::atomic<uint8_t> state{ unlocked };
static constexpr uint8_t unlocked = 0;
static constexpr uint8_t locked = 0b01;
static constexpr uint8_t sleeper = 0b10;
};
The futex operations on here are the call to “wait()” and “notify_one()”. These are like simpler versions of condition variables. The “state.wait(sleeper)” call will put the thread to sleep only if state==sleeper. And “notify_one()” will wake one thread that went to sleep on this variable. You can do this on any atomic variable now. There are no special requirements. How can you sleep and notify on anything? The kernel has a hash table to store sleeping threads which is indexed by the address of the variable. So all you need is a unique address. Most operating systems support this, even Windows. (who, when they copied this idea from Linux, had the chutzpah to patent futexes) If an operating system doesn’t have this, this is implemented with a global hash table in the standard library.
This mutex is optimized for the common case when it is unlocked and there is no contention. In that case we only have to do one exchange() when locking, and one exchange() when unlocking. It’s a bit tricky to convince yourself that this is correct. I verified the implementation in TLA+. The main trick to keeping it simple is that we only try to set the state to “locked” once. If that doesn’t work, we instead set it to “sleeper”. This is necessary because we don’t know how many threads are sleeping, and unlock() clears both bits. So if there was one sleeping thread, it has to set the “sleeper” flag just in case there are others.
One tricky interleaving is if a new thread comes in and does the initial “exchange()” call at a bad time, clearing the “sleeper” bit and causing an unlocking thread to not call notify_one(). In that case the newly entering thread also sets the sleeper flag, so it will call notify_one() when it unlocks.
Two Bit Mutex
The one_byte_mutex is simple and storing a mutex in one byte is good. Storing it in two bits, say the low bits of a pointer, is trickier. Here is an implementation that does that:
template<typename T>
struct pointer_with_mutex
{
T* get() const
{
uint64_t masked = pointer.load(std::memory_order_relaxed) & ~both_bits;
return reinterpret_cast<T*>(masked);
}
void set(T* ptr)
{
static_assert(std::alignment_of<T>::value >= 4);
uint64_t as_int = reinterpret_cast<uint64_t>(ptr);
uint64_t old = pointer.load(std::memory_order_relaxed);
while (!pointer.compare_exchange_weak(old, (old & both_bits) | as_int, std::memory_order_relaxed))
{
}
}
void lock()
{
uint64_t old = pointer.load(std::memory_order_relaxed);
if (!(old & both_bits) && pointer.compare_exchange_strong(old, old | locked, std::memory_order_acquire))
return;
for(;;)
{
if (old & sleeper)
{
pointer.wait(old, std::memory_order_relaxed);
old = pointer.load(std::memory_order_relaxed);
}
else if (pointer.compare_exchange_weak(old, old | sleeper, std::memory_order_acquire))
{
if (!(old & both_bits))
return;
pointer.wait(old | sleeper, std::memory_order_relaxed);
old = pointer.load(std::memory_order_relaxed);
}
}
}
void unlock()
{
uint64_t old = pointer.fetch_and(~both_bits, std::memory_order_release);
if (old & sleeper)
pointer.notify_one();
}
private:
std::atomic<uint64_t> pointer{ 0 };
static constexpr uint64_t locked = 0b01;
static constexpr uint64_t sleeper = 0b10;
static constexpr uint64_t both_bits = locked | sleeper;
};
This one is significantly larger, but it’s a direct translation from the above, just replacing “exchange” with “compare_exchange” to leave the other bits unaffected. Plus some extra conditionals to skip the compare_exchange when we can. (my first attempt was to just replace exchange() with fetch_or(), which would lead to simpler code, but that just uses compare_exchange internally, in a way that was noticeably slower)
The thing to point out is that none of the code depends on which bits we use, or on what the remaining bits are used for. In this case I use them for a pointer, which has to be at least four-byte-aligned, but you could use any two bits for the mutex and store anything in the remaining bits.
Performance
How does this perform? It’s a bit slow when the mutex is contended, but it’s actually faster than std::mutex for an unlocked mutex. Here are the timings on Windows (compiled with Visual Studio 2019):
lock/unlock single-threaded | lock/unlock many threads | |
std::mutex | 22ns | 50ns |
one_byte_mutex | 10ns | 67ns |
pointer_with_mutex | 18ns | 94ns |
Here are the timings on Linux (compiled with clang-12 against libc++-12)
lock/unlock single-threaded | lock/unlock many threads | |
std::mutex | 12ns | 71ns |
one_byte_mutex | 8ns | 228ns |
pointer_with_mutex | 15ns | 255ns |
This is running a benchmark where I call lock(); unlock(); in a loop. The first column is running the benchmark with a single thread, so we always hit the fast path where we don’t have to go to sleep. The second column is running with sixteen threads, so the mutex will often be locked. I then divide the length of the benchmark by the number of lock() unlock() calls to get the time for the average call. These are running on the same machine, booted into either Windows 10 or Ubuntu 20.04.
So if you expect that you will usually find your mutex unlocked, these will actually be both faster and smaller for you than std::mutex. But if you have lots of contention, you may have to put a bit more work into these to make them fast. (my first guess would be that this particular benchmark would benefit from spinning for a bit before sleeping, because the critical section is small. My second guess is that these futexes aren’t implemented efficiently in the standard library yet. I didn’t look into either of these guesses)
Four Mutexes per Byte
So if we only need two bits for a mutex, can we store four mutexes in a single byte? Maybe, but you can’t control which mutex you want to wake up. futexes work by having a hash table with the address of the futex. And you can’t address bits, you can only address bytes. So all four mutexes would have the same address, so they would all be the same futex.
You could probably still make it work by using notify_all() instead of notify_one(), and that might be fine if you expect the mutex to almost always sit idle. Or alternatively you could have a look at the hash table that your standard library uses to implement futexes in userspace, copy it, and change the key to not just be a pointer. But I’ll leave that as an exercise for the reader.
Code
The code for this is available here. It’s released under the boost license.
Also I’m trying out a donation button for the first time. If open source work like this is valuable for you, consider paying for it. The recommended donation is $20 (or your local cost for an item on a restaurant menu) for individuals, or $1000 for organizations. (or your local cost of hiring a contractor for a day) But any amount is appreciated:
Make a one-time donation
Choose an amount.
Or enter a custom amount
Thanks! I have no idea how much this is worth to people. Feedback appreciated.
Donate
Can you share the TLA+ spec?
It was already in the code repository. Or at least I thought it was, but turns out I had only included the file for the two_bit_mutex. Now both files are there:
https://github.com/skarupke/two_bit_mutex
This is fascinating- thanks for writing it up and making it public!
I’m curious- have you done any performance analysis under extremely high loads with many threads? It would be very interesting to see what effect false-sharing reload penalties has on an array of single-byte-packed mutexes.
Best,
Charles
The last column measures high load: One software thread per hardware thread. I did not find out why exactly it’s slower, but I do know that my checks to avoid the compare_exchange when possible made a noticeable difference to that column. This could be an effect of cache line ownership transferring between threads. Obviously you would also see that with many mutexes next to each other.
But that is one benefit of the mutex in two bits: You may be able to interleave it with your regular data. Which, depending on the size of your data, would make false-sharing rarer.
I was curious how this worked on Linux, where futexes are always 32-bits, so I pulled an example apart in Compiler Explorer (https://cpp.godbolt.org/z/9K67jfhn4). It looks like both gcc and clang use a static 16-bucket side table to hold futexes and waiter counts. If the atomic variable isn’t a 32-bit primitive, it uses the futex in the corresponding bucket instead of the atomic variable itself. In this case,
notify_one
ends up having to wake all waiters for that bucket, since the bucket might have waiters for multiple atomic variables.Oh neat. I was curious, too, but didn’t think it would be as easy as checking compiler explorer. (I didn’t think the call would be inlined)
The 16 number is less crazy than it sounds if you assume that you have one software thread per hardware thread. There just shouldn’t be that many sleepers. But since it’s hardcoded to be the same independent of CPU, it does seem kinda low… Especially since there is no linked list in the buckets, like the kernel does it. (Not sure if that’s possible outside of the kernel)
Overall it looks a little clunky. This might explain why, on the same machine, Linux is so much slower than Windows under contention.
I think smaller futexes are coming to Linux, too. I once made an attempt to get them in, and I think someone more persistent than me took over.
Webkit’s ParkingLot does this in userland. Each thread’s (intrusively-linked) node is stored in TLS, and it contains the thread’s dedicated futex and the address of the condition object the thread is waiting on (if any).
I see, each bucket in the hash table has a mutex. That does make it easier. They say that they couldn’t use a spinlock because in their interface you can use arbitrary lambdas as conditionals for their futex-like. The C++ standard doesn’t have that problem: you will only do one comparison before deciding to go to sleep. So a spinlock might be fine.
This might actually be a fun project to implement…