Somebody was nice enough to link my blog post on Hacker News and Reddit. While I didn’t do that, I still read most of the comments on those website. For some reasons the comments I got on my website were much better than the comments on either of those websites. But there seem to be some common misunderstandings underlying the bad comments, so I’ll try to clear them up.

The top comment on Hacker News essentially says “meh, this can’t sort everything and we already knew that radix sort was faster.” Firstly, I don’t understand that negativity. My blog post was essentially “hey everyone, I am very excited to share that I have optimized and generalized radix sort” and your first response is “but you didn’t generalize it all the way.” Why the negativity? I take one step forward and you complain that I didn’t take two steps? Why not be happy that I made radix sort apply to more domains than where it applied before?

So I want to talk about generalizing radix sort even more: The example of something that I don’t handle is sorting a vector of std::sets. Say a vector of sets of ints. The reason why I can’t sort that is that std::set doesn’t have an operator[]. std::sort does not have that problem because std::set provides comparison operators.

There are two possible solutions here:

- Where I currently use operator[], I could use std::next instead. So instead of writing container[index] I could write *std::next(container.begin(), index).
- I could not use an index to indicate the current position, and only use iterators instead. For that I would have to allocate a second buffer to store one iterator per collection to sort.

Both of these approaches have problems. The first one is obviously slow because I have to iterate over the collection from the beginning every time that I want to look up the current element that I want to sort by. Meaning if I need to look at the first n elements to tell two lists apart, I need to walk over those elements n times, resulting in O(n^2) pointer dereferences. The normal set comparison operators don’t have that problem because when they compare two sets, they can iterate over both in parallel. So when they need to look at n elements to tell two lists apart, they can do that in O(n).

I also didn’t want to allocate the extra memory that would be required for the second approach because I didn’t want ska_sort to sometimes require heap allocations, and sometimes not require heap allocations depending on what type it is sorting.

The point is: I could easily generalize radix sort even more so that it can handle this case as well, but it doesn’t seem interesting. Both approaches here have clear problems. I think you should just use std::sort here. So I’ll limit ska_sort to things that can be accessed in constant time.

The other question is why you would want me to handle this. I stopped generalizing when I thought that I could handle all real use cases. Radix sort can be generalized more so that it can sort everything if you want it to. If you really need to sort a vector of std::sets or a vector of std::lists, then I can probably implement the second solution for you. But **the real question isn’t whether ska_sort can sort everything, but whether it can sort your use case. And the answer is almost certainly yes.** If you really have a use case that ska_sort can not sort, then I can understand the criticism. But what do you want to sort that can not be reached in constant time?

That being said one thing that still needs to be done is that I need to allow customization of sorting behavior. Which is also what I wrote in my last blog post. Especially when sorting strings there are good use cases for wanting custom sorting behavior. Like case insensitive sorting. Or number aware sorting so that “foo100” comes after “foo99”. I’ll present an idea for that further down in this blog post. But the work there is not to generalize ska_sort further so that it can sort more data, but instead to give more customization options for the data that it can sort.

Before finishing this section, I actually quickly implemented solution 1 from the approaches for sorting sets above, and the graph is interesting:

ska_sort actually beats std::sort for quite a while there. std::sort is only faster when there are a lot of sets. If I construct the sets such that there is very little overlap between them, ska_sort is actually always faster. Does that mean that I should provide this code? I decided against it for now because it’s not a clear win. I think if I did handle this, I would want to use the allocating solution because I expect a bigger win from that one.

One criticism that I didn’t understand at first was that I am comparing apples to oranges when I’m comparing my ska_sort to std::sort. You can see that same criticism voiced in that top Hacker New comment mentioned above. To me they are both sorting algorithms and who cares how they work internally? If you want to sort things, the only thing you care about is speed, not what the algorithm does internally.

A friend of mine had a good analogy though: **Comparing a better radix sort to std::sort is like writing a faster hash table and saying “look at how much faster this is than std::map.”**

However I contest that **what I did is not equivalent to writing a better hash table, but it is equivalent to writing the first general purpose hash table**. Imagine a parallel world where people have used hash tables for a while, but only ever for special purposes. Say everybody knows that you can only use hash tables if your key is an integer, otherwise you have to use a search tree. And then somebody comes along with the realization that you can store anything in a hash table as long as you provide a custom hash function. In that case it doesn’t make sense to compare this new hash table to older hash tables, because older hash tables simply can’t run most of your benchmarks because they only support integer keys.

Similarly the only thing that I could compare my radix sort to was std::sort, because older radix sort implementations couldn’t run my benchmarks because they could literally only sort ints, floats and strings.

However the above argument doesn’t make sense for me because I made two claims: I claimed that I generalized radix sort, and also that I optimized radix sort. For the second claim I should have provided benchmarks against other radix sort implementations. And also even though something like boost::spreadsort can’t run all my benchmarks, I should have still compared it in the benchmarks that it can run. Meaning for sorting of ints, floats and strings. So yeah, I don’t know what I was thinking there… Sometimes your brain just skips to the wrong conclusions and you never think about it a second time…

So anyways, here is ska_sort compared to boost::spreadsort when sorting uniformly distributed integers:

What we see here is that ska_sort is generally faster than boost::spreadsort. Except for that area between 2048 elements and 16384 elements. The reason for this is mainly that spreadsort picks a different number of partitions than I do. In each recursive step I split the input into 256 partitions. spreadsort uses more. It doesn’t use a fixed amount like I do, so I can’t tell you a simple number, except that it usually picks more than 256.

I had played around with using a different number of partitions in my first blog post about radix sort, but I didn’t have good results that time. That time I found that if I used 2048 partitions, sorting would be faster if the collection had between 1024 and 4096 elements. In other cases using 256 partitions was faster. It’s probably worth trying a variable amount of partitions like spreadsort uses. Maybe I can come up with an algorithm that’s always faster than spreadsort. So there you go, a real benefit just from comparing against another radix sort implementation.

Let’s also look at the graph for sorting uniformly distributed floats:

This graph is interesting for two reasons: 1. spreadsort has much smaller waves than when it sorts ints. 2. All of the algorithms seem to suddenly speed up when there are a lot of elements.

I have no good explanation for the second thing. But it is reproducible so I could do more investigation. My best guess is that this is because uniform_real_distribution just can’t produce this many unique values. (I’m just asking for floats in the range from 0 to 1) So I’m getting more duplicates back there. I tried switching to a exponential_distribution, but the graph looked similar.

The reason for why spreadsort has smaller waves seems to be that spreadsort adjusts its algorithm based on how big the range of inputs is. When sorting ints it could get ints over the entire 32 bit range. When sorting floats it only gets values form zero to one. I need to look more into what spreadsort actually does with that information, but it does compute it and use it to determine how many partitions to use. But there’s no time for looking into that. Instead let’s look at sorting of strings:

This is the “sorting long strings” graph from my last blog post. And oops, that’s embarrassing: spreadsort seems to be better at sorting strings than ska_sort. Which is surprising because I copied parts of my algorithm from spreadsort, so it should work similarly. Stepping through it, there seem to be two main reasons:

- When sorting strings, I have to subdivide the input into 257 partitions: One partition for all the strings that are shorter than my current index, and 256 partitions for all the possible character values. I do that in two passes over the input: First split off all the shorter ones, second run my normal sorting algorithm which splits the remaining data into 256 partitions. spreadsort does this in one pass. There is no reason why I couldn’t do the same thing. Except that it would complicate my algorithm even more because I’d need two slightly different versions of my inner loop. I’ll try to do it when I get to it.
- Spreadsort takes advantage of the fact that strings are always stored in a single chunk of memory. When it tries to find the longest common prefix between all the strings, it uses memcmp which will internally compare several bytes at a time. In my algorithm I have no special treatment for strings: It’s the same algorithm for strings, deques, vectors, arrays or anything else with operator[]. This means I have to compare one element at a time because if you pass in a std::deque, memcmp wouldn’t work. I could solve that by specializing for containers that have a .data() function. I would run into a second problem though: You might be sorting a vector of custom types, in which case memcmp would once again be the wrong thing. It still seems solvable: I just need even more template special cases for when the thing to be sorted is a character pointer in a container that has a data() member function. Doable, but adds more complexity.

So in conclusion spreadsort will stay faster than ska_sort at sorting strings for now. The reason for that is simply that I don’t want to spend the time to implement the same optimizations at the moment.

The top reddit comment talked about something I wrote about the recursion count: It quotes from a part where I make two statements about the recursion count: 1. If I sort a million ints or a thousand ints, I always have to recurse at most four times. 2. If I can tell all the values apart in the first byte, I can stop recursing right there. The comment points out that these two statements apply to different ranges of inputs. Which, yes, they do. The comment makes fun of me for not stating that these apply to different ranges, then it contains some bike shedding about internal variable names and some wrong advice about merging loops that ping-pong between the buffers in ska_sort_copy. (when ping-ponging between two buffers A and B, you can’t start the loop that reads B until the loop that reads A is finished writing to buffer B. Otherwise you read uninitialized data) I really don’t understand why this is the top comment…

But I’ll use this as an excuse to talk in detail about the recursion count and about big O complexity because that was a common topic in the comments. (including in the responses to that comment)

The point where I have to recurse into the second byte is actually more complex than you might think: I fall back to std::sort if a partition has fewer than 128 elements in it. That means that if the inputs are uniformly distributed on the first byte, I can handle up to 127*256 = 32512 values without any recursive calls. The 256 comes from the number of possible values for the first byte, and the 127 comes from the fact that if I create 256 partitions of 127 elements each, I will fall back to std::sort within each of those partitions instead of recursing to a second call of ska_sort.

Now in reality things are not that nicely distributed. Let me insert the graph again about sorting uniformly distributed ints:

The “waves” that you can see on ska_sort happen every time that I have to do one more recursive call. So what we see here is that in that middle wave, from 4096 to 16384 items, is when the pass that looks at the first byte creates more and more partitions that are large enough to require a recursive call. For example let’s say that at 2048 elements I randomly get 80 items with the value 62 in the first byte. Then at 4096 bytes I randomly get 130 elements with the value 62 in the first byte. At 2048 elements I call std::sort directly, at 4096 elements I will do one more recursive call, splitting those 130 elements into another 256 partitions and then call std::sort on each of those.

Then after 16384 what happens is that those partitions are big enough that I can do nice large loops through them, and the algorithm speeds up again. That is until I have to recurse a second time starting at 512k items and I slow down again.

For integers there is a natural limit to these waves: There can be at most four of these waves because there are only four bytes in an int.

That brings us to the discussion about big O complexity. A funny thing to observe in the comments was that the more confident somebody was in claiming that I got my big O complexity wrong, the more likely they were to not understand big O complexity. But I will admit that the big O complexity for radix sort is confusing because it depends on the type of the data that you’re sorting.

To start with I claim that a single pass over the data for me is O(n). This is not obvious from the source code because there are five loops in there, three of which are nested. But after looking at it a bit you find that those loops depends on two things: 1. The number of partitions, 2. the number of elements in the collection. So a first guess for the complexity would be O(n+p) where p is the number of partitions. That number is fixed in my algorithm to 256, so we end up with O(n+256) which is just O(n). But that 256 is the reason why ska_sort slows down when it has lots of small partitions.

Now every time that I can’t separate the elements into partitions of fewer than 128 elements, I have to do a recursive call. So what is the impact of that on the complexity? A simple way of looking at that is to say it’s O(n*b) where b is the number of bytes I need to look at until I can tell all the elements apart. When sorting integers, in the worst case this would be 4, so we end up with O(n*4) which is just O(n). When sorting something with a variable number of bytes, like strings, that b number could be arbitrarily big though. One trick I do to reduce the risk of hitting a really bad case there is that I skip over common prefixes. Still it’s easy to create inputs where b is equal to n. So the algorithm is O(n^2) for sorting strings. But I actually detect that case and fall back to std::sort for the entire range then. So ska_sort is actually O(n log n) for sorting strings.

I like the O(n*b) number better though because the graph doesn’t look like a O(n) graph. (ska_sort_copy however does look like a O(n) graph) The O(n*b) number gives a better understanding of what’s going on. Then we can look at the waves in the graph and can say that at that point b increased by 1. And we can also see that b is not independent of n. (it will become independent of n once n is large enough. Say I’m sorting a trillion ints. But for small numbers b increases together with n)

From this analysis you would think that my algorithm is slowest when all numbers are very close to each other. Say they’re all close to 0. Because then I would have to look at all four bytes until I can tell all the numbers apart. In fact the opposite happens: My algorithm is fastest in these cases. The reason is that the first three passes over the data are very fast in this case because all elements have the same value for the first three bytes. Only the last pass actually has to do anything. (this is the “sorting geometric_distribution ints” graph from my last blog post where ska_sort ends up more than five times faster than std::sort)

Finally when looking at the complexity we have to consider the std::sort fallback. I will only ever call std::sort on partitions of fewer than 128 items. That means that the complexity of the std::sort fallback is not O(n log n) but it’s O(n log 127), which is just O(n). It’s O(n log 127) because a) I call std::sort on every element, so it has to be at least O(n), b) each of those calls to std::sort only sees at most 127 elements, so the recursive calls in quick sort are limited, and those recursive calls are responsible for the log n part of the complexity. If this sounds weird, it’s the same reason that Introsort (which is used in std::sort) is O(n log n) even though it calls insertion sort (an O(n^2) algorithm) on every single element.

Some of the best comments I got were about other good sorting algorithms. And it turns out that other people have generalized radix sort before me.

One of those is BinarSort which was written by William Gilreath who commented on my blog post. BinarSort basically says “everything is made out of bits, so if we sort bits, we can sort everything.” Which is a similar line of thinking that lead to me generalizing radix sort. The big downside with looking at everything as bits is that it leads to a slow sorting algorithm: For an int with 32 bits you have to do up to 31 recursive calls. Running BinarSort through my benchmark for sorting ints looks like this:

The first thing to notice is that BinarSort looks an awful lot as if it’s O(n log n). The reason for that is the same reason that ska_sort doesn’t look like a true O(n) graph: The number of recursive calls is related to the number of elements. BinarSort has to do up to 31 recursive calls. At the point where it reaches that number of recursive calls, you would expect the graph to flatten out. The quick sort which is used in std::sort would continue to grow even then. However it looks like you need to sort a huge number of items to get to that point in the graph. Instead you see an algorithm that keeps on getting slower and slower as it has to do more and more recursive calls, never reaching the point where it would turn linear.

The other big problem with BinarSort is that even though it claims to be general, it only provides source code for sorting ints. It doesn’t provide a method for sorting other data. For example it’s easy to see that you can’t sort floats with it directly, because if you sort floats one bit at a time, positive floats come before negative floats. I now know how to sort floats using BinarSort because I did that work for ska_sort, but if I had read the paper a while ago, I wouldn’t have believed the claim that you can sort everything with it. If you only provide source code for sorting ints, I will believe that you can only sort ints.

A much more promising approach is this paper by Fritz Henglein. I didn’t read all of the paper but it looks like he did something very similar to what I did, except he did it five years ago. According to his graphs, his sorting algorithm is also much faster than older sorting algorithms. So I think he did great work, but for some reason nobody has heard about it. The lesson that I would take from that is that if you’re doing work to improve performance of algorithms, don’t do it in Haskell. The problem is that sorting is slow in Haskell to begin with. So he is comparing his algorithm against slow sorting algorithms and it’s easy to beat slow sorting algorithms. I think that his algorithm would be faster than std::sort if he wrote it in C++, but it’s hard to tell.

A great thing that happened in the comments was that Morwenn adapted his sorting algorithm Vergesort to run on top of ska_sort. The result is an algorithm that performs very well on pre-sorted data while still being fast in random data.

This is the graph that he posted. ska_sort is just ska_sort by itself, verge_sort is a combination of verge sort and ska_sort. Best of all, he posted a comment explaining how he did it.

So that is absolutely fantastic. I’ll definitely attempt to bring his changes into the main algorithm. There might even be a way to merge his loop over the data with my first loop over the data, so that I don’t even have to do an extra pass.

This brings me to future work:

Custom sorting behavior is my next task. I don’t have a full solution yet, but I have something that can handle case-insensitive sorting of ASCII characters and it can do number-aware sorting. The hope is that something like Unicode sorting could be done with the same approach. The idea is that I expose a customization point where you can change how I use iterators in collections. You can change the return value from the iterator, and you can change how far the iterator will advance. So for case insensitive sorting you could simply return ‘a’ from the iterator when the actual value was ‘A’.

The tricky part is number aware sorting. My current idea is that you could return an int instead of a char, and then advance the iterator several positions. You would have to be a bit tricky with the int that you return because you would want to return either a character or an int. I could add support for std::variant (should probably do that anyway) but we can also just say that for characters, we just cast it to an int, and for numbers we return the lowest int plus the number. So for “foo100” you would return the integers ‘f’, ‘o’, ‘o’, INT_MIN+100. And for “foo99” you would return the integers ‘f’, ‘o’, ‘o’, INT_MIN+99. Then you would advance the iterator one position for the first three characters, and three positions for the number 100 and two positions for the number 99. One tricky part on this is that you have to always move the iterator forward by the same distance when elements have the same value. Meaning if two different strings have the value INT_MIN+100 for the current index, they both have to advance their iterators by three elements. Can’t have one of them advancing it by four elements. I need that assumption so that for recursive calls, I only need to advance a single index. So I won’t actually store the iterators that you return, but only a single index that I can use for all values that fell into the same partition. I think it’s a promising idea. It might also work for sorting Unicode, but that is such a complicated topic that I have no idea if this will work or not. I think the only way to find out is to start working on this and see if I run into problems.

The other task that I want to do is to merge my algorithm with verge sort so that I can also be fast for pre-sorted ranges.

The big problem that I have right now is that I actually want to take a break from this. I don’t want to work on this sorting algorithm for a while. I was actually already kinda burned out on this before I even wrote that first blog post. At that point I had spent a month of my free time on this and I was very happy to finally be done with this when I hit “Publish” on that blog post. So sorry, but the current state is what it’s going to stay at. I’m doing this in my spare time, and right now I’ve got other things that I would like to do with my spare time. (Dark Souls III is pretty great, guys)

That being said I do intend to use this algorithm at work, and I do expect some small improvements to come out of that. These things always get improved as soon as you actually start using them. Also I’ll probably get back to this at some point this year.

Until then you should give ska_sort a try. This can literally make your sorting two times faster. Also if you have data that this can’t sort, I am very curious to hear about that. Here’s a link to the source code and you can find instructions on how to use it in my last blog post.

]]>

The easiest way to quickly generate truly random numbers is to use a std::random_device to seed a std::mt19937_64. That way we pay a one-time cost of using random device to generate a seed, and then have quick random numbers after that. Except that the standard doesn’t provide a way to do that. In fact it’s more dangerous than that: It provides an easy wrong way to do it (use a std::random_device to generate a single int and use that single int as the seed) and it provides a slow, slightly wrong way to do it. (use a std::random_device to fill a std::seed_seq and use that as the seed) There’s a proposal to fix this, (that link also contains reasons for why the existing methods are wrong) but I’ve actually been using a tiny class for this:

struct random_seed_seq { template<typename It> void generate(It begin, It end) { for (; begin != end; ++begin) { *begin = device(); } } static random_seed_seq & get_instance() { static thread_local random_seed_seq result; return result; } private: std::random_device device; };

(the license for the code in this blog post is the Unlicense)

This class has the same generate() function that std::seed_seq has and can be used to initialize a std::mt19937_64. The static get_instance() function is a small convenience to make initialization easier so that you can write this:

std::mt19937_64 random_source{random_seed_seq::get_instance()};

Without the get_instance() function this would have to be a two-liner.

Finally a lot of code doesn’t care where their random numbers come from. Sometimes you just want a random float in the range from zero to one and you don’t want to have to set up a random engine and a random distribution. In that case you can write something like this:

float random_float_0_1() { static thread_local std::mt19937_64 randomness(random_seed_seq::get_instance()); static thread_local std::uniform_real_distribution<float> distribution; return distribution(randomness); }

And just like that we have easy, fast, high quality floating point numbers. Well we do if your compiler is GCC. On my machine this last function is slightly faster than the old-school “rand() * (1.0f / RAND_MAX)”. This function takes 11ns, and the old-school method takes 14 nanoseconds. (measured with Google Benchmark) I attribute most of that to the Mersenne Twister being a very fast random number generator.

When I compiled it with Clang however this new function takes 80ns. Stepping through the assembly generated by both compilers reveals that the problem is that Clang doesn’t inline aggressively enough. There are some calls to compute the logarithm of the upper bound and lower bound in the uniform_real_distribution. GCC inlines those expensive calls away, Clang does not.

Not sure what to do about that last problem: The problem is with how std::uniform_real_distribution is defined: It takes the upper bound and lower bound as runtime arguments. In my code listing above they are the default arguments of 0 and 1, but since Clang doesn’t inline the call, it doesn’t know that they are constants. The only way I see around that is to re-implement std::uniform_real_distribution with constants. But that’s beyond the scope of this blog post.

This blog post was only supposed to be about the random_seed_seq. The other code are just examples showing how you could use it. So let’s not worry about the details of std::uniform_real_distribution, and end this by saying that you should probably use random_seed_seq to seed your random number generators.

It’s a tiny class that I find myself needing all the time. Hopefully it will also be useful for you.

]]>

Why is that an unfortunate claim? Because I’ll probably have a hard time convincing you that I did speed up sorting by a factor of two. But this should turn out to be quite a lengthy blog post, and all the code is open source for you to try out on whatever your domain is. So I might either convince you with lots of arguments and measurements, or you can just try the algorithm yourself.

Following up from my last blog post, this is of course a version of radix sort. Meaning its complexity is lower than O(n log n). I made two contributions:

- I optimized the inner loop of in-place radix sort. I started off with the Wikipedia implementation of American Flag Sort and made some non-obvious improvements. This makes radix sort much faster than std::sort, even for a relatively small collections. (starting at 128 elements)
- I generalized in-place radix sort to work on arbitrary sized ints, floats, tuples, structs, vectors, arrays, strings etc. I can sort anything that is reachable with random access operators like operator[] or std::get. If you have custom structs, you just have to provide a function that can extract the key that you want to sort on. This is a trivial function which is less complicated than the comparison operator that you would have to write for std::sort.

If you just want to try the algorithm, jump ahead to the section “Source Code and Usage.”

To start off with, I will explain how you can build a sorting algorithm that’s O(n). If you have read my last blog post, you can skip this section. If you haven’t, read on:

If you are like me a month ago, you knew for sure that it’s proven that the fastest possible sorting algorithm has to be O(n log n). There are mathematical proofs that you can’t make anything faster. I believed that until I watched this lecture from the “Introduction to Algorithms” class on MIT Open Course Ware. There the professor explains that sorting has to be O(n log n) when all you can do is compare items. But if you’re allowed to do more operations than just comparisons, you can make sorting algorithms faster.

I’ll show an example using the counting sort algorithm:

template<typename It, typename OutIt, typename ExtractKey> void counting_sort(It begin, It end, OutIt out_begin, ExtractKey && extract_key) { size_t counts[256] = {}; for (It it = begin; it != end; ++it) { ++counts[extract_key(*it)]; } size_t total = 0; for (size_t & count : counts) { size_t old_count = count; count = total; total += old_count; } for (; begin != end; ++begin) { std::uint8_t key = extract_key(*begin); out_begin[counts[key]++] = std::move(*begin); } }

This version of the algorithm can only sort unsigned chars. Or rather it can only sort types that can provide a sort key that’s an unsigned char. Otherwise we would index out of range in the first loop. Let me explain how the algorithm works:

We have three arrays and three loops. We have the input array, the output array, and a counting array. In the first loop we fill the counting array by iterating over the input array, counting how often each element shows up.

The second loop turns the counting array into a prefix sum of the counts. So let’s say the array didn’t have 256 entries, but only 8 entries. And let’s say the numbers come up this often:

index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

count | 0 | 2 | 1 | 0 | 5 | 1 | 0 | 0 |

prefix sum | 0 | 0 | 2 | 3 | 3 | 8 | 9 | 9 |

So in this case there were nine elements in total. The number 1 showed up twice, the number 2 showed up once, the number 4 showed up five times and the number 5 showed up once. So maybe the input sequence was { 4, 4, 2, 4, 1, 1, 4, 5, 4 }.

The final loop now goes over the initial array again and uses the key to look up into the prefix sum array. And lo and behold, that array tells us the final position where we need to store the integer. So when we iterate over that sequence, the 4 goes to position 3, because that’s the value that the prefix sum array tells us. We then increment the value in the array so that the next 4 goes to position 4. The number 2 will go to position 2, the next 4 goes to position 5 (because we incremented the value in the prefix sum array twice already) etc. I recommend that you walk through this once manually to get a feeling for it. The final result of this should be { 1, 1, 2, 4, 4, 4, 4, 4, 5 }.

And just like that we have a sorted array. The prefix sum told us where we have to store everything, and we were able to compute that in linear time.

Also notice how this works on any type, not just on integers. All you have to do is provide the extract_key() function for your type. In the last loop we move the type that you provided, not the key returned from that function. So this can be any custom struct. For example you could sort strings by length. Just use the size() function in extract_key, and clamp the length to at most 255. You could write a modified version of counting_sort that tells you where the position of the last partition is, so that you can then sort all long strings using std::sort. (which should be a small subset of all your strings so that the second pass on those strings should be fast)

The above algorithm stores the sorted elements in a separate array. But it doesn’t take much to get an in-place sorting algorithm for unsigned chars: One thing we could try is that instead of moving the elements, we swap them.

The most obvious problem that we run into with that is that when we swap the first element out of the first spot, the new element probably doesn’t want to be in the first spot. It might want to be at position 10 instead. The solution for that is simple: Keep on swapping the first element until we find an element that actually wants to be in the first spot. Only when that has happened do we move on to the second item in the array.

The second problem that we then run into is that we’ll find a lot of partitions that are already sorted. We may not know however that those are already sorted. Imagine if we have the number 3 two times and it wants to be in positions six and seven. And lets say that as part of swapping the first element into place, we swap the first 3 to slot six, and the second 3 to slot seven. Now these are sorted and we don’t need to do anything with them any more. But when we advance on from the first element, we will at some point come across the 3 in slot six. And we’ll swap it to spot eight, because that’s the next spot that a 3 would go to. Then we find the next 3 and swap it to spot nine. Then we find the first 3 again and swap it to spot ten etc. This keeps going until we index out of bounds and crash.

The solution for the second problem is to keep a copy of the initial prefix array around so that we can tell when a partition is finished. Then we can skip over those partitions when advancing through the array.

With those two changes we have an in-place sorting algorithm that sorts unsigned chars. This is the American Flag Sort algorithm as described on Wikipedia.

Radix sort takes the above algorithm, and generalizes it to integers that don’t fit into a single unsigned char. The in-place version actually uses a fairly simple trick: Sort one byte at a time. First sort on the highest byte. That will split the input into 256 partitions. Now recursively sort within each of those partitions using the next byte. Keep doing that until you run out of bytes.

If you do the math on that you will find that for a four byte integerÂ you get 256^3 recursive calls: We subdivide into 256 partitions then recurse, subdivide each of those into 256 partitions and recurse again and then subdivide each of the smaller partitions into 256 partitions again and recurse a final time. If we actually did all of those recursions this would be a very slow algorithm. The way to get around that problem is to stop recursing when the number of items in a partition is less than some magic number, and to use std::sort within that sub-partition instead. In my case I stop recursing when a partition is less than 128 elements in size. When I have split an array into partitions that have less than that many elements, I call std::sort within these partitions.

If you’re curious: The reason why the threshold is at 128 is that I’m splitting the input into 256 partitions. If the number of partitions is k, then the complexity of sorting on a single byte is O(n+k). The point where radix sort gets faster than std::sort is when the loop that depends on n starts to dominate over the loop that depends on k. In my implementation that’s somewhere around 0.5k. It’s not easy to move it much lower than that. (I have some ideas, but nothing has worked yet)

It should be clear that the algorithm described in the last section works for unsigned integers of any size. But it also works for collections of unsigned integers, (including pairs and tuples) and strings. Just sort by the first element, then by the next, then by the next etc. until the partition sizes are small enough. (as a matter of fact the paper that Wikipedia names as the source for its American Flag Sort article intended the algorithm as a sorting algorithm for strings)

But it’s straightforward to generalize this to work on signed integers: Just shift all the values up into the range of the unsigned integer of the same size. Meaning for an int16_t, just cast to uint16_t and add 32768.

Michael Herf has also discovered a good way to generalize this to floating point numbers: Reinterpret cast the float to a uint32, then flip every bit if the float was negative, but flip only the sign bit if the float was positive. The same trick works for doubles and uint64s. Michael Herf explains why this works in the linked piece, but the short version of it is this: Positive floating point numbers already sort correctly if we just reinterpret cast them to a uint32. The exponent comes before the mantissa, so we would sort by the exponent first, then by the mantissa. Everything works out. Negative floating point numbers however would sort the wrong way. Flipping all the bits on them fixes that. The final remaining problem is that positive floating point numbers need to sort as bigger than negative numbers, and the easiest way to do that is to flip the sign bit since it’s the most significant bit.

Of the fundamental types that leaves only booleans and the various char types. Chars can just be reinterpret_casted to the unsigned types of the same size. Booleans could also be turned into a unsigned char, but we can also use a custom, more efficient algorithm for booleans: Just use std::partition instead of the normal sorting algorithm. And if we need to recurse because we’re sorting on more than one key, we can recurse into each of the partitions.

And just like that we have generalized in-place radix sort to all types. Now all it takes is a bunch of template magic to make the code do the right thing for each case. I’ll spare you the details of that. It wasn’t fun.

The brief recap of the sorting algorithm for sorting one byte is:

- Count elements and build the prefix sum that tells us where to put the elements
- Swap the first element into place until we find an item that wants to be in the first position (according to the prefix sum)
- Repeat step 2 for all positions

I have implemented this sorting algorithm using Timo Bingmann’s Sound of Sorting. Here is a what it looks (and sounds) like:

As you can see from the video, the algorithm spends most of its time on the first couple elements. Sometimes the array is mostly sorted by the time that the algorithm advances forward from the first item. What you can’t see in the video is the prefix sum array that’s built on the side. Visualizing that would make the algorithm more understandable, (it would make clear how the algorithm can know the final position of elements to swap them directly there) but I haven’t done the work of visualizing that.

If we want to sort multiple bytes we recurse into each of the 256 partitions and do a sort within those using the next byte. But that’s not the slow part of this. The slow part is step 2 and step 3.

If you profile this you will find that this is spending all of its time on the swapping. At first I thought that that was because of cache misses. Usually when the line of assembly that’s taking a lot of time is dereferencing a pointer, that’s a cache miss. I’ll explain what the real problem was further down, but even though my intuition was wrong it drove me towards a good speed up: If we have a cache miss on the first element, why not try swapping the second element into place while waiting for the cache miss on the first one?

I already have to keep information about which elements are done swapping, so I can skip over those. So what I do is that I Iterate over all elements that have not yet been swapped into place, and I swap them into place. In one pass over the array, this will swap at least half of all elements into place. To see why, let’s think how this works in this list:Â { 4, 3, 1, 2 }: We look at the first element, the 4, and swap it with the 2 at the end, giving us this list: { 2, 3, 1, 4 }, then we look at the second element, the 3, and swap it with the 1, giving us this list: { 2, 1, 3, 4 } then we have iterated half-way through the list and find that all the remaining elements are sorted, (we do this by checking that the offset stored in the prefix sum array is the same as the initial offset of the next partition) so we’re done, but our list is not sorted. The solution for that is to say that when we get to the end of the list, we just start over from the beginning, swapping all unsorted elements into place. In that case we only need to swap the 2 into place to get { 1, 2, 3, 4 } at which point we know that all partitions are sorted and we can stop.

In Sound of Sorting that looks like this:

This is what the above algorithm looks like in code:

struct PartitionInfo { PartitionInfo() : count(0) { } union { size_t count; size_t offset; }; size_t next_offset; }; template<typename It, typename ExtractKey> void ska_byte_sort(It begin, It end, ExtractKey & extract_key) { PartitionInfo partitions[256]; for (It it = begin; it != end; ++it) { ++partitions[extract_key(*it)].count; } uint8_t remaining_partitions[256]; size_t total = 0; int num_partitions = 0; for (int i = 0; i < 256; ++i) { size_t count = partitions[i].count; if (count) { partitions[i].offset = total; total += count; remaining_partitions[num_partitions] = i; ++num_partitions; } partitions[i].next_offset = total; } for (uint8_t * last_remaining = remaining_partitions + num_partitions, * end_partition = remaining_partitions + 1; last_remaining > end_partition;) { last_remaining = custom_std_partition(remaining_partitions, last_remaining, [&](uint8_t partition) { size_t & begin_offset = partitions[partition].offset; size_t & end_offset = partitions[partition].next_offset; if (begin_offset == end_offset) return false; unroll_loop_four_times(begin + begin_offset, end_offset - begin_offset, [partitions = partitions, begin, &extract_key, sort_data](It it) { uint8_t this_partition = extract_key(*it); size_t offset = partitions[this_partition].offset++; std::iter_swap(it, begin + offset); }); return begin_offset != end_offset; }); } }

The algorithm starts off similar to counting sort above: I count how many items fall into each partition. But I changed the second loop: In the second loop I build an array of indices into all the partitions that have at least one element in them. I need this because I need some way to keep track of all the partitions that have not been finished yet. Also I store the end index for each partition in the next_offset variable. That will allow me to check whether a partition is finished sorting.

The third loop is much more complicated than counting sort. It’s three nested loops, and only the outermost is a normal for loop:

The outer loop iterates over all of the remaining unsorted partitions. It stops when there is only one unsorted partition remaining. That last partition does not need to be sorted if all other partitions are already sorted. This is an important optimization because the case where all elements fall into only one partition is quite common: When sorting four byte integers, if all integers are small, then in the first call to this function, which sorts on the highest byte, all of the keys will have the same value and will fall into one partition. In that case this algorithm will immediately recurse to the next byte.

The middle loop uses std::partition to remove finished partitions from the list of remaining partitions. I use a custom version of std::partition because std::partition will unroll its internal loop, and I do not want that. I need the innermost loop to be unrolled instead. But the behavior of custom_std_partition is identical to that of std::partition. What this loop means is that if the items fall into partitions of different sizes, say for the input sequence { 3, 3, 3, 3, 2, 5, 1, 4, 5, 5, 3, 3, 5, 3, 3 } where the partitions for 3 and 5 are larger than the other partitions, this will very quickly finish the partitions for 1, 2 and 4, and then after that the outer loop and inner loop only have to iterate over the partitions for 3 and 5. You might think that I could use std::remove_if here instead of std::partition, but I need this to be non-destructive, because I will need the same list of partitions when making recursive calls. (not shown in this code listing)

The innermost loop finally swaps elements. It just iterates over all remaining unsorted elements in a partition and swaps them into their final position. This would be a normal for loop, except I need this loop unrolled to get fast speeds. So I wrote a function called “unroll_loop_four_times” that takes an iterator and a loop count and then unrolls the loop.

This new algorithm was immediately much faster than American Flag Sort. Which made sense because I thought I had tricked the cache misses. But as soon as I profiled this I noticed that this new sorting algorithm actually had slightly more cache misses. It also had more branch mispredictions. It also executed more instructions. But somehow it took less time. This was quite puzzling so I profiled it whichever way I could. For example I ran it in Valgrind to see what Valgrind thought should be happening. In Valgrind this new algorithm was actually slower than American Flag Sort. That makes sense: Valgrind is just a simulator, so something that executes more instructions, has slightly more cache misses and slightly more branch mispredictions would be slower. But why would it be faster running on real hardware?

It took me more than a day of staring at profiling numbers before I realized why this was faster: It has better instruction level parallelism. You couldn’t have invented this algorithm on old computers because it would have been slower on old computers. The big problem with American Flag Sort is that it has to wait for the current swap to finish before it can start on the next swap. It doesn’t matter that there is no cache-miss: Modern CPUs could execute several swaps at once if only they didn’t have to wait for the previous one to finish. Unrolling the inner loop also helps to ensure this. Modern CPUs are amazing, so they could actually run several loops in parallel even without loop unrolling, but the loop unrolling helps.

The Linux perf command has a metric called “instructions per cycle” which measures instruction level parallelism. In American Flag Sort my CPU achieves 1.61 instructions per cycle. In this new sorting algorithm it achieves 2.24 instructions per cycle. It doesn’t matter if you have to do a few instructions more, if you can do 40% more at a time.

And the thing about cache misses and branch mispredictions turned out to be a red herring: The numbers for those are actually very low for both algorithms. So the slight increase that I saw was a slight increase to a low number. Since there are only 256 possible insertion points, chances are that a good portion of them are always going to be in the cache. And for many real world inputs the number of possible insertion points will actually be much lower. For example when sorting strings, you usually get less than thirty because we simply don’t use that many different characters.

All that being said, for small collections American Flag Sort is faster. The instruction level parallelism really seems to kick in at collections of more than a thousand elements. So my final sort algorithm actually looks at the number of elements in the collection, and if it’s less than 128 I call std::sort, if it’s less than 1024 I call American Flag Sort, and if it’s more than that I run my new sorting algorithm.

std::sort is actually a similar combination, mixing quick sort, insertion sort and heap sort, so in a sense those are also part of my algorithm. If I tried hard enough, I could construct an input sequence that actually uses all of these sorting algorithms. That input sequence would be my very worst case: I would have to trigger the worst case behavior of radix sort so that my algorithm falls back to std::sort, and then I would also have to trigger the worst case behavior of quick sort so that std::sort falls back to heap sort. So let’s talk about worst cases and best cases.

The best case for my implementation of radix sort is if the inputs fit in few partitions. For example if I have a thousand items and they all fall into only three partitions, (say I just have the number 1 a hundred times, the number 2 four hundred times, and the number 3 five hundred times) then my outer loops do very little and my inner loop can swap everything into place in nice long uninterrupted runs.

My other best case is on already sorted sequences: In that case I iterate over the data exactly twice, once to look at each item, and once to swap each item with itself.

The worst case for my implementation can only be reached when sorting variable sized data, like strings. For fixed size keys like integers or floats, I don’t think there is a really bad case for my algorithm. One way to construct the worst case is to sort the strings “a”, “ab”, “abc”, “abcd”, “abcde”, “abcdef” etc. Since radix sort looks at one byte at a time, and that byte only allows it to split off one item, this would take O(n^2) time. My implementation detects this by recording how many recursive calls there were. If there are too many, I fall back to std::sort. Depending on your implementation of quick sort, this could also be the worst case for quick sort, in which case std::sort falls back to heap sort. I debugged this briefly and it seemed like std::sort did not fall back to heap sort for my test case. The reason for that is that my test case was sorted data and std::sort seems to use the median-of-three rule for pivot selection, which selects a good pivot on already sorted sequences. Knowing that, it’s probably possible to create sequences that hit the worst case both for my algorithm and for the quick sort used in std::sort, in which case the algorithm would fall back to heap sort. But I haven’t attempted to construct such a sequence.

I don’t know how common this case is in the real world, but one trick I took from the boost implementation of radix sort is that I skip over common prefixes. So if you’re sorting log messages and you have a lot of messages that start with “warning:” or “error:” then my implementation of radix sort would first sort those into separate partitions, and then within each of those partitions it would skip over the common prefix and continue sorting at the first differing character. That behavior should help reduce how often we hit the worst case.

Currently I fall back to std::sort if my code has to recurse more than sixteen times. I picked that number because that was the first power of two for which the worst case detection did not trigger when sorting some log files on my computer.

The sorting algorithm that I provide as a library is called “Ska Sort”. Because I’m not going to come up with new algorithms very often in my lifetime, so might as well put my name on one when I do. The improved algorithm for sorting bytes that I described above in the sections “Optimizing the Inner Loop” and “Implementation Details” is only a small part of that. That algorithm is called “Ska Byte Sort”.

In summary, Ska Sort:

- Is an in-place radix sort algorithm
- Sorts one byte at a time (into 256 partitions)
- Falls back to std::sort if a collection contains less than some threshold of items (currently 128)
- Uses the inner loop of American Flag Sort if a collection contains less than a larger threshold of items (currently 1024)
- Uses Ska Byte Sort if the collection is larger than that
- Calls itself recursively on each of the 256 partitions using the next byte as the sort key
- Falls back to std::sort if it recurses too many times (currently 16 times)
- Uses std::partition to sort booleans
- Automatically converts signed integers, floats and char types to the correct unsigned integer type
- Automatically deals with pairs, tuples, strings, vectors and arrays by sorting one element at a time
- Skips over common prefixes of collections. (for example when sorting strings)
- Provides two customization points to extract the sort key from an object: A function object that can be passed to the algorithm, or a function called to_radix_sort_key() that can be placed in the namespace of your type

So Ska Sort is a complicated algorithm. Certainly more complicated than a simple quick sort. One of the reasons for this is that in Ska Sort, I have a lot more information about the types that I’m sorting. In comparison based sorting algorithms all I have is a comparison function that returns a bool. In Ska Sort I can know that “for this collection, I first have to sort on a boolean, then on a float” and I can write custom code for both of those cases. In fact I often need custom code: The code that sorts tuples has to be different from the code that sorts strings. Sure, they have the same inner loop, but they both need to do different work to get to that inner loop. In comparison based sorting you get the same code for all types.

If you’ve got enough time on your hands that you clicked on the pieces I linked above, you will find that there are two optimizations that are considered important in my sources that I didn’t do.

The first is that the piece that talks about sorting floating point numbers sorts 11 bits at a time, instead of one byte at a time. Meaning it subdivides the range into 2048 partitions instead of 256 partitions. The benefit of this is that you can sort a four byte integer (or a four byte float) in three passes instead of four passes. I tried this in my last blog post and found it to only be faster for a few cases. In most cases it was slower than sorting one byte at a time. It’s probably worth trying that trick again for in-place radix sort, but I didn’t do that.

The second is that the American Flag Sort paper talks about managing recursions manually. Instead of making recursive calls, they keep a stack of all the partitions that still need to be sorted. Then they loop until that stack is empty. I didn’t attempt this optimization because my code is already far too complex. This optimization is easier to do when you only have to sort strings because you always use the same function to extract the current byte. But if you can sort ints, floats, tuples, vectors, strings and more, this is complicated.

Finally we get to how fast this algorithm actually is. Since my last blog post I’ve changed how I calculate these numbers. In my last blog post I actually made a big mistake: I measured how long it takes to set up my test data and to then sort it. The problem with that is that the set up can actually be a significant portion of the time. So this time I also measure the set up separately and subtract that time from the measurements so that I’m left with only the time it takes to actually sort the data. With that let’s get to our first measurement: Sorting integers: (generated using std::uniform_int_distribution)

This graph shows how long it takes to sort various numbers of items. I didn’t mention ska_sort_copy before, but it’s essentially the algorithm from my last blog post, except that I changed it so that it falls back to ska_sort instead of falling back to std::sort. (ska_sort may still decide to fall back to std::sort of course)

One problem I have with this graph that even though I made the scale logarithmic, it’s still very difficult to see what’s going on. Last time I added another line at the bottom that showed the relative scale, but this time I have a better approach. Instead of a logarithmic scale, I can divide the total time by the number of items, so that I get the time that the sort algorithm spends per item:

With this visualization, we can see much more clearly what’s going on. All pictures below use “nanoseconds per item” as scale, like in this graph. Let’s analyze this graph a little:

For the first couple items we see that the lines are essentially the same. That’s because for less than 128 elements, I fall back to std::sort. So you would expect all of the lines to be exactly the same. Any difference in that area is measurement noise.

Then past that we see that std::sort is exactly a O(n log n) sorting algorithm. It goes up linearly when we divide the time by the number of items, which is exactly what you’d expect for O(n log n). It’s actually impressive how it forms an exactly straight line once we’re past a small number of items. ska_sort_copy is truly an O(n) sorting algorithm: The cost per item stays mostly constant as the total number of items increases. But ska_sort is… more complicated.

Those waves that we’re seeing in the ska_sort line have to do with the number of recursive calls: ska_sort is fastest when the number of items is large. That’s why the line starts off as decreasing. But then at some point we have to recurse into a bunch of partitions that are just over 128 items in size, which is slow. Then those partitions grow as the number of items increase and the algorithm is faster again, until we get to a point where the partitions are over 128 elements in size again, and we need to add another recursive step. One way to visualize this is to look at the graph of sorting a collection of int8_t:

As you can see the cost per item goes down dramatically at the beginning. Every time that the algorithm has to recurse into other partitions, we see that initial part of the curve overlaid, giving us the waves of the graph for sorting ints.

One point I made above is that ska_sort is fastest when there are few partitions to sort elements into. So let’s see what happens when we use a std::geometric_distribution instead of a std::uniform_int_distribution:

This graph is sorting four byte ints again, so you would expect to see the same “waves” that we saw in the uniformly distributed ints. I’m using a std::geometric_distribution with 0.001 as the constructor argument. Which means it generates numbers from 0 to roughly 18000, but most numbers will be close to zero. (in theory it can generate numbers that are much bigger, but 18882 is the biggest number I measured when generating the above data) And since most numbers are close to zero, we will see few recursions and because of that we see few waves, making this many times faster than std::sort.

Btw that bump at the beginning is surprising to me. For all other data that I could find, ska_sort starts to beat std::sort at 128 items. Here it seems like ska_sort only starts to win later. I don’t know why that is. I might investigate it at a different point, but I don’t want to change the threshold because this is a good number for all other data. Changing the threshold would move all other lines up by a little. Also since we’re sorting few items there, the difference in absolute terms is not that big: 15.8 microseconds to 16.7 microseconds for 128 items, and 32.3 microseconds to 32.9 microseconds for 256 items.

Let’s look at some more use cases. Here is my “real world” use case that I talked about in the last blog post, where I had to sort enemies in a game by distance to the player. But I wanted all enemies that are currently in combat to come first, sorted by distance, followed by all enemies that are not in combat, also sorted by distance. So I sort by a std::pair:

This turned out to be the same graph as sorting ints, except every line is shifted up by a bit. Which I guess I should have expected. But it’s good to see that the conversion trick that I have to do for floats and the splitting I have to do for pairs does not add significant overhead. A more interesting graph is the one for sorting int64s:

This is the point where ska_sort_copy is sometimes slower than ska_sort. I actually decided to lower the threshold where ska_sort_copy falls back to ska_sort: It will now only do the copying radix sort when it has to do less than eight iterations over the input data. Meaning I have changed the code, so that for int64s ska_sort_copy actually just calls ska_sort. Based on the above graph you might argue that it should still do the copying radix sort, but here is a measurement of sorting an 128 byte struct that has an int64 as a sort key:

As the structs get larger, ska_sort_copy gets slower. Because of this I decided to make ska_sort_copy fall back to ska_sort for sort keys of this size.

One other thing to notice from the above graph is that it looks like std::sort and ska_sort get closer. So does ska_sort ever become slower? It doesn’t look like it. Here’s what it looks like when I sort a 1024 byte struct:

Once again this is a very interesting graph. I wish I could spend time on investigating where that large gap at the end comes from. It’s not measurement noise. It’s reproducible. The way I build these graphs is that I run Google Benchmark thirty times to reduce the chance of random variation.

Talking about large data, in my last blog post my worst case was sorting a struct that has a 256 byte sort key. Which in this case means using a std::array as a sort key. This was very slow on copying radix sort because we actually have to do 256 passes over the data. In-place radix sort only has to look at enough bytes until it’s able to tell two pieces of data apart, so it might be faster. And looking at benchmarks, it seems like it is:

ska_sort_copy will fall back to ska_sort for this input, so its graph will look identical. So I fixed the worst case from my last blog post. One thing that I couldn’t profile in my last blog post was sorting of strings, because ska_sort_copy simply can not sort strings because it can not sort variable sized data.

So let’s look at what happens when I’m sorting strings:

The way I build the input data here is that I take between one and three random words from my words file and concatenate them. Once again I am very happy to see how well my algorithm does. But this was to be expected: It was already known that radix sort is great for sorting strings.

But sorting strings is also when I can hit my worst case. In theory you might get cases where you have to do many passes over the data, because there simply are a lot of bytes in the input data and a lot of them are similar. So I tried what happens when I sort strings of different length, concatenating between zero and ten words from my words file:

What we see here is that ska_sort seems to become a O(n log n) algorithm when sorting millions of long strings. However it doesn’t get slower than std::sort. My best guess for the curve going up like that is that ska_sort has to do a lot of recursions on this data. It doesn’t do enough recursions to trigger my worst case detection, but those recursions are still expensive because they require one more pass over the data.

One thing I tried was lowering my recursion limit to eight, in which case I do hit my worst case detection starting at a million items. But the graph looks essentially unchanged in that case. The reason is that it’s a false positive: I didn’t actually hit my worst case. The sorting algorithm still succeeded at splitting the data into many smaller partitions, so when I fall back to std::sort, it has a much easier time than it would have had sorting the whole range.

Finally, here is what it looks like when I sort containers that are slightly more complicated than strings:

For this I generate vectors with between 0 and 20 ints in them. So I’m sorting a vector of vectors. That spike at the end is very interesting. My detection for too many recursive calls does not trigger here, so I’m not sure why sorting gets so much more expensive. Maybe my CPU just doesn’t like dealing with this much data. But I’m happy to report that ska_sort is faster than std::sort throughout, like in all other graphs.

Since ska_sort seems to always be faster, I also generated input data that intentionally triggers the worst case for ska_sort. The below graph hits the worst case immediately starting at 128 elements. But ska_sort detects that and falls back to std::sort:

Â

For this I’m sorting random combinations of the vectors {},Â { 0 }, { 0, 1 }, { 0, 1, 2 }, … { 0, 1, 2, … , 126, 127 }. Since each element only tells my algorithm how to split off 1/128th of the input data, it would have to recurse 128 times. But at the sixteenth recursion ska_sort gives up and falls back to std::sort. In the above graph you see how much overhead that is. The overhead is bigger than I like, especially for large collections, but for smaller collections it seems to be very low. I’m not happy that this overhead exists, but I’m happy that ska_sort detects the worst case and at least doesn’t go O(n^2).

Ska_sort isn’t perfect and it has problems. I do believe that it will be faster than std::sort for nearly all data, and it should almost always be preferred over std::sort.

The biggest problem it has is the complexity of the code. Especially the template magic to recursively sort on consecutive bytes. So for example currently when sorting on a std::pair<int, int> this will instantiate the sorting algorithm eight times, because there will be eight different functions for extracting a byte out of this data. I can think of ways to reduce that number, but they might be associated with runtime overhead. This needs more investigation, but the complexity of the code is also making these kinds of changes difficult. For now you can get slow compile times with this if your sort key is complex. The easiest way to get around that is to try to use a simpler sort key.

Another problem is that I’m not sure what to do for data that I can’t sort. For example this algorithm can not sort a vector of std::sets. The reason is that std::set does not have random access operators, and I need random access when sorting on one element at a time. I could write code that allows me to sort std::sets by using std::advance on iterators, but it might be slow. Alternatively I could also fall back to std::sort. Right now I do neither: I simply give a compiler error. The reason for that is that I provide a customization point, a function called to_radix_sort_key(), that allows you to write custom code to turn your structs into sortable data. If I did an automatic fallback whenever I can’t sort something, using that customization point would be more annoying: Right now you get an error message when you need to provide it, and when you have provided it, the error goes away. If I fall back to std::sort for data that I can’t sort, your only feedback for would be that sorting is slightly slower. You would have to either profile this and compare it to std::sort, or you would have to step through the sorting function to be sure that it actually uses your implementation of to_radix_sort_key(). So for now I decided on giving an error message when I can’t sort a type. And then you can decide whether you want to implement to_radix_sort_key() or whether you want to use std::sort.

Another problem is that right now there can only be one sorting behavior per type. You have to provide me with a sort key, and if you provide me with an integer, I will sort your data in increasing order. If you wanted it in decreasing order, there is currently no easy interface to do that. For integers you could solve this by flipping the sign in your key function, so this might not be too bad. But it gets more difficult for strings: If you provide me a string then I will sort the string, case sensitive, in increasing order. There is currently no way to do a case-insensitive sort for strings. (or maybe you want number aware sorting so that “bar100” comes after “bar99”, also can’t do that right now) I think this is a solvable problem, I just haven’t done the work yet. Since the interface of this sorting algorithm works differently from existing sorting algorithms, I have to invent new customization points.

I have uploaded the code for this to github. It’s licensed under the boost license.

The interface works slightly differently from other sorting algorithms. Instead of providing a comparison function, you provide a function which returns the sort key that the sorting algorithm uses to sort your data. For example let’s say you have a vector of enemies, and you want to sort them by distance to the player. But you want all enemies that are in combat with the player to come first, sorted by distance, and then all enemies that are not in combat, also sorted by distance. The way to do that in a classic sorting algorithm would be like this:

std::sort(enemies.begin(), enemies.end(), [](const Enemy & lhs, const Enemy & rhs) { return std::make_tuple(!is_in_combat(lhs), distance_to_player(lhs)) < std::make_tuple(!is_in_combat(rhs), distance_to_player(rhs)); });

In ska_sort, you would do this instead:

ska_sort(enemies.begin(), enemies.end(), [](const Enemey & enemy) { return std::make_tuple(!is_in_combat(enemy), distance_to_player(enemy)); });

As you can see the transformation is fairly straightforward. Similarly let’s say you have a bunch of people and you want to sort them by last name, then first name. You could do this:

ska_sort(contacts.begin(), contacts.end(), [](const Contact & c) { return std::tie(c.last_name, c.first_name); });

It is important that I use std::tie here, because presumably last_name and first_name are strings, and you don’t want to copy those. std::tie will capture them by reference.

Oh and of course if you just have a vector of simple types, you can just sort them directly:

ska_sort(durations.begin(), durations.end());

In this I assume that “durations” is a vector of doubles, and you might want to sort them to find the median, 90th percentile, 99th percentile etc. Since ska_sort can already sort doubles, no custom code is required.

There is one final case and that is when sorting a collection of custom types. ska_sort only takes a single customization function, but what do you do if you have a custom type that’s nested? In that case my algorithm would have to recurse into the top-level-type and would then come across a type that it doesn’t understand. When this happens you will get an error message about a missing overload for to_radix_sort_key(). What you have to do is provide an implementation of the function to_radix_sort_key() that can be found using ADL for your custom type:

struct CustomInt { int i; }; int to_radix_sort_key(const CustomInt & i) { return i.i; } //... later somewhere std::vector<std::vector<CustomInt>> collections = ...; ska_sort(collections.begin(), collections.end());

In this case ska_sort will call to_radix_sort_key() for the nested CustomInts. You have to do this because there is no efficient way to provide a custom extract_key function at the top level. (at the top level you would have to convert the std::vector<CustomInt> to a std::vector<int>, and that requires a copy)

Finally I also provide a copying sort function, ska_sort_copy, which will be much faster for small keys. To use it you need to provide a second buffer that’s the same size as the input buffer. Then the return value of the function will tell you whether the final sorted sequence is in the second buffer (the function returns true) or in the first buffer (the function return false).

std::vector<int> temp_buffer(to_sort.size()); if (ska_sort_copy(to_sort.begin(), to_sort.end(), temp_buffer.begin())) to_sort.swap(temp_buffer);

In this code I allocate a temp buffer, and if the function tells me that the result ended up in the temp buffer, I swap it with the input buffer. Depending on your use case you might not have to do a swap. And to make this fast you wouldn’t want to allocate a temp buffer just for the sorting. You’d want to re-use that buffer.

I’ve talked to a few people about this, and the usual questions I get are all related to people not believing that this is actually faster.

Q: Isn’t Radix Sort O(n+m) where m is large so that it’s actually slower than a O(n log n) algorithm? (or alternatively: Isn’t radix sort O(n*m) where m is larger than log n?)

A: Yes, radix sort has large constant factors, but in my benchmarks it starts to beat std::sort at 128 elements. And if you have a large collection, say a thousand elements, radix sort is a very clear winner.

Q: Doesn’t Radix Sort degrade to a O(n log n) algorithm? (or alternatively: Isn’t the worst case of Radix Sort O(n log n) or maybe even O(n^2)?)

A: In a sense Radix Sort has to do log(n) passes over the data. When sorting an int16, you have to do two passes over the data. When sorting an int32, you have to do four passes over the data. When sorting an int64 you have to do eight passes etc. However this is not O(n log n) because this is a constant factor that’s independent of the number of elements. If I sort a thousand int32s, I have to do four passes over that data. If I sort a million int32s, I still have to do four passes over that data. The amount of work grows linearly. And if the ints are all different in the first byte, I don’t even have to do the second, third or fourth pass. I only have to do enough passes until I can tell them all apart.

So the worst case for radix sort is O(n*b) where b is the number of bytes that I have to read until I can tell all the elements apart. If you make me sort a lot of long strings, then the number of bytes can be quite large and radix sort may be slow. That is the “worst case” graph above. If you have data where radix sort is slower than std::sort (something that I couldn’t find except when intentionally creating bad data) please let me know. I would be interested to see if we can find some optimizations for those cases. When I tried to build more plausible strings, ska_sort was always clearly faster.

And if you’re sorting something fixed size, like floats, then there simply is no accidental worst case. You are limited by the number of bytes and you will do at most four passes over the data.

Q: If those performance graphs were true, we’d be radix sorting everything.

A: They are true. Not sure what to tell you. The code is on github, so try it for yourself. And yes, I do expect that we will be radix sorting everything. I honestly don’t know why everybody settled on Quick Sort back in the day.

There are a couple obvious improvements that I may make to the algorithm. The algorithm is currently in a good state, but if I ever feel like working on this again, here are three things that I might do:

As I said in the problems section, there is currently no way to sort strings case-insensitive. Adding that specific feature is not too difficult, but you’d want some kind of generic way to customize sorting behavior. Currently all you can do is provide a custom sort key. But you can not change how the algorithm uses that sort key. You always get items sorted in increasing order by looking at one byte at a time.

When I fall back to std::sort, I re-start sorting from the beginning. As I said above I fall back to std::sort when I have split the input into partitions of less than 128 items. But let’s say that one of those partitions is all the strings starting with “warning:” and one partition is all the strings starting with “error:” then when I fall back to std::sort, I could skip the common prefix. I have the information of how many bytes are already sorted. I suspect that the fact that std::sort has to start over from the beginning is the reason why the lines in the graph for sorting strings are so parallel between ska_sort and std::sort. Making this optimization might make the std::sort fallback much faster.

I might also want to write a function that can either take a comparison function, or an extract_key function. The way it would work is that if you pass a function object that takes two arguments, this uses comparison based sorting, and if you pass a function object that takes one argument, this uses radix sorting. The reason for creating a function like that is that it could be backwards compatible to std::sort.

In Summary I have a sorting algorithm that’s faster than std::sort for most inputs. The sorting algorithm is on github and is licensed under the boost license, so give it a try.

I mainly did two things:

- I optimized the inner loop of in-place radix sort, resulting in the ska_byte_sort algorithm
- I provide an algorithm, ska_sort, that can perform Radix Sort on arbitrary types or combinations of types

To use it on custom types you need to provide a function that provides a “sort key” to ska_sort, which should be a int, float, bool, vector, string, or a tuple or pair consisting of one of these. The list of supported types is long: Any primitive type will work or anything with operator[], so std::array and std::deque and others will also work.

If sorting of data is critical to your performance (good chance that it is, considering how important sorting is for several other algorithms) you should try this algorithm. It’s fastest when sorting a large number of elements, but even for small collections it’s never slower than std::sort. (because it uses std::sort when the collection is too small)

The main lessons to learn from this are that even “solved” problems like sorting are worth revisiting every once in a while. And it’s always good to learn the basics properly. I didn’t expect to learn anything from an “Introduction to Algorithms” course but I already wrote this algorithm and I’m also tempted to attempt once again to write a faster hashtable.

If you do use this algorithm in your code, let me know how it goes for you. Thanks!

]]>

But first an explanation of what radix sort is: **Radix sort is a O(n) sorting algorithm working on integer keys.** I’ll explain below how it works, but the claim that there’s an O(n) searching algorithm was surprising to me the first time that I heard it. I always thought there were proofs that sorting had to be O(n log n). Turns out sorting has to be O(n log n) if you use the comparison operator to sort. Radix sort does not use the comparison operator, and because of that it can be faster.

The other reason why I never looked into radix sort is that it only works on integer keys. Which is a huge limitation. Or so I thought. Turns out all this means is that your struct has to be able to provide something that acts somewhat like an integer. **Radix sort can be extended to floats, pairs, tuples and std::array**. So if your struct can provide for example a std::pair<bool, float> and use that as a sort key, you can sort it using radix sort.

I actually do this somewhat often when I write C++ code nowadays. One recent example was that I had to sort enemies in a game that I was working on. I wanted to sort enemies by distance, but I wanted all enemies that were already fighting with the player to come first. So here is what the comparison function looked like:

bool operator<(const Enemy & a, const Enemy & b) { return std::make_tuple(!IsInCombat(a), DistanceToPlayer(a)) < std::make_tuple(!IsInCombat(b), DistanceToPlayer(b)); }

Using that comparison operator will sort the enemies so that all enemies that are in combat with the player come first, (and they’re sorted by distance) and then there will be all enemies that are not in combat with the player. (also sorted by distance)

Except that by using this comparison operator I have to use an O(n log n) sorting algorithm. But you can use radix sort to sort tuples, so I could sort this in O(n). All I have to do is provide this function

auto sort_key(const Enemy & a) { return std::make_tuple(!IsInCombat(a), DistanceToPlayer(a)); }

If I use that sort_key function as input to radix sort, I can sort in O(n) instead of O(n log n). Neat, huh? So how does radix sort work?

Radix sort builds on top of an algorithm called counting sort, so I’ll explain that one first. Counting sort is also a O(n) sorting algorithm that works on integer keys. The big trick is that instead of using the comparison operator, we use integers as indices into an array. The big downside is that we need an array big enough that the largest integer can index into it. For a uint32 that’s 4 gigabytes of memory… But radix sort will overcome that downside, so for now let’s just look at counting sort on bytes. Then all we need is an array of size 256, because that’s big enough that any byte can index into it. I’ll start off by dumping in a full implementation in C++, then I’ll explain how this works.

template<typename It, typename OutIt, typename ExtractKey> void counting_sort(It begin, It end, OutIt out_begin, ExtractKey && extract_key) { size_t counts[256] = {}; for (It it = begin; it != end; ++it) { ++counts[extract_key(*it)]; } size_t total = 0; for (size_t & count : counts) { size_t old_count = count; count = total; total += old_count; } for (; begin != end; ++begin) { std::uint8_t key = extract_key(*begin); out_begin[counts[key]++] = std::move(*begin); } }

There are three loops here: We iterate over the input array, then we iterate over our buffer, then we iterate over the input array a second time and write the sorted data to the output array:

The first loop counts how often each byte comes up. Remember, we can only sort bytes using this version because we only have an array of size 256. But that array is big enough to hold the information of how often each byte shows up.

The second loop turns that buffer into a prefix sum of the counts. So let’s say the array didn’t have 256 entries, but only 8 entries. And let’s say the numbers come up this often:

index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

count | 0 | 2 | 1 | 0 | 5 | 1 | 0 | 0 |

prefix sum | 0 | 0 | 2 | 3 | 3 | 8 | 9 | 9 |

So in this case there were nine elements in total. The number 1 showed up twice, the number 2 showed up once, the number 4 showed up 5 times and the number 5 showed up once. So maybe the input sequence was { 4, 4, 2, 4, 1, 1, 4, 5, 4 }.

The final loop now goes over the initial array again and uses the number to look up into the prefix sum array. And lo and behold, that array tells us the final position where we need to store the integer. So when we iterate over that sequence, the 4 goes to position 3, because that’s the value that the prefix sum array tells us. We then increment the value in the array so that the next 4 goes to position 4. The number 2 will go to position 2, the next 4 goes to position 5 (because we incremented the value in the prefix sum array twice already) etc. I recommend that you walk through this once manually to get a feeling for it. The final result of this should be { 1, 1, 2, 4, 4, 4, 4, 4, 5 }.

And just like that we have a sorted array. The prefix sum told us where we have to store everything, and we were able to compute that in linear time.

Also notice how this works on any type, not just on integers. All you have to do is provide the extract_key() function for your type. In the last loop we move the type that you provided, not the key returned from that function. So this can be any custom struct. For example you could sort strings by length. Just use the size() function in extract_key, and clamp the length to at most 255. You could write a modified version of counting_sort that tells you where the position of the last bucket is, so that you can then sort all long strings using std::sort. (which should be a small subset of all your strings so that the second pass on those strings should be fast) I could also get my “enemy sorting” example from above to work: Store the boolean in the highest bit, and use the remaining bits to sort all enemies that are within 127 meters of the player. In my example 1 meter resolution would have been fine, (if one enemy is 1.1 meters away and the other is 1.2 meters away, I don’t care which comes first) and I really don’t care about enemies that are hundreds of meters away.

Counting sort is crazy fast and it really should be used more widely. But it sure would be nice if we could use keys bigger than a uint8_t.

Radix Sort builds on top of counting sort. The big problem with counting sort is that we need that buffer that counts how often every input comes up. If our input contains the number ten million, then our buffer has to be ten million items large because we need to increment the count at position ten million. Not good.

Radix sort builds on top of two neat principles:

1. counting sort is a stable sort. If two entries have the same number, they will stay in the same order.

2. If you sort a numbers by their lowest digit first, and then do a stable sort on higher digits, the result will be a sorted list.

Point 2 is not obvious, so let me walk through an example. Let’s sort the integers {11, 55, 52, 61, 12, 73, 93, 44 } first by their lowest digit. What we get is the list { 11, 61, 52, 12, 73, 93, 44, 55 }. You could try it using counting sort using “i % 10” as the extract_key function. Note that this is a stable sort, so for example 52 stays before 12. If we now do a second counting sort on this using the higher digit, we get the list { 11, 12, 44, 52, 55, 61, 73, 93 }. Which is a sorted list! Try it with counting sort using “i / 10” as the extract_key function.

This is a super neat observation. As long as you use a stable sorting algorithm, you can sort the low digits first and then sort the high digits after that.

So with that the implementation of radix sort is obvious: Just sort using one byte at a time, going from the lowest byte to the highest byte.

Now it should also be clear how to generalize radix sort to pairs, tuples and arrays. For a pair sort using the .second member first, and then sort using the .first member. For tuples and fixed size arrays use every element in decreasing order. (Unfortunately we can not use dynamically sized arrays as keys using this method, so we can for example not use strings as sort keys)

And with that we’re also coming to the biggest downside that radix sort has: This means that if we want to sort a two byte integer, we have to go over the input list four times. (counting sort goes over the list twice, and we have to call counting sort twice) For four bytes we have to go over the input list eight times, and for eight bytes we have to traverse sixteen times. For pairs and tuples this gets even bigger.

So radix sort is O(n), but it’s a large O(n). Counting sort is crazy fast, radix sort is not.

But still there should be some number for n where radix sort is faster than a sorting algorithm with O(n log n) complexity. Let’s find out where that is!

(oh but before we move on I should briefly mention how to make it work for signed integers and floats. Signed integers are somewhat straightforward: Just cast to the unsigned version and offset the values so that every value is positive. So for example for a int8_t, cast to uint8_t and add 128 so that -128 turns into 0, 0 turns into 128 and 127 turns into 255. For floats you have to reinterpret_cast to uint32_t then flip every bit if the float was negative, but flip only the sign bit if the float was positive. Michael Herf explains it here. The same approach works for doubles and uint64_t)

To start off with, lets measure how fast it is to sort a single byte using counting sort:

I got a little bit creative on the scales here, so this graph needs some explaining. I measured how fast radix sort (which for one byte is just doing counting sort) and std::sort can sort an array. I measure for each power of two from 2 to 2^30. Since my data growth exponentially, I had to use a logarithmic scale. Then I had a problem because on the logarithmic scale it was difficult to see how big the difference between the two sorting algorithms was, so I added another line that follows a linear scale which shows the relative speed. **That dotted line at the bottom follows a different, linear scale.** But the numbers on it show you how big the relative speed is between the two algorithms.

With that explanation the first thing we notice is that both std::sort and radix sort seem to grow almost linearly. But then the second thing we notice is that even for fairly small numbers, counting sort beats std::sort handily. And as our data set grows, counting sort is between four and six times faster!

Next, let’s see how this holds up when we go from counting sort to radix sort:

When sorting four bytes, radix sort needs to do several passes over the data, and because of that it takes longer for it to beat std::sort. But even for relatively small data sets with a thousand elements, radix sort is several times faster than std::sort.

One interesting thing is that dip at the end: That is me running out of memory. Counting sort is not an in-place sort. It stores the results in a different buffer than the input buffer. Radix sort on an int32 will shuffle the data back and forth between the two buffers four times, so that the results actually do end up in the same original buffer, but it still needs all that extra storage. At the last data point in that graph I’m sorting one billion elements, which are four bytes each, and I need two buffers. That gives me eight gigabytes of RAM. My machine has sixteen gigabytes of RAM. In theory there should be some more space left, but my machine starts slowing down once you use more than half of the available RAM. If I double the size again, radix sort never finishes because it starts swapping memory.

The big surprise from these measurements is that radix sort stays much faster than std::sort even though it now has to go over the input data four more times than when sorting a single byte. It’s hard to see in the graph, but in the underlying numbers it looks like **running radix sort on four bytes is two times slower than running radix sort on one byte**. And apparently std::sort also gets slower when sorting a bigger chunk of data, so radix sort still beats it.

In theory though there should be some data size where radix sort is slower than std::sort. Let’s try increasing the data size some more:

When sorting an int64, radix sort is “only” two to three times faster than std::sort. At least once you have more than 500 elements in your array. Since the difference in scale is linear though, we should be able to decrease the gap further by sorting bigger input data:

Aha, it looks like when we use sixteen bytes of data as the sort key, radix sort is finally slower than std::sort. At this point my implementation of radix sort has to do 18 passes over the input data, shuffling back and forth between the two buffers sixteen times. At some point that had to be slow. Note though that this does not mean that you can not sort large structs using radix sort. It only means that the sort key that you provide to radix sort has to be small. The size of the struct matters less. To prove that point here are the measurements for sorting a sixteen byte struct using an eight byte key:

When using a smaller key, radix sort is faster again. One thing to note though is that it’s not as fast as as when we were just sorting an int64. That suggests that maybe radix sort gets slower relative to std::sort as the size of the struct increases. The performance depends on the key size and the data size. So I decided to calculate the relative speed for a vector of size 2048. Meaning I did the above measurements with 2048 for “number of elements” and varied the key size and the data size, and plotted that in a table:

Time (in microseconds) to sort 2048 elements | key size | ||||||
---|---|---|---|---|---|---|---|

1 | 4 | 16 | 64 | 256 | |||

data size | 1 | radix sort | 16 | ||||

std::sort | 81 | ||||||

relative speed | 5.2 |
||||||

4 | radix sort | 18 | 24 | ||||

std::sort | 88 | 87 | |||||

relative speed | 4.8 |
3.7 |
|||||

16 | radix sort | 24 | 40 | 123 | |||

std::sort | 100 | 97 | 112 | ||||

relative speed | 4.1 |
2.4 |
0.9 |
||||

64 | radix sort | 57 | 119 | 347 | 1881 | ||

std::sort | 141 | 138 | 150 | 254 | |||

relative speed | 2.4 |
1.2 |
0.4 |
0.1 |
|||

256 | radix sort | 144 | 341 | 1195 | 5501 | 17657 | |

std::sort | 413 | 443 | 459 | 577 | 698 | ||

relative speed | 2.9 |
1.3 |
0.4 |
0.1 |
0.04 |

One thing I should note is that my benchmark loop also generated 2048 random numbers. So the measurements above are really for generating 2048 random numbers using std::mt19937_64, and then sorting those random numbers. For the key size of 64 I had to generate eight random numbers and for the key size of 256 I had to generate 32 random numbers, so the overhead for the random number generation is larger in those columns.

So what can we read from this table? There are two main things to notice:

- As the key size increases (reading from left to right), radix sort gets much slower. std::sort also slows down, but not by as much. When sorting one byte (two passes) radix sort is always faster. Same thing when sorting four bytes (five passes). At sixteen bytes (eighteen passes in my implementation, but you could do it in seventeen) radix sort starts to lose, especially when the data to move around is large. Moving the data back and forth sixteen times is just slow.
- When the data size increases (reading from top to bottom) radix sort also gets slower relative to std::sort. However it looks like a data size increase does not cause radix sort to switch from being faster to being slower. In fact the gap in absolute terms actually widens every time that the data size increases.

The main reason why std::sort is not affected as much by an increase in key size is that it uses std::lexicographical_compare. Meaning if I have a key of size 256, which in my case was just a std::array<uint64_t, 32> and **if the first entry in the key differs, then std::sort can early out** and doesn’t even have to look at the remaining bytes. Since radix sort starts sorting at the least significant digit, it has to actually look at every single byte in the key. There is a variant of radix sort that looks at the most significant digit first, so it should perform better for larger keys, but I won’t talk about that too much in this piece.

All of this being said, how does radix sort perform on my initial use case of sorting a std::pair<bool, float>?

Radix sort performs very well on my initial use case. It’s faster starting at 64 elements in the array. That’s because this is the same speed as sorting by a four byte int, and then by a boolean. And sorting by a boolean is the fastest possible version of counting sort. You don’t even need the buffer of 256 counts, you only need to count how many “false” elements there are in the array. So adding a boolean to something will barely slow it down when using radix sort. Actually let’s talk about some more optimizations:

- As just mentioned, you can write a faster version of counting sort for booleans. You don’t need to keep track of 256 counts for booleans, you just need one: How many “false” elements there were. Then you write all “true” elements starting at that offset, and all “false” elements starting at offset 0.
- When sorting multiple bytes, you can combine the first two loops of all of them. For example when sorting four bytes, the straightforward implementation is to just call counting_sort four times. Then you would get eight passes over the input data. But if you allocate four counting buffers of size 256 on the stack, you can initialize all of them in one loop, and turn all of them into prefix sums in one loop. Then you only have to do five passes in total over the data.
- The article that explains how to sort floating point numbers using radix sort also has a trick of sorting 11 bits at a time. Instead of sorting one byte at a time. The benefit of that is that you can sort a 32 bit number in four passes instead of five. I tried that, and for me it only gave me performance benefits if the input data is between 1024 and 4096 elements large. For any input sizes larger or smaller than that, sorting one byte at a time was faster. The reason for these numbers is that when sorting 11 bits, the counting array is of size 2048, and apparently if you do the math, the algorithm is fastest when the counting array is roughly the same size as the input data. I haven’t looked too much into that.
- In my implementation of counting_sort above I use an array of type size_t[256]. If you know that each of the buckets in there is less than four billion elements in size, you could also use a uint32_t[256]. In fact I use a different type depending on the size of the input data. This does actually help because the main cost in counting_sort is cache misses. So if your count array is small, that means more of the other arrays can be in the cache.

Now that we know that radix sort can be fast, we can write a sorting algorithm that has O(n) for many inputs. I think that std::sort should be implemented like this:

template<typename It, typename OutIt, typename ExtractKey> bool linear_sort(It begin, It end, OutIt buffer_begin, ExtractKey && key) { std::ptrdiff_t num_elements = end - begin; auto compare_keys = [&](auto && lhs, auto && rhs) { return key(lhs) < key(rhs); }; if (num_elements <= 16) { insertion_sort(begin, end, compare_keys); return false; } else if (num_elements <= 1024 || radix_sort_pass_count<ExtractKey, It>::value > 10) { std::sort(begin, end, compare_keys); return false; } else return radix_sort(begin, end, buffer_begin, std::forward<ExtractKey>(key)); }

First, the interface: Since this calls radix_sort, you have to provide a buffer that has the same size as the input array and a function to extract the sort key from the object. There could be a second version of this function with a default argument for the extract key function that just returns the value directly. So you sort can any type that radix sort supports. You would only have to provide an ExtractKey function for custom structs.

Next we decide which algorithm to use based on the number of elements. For a small number of elements, insertion_sort is generally thought to be the fastest algorithm. And for that I build a comparison function from the ExtractKey object. For a medium number of elements I would call std::sort. And for a large number of elements I would call radix_sort.

There is one more case where I call std::sort instead of radix_sort, and that is when radix sort would have to do a lot of passes over the input data. I can calculate how many passes radix sort has to do at compile time.

And finally the return value is a boolean that says whether the result was stored in the input buffer or in the output buffer. Depending on how many passes radix_sort has to do, the result could end up in either. So for example when sorting an int32, the function would return false because radix sort does four passes and the data ends up back in the input array, but when sorting a std::pair<bool, float> the function would return true because radix sort does five passes and the data ends up in the output array. The calling function then has to do something sensible with this information. If the two buffers are std::vectors, it could just swap them afterwards to get the data where it wants it to be.

Based on the benchmarks above, this algorithm would be several times faster than current implementations of std::sort for many inputs, and it would never be slower than std::sort.

How would we go about getting something like this into the standard? Well clearly we can’t change the interface of std::sort at this point. We could provide a function called std::sort_copy though that would have the above interface and could call radix_sort when that makes sense.

There is an in-place version of radix sort. If we used that, we could even use radix sort in std::sort. Except that we can’t get the ExtractKey function because std::sort takes a comparison functor. One solution for that would be to provide a customization point called std::sort_key which would work similar to std::hash. If your class provides a specialization for std::sort_key, std::sort is allowed to use an in-place version of radix sort when it makes sense, or it could build a comparison operator using std::sort_key and fall back to the old behavior.

This entire time we were building on top of counting_sort which needs to copy results to a different buffer. But if we could provide a version of radix sort that does all operations in one buffer, we could get that version into std::sort.

The in-place version of radix sort has one other very nice benefit: It starts sorting at the most significant digit. The version of radix sort that we used above started sorting at the least significant digit. This made radix sort slow for large keys because it always had to look at every byte of the key. **The in-place version could early out after looking at the first byte, which would potentially make it much faster for large keys**.

I will sketch out how in-place radix sort works, but I’ll leave the work of implementing it and measuring it to “future work.” I’ll explain why I didn’t implement it after I explain how it works.

We can’t build on top of counting sort because counting sort needs to copy results into a new buffer. But there is an in-place O(n) sorting algorithm called American Flag Sort. It works similar to counting sort in that you need an array to count how often each element appears. So if we sort bytes, we also need a count array of 256 elements. Then we also compute the prefix sum of this count array, like we did in counting sort. Only the final loop is different:

In the final loop of counting sort, we would directly move elements into the position that they need to be at. The prefix sum would tell us directly what the right position is. Since American Flag Sort is in-place, we need to swap instead. So let’s say the first element in the array actually wants to be at position 5. We swap it with whatever was at position 5. If the new element actually wants to be at position 3, we swap it with whatever was at position 3. We keep doing this until we find an element that actually wants to be the first element of the array. Only then do we move on to the second element in the array.

What tends to happen is that all the swapping at the first element moves a lot of elements into the right positions. Then all the swapping at the second element moves a lot more elements into the right position. So by the time that we’re a third of the way through the array, most elements are actually already sorted. So a lot of work happens on the first few elements, but at the later elements you mostly just determine that the elements are already where they want to be.

If you implement this you will need two copies of the prefix array. One copy that you change as you swap elements into place, (so that if two elements want to be in the bucket starting at position 5, the first one gets moved to position 5, and the second one to position 6) and one copy that you leave unchanged so that you can determine whether the element is already in the bucket that it wants to be in. (otherwise the element that you swapped into position 5 would think that its bucket now starts at position 6)

Now that we know how American Flag Sort works, we can implement radix sort on top of this. For that, American Flag Sort has to return the 256 offsets into the 256 buckets that it created. Then we call American Flag Sort again to sort within each of those 256 buckets, using the next byte in the key as the byte that we want to sort on. Meaning for a four byte integer, we have to sort recursively within a smaller bucket three more times after the initial sort. Since there’s 256 buckets and each of those gets split into 256 buckets after the second iteration and each of those gets split again after the third iteration, that means that we’ll call the function 256^3 times. Since that is a crazy number, we can just call insertion_sort for any bucket that is less than 16 elements in size, which will be most buckets. And actually since the in-place radix sort isn’t stable, we can also just call std::sort for any bucket that is less than 1024 elements in size. That gets the number of recursive calls down by a lot.

This sounds simple: Use American Flag Sort to subdivide into 256 buckets using the first byte, then sort each of those buckets recursively using the remaining bytes as sort key. The problem that I ran into was that I had generalized radix sort to work on std::pair, std::tuple and std::array. On the in-place version these were far more complicated because you have to pass the logic for advancing and the next comparison function through all recursion layers. Especially the std::tuple code drove me template-crazy. Since American Flag Sort was also significantly slower than counting sort, I abandoned the in-place version for now and decided to leave that for future work.

So for now the takeaway is this: There is an in-place version of radix sort, but for now I decided that it’s too much work to implement. The part that I did implement looked several times slower than the copying radix sort. It might still be faster than std::sort, but I haven’t measured that.

In this article I found that radix sort is several times faster than std::sort for what seem like pretty normal use cases. So why isn’t it used all over the place? I did a quick poll at work, and many people had heard of it, but didn’t know how to use it or whether it was even worth using at all. Which is exactly what I was like before I started this investigation. So why isn’t it popular? I have a few explanations

- Size overhead: Radix sort requires a second array of the same size to store the data in. If your data is small you may not want to pay for the overhead of allocating and freeing a second buffer. If your data is large you may not have enough memory to have that second buffer. Radix sort may still be a great option if your data is sized somewhere in the middle, but using radix sort means that you have to worry about these things.
- Can’t use radix sort on variable sized keys. The version of radix sort presented here only works on fixed sized keys. So it can’t sort strings for example. The in-place version of radix sort can sort strings, but I didn’t look into that too much.
- We used to always write custom comparison functions. If you don’t use std::make_tuple or std::tie to implement your comparison function, it may not be obvious how to use radix sort for your class. You need to know that you can sort tuples using radix sort, and you need to notice that you’re using tuples in your comparison functions already.
- I can’t find any place that generalizes radix sort to std::pair, std::tuple and std::array. So this might actually be an original contribution of mine. Googling for it I can find mentions of using radix sort on tuples of ints, but it seems like those people don’t realize that you can generalize beyond that. Certainly nobody suggests that you could use radix sort on a std::pair<bool, float>. (for example the boost version of radix sort can not sort a std::pair<bool, float>, and boost code is usually way too generic) If you think that radix sort is only for integers, it’s not very useful.
- Radix sort can not take advantage of already sorted data. It always has to do the same operations, no matter what the data looks like. std::sort can be much faster on already sorted data.

So there are certainly some good reasons for not using radix sort. There simply can’t be one best sorting algorithm. However I also think that radix sort lost some popularity due to historical accidents. People often don’t seem to think that it applies to their data even though it does.

Radix sort is an old algorithm, but I think it should be used more. Much more. Depending on what your sort key is, it will be several times faster than comparison based sorting algorithms.

Radix sort can be used to implement a general purpose O(n) sorting algorithm that automatically falls back to std::sort when that would be faster. I think the standard library should be modified so that it can provide this behavior. I think this is possible by offering an extension point called std::sort_key which would work similar to std::hash. Even without that the standard could provide std::sort_copy, which would promise O(n) sorting on small keys.

The final conclusion is that it’s worth learning something about algorithms even if you’ve programmed for a while. I learned how radix sort works because I’ve been watching an Introduction to Algorithms course by MIT. I didn’t expect to learn anything new in that course, but it’s already caused me to write this blog post, and it’s inspired me to take another stab at writing the worlds fastest hashtable. (in progress) So never stop learning, and try to fill in the gaps in your knowledge of the basics. You never know when it will be useful.

I have uploaded my implementation of radix sort here, licensed under the boost license.

]]>

Shenzhen I/O shows you a histogram of all the scores that other people have reached. If my solution would fall on the right of the bell curve, I would optimize it until I was on the left. After a lot of work I would usually arrive at an “optimal” solution that puts me in the best bracket on the histogram. Those solutions were always far from optimal.

When you’re competing with another player, they will probably find a way to beat your score by just a few points. Let’s say my score is 340 and a friend beats me with a score of 335. (lower is better. The score is just the number of executed instructions) What follows is a bunch of head-scratching about how you could possibly get any more cycles out of the algorithm. After an hour of staring and trying different things you find a small improvement, and your new score is 332! Proudly you tell your friend that you beat their score. Soon after your friend will beat your score with 320. Such a big jump seems impossible. But your friend somehow did it. So now you need to think outside of the box. You’re thinking the only way that you could possibly achieve such a big jump is if you could somehow combine these two different parts of the algorithm, so that they can share this one part of the work. It doesn’t seem possible, and it’s not even clear that this will buy this much of a score improvement, but it’s the only thing you can think of. So after another hour of head-scratching about how you could possible achieve this you find a way to do it, and lo and behold it the wins are far bigger than expected, the new score is 310! And the next day your friend comes back with 290…

My friend and I have literally had cases where we went from a score of more than 500, where my friend thought that my score was impossible, down to a score of 202 for me, and 200 for my friend which put us completely off the charts. At that point a new patch hit that changed the level slightly so that our solutions didn’t work any more. (the game is still in early access)Â But if it hadn’t been for that, I wouldn’t have been surprised if we could have optimized this further. Almost every single time that I thought the limit was reached, we broke through it soon after.

I can now say for a fact that a lot of code out there is far from optimal. Even the code in our standard libraries that’s maintained by some of the best programmers and that’s used by millions is slower than it needs to be. It is simply faster than whatever code they compared it against.

On the second puzzle in the game, which serves as a kind of tutorial, the only possible score is 240. Except there were some people over on the left of the histogram. And wondering how to get over there, my friend somehow got to 180, telling me “I think this one is optimal.” The score seemed unreachable. With a few tricks I got it down to 232. After literally days of thinking about this problem I managed to think outside the box and match my friend’s score of 180. It wasn’t until we talked about it that we realized that we had used different solutions. It was crazy to realize that there were in fact two entirely different ways to reach 180. Once I realized that I had used a different solution, I also realized that the solution that my friend picked could not be optimized further, but mine could. It took me hours, but I got the score down to 156, and then very quickly down to 149. My friend then beat me with 148 using my technique, forcing me to find one last cycle.

If nobody had gotten to the score of 180 before me, I couldn’t have thought of any faster way of solving this puzzle. Without that piece of information, the brain just comes up with reasons why the score is already optimal. Only once you know for a fact that a better solution is possible can you actually think of that solution. If you now say “but how did the first person get to a lower score?” then the answer is that the technique that my friend used is actually useful in other levels, so they could have gotten the trick from one of those and then just applied it in earlier levels once they had come up with it in a later level. Or maybe somebody got to the score of 232 which is just the 240 score with a few dirty tricks, and somebody else thought about how to get to the “impossible” score of 232, and accidentally got to 180 instead.

Or Michael Abrash has told the story of how he was optimizing an inner loop, and he was asking a friend for help. The friend stayed long in the office and at night left a message for Abrash telling him that he had gotten two more instructions out of the seemingly optimal inner loop. Abrash didn’t think that was possible, but before the friend came into work the next day Abrash had already found how to reduce the loop by one instruction. At that point the friend told Abrash that the friend had actually made a mistake, and the two cycle optimization wasn’t valid. But just the thought that the friend could have gotten two more instructions out of the loop made it possible for Abrash to find another optimization.

In the puzzle above where my friend and I brought our score down from more than 500 to 200, the final solution was actually much cleaner than the solution that has a score of 500. Well my final 202 score solution is a dirty mess, but somewhere around 220 I had just the most beautiful code. It was much cleaner than the code I had for a score of 270, which in turn was much cleaner than the code I had for 340, which in turn was cleaner than the code I had for 410. But even though the fast solution is much simpler and cleaner than the bloated, slow solution, you have to write the bloated, slow solution first. It is a necessary step in getting familiar with the problem. Only once you’re familiar with it can you recognize the points where it could be cleaner. The only way to get to the good solution is to perform many steps of filing off the bumps and cleaning up the dust.

Even big, algorithmic improvements come from writing the bad solution first and then making many small improvements. At some points the small improvements clarify something in the solution. They reveal a symmetry or uncover that some work was done twice. Sometimes a new fact reveals itself very hazily, and only more work and thought on the problem can slowly make it clearer. Sometimes you don’t realize that you just made a big, algorithmic improvement until after you’re done. “Oh I can delete this entire chunk of code now. How did that happen?” And then after the fact you can reason through the steps that took you there.

For all of this you have to keep working on the problem and you have to keep it in your head. (partly so that it’s in the back of your head when you’re sleeping or taking a shower) You can’t come up with improvements if you’re not actively working on the problem.

This is obvious for people who have worked with tests, but in the videogame industry where I work, unit tests are still rare. In Shenzhen I/O you are so ridiculously productive thanks to the automated tests, that I point out to everyone who has played it “you could be this productive at work if you just wrote tests.”

Tests allow you to have a feedback loop of seconds. Manual testing requires launching the game, teleporting to the point you want to test something, waiting for loading, then manually doing your test. (say by killing a goblin and checking that the right effects play when the goblin dies) Not only does the automated test drastically improve iteration times, it will probably test more cases and provide more helpful error messages when something goes wrong.

I think the fast iteration times in Shenzhen I/O are one of the main reasons for why it is so much more fun than normal programming. Fast feedback and fast iteration times just make programming better. Suddenly I want to go back to old code to see if I can improve it, because if I get a few more cycles out of it I can find out very quickly. How long does it take you to set up a test case at work that measures performance and measures improvements? How long does it take you to make sure that your optimization didn’t break anything? Does that keep you from trying more risky optimizations?

Slow iteration times make you work differently. Not only do they drag the fun out of programming, but they make you spend less time on improvements. They hurt your code quality. It’s worth spending time on improving iteration times even if you did the math and figured that people didn’t spend a lot of time compiling. It’s not just about time spent.

If our libraries were set up like Shenzhen I/O puzzles, all of our code would run much faster. The way this could work is that the standard library would define an interface, tests, and a simple implementation. Then anyone could submit better implementations. And you could judge how fast each solution completes each test. You pick the test cases that you care about and pick the implementation that does best in those.

People could provide several different implementations that do better in different scenarios. (“this one does better if your data grows and shrinks a lot, this one does better if it’s mostly stable”) All you have to do is make sure that your implementation satisfies certain tests.

I think if we had this we would quickly find a new, faster sorting algorithm. The current favorites seem to be Introsort and Timsort, but I am confident that they would be beat immediately. The reason is simply that nobody has worked on sorting algorithms in an environment like the one in Shenzhen I/O.

Shenzhen I/O has impressively good writing, and it really enhances the game. The story is that you’re a programmer who moves to China for a job. It’s a simple story, entirely told in email conversations with your in-game coworkers, but the small story snippets really lighten up the game. Your coworkers have personalities that seem well-researched, almost as if the author has experience with working abroad himself. Each puzzle also has a little back story. I find that I use the back story to determine whether my solution is “cheating” or not. You can “cheat” by adapting your solution to the test cases, so that only those pass and other tests cases might fail. Usually if the device still fulfills its purpose according to the back story,Â I’m fine with taking a shortcut. (e.g. it’s fine to err on the side of false positives for the “spoiler-blocking headphones”, but not for the security card reader. For that puzzle however false negatives based on timing are OK because people can just swipe the card again)

The emails contain funny moments between coworkers, informative emails where you learn something about China, and emails that mirror your own emotions: When you get access to a new part that will make puzzles easier your coworkers are ecstatic and so are you. Or you are confused early in the game because you have to learn a lot, and the game acknowledges and plays with your confusion by making part of the documentation Chinese. This actually helps because it makes clear that you don’t have to learn everything to get started, and it’s OK to be a bit confused. It’s very impressive how all of this is told in very short email conversations that take maybe a minute or two between puzzles.

If this was just a series of programming puzzles, it wouldn’t work nearly as well. Before playing this game I wouldn’t have thought that programming puzzles need a story. The game would work without a story, it just wouldn’t be as good.

You should play Shenzhen I/O. It takes all the fun parts of programming and distills them into a game. If you can, convince a friend and start roughly at the same time.

The game teaches persistence and how to improve a solution with any means necessary. The game teaches out of the box thinking. When you have a tiny, constrained problem and somehow people are much faster than you, you have to think outside the box. (or sometimes you think outside the box for an hour only to realize that there was an obvious improvement left to do inside the box)

The game shows a great way to program, making you incredibly productive which will make your work feel sluggish in comparison. It’ll make you want to improve your tools at work.

Many of these lessons aren’t new, but since Shenzhen I/O is such a condensed experience, it makes these lessons clear and easy to acquire. It’s a great way to spend time as a programmer.

]]>

To illustrate let’s look at how objects were composed before C++11, what problems we ran into, and how everything just works automatically since C++11. Let’s build an example of three objects:

struct Expensive { std::vector<float> vec; }; struct Group { Group(); Group(const Group &); Group & operator=(const Group &); ~Group(); int i; float f; std::vector<Expensive *> e; }; struct World { World(); World(const World &); World & operator=(const World &); ~World(); std::vector<Group *> c; };

Before C++11 composition looked something like this. It was OK to have a vector of floats, but you’d never have a vector of more expensive objects because any time that that vector re-allocates, you’d have a very expensive operation on your hand. So instead you’d write a vector of pointers. Let’s implement all those functions:

Group::Group() : i() , f() { } Group::Group(const Group & other) : i(other.i) , f(other.f) { e.reserve(other.e.size()); for (std::vector<Expensive *>::const_iterator it = other.e.begin(), end = other.e.end(); it != end; ++it) { e.push_back(new Expensive(**it)); } } Group & Group::operator=(const Group & other) { i = other.i; f = other.f; for (std::vector<Expensive *>::iterator it = e.begin(), end = e.end(); it != end; ++it) { delete *it; } e.clear(); e.reserve(other.e.size()); for (std::vector<Expensive *>::const_iterator it = other.e.begin(), end = other.e.end(); it != end; ++it) { e.push_back(new Expensive(**it)); } return *this; } Group::~Group() { for (std::vector<Expensive *>::iterator it = e.begin(), end = e.end(); it != end; ++it) { delete *it; } } World::World() { } World::World(const World & other) { c.reserve(other.c.size()); for (std::vector<Group *>::const_iterator it = other.c.begin(), end = other.c.end(); it != end; ++it) { c.push_back(new Group(**it)); } } World & World::operator=(const World & other) { for (std::vector<Group *>::iterator it = c.begin(), end = c.end(); it != end; ++it) { delete *it; } c.clear(); c.reserve(other.c.size()); for (std::vector<Group *>::const_iterator it = other.c.begin(), end = other.c.end(); it != end; ++it) { c.push_back(new Group(**it)); } return *this; } World::~World() { for (std::vector<Group *>::iterator it = c.begin(), end = c.end(); it != end; ++it) { delete *it; } }

Oh god this is painful to do now, but this illustrates how people used to do composition. Or most of the time what people actually did is they just made their type non-copyable. Nobody would have wanted to maintain all this code (Too easy to make typos in this mindless code), so the easiest thing to do is to make the type non-copyable.

In fact oftentimes it just looked like types were non-copyable just because it’s difficult to reason through all these pointers. So in a sense it doesn’t matter that you could have implemented a copy constructor, the problem was that it’s difficult to reason through everything.

Nowadays I would write the above classes like this:

struct Expensive { std::vector<float> vec; }; struct Group { int i = 0; float f = 0.0f; std::vector<Expensive> e; }; struct World { std::vector<Group> c; };

This does everything that the above code did and it does it faster and with less heap allocations. The main feature in C++11 that made this possible was the addition of move semantics. Why isn’t this possible without move semantics? After all that last chunk of code would have compiled fine and run fine before C++11. But before C++11 people would have changed this code to look like the code further up. To see why imagine what happens when we add a new Group to the World.

If the vector in World reallocates its internal storage, we have to create temporary copies of our Groups and may have to allocate thousands of temporary vectors in the nested classes. Just to do an operation that’s internal to the vector. It’s terrible that we can randomly get slowdowns like this from harmless operations like a push_back.

The first time that somebody catches this in a profiler they will take a look at the codebase and find that we rarely copy Groups. So why don’t we just replace the internals with a pointer? That will make the copy more expensive but it will make growing and shrinking the vector practically free because we don’t have to copy in that case. We get a huge performance improvement and everyone is happy. And with that we’re back at the initial code.

Move semantics solve that problem. **With move semantics objects can re-organize their internals without having to copy everything that they own**. That’s obviously very useful for std::vector, but it turns out to be useful in a lot of classes.

Move semantics also gives composition to types that aren’t copyable. Before C++11 you could use RAII for non-copyable types, but then you couldn’t compose them as well as other classes. To illustrate let’s add some kind of OS handle to the Expensive struct. And let’s say that this OS handle requires manual clean-up:

struct Expensive { Expensive() : h(GetOsHandle()) { } ~Expensive() { FreeOsHandle(h); } HANDLE h; std::vector<float> vec; };

And just with that, everything is ruined. Expensive now can’t be copied and can’t be moved. That immediately breaks Group, which immediately breaks World. To fix this we could change Group to use a pointer to Expensive instead of using Expensive by value. But then Group has to be non-copyable, too and World is still broken. So now we also have to change World to store Group by pointer and we propagate our ugliness all the way through the codebase. A single type that requires manual clean-up makes us add the boilerplate code from C++98 to do composition to all other classes that use it directly or indirectly. It’s a mess.

Of course you know the solution already: Move semantics. If we just wrap the OS handle in a type that supports move semantics, everything continues to work:

struct WrappedOsHandle { WrappedOSHandle() : h() { } WrappedOSHandle(HANDLE h) : h(h) { } WrappedOSHandle(WrappedOSHandle && other) : h(other.h) { other.h = HANDLE(); } WrappedOSHandle & operator=(WrappedOSHandle && other) { std::swap(h, other.h); return *this; } ~WrappedOsHandle() { if (h) FreeOsHandle(h); } operator HANDLE() const { return h; } private: HANDLE h; }; struct Expensive { Expensive() : h(GetOsHandle()) { } WrappedOSHandle h; std::vector<float> vec; };

It’s a bit of boilerplate, but there are ways of avoiding it. (for example use a unique_ptr with a custom deleter). Now whenever we use this handle type, our class stays composable. Group keeps working and the World keeps working and everyone is happy.

There is a more fundamental reason why this works and why RAII is important for this: **Composing objects is a lot easier if certain operations are standardized**. If my object A consists of two objects B and C, it’s a lot easier to write the clean-up code for A if the clean-up code for all types is standardized. Otherwise B and C might have custom clean-up code and now A has to also have custom clean-up code. If everyone standardizes on one way to clean up objects, composition is easier.

The list of functions that make composition easier is long. It includes construction, copying, moving, assignment, swapping, destruction, reflection, comparison, hashing, checking for validity, pattern matching, interfacing with scripting languages, serialization in all its many forms and more. For example it’s a lot easier to write a hash function for my type if there is a standard way to hash my components. Or it’s a lot easier to copy my type if there is a standard way to copy my components. Not all types need all operations from this list, but if your type does need one of these, you’ll want a standard interface for your components. In fact once there is a standard way, you might as well automate this.

C++ has decided to automate the bare necessities out of that list: Construction, copying, moving, assignment and destruction. And it did this in the set of rules that we call RAII. If you use RAII, composition will be a lot easier for you. You’ll find that you’ll have a lot more types that just slot together and just work together. It’ll improve your code.

Oh and this is also another good reason to standardize reflection: With reflection, I can automate a lot of other elements in that list.

]]>

At least that’s my first impression. Still just dipping my toes in. But there is one thing I am very impressed with: How much data neural networks can express in how few connections.

To illustrate let me draw a very simple neural network. It’s not a very interesting neural network, I’m just connecting inputs to outputs:

And now let’s say that I want to teach this neural network the following pattern: Whenever input 1 fires, fire output 2. When input 2 fires, fire output 3. When input 3 fires, fire output 4. When input 4 fires, fire output 5. Output 1 never gets fired and input 5 never gets fired. To do that you use an algorithm called back propagation and repeatedly tell the network what output you expect for a given input, but that is not what I want to talk about here. I want to talk about the results. I’ll make the connections that the network learns stronger, and the connections that the network doesn’t learn weaker:

What happened here is that the weights for some connections have increased, while the weights for most connections have decreased. I haven’t talked about weights yet. Weights are what neural networks learn. When we draw networks, the nodes seem more important. But what the network actually learns and stores are weights on the connections. In the picture above the thick lines have a weight of 1, the others have a weight of 0. (weights can also go negative, and they can go up arbitrarily high)

A simple network like the one above can learn simple patterns like this. If we want to learn more complex patterns, we have to create a network that has more layers. Let’s insert a layer with two neurons in the middle:

There are many possible behaviors for those neurons in the middle, and that is the part where most of the magic happens in neural networks. It seems like picking what goes there just requires experimentation and experience. A simple neuron would be the tanh neuron which does these three steps:

- add up all of its inputs multiplied by their weights
- calculate tanh() on the sum
- fire the result to its outputs multiplied by their weights

Why tanh? Because it has a couple convenient properties, and it happens to work better than other functions that have the same properties. But mostly it’s “beause it works like this and it doesn’t work when we change it.”

If you count the number of connections on that last picture you will notice that there are fewer than there were in the first network. There were 5*5=25 at first, now there are 5*2*2=20. That reduction would be larger if I had more input and output nodes. For 10 nodes that reduction would be from 100 connections down to 40 when inserting two nodes in the middle. That’s where the compression in the title of this blog post comes from.

The question is whether we can still represent the original pattern in this new representation. To show why that is not obvious, I’ll explain why it doesn’t work if you just have one middle neuron:

If I initialize the weights on here so that the first node fires the second output, all nodes will fire the second output:

That one node in the middle messes up the ability for my network to learn my very simple rule. This network can not learn different behaviors for different input nodes because they all have to go through that one node in the middle. Remember that the node in the middle just adds all of its inputs, so it can not have different behaviors depending on which input it receives. The information of where a value came from gets lost in the summation.

So how many nodes do I need to put into the middle to still be able to learn my rule? In real neural networks you usually put hundreds of nodes into that middle layer, but what is the minimal number to learn my pattern? Before I get to the answer I need to explain one more type of neuron that enables my compression: The softmax layer. If I make my output layer a softmax layer, that means that it will look at all the activations on that layer, and that it will only fire the output with the highest activation. That’s where the “max” in the softmax comes from: It fires the max node. The “soft” part is also very important in other contexts because a softmax layer can fire more than one output, but in my case it will only ever fire one output so we’ll stay with this explanation. If I make my middle layer a tanh layer and my output layer a softmax layer, I can represent my pattern just by having two nodes in the middle:

Here I’ve made lines with positive weights blue, and lines with negative weights red. This means that if for example the first input fires, I will get these values on the nodes:

The first node activates both the middle nodes. That activates the second output node with weight two. The next two nodes are canceled out because they receive a positive weight from one of the middle nodes and a negative weight from the other. The last node is activated with weight -2. If I then run softmax on this only the top node will fire. Meaning the bottom node will not fire a negative output. Softmax suppresses everything except for the most active node.

This is a little bit simplified, because the tanh layer doesn’t give out nice round numbers and the softmax layer wants bigger numbers, but those complications are not necessary to understand what’s going on, so I’ll keep the numbers in this blog post simple. (the real weights in my test code are +9 and -9 on the connections going from the input layer to the middle layer, and +26 and -26 on the connections going from the middle layer to the output layer. I don’t know why those numbers specifically, that’s just where the network decided to stabilize)

Let’s quickly run through this for the other three cases as well:

As you can see there is always one clear winner and then the softmax layer at the end will make sure that only that one fires and the other outputs remain quiet. When I first saw this behavior I was quite impressed. In fact I saw this behavior on a network with eleven inputs, eleven outputs and just two middle nodes. Can you think of how the above layer would work with eleven inputs? It seems like there’s only four possible combinations for these weights and we’ve used all of them, right? You’re underestimating neural networks. It’s quite impressive how they will try to squeeze any pattern that you throw at them into what’s available. For eleven inputs and eleven outputs it looks like this:

That is a lot of connections and a lot of numbers. The network has now decided to make some connections stronger and other connections weaker. I couldn’t keep using colors and line-style to visualize different strengths because now there are a lot of different values. Whenever I tried to simplify and not use numbers I’d break the network. So instead I just show the numbers. The upper number on each node is the weight of the connection to/from the upper node in the middle layer, and the lower number is the weight of the connection to/from the lower node in the middle layer.

Let’s walk through a random example and see how it works:

The node with the largest output value (72 = -10 * -4 + 8 * 4) is the one we want to fire, and softmax will select that. In this picture I’m just multiplying the numbers linearly because the weights on the left actually already have tanh applied (and were then multiplied by 10 to get them into the range -10 to 10 which is slightly nicer for pictures than the range -1 to 1). I can do that in this case since I only ever fire one input. And if I can just do linear math the example is easier to follow.

I’ll post a second example to show that the weights will activate the desired output for any input:

Here also for my given input n, output n+1 had the highest value at the end and softmax will make that output fire. You could go through a couple more examples and convince yourself that this network has learned the pattern that I want it to learn.

I don’t know about you, but I think this is quite impressive. For the small first network I think I could have figured out how to put the numbers in manually. (once you know the trick that the nodes can cancel each other out it’s easy) For all the connections on the larger network you would have to be very good at balancing these weights, and I honestly don’t think I could have done this by hand. The neural network however seems to just figure this out. Or rather: it figures this out if you use a tanh layer in the middle and a softmax layer at the end. If you don’t, it will stop working.

Since I’m using real numbers now I might as well explain how softmax works. Softmax uses the number at the end as an exponent and then normalizes the column afterwards. If I use 2^x it should be obvious that 2^56, the highest number, is a lot bigger than 2^44, the next highest number. If I have 2^56 in the column and I normalize it, all the other activations will go very close to zero and the output that I want to fire will be close to 1. Also by using an exponent I get no negative numbers. 2^-96 is just a number that’s very close to zero. The standard is to use e^x instead of 2^x, and I also use that, it’s just that as a programmer I find it easier to explain and easier to understand using powers of two.

With that information I can give a bit of an intuition why this happens to work if you use tanh and softmax: When using softmax, the connections can keep on improving the chance of their desired output just by making their own weights bigger. In fact since I just used linear math in my explanation above and didn’t need to run tanh() on anything, couldn’t softmax have learned those weighs even without using a tanh layer? In theory it could, but in practice it will keep on bumping up the numbers to get small improvements and you will very quickly overflow floating point numbers. The tanh layer in the middle is then just responsible for keeping the numbers small: it clamps its outputs to the range -1 to 1, and the inputs won’t grow past a certain point either because at some point a bigger number just means that the output goes from 0.995 to 0.997. But since they can usually still get a small improvement you don’t lose that nice property of softmax where neurons can keep on edging out small improvements over each other.

So if you just use softmax your weights will explode and if you just use tanh your weights will stagnate too soon before a good solution emerges. If you use both you get nice greedy behavior where the connections keep on looking for better solutions, but you also keep the weights from exploding.

Now all I’ve shown is that my network can learn the rule that when input n fires, it needs to fire output n+1. It should be easy to see that just by creating a different permutation of the weights of the connections I can teach my network that for any input neuron x, it should fire any other output neuron y. The impressive part is that without the middle layer I would have needed 11*11 = 121 connections to have a network that can learn any combination, and with that middle layer I can do it in only 11*2*2 = 44 connections.

Eleven inputs and outputs is close to the limit for what you can do with two nodes in the middle. If you ask it to do much more the weights will fluctuate and they will keep on stepping on each others toes. But with three nodes in the middle I can learn these patterns for something like thirty inputs and outputs, and with four nodes, I can learn direct connections for more than a hundred inputs and outputs. It doesn’t grow linearly because the number of combinations doesn’t grow linearly. Luckily the back propagation algorithm scales with the number of connections, not with the number of combinations. So with a linear increase in computation cost I get a super-linear increase in the amount of stuff I can learn.

I’m still just getting started with neural networks, and real networks look nothing like my tiny examples above. I don’t know how many neurons people actually use nowadays, but even small examples talk about hundreds of neurons. At that point it’s more difficult to understand what your network has learned. You can’t visualize it as easily as the above pictures. In fact even the eleven neuron picture is difficult to understand because you can’t just look at the strong connections. Even the weak connections are important. If that middle layer had 700 neurons, then good luck getting a picture of which connections are actually important. Maybe a bunch of weak connections add up to build the output you wanted and one random strong connection suppresses all the unwanted outputs that you didn’t want.

But I hope I have given you an intuition for how neural networks can compress patterns in few weights. They use the full range of the weights to the point where a connection activated with a strong input can mean something entirely different than the same connection activated with a weak input. And best of all I didn’t have to teach them to do this. They just start behaving like this if you force them to express a complex pattern in few connections.

]]>

The obvious use case is for news reports and videos. When I now see pictures from Syria I want to have a 360 degree picture to be able to look around and get a better feel for the situation. But for video games, VR didn’t quite click for me.

At GDC I played two games that used the Oculus and Vive controllers, and now I finally get what you can do with VRÂ in gamesthat you can’t do otherwise: You can use your own hands to interact with things in the virtual world.

The demo that impressed me most was Epic’s Bullet Train:

In Bullet Train bullets go into slow motion as they get close to you. You can pick them out of the air, turn them around, and watch them whizz away again towards your enemies. And it feels like you’re using your own hands for this. You soon realize that you can give the bullets a little momentum by flicking them and it feels amazing.

I played this with the Oculus Touch controller, but you very quickly forget that you’re using a controller. Just like when using a classical controller, you’re soon immersed in the game and you forgot that you’re pressing buttons or moving sticks. There are also a few small tricks they do to help with the illusion: In the game you see your movement as hands, and your brain quickly accepts those hands as your own hands. Additionally the buttons on the controller are very touch sensitive. Just resting your finger on a button will make that finger move in VR. Pressing the button will make the finger close in to grab something. It feels natural.

The other game I played was Fantastic Contraption:

Just look at how naturally she moves her hands to build things in there. The video is pretty long and you don’t have to watch all of it. The point is: In VR, video games allow you to use your hands and it actually works.

The Wii and the Kinect promised something similar, but they were both kinda crap. The Playstation Move was supposed to be better (I never actually tried it) but there is still that disconnect that you’re moving your hand in front of you and something happens far away on a screen. In VR you’re moving your hands in front of you and they also move in front of you in the game. It works the way it should.

In the above video there are many points where she moves her hands outside of her field of vision. Either to drop something behind her or to grab something on her side that she knows is there. That kind of sptial reasoning would have been impossible with older motion controllers.

Pool Nation is a game where it looks like they stumbled across this by accident:

I don’t know if they had planned for the interactions with the bottles and other bar items, but I can see exactly how it would have happened by accident: As soon as you put on VR goggles and VR controllers and you see a bottle standing off to the side, you will want to reach for it, pick it up, and maybe toss it somewhere. You want to use your hands. As soon as the developers put that in, all the other interactions shown in the video come naturally.

I believe this last video best captures what VR will be about a couple years from now. I think this ability to use your hands like you would expect to will be the defining feature of early successful VR games. It’s the feature that finally convinced me that you can make games in VR that you couldn’t make outside of it. Remember how hyped people were about motion controllers when the Wii came out? The feature actually works now.

Oh also I’m happy to report that I didn’t feel motion sick in the VR games I played at GDC. As someone who was quickly affected by that in the dev kits, I can report that they have that figured out well enough now that I can play for at least fifteen minutes without noticing anything.

]]>

Still, the fact remains that such arguments have been insufficient to result in widespread adoption of functional programming. We must therefore conclude that the main weakness of functional programming is the flip side of its main strength – namely that problems arise when (as is often the case) the system to be built must maintain state of some kind.

I think the reason for the lack of popularity is much simpler: Writing functional code is often backwards and can feel more like solving puzzles than like explaining a process to the computer. In functional languages I often know what I want to say, but it feels like I have to solve a puzzle in order to express it to the language. Functional programming is just a bit too weird.

To talk about functional programming let’s bake a cake. Taking a recipe from here, this is how you bake an imperative cake:

- Preheat oven to 175 degrees C. Grease and flour 2 – 8 inch round pans. In a small bowl, whisk together flour, baking soda and salt; set aside.
- In a large bowl, cream butter, white sugar and brown sugar until light and fluffy. Beat in eggs, one at a time. Mix in the bananas. Add flour mixture alternately with the buttermilk to the creamed mixture. Stir in chopped walnuts. Pour batter into the prepared pans.
- Bake in the preheated oven for 30 minutes. Remove from oven, and place on a damp tea towel to cool.

I’d take some issue with the numbering there (clearly every step is actually several steps) but let’s see how we bake a functional cake.

- A cake is a hot cake that has been cooled on a damp tea towel, where a hot cake is a prepared cake that has been baked in a preheated oven for 30 minutes.
- A preheated oven is an oven that has been heated to 175 degrees C.
- A prepared cake is batter that has been poured into prepared pans, where batter is mixture that has chopped walnuts stirred in. Where mixture is butter, white sugar and brown sugar that has been creamed in a large bowl until light and fluffy…

Ah screw it I can’t finish this. I don’t know how to translate the steps without mutable state. I can either lose the ordering or I can say “then mix in the bananas,” thus modifying the existing state. Anyone want to try finishing this translation in the comments? I’d be interested in a version that uses monads and one that doesn’t use monads.

Imperative languages have this huge benefit of having implicit state. Both humans and machines are really good at implicit state attached to time. When reading the cake recipe, you know that after finishing the first instruction the oven is preheated, the pans are greased and we have mixed a batter. This doesn’t have to be explicitly stated. We have the instructions and we know what the resulting state would be of performing the instructions. Nobody is confused by the imperative recipe. If I was able to actually finish writing the functional recipe and if I showed it to my mom, she would be very confused by it. (at least the version that doesn’t use monads would be very confusing. Maybe a version using monads wouldn’t be as confusing)

I’m writing this blog post because I ran into a related problem recently. C++ templates are accidentally a functional language. When this was realized that problem wasn’t fixed, but instead the C++ designers doubled down on functional templates which can make it terribly annoying to convert code to generic code. Here’s something I wrote recently for a parser: (I know it’s stupid to write your own parser, but the old tools like yacc or bison are bad, and when I tried to use boost spirit I ran into a few problems that took way too long to figure out, until eventually I decided to just write my own)

ParseResult<V> VParser::parse_impl(ParseState state) { ParseResult<A> a = a_parser.parse(state); if (ParseSuccess<A> * success = a.get_success()) return ParseSuccess<V>{{std::move(success->value)}, success->new_state}; ParseResult<B> b = b_parser.parse(state); if (ParseSuccess<B> * success = b.get_success()) return ParseSuccess<V>{{std::move(success->value)}, success->new_state}; ParseResult<C> c = c_parser.parse(state); if (ParseSuccess<C> * success = c.get_success()) return ParseSuccess<V>{{std::move(success->value)}, success->new_state}; ParseResult<D> d = d_parser.parse(state); if (ParseSuccess<D> * success = d.get_success()) return ParseSuccess<V>{{std::move(success->value)}, success->new_state}; return select_parse_error(*a.get_error(), *b.get_error(), *c.get_error(), *d.get_error()); }

This function parses a variant type called “V” by trying to parse the types A, B, C and D. They have better names in the real code but those names are not important. There is some obvious repetition here: This calls exactly the same code for four different parsers. C++ doesn’t really support the monad pattern, but I could make this reusable by writing a loop that iterates over all four, trying them in order:

template<typename Variant, typename... Types> ParseResult<Variant> parse_variant(ParseState state, Parser<Types> &... parsers) { boost::optional<ParseError> error; template<typename T> for (Parser<T> & parser : parsers) { ParseResult<T> result = parser.parse(state); if (ParseSuccess<T> * success = result.get_success()) return ParseSuccess<Variant>{{std::move(success->value)}, success->new_state}; else error = select_parse_error(error, *result.get_error()); } return *error; } ParseResult<V> VParser::parse_impl(ParseState state) { return parse_variant<V>(state, a_parser, b_parser, c_parser, d_parser); }

There is some overhead here because I have to select one of the error messages to return, but overall this is a pretty straight forward transition to do. Except you can’t do this in C++. As soon as templates are involved you have to think more functional. Here is my solution:

template<typename Variant, typename First> ParseResult<Variant> parse_variant(ParseState state, Parser<First> & first_parser) { ParseResult<First> result = first_parser.parse(state); if (ParseSuccess<First> * success = result.get_success()) return ParseSuccess<Variant>{{std::move(success->value)}, success->new_state}; else return *result.get_error(); } template<typename Variant, typename First, typename... More> ParseResult<Variant> parse_variant(ParseState state, Parser<First> & first_parser, Parser<More> &... more_parsers) { ParseResult<First> result = first_parser.parse(state); if (ParseSuccess<First> * success = result.get_success()) return ParseSuccess<Variant>{{std::move(success->value)}, success->new_state}; else { ParseResult<Variant> more_result = parse_variant<Variant>(state, more_parsers...); if (ParseSuccess<Variant> * more_success = more_result.get_success()) return std::move(*more_success); else return select_parse_error(*result.get_error(), *more_result.get_error()); } } ParseResult<V> VParser::parse_impl(ParseState state) { return parse_variant<V>(state, a_parser, b_parser, c_parser, d_parser); }

I am actually very happy with this. Sure it’s harder to read because the iteration is hidden in a recursion, but you should have seen what I had before I came across this solution: I had a struct with a std::tuple<std::reference_wrapper<Parser<T>>…> member. If you’ve ever worked with a variadic sized std::tuple, you know that that alone turns any code into a puzzle.

In any case the point is this: I had some straight imperative code that was doing the same thing several times. In order to make it generic I couldn’t just introduce a loop around the repeated code, but I had to completely change the control flow. There is too much puzzle solving here. In fact I didn’t solve this the first time I tried. In my first attempt I ended up with something far too complicated and then just left the code in the original form. Only after coming back to the problem a few days later did I come up with the simple solution above. Making code generic shouldn’t be this complicated. The work here is not trying to figure out what to do, but it’s trying to figure out how to satisfy a system.

I get that too often in functional languages. I know C++ templates are a bad functional language, but even in good functional languages I spend too much time trying to figure out how to say things as opposed to figuring out what to say.

Now all that being said do I think that functional programming is a bad thing? Not at all! The benefits of functional programming are real and valuable. Everyone should learn at least one functional programming language and try to apply what they learned in other languages. But if functional programming languages want to become popular, they have to be less about puzzle solving.

]]>

Based on frustrations with existing solutions I decided to write something that can

- read and write existing, non-POD classes. No custom build step should be required
- write human readable formats like JSON
- read data very quickly (I care less about how fast it writes stuff. I work in an environment where there is a separate compile step for all data)

I came up with something that is faster than any of the libraries mentioned above.

The fastest possible way I know how to load stuff is to just memcpy the bytes straight from disk. So I’m going to load and save a POD struct. I know I said above that I want to save non-POD classes, but if I want to compare to memcpy, I have to keep my test case simple. Here is what it looks like:

struct memcpy_speed_comparison { float vec[4] = { 0.0f, 0.0f, 0.0f, 0.0f }; int i = 0; float f = 0.0f; };

Where the loading and saving code looks like this:

void test_write_memcpy(const std::string & filename, const std::vector<memcpy_speed_comparison> & elements) { std::ofstream file(filename, std::ios::binary); size_t size = elements.size(); file.write(reinterpret_cast<const char *>(&size), sizeof(size)); file.write(reinterpret_cast<const char *>(elements.data()), sizeof(memcpy_speed_comparison) * elements.size()); } std::vector<memcpy_speed_comparison> memcpy_from_bytes(ArrayView<const unsigned char> bytes) { size_t size = *reinterpret_cast<const size_t *>(bytes.begin()); ArrayView<const unsigned char> content = { bytes.begin() + sizeof(size), bytes.end() }; std::vector<memcpy_speed_comparison> elements(size); memcpy(elements.data(), content.begin(), content.size()); return elements; }

I memcpy the bytes to disk using file.write(), and then later I memcpy them back from the byte representation. (I keep using the term memcpy to indicate that I’m just copying bytes and not doing anything smarter than that. Just saying “copy” doesn’t properly convey that in my opinion, so memcpy it is)

I then use google benchmark to set up two benchmarks to measure how fast this is:

void MemcpyInMemory(benchmark::State & state) { std::vector<memcpy_speed_comparison> elements = generate_comparison_data(); while (state.KeepRunning()) { std::vector<memcpy_speed_comparison> comparison = elements; RAW_ASSERT(comparison == elements); } } void MemcpyReading(benchmark::State & state) { std::vector<memcpy_speed_comparison> elements = generate_comparison_data(); std::string memcpy_filename = "/tmp/memcpy_test"; test_write_memcpy(memcpy_filename, elements); while (state.KeepRunning()) { MMappedFileRead file(memcpy_filename); std::vector<memcpy_speed_comparison> comparison = memcpy_from_bytes(file.get_bytes()); RAW_ASSERT(comparison == elements); } }

Google Benchmark measures everything inside of the block that starts with while(state.KeepRunning()); meaning all of the writing code above that is not part of the measurements.

The first benchmark measures how fast it is to memcpy the comparison data in memory. The second measures how long it takes to memcpy them back from disk. The reason why I measure both will become clear when we look at how long the above benchmarks take:

Benchmark | Time(ns) | CPU(ns) |
---|---|---|

MemcpyInMemory | 482345 | 482846 |

MemcpyReading | 663630 | 661515 |

Does it surprise you that copying from disk is only slightly slower than copying in memory? I should say that this is copying from a SSD, but even then I didn’t expect loading from disk to be this fast. Turns out it’s not this fast. What I’m measuring above is how fast it is to copy from the OS cache. Which is slightly slower than copying from normal memory, because you have the overhead of a bunch of page faults as the pages are mapped in, but overall it’s still just copying from memory. In order to measure how fast it is to copy from disk I have to make sure that the file gets evicted from the OS cache between runs. Luckily you can do that in Linux by calling these two lines:

fdatasync(file_descriptor); posix_fadvise(file_descriptor, 0, 0, POSIX_FADV_DONTNEED);

Where posix_fadvise is the call that matters, but if you don’t call fdatasync first, the pages that haven’t been written to disk yet will not be evicted from the cache. That doesn’t matter in this benchmark because we are not writing in the loop, so the call to posix_fadvise would be enough, but for a general function that flushes a file from cache you need to call both.

With this call added I can actually measure how long it takes to memcpy data from a mmapped file, and these are the correct results:

Benchmark | Time(ns) | CPU(ns) |
---|---|---|

MemcpyInMemory | 466284 | 464406 |

MemcpyReading | 5580953 | 1627145 |

Which looks much more like what you would expect. Reading from a SSD is roughly ten times more expensive than reading from memory.

I’m using mmap() here, mainly because I will use mmap later and I want to be able to easily compare the numbers. If you’re interested in how fast using read(2) is, it’s essentially the same speed: 5437341 ns if I’m reading directly into the std::vector.

I decided to compare the above measurements with three solutions: Boost Serialization, Protobuffers and Flatbuffers. Why these three? Because I knew them and because they are a good representation for other solutions that use similar approaches. Boost Serialization is a non intrusive library that can save existing classes, Protobuffers and Flatbuffers both require a custom build step. The difference between Protobuffers and Flatbuffers is that Flatbuffers doesn’t do any actual loading. You just get a view into the data on disk.

Boost Serialization requires that I add this member function to the class:

struct memcpy_speed_comparison { // ... template<typename Ar> void serialize(Ar & archive, int) { archive & vec; archive & i; archive & f; } };

I could also write this as a non-member function but that doesn’t really matter for this example. This exposes the three members of this struct to an “archive” class that boost serialization provides and that I don’t need to know anything about, other than that it has an operator&. This code is all I have to do to read and write a vector of these.

The code for the benchmark then looks like this:

void BoostSerializationReading(benchmark::State & state) { std::string boost_filename = "/tmp/boost_serialization_test"; std::vector<memcpy_speed_comparison> elements = generate_comparison_data(); { std::ofstream file_out(boost_filename); boost::archive::binary_oarchive out(file_out); out & elements; } while (state.KeepRunning()) { std::vector<memcpy_speed_comparison> comparison; std::ifstream file_in(boost_filename); boost::archive::binary_iarchive in(file_in); in & comparison; RAW_ASSERT(comparison == elements); file_in.close(); UnixFile(boost_filename, 0).evict_from_os_cache(); } }

Which is pretty straight forward: Create the archive class that boost serialization provides and call operator& on it and the data that I want to read and write. I really like this library. The evict_from_os_cache() line at the end is a bit awkward, but the reason for that is that I decided to add it to a UnixFile() class of mine which just wraps open() and close().

Here is the performance of the above benchmark:

Benchmark | Time(ns) | CPU(ns) |
---|---|---|

MemcpyInMemory | 466284 | 464406 |

MemcpyReading | 5580953 | 1627145 |

BoostSerializationReading | 8881369 | 8200465 |

Which is disappointing. I blame this on the fact that Boost Serialization goes through the std::istream interface, but I can’t be sure. I didn’t investigate further.

Protobuf can not read or write the above class, so I had to make protobuf generate an equivalent class. Here is the protobuf description that I used to mirror the above data:

package test; message Vec4 { required float f0 = 1; required float f1 = 2; required float f2 = 3; required float f3 = 4; } message LoadFastTest { required Vec4 vec = 1; required int32 i = 2; required float f = 3; } message Array { repeated LoadFastTest array = 1; }

This description will generate three classes that are kind of similar to the std::vector that I used above. The code for the benchmark looks like this:

void ProtobufReading(benchmark::State & state) { test::Array array; for (int i = 0; i < 100000; ++i) { test::LoadFastTest * test = array.add_array(); test->mutable_vec()->set_f0(0.0f); test->mutable_vec()->set_f1(0.0f); test->mutable_vec()->set_f2(0.0f); test->mutable_vec()->set_f3(0.0f); test->set_f(float(i)); test->set_i(i % 5); } StringView<const char> tmp_filename = "/tmp/protobuf_test"; { UnixFile file(tmp_filename, O_RDWR | O_CREAT, 0644); array.SerializePartialToFileDescriptor(file.file_descriptor); } while (state.KeepRunning()) { UnixFile file(tmp_filename, O_RDONLY); test::Array copy; copy.ParsePartialFromFileDescriptor(file.file_descriptor); for (int i = 0; i < 100000; ++i) { const test::LoadFastTest & test = copy.array(i); RAW_ASSERT(0.0f == test.vec().f0()); RAW_ASSERT(0.0f == test.vec().f1()); RAW_ASSERT(0.0f == test.vec().f2()); RAW_ASSERT(0.0f == test.vec().f3()); RAW_ASSERT(float(i) == test.f()); RAW_ASSERT(i % 5 == test.i()); } file.evict_from_os_cache(); } }

Which is… unpleasant. I can’t reuse my code for setting up the test data, so I had to inline it here. Also protobuf doesn’t provide an operator== for the classes that it generates, so I had to write the comparison code inline here as well. This awkwardness is actually the main reason why I wouldn’t use protobuf. I don’t want low quality classes that constantly require small bits of wrapper code all throughout my codebase. Anyway it is also really slow:

Benchmark | Time(ns) | CPU(ns) |
---|---|---|

MemcpyInMemory | 466284 | 464406 |

MemcpyReading | 5580953 | 1627145 |

ProtobufReading | 22447402 | 20409829 |

Since protobuf generates its own classes, it should be able to generate really fast code for reading and writing. No idea why it is so much slower than Boost Serialization. That being said since protobuf is so slow, Google has a second library for reading and writing called Flatbuffers.

Flatbuffers also requires a description of your class in a separate type system. It looks like this for the above data:

namespace test_flatbuffers; struct Vec4 { x : float = 0; y : float = 0; z : float = 0; w : float = 0; } table Monster { vec : Vec4; i : int = 0; f : float = 0; } table Array { array : [Monster]; } root_type Array;

My class is named “Monster” because that is the name in the flatbuffers example. I used a “table” for the class because one cool feature of flatbuffers is that it doesn’t save members that have the default value. Meaning if you never assign a value to a member, it also doesn’t have to be saved. Unfortunately this feature doesn’t seem to work for struct members, which means that even if I never change the “vec” member, it will still always be written out to disk.

The flatbuffers interface is also a little bit curious due to the differences between structs and tables. It took me really long to write a simple test case. Again it’s a symptom of the problem that they decided to use their own type system and that they generate their own classes, and they are just not doing a good job at generating quality code. Here is what my test case looks like:

void FlatbuffersReading(benchmark::State & state) { namespace fb = flatbuffers; namespace tfb = test_flatbuffers; std::string filename = "/tmp/flatbuffers_test"; { fb::FlatBufferBuilder builder; std::vector<fb::Offset<tfb::Monster>> monster_array; for (int i = 0; i < 100000; ++i) { tfb::Vec4 vec(0.0f, 0.0f, 0.0f, 0.0f); monster_array.push_back(test_flatbuffers::CreateMonster(builder, &vec, i % 5, float(i))); } auto vec = builder.CreateVector(monster_array); tfb::ArrayBuilder array(builder); array.add_array(vec); builder.Finish(array.Finish()); std::ofstream file(filename); file.write(reinterpret_cast<const char *>(builder.GetBufferPointer()), builder.GetSize()); } while (state.KeepRunning()) { MMappedFileRead file(filename); const tfb::Array * read_array = tfb::GetArray(file.get_bytes().begin()); auto array = read_array->array(); int count = 0; for (auto it = array->begin(), end = array->end(); it != end; ++it) { RAW_ASSERT(it->vec()->x() == 0.0f); RAW_ASSERT(it->vec()->y() == 0.0f); RAW_ASSERT(it->vec()->z() == 0.0f); RAW_ASSERT(it->vec()->w() == 0.0f); RAW_ASSERT(it->f() == float(count)); RAW_ASSERT(it->i() == count % 5); ++count; } RAW_ASSERT(count == 100000); file.close_and_evict_from_os_cache(); } }

One feature of flatbuffers is that it doesn’t actually “load” anything. It just creates a mapping onto the data that you provide it. In my case, since I mmap the file, this code would actually run really fast if I never used the data. If I remove the comparison code from the above test, my measurements are this:

Benchmark | Time(ns) | CPU(ns) |
---|---|---|

MemcpyInMemory | 466284 | 464406 |

MemcpyReading | 5580953 | 1627145 |

FlatbuffersReading | 208817 | 9814 |

Flatbuffers is insanely fast here because all it does is create a view onto the mapped data. As soon as I add the comparison code like in the above code snippet, I get these measurements though:

Benchmark | Time(ns) | CPU(ns) |
---|---|---|

MemcpyInMemory | 466284 | 464406 |

MemcpyReading | 5580953 | 1627145 |

FlatbuffersReading | 8954969 | 2633689 |

Which puts it at about the same speed as boost serialization. This is only because I’m only using the data once though. If I run the comparison code twice, flatbuffers runs more slowly than boost serialization. That’s because the flatbuffer types are really fast to load, (in fact they load in zero time as long as you never use them) but they are slow to use. Here is the comparison to boost serialization when checking the results twice:

Benchmark | Time(ns) | CPU(ns) |
---|---|---|

MemcpyInMemory | 466284 | 464406 |

MemcpyReading | 5580953 | 1627145 |

FlatbuffersReading | 10637846 | 4099547 |

BoostSerializationReading | 9404415 | 8649244 |

So it’s not super slow, but it’s noticeably slower than accessing normal C++ objects.

One important feature of flatbuffers that I didn’t use above is that it can skip default values. In my comparison data above the vec4 has all default values, but flatbuffers seems to not be able to take advantage of that. If I make all the values default values though the speed for loading and checking the results once is this:

Benchmark | Time(ns) | CPU(ns) |
---|---|---|

MemcpyInMemory | 466284 | 464406 |

MemcpyReading | 5580953 | 1627145 |

FlatbuffersReading | 6930696 | 2155805 |

This makes flatbuffers really interesting if your use case you has a lot of default values or if you only have to look at parts of the data. If you’re careful with your data access, flatbuffers might only load what you access. (what actually gets loaded in depends on the OS paging if you use mmap like I did)

Overall the above solutions are all disappointing though. I always had the memcpy comparison in there to show what the optimal number is. Both flatbuffers and protobuf should have an easy time reaching that number because they can generate their own code. So let’s try to write something that reaches that optimal number. To start, I will try to use an approach similar to boost serialization, but I will mmap the data in instead of going through std::istream. I also want an interface that is flexible enough that it can be used for a human readable json file, so I use macros. The solution that I end up with requires that I write the following function somewhere:

REFLECT_CLASS_START(memcpy_speed_comparison, 0) REFLECT_MEMBER(vec); REFLECT_MEMBER(i); REFLECT_MEMBER(f); REFLECT_CLASS_END()

This generates a function very similar to the boost serialization function above, but since I use macros I also have the names of the members which makes it possible for me to write out a json file. For now though I will test the binary format. The number zero in the macro REFLECT_CLASS_START() above is the version number. I want to offer similar versioning support to boost serialization, and for that I need the current version somewhere. That being said my binary format does not support versioning. The way this will work is that json will be used to save all data while it is still being edited. Then there will be a compile step which loads the json data and writes out the binary data. (it’s a tiny program that literally just calls read_json followed by write_binary) After that compile step the data can be loaded very quickly without the overhead for the version support. So I aim to provide all the functionality of the other libraries, just not in one file format. The above description is minimal enough that I have the flexibility to do this. I could even generate mappings for other programming languages from the above code. After all it’s really just a reflection system.

Here is the benchmark code for my binary format:

void ReflectionReading(benchmark::State & state) { std::string serialization_filename_fast = "/tmp/serialization_test_fast"; std::vector<memcpy_speed_comparison> elements = generate_comparison_data(); { std::ofstream file(serialization_filename_fast); metaf::BinaryOutput output(file); metaf::write_binary(output, elements); } while (state.KeepRunning()) { MMappedFileRead file(serialization_filename_fast); metaf::BinaryInput input = file.get_bytes(); std::vector<memcpy_speed_comparison> comparison; metaf::read_binary(input, comparison); RAW_ASSERT(comparison == elements); file.close_and_evict_from_os_cache(); } }

And here is how fast this is:

Benchmark | Time(ns) | CPU(ns) |
---|---|---|

MemcpyInMemory | 484328 | 484871 |

MemcpyReading | 5590015 | 1616091 |

ReflectionReading | 5672753 | 1851692 |

It’s pretty much as fast as memcpy. To see how fast this really is I will try to get rid of the file overhead and just serialize to memory and read from memory. Here is my test for that:

void ReflectionInMemory(benchmark::State & state) { std::vector<memcpy_speed_comparison> elements = generate_comparison_data(); std::stringstream buffer; metaf::BinaryOutput output(buffer); metaf::write_binary(output, elements); std::string in_memory = buffer.str(); while (state.KeepRunning()) { metaf::BinaryInput input({ reinterpret_cast<const unsigned char *>(in_memory.data()), reinterpret_cast<const unsigned char *>(in_memory.data() + in_memory.size()) }); std::vector<memcpy_speed_comparison> comparison; metaf::read_binary(input, comparison); RAW_ASSERT(comparison == elements); } }

And here is how fast this runs:

Benchmark | Time(ns) | CPU(ns) |
---|---|---|

MemcpyInMemory | 484328 | 484871 |

MemcpyReading | 5590015 | 1616091 |

ReflectionInMemory | 598699 | 602207 |

ReflectionReading | 5672753 | 1851692 |

Remember that the MemcpyInMemory test just calls the std::vector copy constructor. Turns out you can get performance pretty close to that even when going through a stringstream. What is important to note here is that while I am going through the std::ostream interface for writing, I am not going through the std::istream interface for reading. For reading I am using an ArrayView class. (or array_ref or span or whatever it will end up being called once it’s standardized) So there are no virtual functions and the optimizer can go to town. In fact when you look at the assembly of this code, it’s pretty close to optimal. It’s just assigning the ints and floats that are members of the objects directly. I still haven’t shown the code for this, but remember that all I have is the REFLECT_MEMBER() calls from above. It should be straightforward to see how to expand that macro to assign from an ArrayView to those members with optimal assembly.

One thing that is interesting is that even just with this code, I can be faster than memcpy. Let me make one change to the class that I’m working with:

struct memcpy_speed_comparison { alignas(16) float vec[4] = { 0.0f, 0.0f, 0.0f, 0.0f }; int i = 0; float f = 0.0f; };

Just by adding alignas(16), my code is faster. Why would you align the float[4]? Because then you can use SIMD instructions directly without having to load the data unaligned. Here is the performance with this change:

Benchmark | Time(ns) | CPU(ns) |
---|---|---|

MemcpyInMemory | 682278 | 680601 |

MemcpyReading | 7601910 | 2309319 |

ReflectionInMemory | 906126 | 906933 |

ReflectionReading | 6030785 | 2210939 |

Why is my solution faster? Because now there are eight bytes of padding at the end of the struct. The memcpy solution has to copy those bytes. It has no type information and doesn’t know what it is copying, so it has to copy the padding bytes. My method ignores the padding because I didn’t expose it to the system. Because of that it has to copy less bytes from disk, and in this benchmark the time spent loading from disk dominates.

This is relevant even if you don’t have padding bytes. One approach that game developers use sometimes is what’s called “memcpy followed by pointer patching”. You use it if you are copying structs that are not just PODs. The way it works is you memcpy in the whole struct, and then you do a second pass over the data to initialize all of your pointer values, because you can’t persist pointers to disk using memcpy. So you write offsets into your pointer members and then patch those offsets once loading has finished. Except there are always edge cases where you don’t want to load a member from disk but want to use custom logic to initialize it later. Depending on how many of those members you have, this can be faster since memcpy still has to copy them in and my method can ignore them.

If you look at the measurements above, you will notice that most of the time for reading is spent idle. Out of the 6ms that it takes to load using reflection, the CPU is active for only 2.2ms. The remainder of the time it is just waiting for the disk. So if we could make the data that we load a little bit smaller, we could spend quite a lot of CPU time decompressing the data, and loading should be faster overall.

LZ4 is a compression library that promises that it has an “extremely fast decoder” which sounds exactly like what I need. I have to confess that I didn’t try any other compression library, (Brotli and LZHAM also claim to decompress quickly) but the reason for that is that I got very good results from LZ4, and that LZ4 just comes across as a high quality library. (it was easy to integrate and has simple, clean function interfaces) Meaning I am very happy with it and didn’t feel the need to investigate anything else.

So what happens when we compress our data before writing it to disk and decompress it when reading it back? Things get a lot faster:

Benchmark | Time(ns) | CPU(ns) |
---|---|---|

MemcpyInMemory | 466064 | 466548 |

MemcpyReading | 5457860 | 1358706 |

MemcpyCompressedReading | 2210455 | 1407887 |

ReflectionInMemory | 597280 | 597650 |

ReflectionReading | 5553340 | 1535985 |

ReflectionCompressedReading | 2321451 | 1456029 |

(I undid the alignas(16) change from above before running this)

I expected CPU time to go up and overall time to go down. Overall time did go down, but CPU time didn’t go up as expected. I think the reason for that is that since I’m reading less bytes from disk (my test data compresses very well, down to 500kb from 2.4 megabytes) I’m saving a lot of CPU instructions because I simply have to read less. So even though LZ4 requires a bunch more processing, I’m saving CPU time in other places and I’m coming away from this with unchanged CPU time. (this does suggest that I should investigate other compression libraries which have better compression at the cost of having a slower decoder, because there should be further gains to be had here, but I’ll leave that as an exercise to the reader)

I wouldn’t always expect to get gains this strong. My test data is pretty simple. This is the point where I should run this on real data, but I don’t have real data readily available. If I try to randomize the data I can see that the amount of compression varies with how much randomness I put in. If I just use a std::uniform_real_distribution, the compressed version is slower than the uncompressed version. If I use a std::binomial_distribution or a std::geometric_distribution to better approximate real world data, I still see big benefits. But at that point I’m just testing my assumptions about compression, which isn’t terribly useful for real world performance. So I’ll have to decide on some real data to use and run this again at some point.

One thing that C++ has taught us is that you can write faster code if you have type information. That’s why std::sort is faster than qsort. So I wanted to apply that idea to this problem. Let me start off by saying that it didn’t entirely work. My type aware compression attempts are faster than normal memcpy, but they are not faster than memcpy with LZ4 compression. But I thought this is interesting, and I’ll definitely revisit these when I’m working on real data to see how they work there.

My first method of compression is to store any int that fits into seven bits as only one byte. I’m using a method similar to utf8, where the first seven bits contain data, and the last bit indicates whether there is more data to come. The result of this is that 32 bit integers can be anywhere between 1 and 5 bytes in size, and 64 bit integers can be anywhere between 1 and 9 bytes in size. I can only think of one case where you’re going to hit the worst case, and that is when storing hashes. For that I would eventually have some kind of type tag to indicate that hashes should just be stored without compression so that they will always use four bytes instead of five bytes. But for all other ints, you will usually benefit a lot from this compression.

Floating point numbers are a bit more complicated: It’s difficult to store them with incremental precision like I did for ints. Floating point numbers consist of one sign bit, eight exponent bits and 23 mantissa bits. If the eight exponent bits are all 1, then that indicates that the number is either NaN or infinity. So if I drop support for NaN and infinity, I can instead use that special case to indicate that this is a compressed float.

So I first read sixteen bits of a float, which includes the sign bit, the eight exponent bits, and seven bits of the mantissa. If one of the exponent bits is zero, I read the remaining sixteen bits and assign the whole float. If they are all 1 however, I know that there is a compressed float stored in the seven bits of the mantissa. A lot of useful numbers can be expressed with that, like 0, 0.125, 0.5, 1, 2, 3, 4, 10, the negative versions of all of those, and 240 other numbers close to zero. So I am satisfied that my compression case will be hit often enough. In the case where it doesn’t get hit, the float will still only use four bytes on disk and I will just waste a few CPU cycles that I would otherwise spend waiting for disk.

Once I defined this eight bit float format, (sign bit plus three bits exponent plus four bits mantissa) the question arose of what to do with denormalized floats in this new format. I could use them to indicate special common values like 0.1, and I could also use them to support NaN and infinity again. I have an implementation of supporting NaN and infinity using that method, but I don’t have it enabled by default. The reason for that is that I have to be a bit careful with code size. Any time that I increase the code size, the inliner changes behavior. Since my code needs to be inlined to get its performance, I need to be careful.

I did not write float compression code for double precision numbers yet. That is probably also worth doing, I just haven’t thought about it yet.

The final method of compression that I wrote is the same one that flatbuffers uses: If a value is never changed, don’t write it to disk. This is especially useful for big classes that have grown over the years. For example in Bullet Physics a rigid object can be marked as “static” in which case many of the members on there can never be modified. And the majority of physics objects are static objects. Wouldn’t it be nice if all the meaningless members that are never changed just weren’t saved and loaded?

The question is how can I tell whether a member has been modified? Easy: Serialize out a default constructed struct and remember what all the values were. If a member on another struct has the same value as it had on the default constructed one, I don’t need to save it.

For this I store a small bitset with each struct that indicates which members are saved on disk and which aren’t. For small structs this bitset is just one byte, if you have more than eight members it’s two bytes, if you have more than sixteen members it’s four bytes and if you have more than 32 members it’s eight bytes. I don’t support more than 64 members. If you have more members than that, you have to create a nested struct. It’s not super complicated to support more than 64 members, but for now I just made my life easy.

If I change my test case so that the float[4] member is never changed and so that both the int and the float are filled with random numbers that mostly benefit from my compression, I get this speed:

Benchmark | Time(ns) | CPU(ns) |
---|---|---|

MemcpyInMemory | 482933 | 483374 |

MemcpyReading | 5467386 | 1410453 |

MemcpyCompressedReading | 2525274 | 1779318 |

ReflectionInMemory | 1899832 | 1903480 |

ReflectionReading | 2692708 | 2015040 |

ReflectionCompressedReading | 2953027 | 2203459 |

Look at how close ReflectionReading now is to MemcpyCompressedReading. ReflectionCompressedReading became slower because now I’m compressing the same data twice and paying for the overhead of that. The biggest speed-up comes from the fact that I’m just not loading 2/3 of the data because it is unchanged, so I don’t have to persist it to disk. Is that a realistic number? I think it is, but it depends on your data. The LZ4 compressed version is also benefitting from that.

So type aware compression gets pretty damn close to type-unaware compression. But it doesn’t beat it. I have ideas for how to improve this further: Right now I’m looking at each object separately. If I could just say “object number 500 is exactly like object number 210 except the transform has changed” then I could compress this a lot more. But generating the code for that is difficult with what I have available at compile time in C++. I’m also thinking about possibly solving that at a higher level. But right now I don’t have a good approach for this idea.

But writing these has been a lot of fun. It’s really rewarding to do these micro optimizations like saving bytes off of an int. There’s so many more ideas for this: You could compress strings based on what they are used for (paths, menu options and scripts all use different character sets), you could save a matrix as a vector, quaternion and scale and save six floats like that, or you could compress normals (graphics programmers really like doing this). That last one is a lossy compression which LZ4 is not allowed to do. If you know that that’s OK sometimes, you can save more bytes.

But for now I admit defeat and say that I recommend using LZ4.

Could boost serialization, protobuf and flatbuffers also benefit from these optimizations?

Since my library has a similar implementation to Boost Serialization, I believe that all the same optimizations would also apply to that library. The main difference between my library and Boost Serialization is that I support writing a human readable json, and that I don’t support versioning in my binary format. Neither of these is very important for the above optimizations.

Flatbuffers would have a hard time because it wants to do zero processing on loading. I believe that it would still see significant speed ups if it used the type aware compression to compress its ints and floats, but that speed up would turn into a slowdown if you use the members of the loaded data more than once.

Protobuf could use the same optimizations. In fact protobuf should be the fastest of all of these libraries because it can generate its own code. I have no idea why it is so slow. Maybe if I had looked at other libraries that generate custom code for loading I would have found one that is as fast as memcpy. But even then if you apply the above optimizations to a library like that, what is the point? The big trade-off of a library that generates classes for you is that you have to give up code quality in order to get fast loading and saving as well as support in multiple languages. If I can make loading and saving just as fast for any custom class, why would I use a library that generates code? And I am also convinced that I can generate support for other languages from my macros. I haven’t written that code yet, but it seems straightforward. As evidence for this, it should be easy to see that I can just write out a description of all the members, so if I wanted to do this cheaply I could just generate the protobuf file from my macros. Then all I would have to do was write loading code that understands the protobuf file format. So I honestly can’t think of a reason why you would use a library that generates code for you based on a description in a parallel type system. Just use normal classes and write high quality code.

I have uploaded the code for all of this to github. Unfortunately I can’t claim that this library is finished. It works, but it needs a few good clean ups. There is also no documentation. I wouldn’t recommend that you use it for production. That being said for me personally the next step is to use it in production because I believe that only by taking that step can I get the quality to where it has to be. The best way to get a proof of concept to a releasable quality level is to just use it and see what issues you run into. But for anyone else I would consider this as just that: a proof of concept.

]]>