I can program and like games

## Tag: C++

### I Wrote a Faster Sorting Algorithm

These days it’s a pretty bold claim if you say that you invented a sorting algorithm that’s 30% faster than state of the art. Unfortunately I have to make a far bolder claim: I wrote a sorting algorithm that’s twice as fast as std::sort for many inputs. And except when I specifically construct cases that hit my worst case, it is never slower than std::sort. (and even when I hit those worst cases, I detect them and automatically fall back to std::sort)

Why is that an unfortunate claim? Because I’ll probably have a hard time convincing you that I did speed up sorting by a factor of two. But this should turn out to be quite a lengthy blog post, and all the code is open source for you to try out on whatever your domain is. So I might either convince you with lots of arguments and measurements, or you can just try the algorithm yourself.

Following up from my last blog post, this is of course a version of radix sort. Meaning its complexity is lower than O(n log n). I made two contributions:

1. I optimized the inner loop of in-place radix sort. I started off with the Wikipedia implementation of American Flag Sort and made some non-obvious improvements. This makes radix sort much faster than std::sort, even for a relatively small collections. (starting at 128 elements)
2. I generalized in-place radix sort to work on arbitrary sized ints, floats, tuples, structs, vectors, arrays, strings etc. I can sort anything that is reachable with random access operators like operator[] or std::get. If you have custom structs, you just have to provide a function that can extract the key that you want to sort on. This is a trivial function which is less complicated than the comparison operator that you would have to write for std::sort.

If you just want to try the algorithm, jump ahead to the section “Source Code and Usage.”

I recently learned how radix sort works, and in hindsight it’s weird that I never really learned about it before, and that it doesn’t seem to be widely used. In this blog post I claim that std::sort should use radix sort for large arrays, and I will provide a simple implementation that does that.

But first an explanation of what radix sort is: Radix sort is a O(n) sorting algorithm working on integer keys. I’ll explain below how it works, but the claim that there’s an O(n) searching algorithm was surprising to me the first time that I heard it. I always thought there were proofs that sorting had to be O(n log n). Turns out sorting has to be O(n log n) if you use the comparison operator to sort. Radix sort does not use the comparison operator, and because of that it can be faster.

The other reason why I never looked into radix sort is that it only works on integer keys. Which is a huge limitation. Or so I thought. Turns out all this means is that your struct has to be able to provide something that acts somewhat like an integer. Radix sort can be extended to floats, pairs, tuples and std::array. So if your struct can provide for example a std::pair<bool, float> and use that as a sort key, you can sort it using radix sort.

### C++11 Completed RAII, Making Composition Easier

The addition of move semantics in C++11 is not just a performance and safety improvement. It’s also the feature that completed RAII. And as of C++11 I believe that RAII is absolutely necessary to make object composition easy in the language.

To illustrate let’s look at how objects were composed before C++11, what problems we ran into, and how everything just works automatically since C++11. Let’s build an example of three objects:

```struct Expensive
{
std::vector<float> vec;
};
struct Group
{
Group();
Group(const Group &);
Group & operator=(const Group &);
~Group();
int i;
float f;
std::vector<Expensive *> e;
};
struct World
{
World();
World(const World &);
World & operator=(const World &);
~World();
std::vector<Group *> c;
};
```

Before C++11 composition looked something like this. It was OK to have a vector of floats, but you’d never have a vector of more expensive objects because any time that that vector re-allocates, you’d have a very expensive operation on your hand. So instead you’d write a vector of pointers. Let’s implement all those functions:

### plalloc: A simple stateful allocator for node based containers

Allocators in C++ are awkward classes. Allocators are usually zero size to make the containers small, and they have no state because that used to be difficult to do in C++. But nowadays we don’t really care about the size of our containers (we care about the size of the content of the containers) and since C++11 we have support for stateful allocators.

The problem I’m trying to solve is that node based containers have bad cache behavior because they allocate all over the place. So I wrote a small allocator which gives out memory from a contiguous block. It speeds up std::map and std::unordered_map. It’s called plalloc because I think this is a pool allocator.

Code is below:

### Comma Operator RAII Abuse

Here’s a neat little trick that one of my co-workers, Clint Levijoki, discovered. In C++ you often use an RAII wrapper that you place on the stack if you want to be sure that code gets run at a later point. One good example would be std::lock_guard which you use if you want to be sure that a lock gets released in the future, or scoped profiling which you use to stop a timer in the future. For example for this:

```std::string foo();
void bar()
{
std::string baz = foo();
// do something with baz
}```

If you want to profile foo() you’d write it like this:

```std::string foo();
void bar()
{
std::string baz;
{
ScopedProfiler profile_foo("foo()");
baz = foo();
}
// do something with baz
}```

Which is less pretty and slightly slower. Alternatively you can use the comma operator and do it like this:

```std::string foo();
void bar()
{
std::string baz = (ScopedProfiler("foo()"), foo());
// do something with baz
}```

And this will start a timer before calling foo(), and stop the timer after calling foo(). You could wrap it in a macro to make it more readable. And the benefit is obviously that you don’t have to destroy your function flow when you want to insert RAII objects.

### Another Opinion on “Almost Always Auto”

Herb Sutter has been promoting his almost always auto style again, and I think it is harmful. I would agree with “almost always auto” in Scala. I disagree with it in C++. And that’s because there is a slight difference in syntax for type inference between the two languages.

Here’s type deduction in Scala:

```val inferred = 0
val typed : Int = 1
```

And here it is in C++

```auto inferred = 0;
int typed = 1;
```

Seems similar, right? But the difference in syntax leads to different long term programmer behavior.

### I Wrote a Faster Hash Table

edit: turns out you can get an even faster hash table by using this allocator with boost::multi_index. I now recommend that solution over the hash table from the post below. But anyway here is the original post:

This is a follow up post to “I Wrote a Fast Hash Table.

I’ve spent some time optimizing my sherwood_map implementation and now it is faster than boost::unordered_map and boost::multi_index. Which is what I would have expected from my first implementation, but it turns out that those containers are pretty damn fast.

If you don’t know what Robin Hood Linear Probing is I would encourage you to read the previous post and the post that I linked to from that one. With that said let’s talk about details.

### I Wrote a Fast Hash Table

As others have pointed out, Robin Hood Hashing should be your default hash table implementation. Unfortunately I couldn’t find a hashtable on the internet that uses robin hood hashing while offering a STL-style interface. I know that some people don’t like the STL but I’ve found that those people tend to write poorer interfaces. So I learn from that by not trying to invent my own interface. You can use this hash table as a replacement for std::unordered_map and you will get a speedup in most cases.

In order to reduce name conflicts I call it sherwood_map, because Robin Hood Hashing.

### Reinventing the Wheel for Better Compile Time

I have a new laptop on which everything compiles slowly. Which means I sometimes get really annoyed with C++ and its slow compile times. The two most common causes of slow compile times are large headers and complicated templates. Large headers are somewhat easy to identify, and you can usually improve your compile time by using precompiled headers, but it has traditionally been hard to find out which templates are hurting your compile time. Luckily there is a project called Templight which can tell you how much time Clang is spending on instantiating templates. Using Templight I have found that some templates take unnecessarily long to compile.

So I re-implemented std::unique_ptr, boost::flat_map, and await.

### Introducing the Asserting Mutex

Multithreaded code in C++ tends to become brittle over time. If you write your code well you’ll need almost no synchronization between different threads, but the price of that is that your code will be littered with undocumented conventions of when you can read or modify which state. In your average threaded C++ application there are countless potential race conditions, all of which never happen because people follow conventions about when to do what. Until someone doesn’t know about a convention that he has to follow or until you change the conventions and you forget to update one piece of code that you didn’t know about.

Enter the asserting mutex. The asserting mutex is a conditional breakpoint that triggers only if a potential race condition actually happens. I call it asserting mutex because you use it like a mutex to protect a critical section. It works very simply: If one thread locks the asserting mutex and a second thread attempts to lock the asserting mutex before the first thread unlocks it, you get an assert. And it guarantees that both threads will still be inside the critical section when you get the assert. The cost is one atomic increment and one atomic decrement, which is not free but cheap enough that you can place lots of asserting mutexes in your code before they cause problems. So you could use this to document many of your threading conventions. Used correctly this is a breakpoint that makes it very easy to find data races.

Here is the complete code: