Reinventing the Wheel for Better Compile Time

by Malte Skarupke

I have a new laptop on which everything compiles slowly. Which means I sometimes get really annoyed with C++ and its slow compile times. The two most common causes of slow compile times are large headers and complicated templates. Large headers are somewhat easy to identify, and you can usually improve your compile time by using precompiled headers, but it has traditionally been hard to find out which templates are hurting your compile time. Luckily there is a project called Templight which can tell you how much time Clang is spending on instantiating templates. Using Templight I have found that some templates take unnecessarily long to compile.

So I re-implemented std::unique_ptr, boost::flat_map, and await.


std::unique_ptr should be the very simple. All you need to do is disable the copy constructor, implement a move constructor and then delete the pointer in your destructor. But it gets more complicated because you can specify a custom deleter, and if that custom deleter has zero size, std::unique_ptr will allocate it using the empty base class optimization. But if you use a function pointer as your deleter it can’t do that because you can’t inherit from a pointer. So there has to be some compile time logic that determines how std::unique_ptr allocates the deleter object. I use std::unique_ptr a lot, and that compile time logic eventually starts to hurt you. Especially if all you ever do is use the default deleter and you don’t even need that logic.

I implemented a version of unique_ptr which I call dunique_ptr (for default unique_ptr) which only has the behavior of std::unique_ptr<T, std::default_delete<T>>, and measured the compile time against std::unique_ptr:

Num Instantiations dunique_ptr compile time (in seconds) std::unique_ptr compile time (in seconds)
2 0.76 0.88
4 0.78 1
8 0.8 1.03
16 0.85 1.44
32 1.11 1.89
64 1.18 3.32
128 1.63 5.57
256 2.54 10.01

Which means that dunique_ptr takes roughly 30ms less to compile per instantiation on my laptop, which I am pretty happy with. And since I was using unique_ptr several times in a header that gets commonly included, this would actually add up to a few seconds in my project.

Depending on how brave you are in dealing with header include order and all the problems that can come from that, you can even define dunique_ptr to be the template specialization of std::unique_ptr<T, std::default_delete<T>>, and you’ll get the same improvement in compile time. To do that you just do this:

namespace std
template<typename T>
class unique_ptr<T, std::default_delete<T>>
    // add the code from the link above

This can easily break though if you ever forget to include the header in which you do this. And debugging ODR bugs can be a lot of pain. It would be great if the maintainers of standard libraries did this instead. The savings in time and electricity would quickly add up.



boost::flat_map is a sorted vector with an interface similar to std::map. It gives you faster lookup than std::map, but insertion and removal complexity is O(n). So depending on how often you add or remove elements vs how often you look up or modify existing elements, boost::flat_map can be much faster than std::map. Unfortunately it compiles slowly, so I re-implemented it.

The biggest problem of boost::flat_map (which it shares with many boost libraries) is the size of it’s header. My test file compiles in 2.23 seconds if I #include my flat_map, and in 3.03 seconds if I #include boost::flat_map. Those 0.8 seconds quickly add up if you use boost::flat_map in a header that is included often. I have also run a similar test as above where I instantiate a template very often, and here are my timings:

Number of Instantiations my flat_map compile time (in seconds) boost::flat_map compile time (in seconds)
2 2.87 3.82
4 3.48 4.52
8 4.59 5.92
16 6.91 8.84
32 11.61 14.52
64 20.96 26.13
128 40.08 50.19

Which I am quite happy about. The number of instantiations is lower because flat_map is a bigger template than unique_ptr, but also because I am instantiating a more complex test case. All that means though is that you can benefit from the reduced compile time more quickly.

To my surprise my flat_map compiles more slowly than boost::flat_map when I instantiate the boost::flat_map unit test. When I compile the boost’s flat_tree test with boost::flat_map it takes 17 seconds, compared to 18 seconds for my flat_map. But using Templight I quickly found out why: It turns out that the range insert function of boost::flat_map is quite simple and has O(n^2) complexity, even though the documentation claims that it’s O(n log(n)). My flat_map has a range insert function which actually has O(n log(n)) complexity. Which makes quite a big difference: When I insert 10,000 random integers into a map it takes 16ms for my flat_map, 316 ms for boost::flat_map and 28ms for std::map. But since boost::flat_map’s range insert is in a different complexity class it’s trivial to make it look bad: If I insert 100,000 random integers instead those numbers go up to 180ms, 30,751ms and 443ms respectively.

The price for that is that my flat_map compiles more slowly if you use range insert. (which you also use when you use the initializer_list constructor) I think that’s a price worth paying though because the range insert function makes it easier to have large flat_maps. If you want a large flat_map using boost::flat_map you have to sort your elements before you insert them.

But why does my flat_map compile faster in most cases? One thing is that it does less than boost::flat_map. I don’t provide flat_multimap, flat_set and flat_multiset. But looking in templight it’s not obvious what gives me a big benefit, it looks like it’s just many details. One of those details is that I use std::vector where boost uses boost::vector, and apparently the libstdc++ version of vector compiles ever so slightly faster.


await was a proposed new language keyword that would have made working with asynchronous code easier. But instead of explaining it I’ll defer to the paper N3722 and to this talk by Deon Brewis for motivation. I recommend that you watch at least the first few minutes of the talk because await is great. It wasn’t long until someone found out that you can implement await using coroutines without needing to extend the language. There’s an implementation by Evgeny Panasyuk here. Which is very clever, but it compiles slowly.

There is no simple link for my implementation because you can’t implement await in a single header. To implement await you need coroutines, so I also provide my own implementation for those. This also comes with a variation of my implementation of std::function and an extension to std::future to make it provide a .then() method which made this end up as something like ten files in total.

So here’s a link to a github folder.

Briefly instructions on how to use it: Call a function with resumable() then use the await macro on futures, like in the linked talk above. When the future’s result is ready you can continue your function by calling await_tasks_to_finish.run_single_task() where await_tasks_to_finish is a global variable. This means that different from the proposed addition to the standard there is no magic scheduling going on here, and you have to at some point take action to continue suspended functions.

But the main point here is compile time, and a file that includes Panasyuk’s header takes 11 seconds to compile, where the same file including my header takes 3 seconds to compile. Compiling the example from Panasyuk’s implementation takes 18 seconds compared to 8 seconds if I compile the same example using my implementation. Which is not ideal but Templight tells me that I’m spending compile time all over the place with no obvious point for optimization. The only class that seems too slow is std::thread (don’t ever use std::bind or std::tuple folks, they make your class compile slowly) but I don’t want to re-implement that. Also std::thread is not really part of the await implementation, it’s just for the example.

One comment about await though: I don’t plan on using this implementation. It’s a neat implementation because it has the same syntax as the proposed language extension, but it bothers me that you can accidentally call await when it is not valid to do so. Because of that I’ll probably switch it away from being a macro to being a function on an object. And I’ll only give you that object when it is valid to call await. That’s for a different post though which is not about reinventing the wheel.