I Wrote a Fast Hash Table

by Malte Skarupke

As others have pointed out, Robin Hood Hashing should be your default hash table implementation. Unfortunately I couldn’t find a hashtable on the internet that uses robin hood hashing while offering a STL-style interface. I know that some people don’t like the STL but I’ve found that those people tend to write poorer interfaces. So I learn from that by not trying to invent my own interface. You can use this hash table as a replacement for std::unordered_map and you will get a speedup in most cases.

In order to reduce name conflicts I call it sherwood_map, because Robin Hood Hashing.

It’s good to have bad jokes in your code. Other than that name though there are good reasons to prefer this hash table over std::unordered_map.

The main benefit of sherwood_map over unordered_map is that it doesn’t put every element into it’s own heap allocation. Less allocations give you a speedup because a) less allocations and b) more cache hits.

To be up front here is the list of limitations:

  • No support for custom allocator propagation. Sorry. I didn’t know about this feature and I learned about it too late. It can be quite a pain to implement so I haven’t done it yet.
  • Less exception safety. I assume noexcept moves. If you have that you can throw pretty much anywhere else though.
  • Only load factors up to 1
  • No iterations within a bucket. I don’t know what that interface is for and it is difficult to implement in Robin Hood Hashing.
  • Any modification to the container will invalidate all iterators and references

I think those limitations are OK.

But let’s get to performance tests. Since sherwood_map may have to move elements around when the map gets modified, that would suggest that sherwood_map may be slower than unordered_map for large objects. Because of that I ran all performance tests with a small, (sherwood_map<int, int>) a medium-sized, (sherwood_map<pair<int, string>, pair<int, string>>) and a large (sherwood_map<int, char[512]>) object.

Clang 3.4 GCC 4.8.2
sherwood_map unordered_map speedup sherwood_map unordered_map speedup
insert small 113 134 16% 106 136 23%
medium 124 150 17% 128 152 16%
large 326 393 17% 286 387 26%
reserve insert small 72 119 39% 72 119 46%
medium 80 131 39% 82 136 39%
large 147 342 57% 114 335 66%
erase small 137 196 30% 149 199 25%
medium 205 225 9% 205 231 11%
large 347 361 4% 353 345 -2%
lookup small 228 361 37% 230 473 52%
medium 243 436 44% 242 634 62%
large 277 571 51% 263 709 63%
modify small 384 459 16% 388 508 24%
medium 549 519 -6% 530 577 8%
large 1120 724 -55% 1279 750 -71%

Explanations for the tests:

  • insert inserts a bunch of elements into a map
  • reserve insert is the same test but I call reserve() before inserting the elements
  • erase builds a map and then erases all elements again
  • lookup builds a map and then performs many lookups where most of the lookups will fail
  • modify builds a map and then erases one element and adds one new element over and over again

In all of the non-insert tests I call reserve() when building the map. But I use a different number of elements for each test, so the overhead for building the map is not always the same. I also fill up the map until it has 85% load factor, which is the default max_load_factor() for sherwood_map. The default max_load_factor() for unordered_map is 100%, but for comparisons sake I use the same value for both.

The results are kind of what I expected. The biggest speed improvements are for lookups, and for when inserting into a map that has reserved enough space to fit all elements. I didn’t measure this exactly, but I expect that the main reason for the improved lookup time is that there are fewer pointer indirections, and the reason for the improved insert time is that sherwood_map doesn’t need to do new heap allocations.

I was surprised though to see that sherwood_map also performs better in almost all other areas. Only when there are many modifications to the map is unordered_map faster. Note though that this is not the case if you modify values that are already stored in the map. Only when you erase objects and add new objects is unordered_map faster. (updating existing objects is as fast as performing a lookup)

I also did a brief test to see how much memory each map is using: When storing one million objects in a unordered_map<int, string> my application uses 39 MiB of memory. When storing one million objects in a sherwood_map<int, string> my applications uses 27.6 MiB of memory. I expect that the reason for this is that unordered_map has more overhead for every object stored in the map. I did not modify the max_load_factor of the map for this test, so unordered_map uses a max_load_factor of 1, where sherwood_map uses a max_load_factor of 0.85.

One thing that I’m kind of embarrassed about (because I just did a big post about this) is that sherwood_map compiles more slowly than unordered_map. I know why it compiles more slowly and I know how to change it, but it requires some non-trivial changes to the structure of the map. Maybe I’ll do that when I support allocator propagation. But C++ does make it difficult to predict how fast or slow your code will compile… (in my case it’s actually because std::swap compiles slowly, who would have thought…)

So when should you use unordered_map? If you plan to do lots of modifications to the map (lots of erases and insertions of new objects) or if you need your iterators and references to not be invalidated. Otherwise using Robin Hood Linear Probing seems to be a better alternative.

And here is a link to the github files.

edit:

gnzlbg in the comments pointed me to a Joaquín M López Muñoz  blog where he had done performance comparisons for hashtable implementations. I ran his same benchmarks and the results are… embarrasing. Let’s start with the results:

Inserting elements:

running_insertion

Reserving and then inserting elements:

no_rehash_running_insertion

Erasing elements:

scattered_erasure

Successful lookups:

scattered_successful_lookup

Unsuccessful lookups:

scattered_unsuccessful_lookup

In all of these lower numbers are better. So sherwood_map doesn’t do terribly, but unfortunately the right hand side of that graph doesn’t really count. The left hand side is interesting because at least for me most maps have few elements. Unfortunately I ran my comparison above on the right hand side of the graph. I didn’t do that intentionally, I just ran a test that took long enough to be easy to measure.

So what do I take from this? Time to optimize sherwood_map :-). I spend all of my time on the calculation for the bucket of a given index. The way you find out when to stop iterating in Robin Hood Linear Probing is to look at how far you are from the initial bucket of a given element. When you find an element that’s less steps away from its initial bucket than your element is from your initial bucket, then you can stop iterating because you know that your object is not in the map. Unfortunately that means that as you’re iterating you keep on having to calculate the initial bucket of elements that you walk past. So I’ll play around with caching that calculation. If that doesn’t work I could also change my bucket count to be a power of two. That’s frowned upon for hashtables because it increases clustering but the increased clustering may be worth it if iterating over those clusters to find your element is cheap.

I’ll probably write a new blog post once I’ve done that.