Probably Dance

I can program and like games

Tag: C++

plalloc: A simple stateful allocator for node based containers

Allocators in C++ are awkward classes. Allocators are usually zero size to make the containers small, and they have no state because that used to be difficult to do in C++. But nowadays we don’t really care about the size of our containers (we care about the size of the content of the containers) and since C++11 we have support for stateful allocators.

The problem I’m trying to solve is that node based containers have bad cache behavior because they allocate all over the place. So I wrote a small allocator which gives out memory from a contiguous block. It speeds up std::map and std::unordered_map. It’s called plalloc because I think this is a pool allocator.

Code is below:

Read the rest of this entry »

Comma Operator RAII Abuse

Here’s a neat little trick that one of my co-workers, Clint Levijoki, discovered. In C++ you often use an RAII wrapper that you place on the stack if you want to be sure that code gets run at a later point. One good example would be std::lock_guard which you use if you want to be sure that a lock gets released in the future, or scoped profiling which you use to stop a timer in the future. For example for this:

std::string foo();
void bar()
    std::string baz = foo();
    // do something with baz

If you want to profile foo() you’d write it like this:

std::string foo();
void bar()
    std::string baz;
        ScopedProfiler profile_foo("foo()");
        baz = foo();
    // do something with baz

Which is less pretty and slightly slower. Alternatively you can use the comma operator and do it like this:

std::string foo();
void bar()
    std::string baz = (ScopedProfiler("foo()"), foo());
    // do something with baz

And this will start a timer before calling foo(), and stop the timer after calling foo(). You could wrap it in a macro to make it more readable. And the benefit is obviously that you don’t have to destroy your function flow when you want to insert RAII objects.

Read the rest of this entry »

Another Opinion on “Almost Always Auto”

Herb Sutter has been promoting his almost always auto style again, and I think it is harmful. I would agree with “almost always auto” in Scala. I disagree with it in C++. And that’s because there is a slight difference in syntax for type inference between the two languages.

Here’s type deduction in Scala:

val inferred = 0
val typed : Int = 1

And here it is in C++

auto inferred = 0;
int typed = 1;

Seems similar, right? But the difference in syntax leads to different long term programmer behavior.

Read the rest of this entry »

I Wrote a Faster Hash Table

edit: turns out you can get an even faster hash table by using this allocator with boost::multi_index. I now recommend that solution over the hash table from the post below. But anyway here is the original post:

This is a follow up post to “I Wrote a Fast Hash Table.

And I’ll start off with a download link.

I’ve spent some time optimizing my sherwood_map implementation and now it is faster than boost::unordered_map and boost::multi_index. Which is what I would have expected from my first implementation, but it turns out that those containers are pretty damn fast.

If you don’t know what Robin Hood Linear Probing is I would encourage you to read the previous post and the post that I linked to from that one. With that said let’s talk about details.

Read the rest of this entry »

I Wrote a Fast Hash Table

As others have pointed out, Robin Hood Hashing should be your default hash table implementation. Unfortunately I couldn’t find a hashtable on the internet that uses robin hood hashing while offering a STL-style interface. I know that some people don’t like the STL but I’ve found that those people tend to write poorer interfaces. So I learn from that by not trying to invent my own interface. You can use this hash table as a replacement for std::unordered_map and you will get a speedup in most cases.

In order to reduce name conflicts I call it sherwood_map, because Robin Hood Hashing.

Read the rest of this entry »

Reinventing the Wheel for Better Compile Time

I have a new laptop on which everything compiles slowly. Which means I sometimes get really annoyed with C++ and its slow compile times. The two most common causes of slow compile times are large headers and complicated templates. Large headers are somewhat easy to identify, and you can usually improve your compile time by using precompiled headers, but it has traditionally been hard to find out which templates are hurting your compile time. Luckily there is a project called Templight which can tell you how much time Clang is spending on instantiating templates. Using Templight I have found that some templates take unnecessarily long to compile.

So I re-implemented std::unique_ptr, boost::flat_map, and await.

Read the rest of this entry »

Introducing the Asserting Mutex

Multithreaded code in C++ tends to become brittle over time. If you write your code well you’ll need almost no synchronization between different threads, but the price of that is that your code will be littered with undocumented conventions of when you can read or modify which state. In your average threaded C++ application there are countless potential race conditions, all of which never happen because people follow conventions about when to do what. Until someone doesn’t know about a convention that he has to follow or until you change the conventions and you forget to update one piece of code that you didn’t know about.

Enter the asserting mutex. The asserting mutex is a conditional breakpoint that triggers only if a potential race condition actually happens. I call it asserting mutex because you use it like a mutex to protect a critical section. It works very simply: If one thread locks the asserting mutex and a second thread attempts to lock the asserting mutex before the first thread unlocks it, you get an assert. And it guarantees that both threads will still be inside the critical section when you get the assert. The cost is one atomic increment and one atomic decrement, which is not free but cheap enough that you can place lots of asserting mutexes in your code before they cause problems. So you could use this to document many of your threading conventions. Used correctly this is a breakpoint that makes it very easy to find data races.

Here is the complete code:

Read the rest of this entry »

Type-safe Pimpl implementation without overhead

I like the pimpl idiom because I like to keep my headers as clean as possible, and other people’s headers are dirty. Unfortunately the pimpl idiom never feels like a good solution because it has runtime overhead that wouldn’t be needed if I didn’t care about clean headers so much.

If you’re not familiar with the Pimpl idiom, it stands for “pointer to implementation” and you use it in C/C++ headers to use a class without having to include the other header in your header. You can also use it to hide your implementation from your users so that you can change the internals of your class and nobody has to know. It’s used all over the place but it has one disadvantage: You always need an extra heap allocation and every method performs an extra pointer dereference.

This code fixes that, so that there can be zero runtime overhead. Here’s how to use it:

class btRigidBody;
class MyRigidBody
    // ...
    ForwardDeclaredStorage<btRigidBody, 768> bulletBody;

And the code is below:
Read the rest of this entry »

4GB per Vector

I recently had a problem where I had a vector that was growing once, being iterated over once, and then deallocated. And it was bothering me how much time I spent on reallocating. The problem was that I also could not predict well how big the vector would be ahead of time. It could vary by orders of magnitude for input that looked similar at first.

As I was researching the minor page faults that I was hitting during the reallocations I came across this stack overflow question. There’s a person there who speaks of resizing an array by requesting a lot of memory from the OS, but then only actually committing the memory as he wants the array to grow.

Which is smart, and I wonder why nobody has taken it further. I think all vectors should operate like that all the time. Thus I wrote my own vector class which always requests 4GB of memory from the OS, and which then only actually commits that memory one page at a time. After all we have this enormous address space in 64 bit that we won’t be using for a long time.

Read the rest of this entry »

Handmade Coroutines for Windows

In a previous post I implemented coroutines using the ucontext.h header. In this post I will show a replacement that should be much faster. I will also provide a Windows implementation.

I like to start off with some code, so here is the complete code for switching the stack in Linux:

    pushq %rbp
    movq %rsp, %rbp
    // store rbx and r12 to r15 on the stack. these will be restored
    // after we switch back
    pushq %rbx
    pushq %r12
    pushq %r13
    pushq %r14
    pushq %r15
    movq %rsp, (%rdi) // store stack pointer
    // set up the other guy's stack pointer
    movq %rsi, %rsp
    // and we are now in the other context
    // restore registers
    popq %r15
    popq %r14
    popq %r13
    popq %r12
    popq %rbx
    popq %rbp

Read the rest of this entry »