A faster implementation of std::function
by Malte Skarupke
As I wrote in my last post, I consider std::function to be a very important class that will change how you design your code, because it means that you have to use inheritance less often. In that post I was very impressed with the performance of std::function when compiled with optimizations. Unfortunately std::function can be far slower than a virtual function call in debug.
I wanted a std::function implementation that doesn’t have too big of a performance impact on your application when debugging it, so I wrote my own. It is also faster than all other implementations that I could find in release mode.
The code is in the public domain (I want all library writers to start using it) and here is a download link.
I mainly looked at the implementations of Visual Studio and libstdc++. Both use a small functor optimization where you store small functors inside the std::function itself rather than allocating them on the heap. libstdc++ only uses that optimization for function pointers and member function pointers though. Visual Studio uses it for anything that is 24 bytes in size or less. Which is incorrect because then they can no longer offer a noexcept swap, but since Visual Studio currently doesn’t support noexcept, my best guess is that they didn’t notice. I made the choice to use 16 bytes for a small functor optimization in which I only place objects that are no-throw move constructible.
I use the 8 bytes that I gain over Visual Studio for storing a function pointer in the struct that allows me to call the provided functor directly. I have that optimization from libstdc++. Any other operations use virtual function calls. But storing the pointer for operator() directly instead of in a vtable causes there to be one less pointer indirection and one less possible cache miss.
Except that I don’t actually use virtual functions. Instead I use a manager function that acts very similar to a vtable. That is an optimization that comes from boost, and it prevents RTTI code bloat. We never need RTTI for the internals of a std::function, but there is no way to prevent the compiler from generating the relevant information if you have a virtual function in your class. So you hand-roll your own virtual function implementation.
One optimization that I came up with myself is to have no conditional in operator(). The first line in all other implementations looks like this: “if (!*this) throw bad_function_call();” The optimization is to put that functionality inside a functor that gets assigned to an empty std::function. So that operator() looks the same no matter if the std::function is empty or not. The disadvantage of this is that operator bool() needs a virtual function call in my implementation. I consider that a good trade-off, as I expect functions to be checked far less often than they are called.
To get it to run slightly faster in debug I made sure that there were as few function calls as possible between calling operator() and your functor being called. The biggest performance problems in debug were in Visual Studio. Since Visual Studio seems to inline std::forward even in debug, I allowed myself the simplification of using std::forward.
Here are some measurements for performance:
|Virtual Function||std::function||My std::function|
|Visual Studio 2012 Debug||1917||4195||3262|
|Visual Studio 2012 Release||800||770||630|
|Clang 3.1 -O0 -g3||1864||3161||2513|
|Clang 3.1 -O3||564||474||456|
|GCC 4.7.2 -O0 -g3||1755||3363||2587|
|GCC 4.7.2 -O3||555||466||431|
(Smaller is better. Time in milliseconds, but that doesn’t really matter because I used different machines. So only compare within a row, not across rows)
The code for the tests is here. I rotated the order of the tests (to reduce the effect of the cache and of heap fragmentation) and chose the shortest time that each implementation was able to achieve.
I only tested with lambdas that benefit from the small functor optimization because I think that those will be most common.
My implementation beats std::function in all categories, and it beats a virtual function call in optimized builds. In debug builds it is roughly 50% slower than a virtual function call. Which is not ideal, but it is better than other implementations. I think I can live with this. The benefits of std::function still far outweigh the downsides, and I think that other code will slow you down more in debug.
You can define FUNC_NO_EXCEPTIONS to prevent the function from using exceptions. In that case you will get a segmentation fault instead when calling an empty std::function. You can define FUNC_NO_RTTI to compile the std::function if you have turned off RTTI in your compiler. In that case the target_type() and target() member functions will not be available. That one came from libc++.
Here is the download link again and now go and experiment with std::function. One thing I want to try is to introduce Python’s function decorators into C++. Not in terms of syntax, but in terms of semantics. Doing that would have been very difficult with virtual functions, but should be trivial with a std::function. And they should be very fast if you apply the decorator before you store your functor in a std::function.
Why should operator bool() need a virtual function call? You keep a direct pointer to the function to call, right? And there should only be one empty function for any given type, so you should be able to just do a pointer comparison.
It seems that as long as you’re storing a function pointer, you should be able to store the *actual* member function pointer of the functor’s operator(), and handle virtual operator()s with a stub function. Hooray C++ abstraction penalty!
Your test suite should probably include functors of a size such that they don’t get allocated in-place (i.e. 16-, 24-, and 32-byte functors).
But how do you hold functors operator() member function pointer? You need to keep it in type-erased context and convert it back at the time of invocation, which is why I see no other way than having a type-erased templated function (in his case “call”) to do the job and you get double indirection again.
John’s idea worked. Operator bool is now just a comparison in the implementation. It also meant that less code gets generated.
You don’t need to store the functors operator() member function pointer. Every time that you instantiate a func::function with a new typename T, it will create a new templated function to call operator() on said typename T. That function knows how to call operator(). You just have to keep a pointer to it.
I’m commenting your comment here as I can’t nest them more:
I see that your operator() does
call(manager_storage.functor, FUNC_FORWARD(Arguments, arguments)…);
with call being a function pointer. Then in your templated call you do:
(*reinterpret_cast<const typename std::allocator_traits::pointer &>(storage))(FUNC_FORWARD(Arguments, arguments)…)
So you’re following function pointers two times. Is this not double indirection as well?
The second indirection is not a function pointer. It’s reading the ‘this’ pointer to the object itself. (You copied the version when the function object is too big to fit into the internal storage. There is a second, similar call above that does less work for small function objects)
Importantly we don’t have to wait for the second indirection to proceed. If typename T is a lambda function, the result of this read is only needed the first time that you use a variable from the capture list inside the function body. It’s not blocking until then. With an empty capture list, this read is never needed. The compiler knows this because it inlines the body of the lambda there. (Because the call to operator() is resolved at compile time)
This should also be in cache because it’s right next to ‘call’, which we just read.
I hadn’t thought properly about operator bool(). The issue was that I’m not actually storing a pointer to the function that gets called, but a pointer to a function that calls the target of the std::function. Which may or may not be a function pointer.
So what I’m storing looks like this
And then the function behind that pointer knows that it can reinterpret_cast the functor_storage_type to the target type and call that. (or if the target is heap allocated, it will reinterpret_cast the functor_storage_type to the correct pointer, dereference it and call that)
I had made the functor that gets called in there be the one that throws an exception. But I could have made the function that this pointer points to throw the exception itself. That would have made the operator bool() come down to a function pointer comparison.
I uploaded a new version of the code that does just that.
As for other performance measurements: If I create the trivial test of bloating the lambda so that it gets heap allocated (I made it be 40 bytes in size) there is no difference in performance. I assume that this is because I only allocate 1000 objects and since I make a total of 100 million calls to operator(), everything will be in the cache for the vast majority of the test. And then the one additional pointer dereference doesn’t matter.
If I allocate more objects, so that they no longer fit in the cache, you start to see that the virtual function performs better than my std::function implementation. Somewhere at around 55000 objects a virtual function call is faster than my std::function. If I take it to the extreme and make the vectors contain 10 million elements, my std::function implementation takes roughly 1.5 times as long as using virtual functions.
Testing in debug doesn’t show a big difference to the results in the post, even at 10 million objects.
For functors that benefit from the small functor optimization, the virtual function is never faster than my std::function. Not even at 10 million elements in the vectors.
In all of these cases my std::function remains faster than the libstdc++ version of std::function. The gap widens in fact, but not by much.
All of this is with the same test code I linked above (just some numbers changed) compiled in GCC 4.7.2 running on a Core i5 3450 with 8 gigabytes of ram.
I did not run these tests again on all compilers.
Another interesting test would be to create many many different objects. Currently I have two types for the virtual function test, and two types for the std::function test. It would be interesting to have a test with hundreds of thousands of types. Then the question would be at which point the code bloat of std::function starts affecting performance. But I don’t know an easy way of testing that. Maybe with some boost preprocessor hackery…
Since the overhead for calling std::function is greater than for virtual functions, I suspect the performance penalty you observed is simply due to memory allocation/caching.
Why is the overhead greater for std::function? You can implement std::function to be one virtual function call, in which case the overhead would be exactly the same. (after optimization) And I implemented it to be a function pointer call, which is one pointer dereference less.
I don’t remember the details but I think when I looked at the disassembly for the std::function call vs the virtual function call, there were one or two less instructions per loop iteration for std::function. That alone could explain the difference in speed.
As for memory related issues: Yes, that came up, but as I said I ran the test multiple times,and I rotated the order of the tests. So one time I allocated all the stuff for virtual functions first, once I allocated everything for std::function first, and once I allocated everything for my function implementation first. There were real differences in performance for those runs, so I decided to pick the fastest run that each option was able to achieve. I think that’s fair. I don’t think I had a run where virtual functions were faster than my std::function implementation.
To reduce memory related issues you could put everything in one huge array. In my opinion that wouldn’t give you valuable test results because that’s not what you’re going to get in the real world. If you had everything in one array, why would you use virtual functions or something like std::function?
In my test (VS2012, x86, full optimizations) there was one additional nullptr check before each call, plus LambdaA::update had one additional memory access. Other than that they were pretty much identical.
The nullptr check sounds like you’re looking at Microsoft’s implementation of std::function. My implementation is branchless. For me (VS2012, x64, full optimizations) the inner loop has six instructions for std::function while the virtual function inner loop has the exact same six instructions plus one additional pointer dereference. Which is what you would expect.
But I am also getting the one additional memory access in LambdaA compared to UpdateableA. Which makes sense because in UpdateableA the arguments passed to the functions are all you need, while LambdaA needs to get it’s captured this pointer first.
This made me curious as to whether a comparison of just LambdaA vs UpdateableA would lead to a draw, because then both would have the same number of instructions. (the virtual function of UpdateableA has one more in the loop, while LambdaA has one more in the called function)
Turns out that that barely changes the outcome. I don’t know why. The two loops are running almost the same instructions, (only the order and some arguments are changed) but for some reason std::function is still drastically faster.
If someone has an idea as to what could cause this, I would be very curious.
I couldn’t test your code since I don’t have the CTP installed. As I said before, the difference in performance is probably due to cache misses. If you only allocate one instance of LambdaA and LambdaB each and fill the slots with those two, the difference should disappear.
I believe there is a small issue with your code: is_inplace_allocated doesn’t take into account the alignment requirements of the type. This can easily be solved by adding the following condition:
alignof(functor_padding) % alignof(T) == 0
You could use std::alignment_of::value instead of alignof for compatibility with older compilers which don’t fully support C++11.
Thanks, that’s a good point. I have added that.
I made tests with using virtual methods and std::function without any state and with vs12 update2 the functional is comparable slower than virtual call. It seems to be doing more than 1 indirection.
[…] There are some optimizations that are summarized in this blog post. […]
Try running tests for your implementation first and std::function next. It seems to be affecting the results.
These results are interesting, but not that surprising once your explanation is taken into account.
However, these tests are testing a single function. Your explanation, that std::function can be implemented as a single virtual function call, but can also be implemented more efficiently, makes sense for the case where you have a single function. If you have multiple functions that you are calling in related blocks of code, I wonder what it would take for the virtual function to become faster. I do not know if modern compilers perform this optimization, but presumably the optimizer could retrieve the information of multiple functions when it accesses the vtable, allowing the cost to be amortized over several calls to different functions, whereas std::function would require an indirection with each new call.
I am not sure how to devise a fair test for this, though.
In theory virtual functions can’t become faster because it will always be two pointer indirections: First to get the vtable, second to get the actual function pointer. When you’re passing a lambda to a std::function that will create a special function which will call the lambda without an additional indirection, so it will just be one pointer indirection to get to that special function.
As for vtable optimizations: I rather liked this talk: http://channel9.msdn.com/Events/GoingNative/2013/Compiler-Confidential (starts talking about vtable optimizations at 22:26)
All the optimizations that he talks about could in theory also apply to std::function. A vtable might be easier to reason about though because you have more type information to know the possible set of functions to call.
So then it comes down to how smart the optimizer is. When I did the measurements above, I stepped through the assembly and none of these optimizations were happening. So the loop simply benefited from having one less indirection.
The dropbox link to your std::function implementation is no longer accessible, could you please make it available somewhere? github would be nice!
Oh that is embarrassing. The dropbox link still works for me, but I’ve also uploaded the function on github here:
you compare_functions.cpp link has broken, could you please upload to your github ? thanks.
I’ve updated the link. Thanks for letting me know!
This is very nice, thanks. One question: If I create a null func::function, and assign that to a std::function, it seems that the std::function’s operator bool then returns true. Is this intended? I would expect it to return false in this case, because the func::function is null.
Sadly that doesn’t work. I would have to write a custom conversion operator. The problem is that as far as std::function is concerned, a null func::function is a perfectly fine callable object. The fact that it will throw an exception or assert on call doesn’t matter.