plalloc: A simple stateful allocator for node based containers

by Malte Skarupke

Allocators in C++ are awkward classes. Allocators are usually zero size to make the containers small, and they have no state because that used to be difficult to do in C++. But nowadays we don’t really care about the size of our containers (we care about the size of the content of the containers) and since C++11 we have support for stateful allocators.

The problem I’m trying to solve is that node based containers have bad cache behavior because they allocate all over the place. So I wrote a small allocator which gives out memory from a contiguous block. It speeds up std::map and std::unordered_map. It’s called plalloc because I think this is a pool allocator.

Code is below:

#pragma once

#include <memory>
#include <vector>

template<typename T>
struct plalloc
{
    typedef T value_type;

    plalloc() = default;
    template<typename U>
    plalloc(const plalloc<U> &) {}
    plalloc(const plalloc &) {}
    plalloc & operator=(const plalloc &) { return *this; }
    plalloc(plalloc &&) = default;
    plalloc & operator=(plalloc &&) = default;

    typedef std::true_type propagate_on_container_copy_assignment;
    typedef std::true_type propagate_on_container_move_assignment;
    typedef std::true_type propagate_on_container_swap;

    bool operator==(const plalloc & other) const
    {
        return this == &other;
    }
    bool operator!=(const plalloc & other) const
    {
        return !(*this == other);
    }

    T * allocate(size_t num_to_allocate)
    {
        if (num_to_allocate != 1)
        {
            return static_cast<T *>(::operator new(sizeof(T) * num_to_allocate));
        }
        else if (available.empty())
        {
            // first allocate 8, then double whenever
            // we run out of memory
            size_t to_allocate = 8 << memory.size();
            available.reserve(to_allocate);
            std::unique_ptr<value_holder[]> allocated(new value_holder[to_allocate]);
            value_holder * first_new = allocated.get();
            memory.emplace_back(std::move(allocated));
            size_t to_return = to_allocate - 1;
            for (size_t i = 0; i < to_return; ++i)
            {
                available.push_back(std::addressof(first_new[i].value));
            }
            return std::addressof(first_new[to_return].value);
        }
        else
        {
            T * result = available.back();
            available.pop_back();
            return result;
        }
    }
    void deallocate(T * ptr, size_t num_to_free)
    {
        if (num_to_free == 1)
        {
            available.push_back(ptr);
        }
        else
        {
            ::operator delete(ptr);
        }
    }

    // boilerplate that shouldn't be needed, except
    // libstdc++ doesn't use allocator_traits yet
    template<typename U>
    struct rebind
    {
        typedef plalloc<U> other;
    };
    typedef T * pointer;
    typedef const T * const_pointer;
    typedef T & reference;
    typedef const T & const_reference;
    template<typename U, typename... Args>
    void construct(U * object, Args &&... args)
    {
        new (object) U(std::forward<Args>(args)...);
    }
    template<typename U, typename... Args>
    void construct(const U * object, Args &&... args) = delete;
    template<typename U>
    void destroy(U * object)
    {
        object->~U();
    }

private:
    union value_holder
    {
        value_holder() {}
        ~value_holder() {}
        T value;
    };

    std::vector<std::unique_ptr<value_holder[]>> memory;
    std::vector<T *> available;
};

Since support for C++11 allocators isn’t universal yet there’s a bit of old boilerplate in there, but overall it’s still pretty compact.

To see how it works let’s look at the deallocate function first: Whenever there’s a large block of memory, I just call regular operator delete. Small blocks of memory go into the available list. The allocate function matches this: Large blocks of memory come from operator new, small blocks come from the available list. When the available list is empty, a new chunk of memory is allocated and all new objects from that chunk get added to the available list.

Should be pretty straight forward. The benefit of this is that the memory from one container gets allocated mostly contiguously. It only makes sense for node based containers though since it falls back to regular operator new and operator delete for large blocks of memory.

One thing that’s a bit ironic is that the only container that fully supports stateful allocators, boost::unordered_map, actually asserts when it tries to move plalloc. The assert comes from a sequence where it does “this->alloc = std::move(other.alloc); ASSERT(this->alloc == other.alloc);” which will always be false for stateful allocators. I consider this a defect in boost::unordered_map and will file a bug for it.

I still had the code for generating the graphs from my sherwood_map post, but this time I didn’t want to spend as much time on generating them so they are less pretty because they have less data points:

Insert no_reserve_insert
Reserve then Insert reserve_insert
Erase erase
Successful Lookup successful_lookup
Unsuccessful Lookup unsuccessful_lookup

In all of these the x axis is the number of items in the container. The y axis doesn’t mean much except to compare relative numbers. Lower numbers are faster. I used maps from int to pair<string, string>.

As you can see using plalloc improves performance in almost all benchmarks. Obviously insert and erase are faster because the code for allocations has simplified a lot, but you probably don’t care about those. (at least I don’t) The lookup performance is probably what matters most, so here are the speed up numbers for those:

Test Container Average speed up with plalloc
Successful Lookup std::unordered_map 0.3%
boost::unordered_map 11.2%
boost::multi_index 4.8%
Unccessful Lookup std::unordered_map 4.6%
boost::unordered_map 6.5%
boost::multi_index 3.8%

boost::multi_index improved enough to beat sherwood_map. Which means I’ll have to spend more time optimizing it…

There’s no graph for sherwood_map with plalloc because it doesn’t make any sense. For sherwood_map plalloc would just be a slower version of std::allocator because sherwood_map is not node based.

Also here are graphs comparing std::map with plalloc to boost::flat_map:

Successful Lookup map_successful_lookup
Unsuccessful Lookup map_unsuccessful_lookup
Insert map_insert
Erase map_erase

The average speed up for successful lookup is 16.8%, and the average speed up for unsuccessful lookup is 6.7%.

A few notes about this: boost::flat_map is still much faster. I didn’t try to find out why this is exactly, but my assumption would be that the processor can predict more than jump ahead of time because it doesn’t have to finish fetching the old memory before it knows where the next jump could go to. I had to make the insert graph be log scale because it’s very slow to insert random values into boost::flat_map. (and Libre Office doesn’t seem to have a way to split an axis) That’s not a problem with the container though, it just shows that you shouldn’t do this. (and it’s not fair to profile the container for something that you shouldn’t do) I left boost::flat_map out of the erase graph because of this. That dent in the unsuccessful lookup is weird. At first I thought it was a measuring error, but it is repeatable. Finding out what it is would take too long though, so It’ll just stay an unsolved mystery. I wouldn’t expect to find this on all machines.

So the benefits of plalloc are:

  • It makes code faster

And the downsides are

  • It will keep as much memory as you had in your worst case. If your container usually has 1,000 elements but every now and then it has 10,000, then plalloc will always keep enough memory for 10,000
  • That remains true even if you clear the container. The memory only gets returned when the container is destroyed

On the other hand whatever method that you were going to use instead of a node based container probably has these same downsides.

One thing that you may notice is that this comes pretty close to turning node based containers into a packed array. It doesn’t exactly, but it goes in the same direction. The biggest downside of plalloc compare to a packed array is that you can’t iterate over the nodes in the order in which they are in the array. But I think you could write an implementation of std::unordered_map which uses a packed array internally and which does allow you to iterate in order. I think the standard allows that…

You can probably use plalloc even if your container doesn’t fully support stateful allocators. Just be careful about copying or moving the containers. I can’t predict what behavior you’d get.

The license for the code is this:

This is free and unencumbered software released into the public domain.

Anyone is free to copy, modify, publish, use, compile, sell, or
distribute this software, either in source code form or as a compiled
binary, for any purpose, commercial or non-commercial, and by any
means.

In jurisdictions that recognize copyright laws, the author or authors
of this software dedicate any and all copyright interest in the
software to the public domain. We make this dedication for the benefit
of the public at large and to the detriment of our heirs and
successors. We intend this dedication to be an overt act of
relinquishment in perpetuity of all present and future rights to this
software under copyright law.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.

For more information, please refer to <http://unlicense.org/>