Quickly Loading Things From Disk

by Malte Skarupke

Loading things from disk is a surprisingly unsolved problem in C++. If you have a class that you want to save to JSON or binary and load back later, you’ll probably end up writing custum parsing code. Solutions like Protobuf or Flatbuffers generate custom classes that are… how should I put it… bad code. They force you to work in a parallel type system that’s not quite as powerful as C++ and that doesn’t allow you to write custom member functions. You pretty much have to wrap anything that they generate. Boost serialization is a better solution in that it allows you to load and save existing classes, which allows you to keep your code quality higher, but it’s unable to write human readable files. The default implementation that it provides is also surprisingly slow. I blame it on the fact that it loads using std::istream.

Based on frustrations with existing solutions I decided to write something that can

read and write existing, non-POD classes. No custom build step should be required
write human readable formats like JSON
read data very quickly (I care less about how fast it writes stuff. I work in an environment where there is a separate compile step for all data)

I came up with something that is faster than any of the libraries mentioned above.

Establishing a Baseline

The fastest possible way I know how to load stuff is to just memcpy the bytes straight from disk. So I’m going to load and save a POD struct. I know I said above that I want to save non-POD classes, but if I want to compare to memcpy, I have to keep my test case simple. Here is what it looks like:

struct memcpy_speed_comparison
{
    float vec[4] = { 0.0f, 0.0f, 0.0f, 0.0f };
    int i = 0;
    float f = 0.0f;
};

Where the loading and saving code looks like this:

void test_write_memcpy(const std::string & filename, const std::vector<memcpy_speed_comparison> & elements)
{
    std::ofstream file(filename, std::ios::binary);
    size_t size = elements.size();
    file.write(reinterpret_cast<const char *>(&size), sizeof(size));
    file.write(reinterpret_cast<const char *>(elements.data()), sizeof(memcpy_speed_comparison) * elements.size());
}
std::vector<memcpy_speed_comparison> memcpy_from_bytes(ArrayView<const unsigned char> bytes)
{
    size_t size = *reinterpret_cast<const size_t *>(bytes.begin());
    ArrayView<const unsigned char> content = { bytes.begin() + sizeof(size), bytes.end() };
    std::vector<memcpy_speed_comparison> elements(size);
    memcpy(elements.data(), content.begin(), content.size());
    return elements;
}

I memcpy the bytes to disk using file.write(), and then later I memcpy them back from the byte representation. (I keep using the term memcpy to indicate that I’m just copying bytes and not doing anything smarter than that. Just saying “copy” doesn’t properly convey that in my opinion, so memcpy it is)

I then use google benchmark to set up two benchmarks to measure how fast this is:

void MemcpyInMemory(benchmark::State & state)
{
    std::vector<memcpy_speed_comparison> elements = generate_comparison_data();
    while (state.KeepRunning())
    {
        std::vector<memcpy_speed_comparison> comparison = elements;
        RAW_ASSERT(comparison == elements);
    }
}
void MemcpyReading(benchmark::State & state)
{
    std::vector<memcpy_speed_comparison> elements = generate_comparison_data();
    std::string memcpy_filename = "/tmp/memcpy_test";
    test_write_memcpy(memcpy_filename, elements);
    while (state.KeepRunning())
    {
        MMappedFileRead file(memcpy_filename);
        std::vector<memcpy_speed_comparison> comparison = memcpy_from_bytes(file.get_bytes());
        RAW_ASSERT(comparison == elements);
    }
}

Google Benchmark measures everything inside of the block that starts with while(state.KeepRunning()); meaning all of the writing code above that is not part of the measurements.

The first benchmark measures how fast it is to memcpy the comparison data in memory. The second measures how long it takes to memcpy them back from disk. The reason why I measure both will become clear when we look at how long the above benchmarks take:

Benchmark	Time(ns)	CPU(ns)
MemcpyInMemory	482345	482846
MemcpyReading	663630	661515

Does it surprise you that copying from disk is only slightly slower than copying in memory? I should say that this is copying from a SSD, but even then I didn’t expect loading from disk to be this fast. Turns out it’s not this fast. What I’m measuring above is how fast it is to copy from the OS cache. Which is slightly slower than copying from normal memory, because you have the overhead of a bunch of page faults as the pages are mapped in, but overall it’s still just copying from memory. In order to measure how fast it is to copy from disk I have to make sure that the file gets evicted from the OS cache between runs. Luckily you can do that in Linux by calling these two lines:

    fdatasync(file_descriptor);
    posix_fadvise(file_descriptor, 0, 0, POSIX_FADV_DONTNEED);

Where posix_fadvise is the call that matters, but if you don’t call fdatasync first, the pages that haven’t been written to disk yet will not be evicted from the cache. That doesn’t matter in this benchmark because we are not writing in the loop, so the call to posix_fadvise would be enough, but for a general function that flushes a file from cache you need to call both.

With this call added I can actually measure how long it takes to memcpy data from a mmapped file, and these are the correct results:

Benchmark	Time(ns)	CPU(ns)
MemcpyInMemory	466284	464406
MemcpyReading	5580953	1627145

Which looks much more like what you would expect. Reading from a SSD is roughly ten times more expensive than reading from memory.

I’m using mmap() here, mainly because I will use mmap later and I want to be able to easily compare the numbers. If you’re interested in how fast using read(2) is, it’s essentially the same speed: 5437341 ns if I’m reading directly into the std::vector.

Comparing with Existing Solutions

I decided to compare the above measurements with three solutions: Boost Serialization, Protobuffers and Flatbuffers. Why these three? Because I knew them and because they are a good representation for other solutions that use similar approaches. Boost Serialization is a non intrusive library that can save existing classes, Protobuffers and Flatbuffers both require a custom build step. The difference between Protobuffers and Flatbuffers is that Flatbuffers doesn’t do any actual loading. You just get a view into the data on disk.

Boost Serialization

Boost Serialization requires that I add this member function to the class:

struct memcpy_speed_comparison
{
    // ...

    template<typename Ar>
    void serialize(Ar & archive, int)
    {
        archive & vec;
        archive & i;
        archive & f;
    }
};

I could also write this as a non-member function but that doesn’t really matter for this example. This exposes the three members of this struct to an “archive” class that boost serialization provides and that I don’t need to know anything about, other than that it has an operator&. This code is all I have to do to read and write a vector of these.

The code for the benchmark then looks like this:

void BoostSerializationReading(benchmark::State & state)
{
    std::string boost_filename = "/tmp/boost_serialization_test";
    std::vector<memcpy_speed_comparison> elements = generate_comparison_data();
    {
        std::ofstream file_out(boost_filename);
        boost::archive::binary_oarchive out(file_out);
        out & elements;
    }
    while (state.KeepRunning())
    {
        std::vector<memcpy_speed_comparison> comparison;
        std::ifstream file_in(boost_filename);
        boost::archive::binary_iarchive in(file_in);
        in & comparison;
        RAW_ASSERT(comparison == elements);
        file_in.close();
        UnixFile(boost_filename, 0).evict_from_os_cache();
    }
}

Which is pretty straight forward: Create the archive class that boost serialization provides and call operator& on it and the data that I want to read and write. I really like this library. The evict_from_os_cache() line at the end is a bit awkward, but the reason for that is that I decided to add it to a UnixFile() class of mine which just wraps open() and close().

Here is the performance of the above benchmark:

Benchmark	Time(ns)	CPU(ns)
MemcpyInMemory	466284	464406
MemcpyReading	5580953	1627145
BoostSerializationReading	8881369	8200465

Which is disappointing. I blame this on the fact that Boost Serialization goes through the std::istream interface, but I can’t be sure. I didn’t investigate further.

Protobuf

Protobuf can not read or write the above class, so I had to make protobuf generate an equivalent class. Here is the protobuf description that I used to mirror the above data:

package test;

message Vec4
{
    required float f0 = 1;
    required float f1 = 2;
    required float f2 = 3;
    required float f3 = 4;
}

message LoadFastTest
{
    required Vec4 vec = 1;
    required int32 i = 2;
    required float f = 3;
}

message Array
{
    repeated LoadFastTest array = 1;
}

This description will generate three classes that are kind of similar to the std::vector that I used above. The code for the benchmark looks like this:

void ProtobufReading(benchmark::State & state)
{
    test::Array array;
    for (int i = 0; i < 100000; ++i)
    {
        test::LoadFastTest * test = array.add_array();
        test->mutable_vec()->set_f0(0.0f);
        test->mutable_vec()->set_f1(0.0f);
        test->mutable_vec()->set_f2(0.0f);
        test->mutable_vec()->set_f3(0.0f);
        test->set_f(float(i));
        test->set_i(i % 5);
    }
    StringView<const char> tmp_filename = "/tmp/protobuf_test";
    {
        UnixFile file(tmp_filename, O_RDWR | O_CREAT, 0644);
        array.SerializePartialToFileDescriptor(file.file_descriptor);
    }
    while (state.KeepRunning())
    {
        UnixFile file(tmp_filename, O_RDONLY);
        test::Array copy;
        copy.ParsePartialFromFileDescriptor(file.file_descriptor);
        for (int i = 0; i < 100000; ++i)
        {
            const test::LoadFastTest & test = copy.array(i);
            RAW_ASSERT(0.0f == test.vec().f0());
            RAW_ASSERT(0.0f == test.vec().f1());
            RAW_ASSERT(0.0f == test.vec().f2());
            RAW_ASSERT(0.0f == test.vec().f3());
            RAW_ASSERT(float(i) == test.f());
            RAW_ASSERT(i % 5 == test.i());
        }
        file.evict_from_os_cache();
    }
}

Which is… unpleasant. I can’t reuse my code for setting up the test data, so I had to inline it here. Also protobuf doesn’t provide an operator== for the classes that it generates, so I had to write the comparison code inline here as well. This awkwardness is actually the main reason why I wouldn’t use protobuf. I don’t want low quality classes that constantly require small bits of wrapper code all throughout my codebase. Anyway it is also really slow:

Benchmark	Time(ns)	CPU(ns)
MemcpyInMemory	466284	464406
MemcpyReading	5580953	1627145
ProtobufReading	22447402	20409829

Since protobuf generates its own classes, it should be able to generate really fast code for reading and writing. No idea why it is so much slower than Boost Serialization. That being said since protobuf is so slow, Google has a second library for reading and writing called Flatbuffers.

Flatbuffers

Flatbuffers also requires a description of your class in a separate type system. It looks like this for the above data:

namespace test_flatbuffers;

struct Vec4
{
  x : float = 0;
  y : float = 0;
  z : float = 0;
  w : float = 0;
}

table Monster
{
  vec : Vec4;
  i : int = 0;
  f : float = 0;
}

table Array
{
  array : [Monster];
}

root_type Array;

My class is named “Monster” because that is the name in the flatbuffers example. I used a “table” for the class because one cool feature of flatbuffers is that it doesn’t save members that have the default value. Meaning if you never assign a value to a member, it also doesn’t have to be saved. Unfortunately this feature doesn’t seem to work for struct members, which means that even if I never change the “vec” member, it will still always be written out to disk.

The flatbuffers interface is also a little bit curious due to the differences between structs and tables. It took me really long to write a simple test case. Again it’s a symptom of the problem that they decided to use their own type system and that they generate their own classes, and they are just not doing a good job at generating quality code. Here is what my test case looks like:

void FlatbuffersReading(benchmark::State & state)
{
    namespace fb = flatbuffers;
    namespace tfb = test_flatbuffers;
    std::string filename = "/tmp/flatbuffers_test";
    {
        fb::FlatBufferBuilder builder;
        std::vector<fb::Offset<tfb::Monster>> monster_array;
        for (int i = 0; i < 100000; ++i)
        {
            tfb::Vec4 vec(0.0f, 0.0f, 0.0f, 0.0f);
            monster_array.push_back(test_flatbuffers::CreateMonster(builder, &vec, i % 5, float(i)));
        }
        auto vec = builder.CreateVector(monster_array);
        tfb::ArrayBuilder array(builder);
        array.add_array(vec);
        builder.Finish(array.Finish());
        std::ofstream file(filename);
        file.write(reinterpret_cast<const char *>(builder.GetBufferPointer()), builder.GetSize());
    }

    while (state.KeepRunning())
    {
        MMappedFileRead file(filename);
        const tfb::Array * read_array = tfb::GetArray(file.get_bytes().begin());
        auto array = read_array->array();
        int count = 0;
        for (auto it = array->begin(), end = array->end(); it != end; ++it)
        {
            RAW_ASSERT(it->vec()->x() == 0.0f);
            RAW_ASSERT(it->vec()->y() == 0.0f);
            RAW_ASSERT(it->vec()->z() == 0.0f);
            RAW_ASSERT(it->vec()->w() == 0.0f);
            RAW_ASSERT(it->f() == float(count));
            RAW_ASSERT(it->i() == count % 5);
            ++count;
        }
        RAW_ASSERT(count == 100000);
        file.close_and_evict_from_os_cache();
    }
}

One feature of flatbuffers is that it doesn’t actually “load” anything. It just creates a mapping onto the data that you provide it. In my case, since I mmap the file, this code would actually run really fast if I never used the data. If I remove the comparison code from the above test, my measurements are this:

Benchmark	Time(ns)	CPU(ns)
MemcpyInMemory	466284	464406
MemcpyReading	5580953	1627145
FlatbuffersReading	208817	9814

Flatbuffers is insanely fast here because all it does is create a view onto the mapped data. As soon as I add the comparison code like in the above code snippet, I get these measurements though:

Benchmark	Time(ns)	CPU(ns)
MemcpyInMemory	466284	464406
MemcpyReading	5580953	1627145
FlatbuffersReading	8954969	2633689

Which puts it at about the same speed as boost serialization. This is only because I’m only using the data once though. If I run the comparison code twice, flatbuffers runs more slowly than boost serialization. That’s because the flatbuffer types are really fast to load, (in fact they load in zero time as long as you never use them) but they are slow to use. Here is the comparison to boost serialization when checking the results twice:

Benchmark	Time(ns)	CPU(ns)
MemcpyInMemory	466284	464406
MemcpyReading	5580953	1627145
FlatbuffersReading	10637846	4099547
BoostSerializationReading	9404415	8649244

So it’s not super slow, but it’s noticeably slower than accessing normal C++ objects.

One important feature of flatbuffers that I didn’t use above is that it can skip default values. In my comparison data above the vec4 has all default values, but flatbuffers seems to not be able to take advantage of that. If I make all the values default values though the speed for loading and checking the results once is this:

Benchmark	Time(ns)	CPU(ns)
MemcpyInMemory	466284	464406
MemcpyReading	5580953	1627145
FlatbuffersReading	6930696	2155805

This makes flatbuffers really interesting if your use case you has a lot of default values or if you only have to look at parts of the data. If you’re careful with your data access, flatbuffers might only load what you access. (what actually gets loaded in depends on the OS paging if you use mmap like I did)

Writing Our Own

Overall the above solutions are all disappointing though. I always had the memcpy comparison in there to show what the optimal number is. Both flatbuffers and protobuf should have an easy time reaching that number because they can generate their own code. So let’s try to write something that reaches that optimal number. To start, I will try to use an approach similar to boost serialization, but I will mmap the data in instead of going through std::istream. I also want an interface that is flexible enough that it can be used for a human readable json file, so I use macros. The solution that I end up with requires that I write the following function somewhere:

REFLECT_CLASS_START(memcpy_speed_comparison, 0)
    REFLECT_MEMBER(vec);
    REFLECT_MEMBER(i);
    REFLECT_MEMBER(f);
REFLECT_CLASS_END()

This generates a function very similar to the boost serialization function above, but since I use macros I also have the names of the members which makes it possible for me to write out a json file. For now though I will test the binary format. The number zero in the macro REFLECT_CLASS_START() above is the version number. I want to offer similar versioning support to boost serialization, and for that I need the current version somewhere. That being said my binary format does not support versioning. The way this will work is that json will be used to save all data while it is still being edited. Then there will be a compile step which loads the json data and writes out the binary data. (it’s a tiny program that literally just calls read_json followed by write_binary) After that compile step the data can be loaded very quickly without the overhead for the version support. So I aim to provide all the functionality of the other libraries, just not in one file format. The above description is minimal enough that I have the flexibility to do this. I could even generate mappings for other programming languages from the above code. After all it’s really just a reflection system.

Here is the benchmark code for my binary format:

void ReflectionReading(benchmark::State & state)
{
    std::string serialization_filename_fast = "/tmp/serialization_test_fast";
    std::vector<memcpy_speed_comparison> elements = generate_comparison_data();
    {
        std::ofstream file(serialization_filename_fast);
        metaf::BinaryOutput output(file);
        metaf::write_binary(output, elements);
    }
    while (state.KeepRunning())
    {
        MMappedFileRead file(serialization_filename_fast);
        metaf::BinaryInput input = file.get_bytes();
        std::vector<memcpy_speed_comparison> comparison;
        metaf::read_binary(input, comparison);
        RAW_ASSERT(comparison == elements);
        file.close_and_evict_from_os_cache();
    }
}

And here is how fast this is:

Benchmark	Time(ns)	CPU(ns)
MemcpyInMemory	484328	484871
MemcpyReading	5590015	1616091
ReflectionReading	5672753	1851692

It’s pretty much as fast as memcpy. To see how fast this really is I will try to get rid of the file overhead and just serialize to memory and read from memory. Here is my test for that:

void ReflectionInMemory(benchmark::State & state)
{
    std::vector<memcpy_speed_comparison> elements = generate_comparison_data();
    std::stringstream buffer;
    metaf::BinaryOutput output(buffer);
    metaf::write_binary(output, elements);
    std::string in_memory = buffer.str();
    while (state.KeepRunning())
    {
        metaf::BinaryInput input({ reinterpret_cast<const unsigned char *>(in_memory.data()), reinterpret_cast<const unsigned char *>(in_memory.data() + in_memory.size()) });
        std::vector<memcpy_speed_comparison> comparison;
        metaf::read_binary(input, comparison);
        RAW_ASSERT(comparison == elements);
    }
}

And here is how fast this runs:

Benchmark	Time(ns)	CPU(ns)
MemcpyInMemory	484328	484871
MemcpyReading	5590015	1616091
ReflectionInMemory	598699	602207
ReflectionReading	5672753	1851692

Remember that the MemcpyInMemory test just calls the std::vector copy constructor. Turns out you can get performance pretty close to that even when going through a stringstream. What is important to note here is that while I am going through the std::ostream interface for writing, I am not going through the std::istream interface for reading. For reading I am using an ArrayView class. (or array_ref or span or whatever it will end up being called once it’s standardized) So there are no virtual functions and the optimizer can go to town. In fact when you look at the assembly of this code, it’s pretty close to optimal. It’s just assigning the ints and floats that are members of the objects directly. I still haven’t shown the code for this, but remember that all I have is the REFLECT_MEMBER() calls from above. It should be straightforward to see how to expand that macro to assign from an ArrayView to those members with optimal assembly.

Faster than Memcpy

One thing that is interesting is that even just with this code, I can be faster than memcpy. Let me make one change to the class that I’m working with:

struct memcpy_speed_comparison
{
    alignas(16) float vec[4] = { 0.0f, 0.0f, 0.0f, 0.0f };
    int i = 0;
    float f = 0.0f;
};

Just by adding alignas(16), my code is faster. Why would you align the float[4]? Because then you can use SIMD instructions directly without having to load the data unaligned. Here is the performance with this change:

Benchmark	Time(ns)	CPU(ns)
MemcpyInMemory	682278	680601
MemcpyReading	7601910	2309319
ReflectionInMemory	906126	906933
ReflectionReading	6030785	2210939

Why is my solution faster? Because now there are eight bytes of padding at the end of the struct. The memcpy solution has to copy those bytes. It has no type information and doesn’t know what it is copying, so it has to copy the padding bytes. My method ignores the padding because I didn’t expose it to the system. Because of that it has to copy less bytes from disk, and in this benchmark the time spent loading from disk dominates.

This is relevant even if you don’t have padding bytes. One approach that game developers use sometimes is what’s called “memcpy followed by pointer patching”. You use it if you are copying structs that are not just PODs. The way it works is you memcpy in the whole struct, and then you do a second pass over the data to initialize all of your pointer values, because you can’t persist pointers to disk using memcpy. So you write offsets into your pointer members and then patch those offsets once loading has finished. Except there are always edge cases where you don’t want to load a member from disk but want to use custom logic to initialize it later. Depending on how many of those members you have, this can be faster since memcpy still has to copy them in and my method can ignore them.

Compressing Data

If you look at the measurements above, you will notice that most of the time for reading is spent idle. Out of the 6ms that it takes to load using reflection, the CPU is active for only 2.2ms. The remainder of the time it is just waiting for the disk. So if we could make the data that we load a little bit smaller, we could spend quite a lot of CPU time decompressing the data, and loading should be faster overall.

LZ4

LZ4 is a compression library that promises that it has an “extremely fast decoder” which sounds exactly like what I need. I have to confess that I didn’t try any other compression library, (Brotli and LZHAM also claim to decompress quickly) but the reason for that is that I got very good results from LZ4, and that LZ4 just comes across as a high quality library. (it was easy to integrate and has simple, clean function interfaces) Meaning I am very happy with it and didn’t feel the need to investigate anything else.

So what happens when we compress our data before writing it to disk and decompress it when reading it back? Things get a lot faster:

Benchmark	Time(ns)	CPU(ns)
MemcpyInMemory	466064	466548
MemcpyReading	5457860	1358706
MemcpyCompressedReading	2210455	1407887
ReflectionInMemory	597280	597650
ReflectionReading	5553340	1535985
ReflectionCompressedReading	2321451	1456029

(I undid the alignas(16) change from above before running this)

I expected CPU time to go up and overall time to go down. Overall time did go down, but CPU time didn’t go up as expected. I think the reason for that is that since I’m reading less bytes from disk (my test data compresses very well, down to 500kb from 2.4 megabytes) I’m saving a lot of CPU instructions because I simply have to read less. So even though LZ4 requires a bunch more processing, I’m saving CPU time in other places and I’m coming away from this with unchanged CPU time. (this does suggest that I should investigate other compression libraries which have better compression at the cost of having a slower decoder, because there should be further gains to be had here, but I’ll leave that as an exercise to the reader)

I wouldn’t always expect to get gains this strong. My test data is pretty simple. This is the point where I should run this on real data, but I don’t have real data readily available. If I try to randomize the data I can see that the amount of compression varies with how much randomness I put in. If I just use a std::uniform_real_distribution, the compressed version is slower than the uncompressed version. If I use a std::binomial_distribution or a std::geometric_distribution to better approximate real world data, I still see big benefits. But at that point I’m just testing my assumptions about compression, which isn’t terribly useful for real world performance. So I’ll have to decide on some real data to use and run this again at some point.

Type Aware Compression

One thing that C++ has taught us is that you can write faster code if you have type information. That’s why std::sort is faster than qsort. So I wanted to apply that idea to this problem. Let me start off by saying that it didn’t entirely work. My type aware compression attempts are faster than normal memcpy, but they are not faster than memcpy with LZ4 compression. But I thought this is interesting, and I’ll definitely revisit these when I’m working on real data to see how they work there.

Most Ints Are Small

My first method of compression is to store any int that fits into seven bits as only one byte. I’m using a method similar to utf8, where the first seven bits contain data, and the last bit indicates whether there is more data to come. The result of this is that 32 bit integers can be anywhere between 1 and 5 bytes in size, and 64 bit integers can be anywhere between 1 and 9 bytes in size. I can only think of one case where you’re going to hit the worst case, and that is when storing hashes. For that I would eventually have some kind of type tag to indicate that hashes should just be stored without compression so that they will always use four bytes instead of five bytes. But for all other ints, you will usually benefit a lot from this compression.

Most Floats Are Also Small

Floating point numbers are a bit more complicated: It’s difficult to store them with incremental precision like I did for ints. Floating point numbers consist of one sign bit, eight exponent bits and 23 mantissa bits. If the eight exponent bits are all 1, then that indicates that the number is either NaN or infinity. So if I drop support for NaN and infinity, I can instead use that special case to indicate that this is a compressed float.

So I first read sixteen bits of a float, which includes the sign bit, the eight exponent bits, and seven bits of the mantissa. If one of the exponent bits is zero, I read the remaining sixteen bits and assign the whole float. If they are all 1 however, I know that there is a compressed float stored in the seven bits of the mantissa. A lot of useful numbers can be expressed with that, like 0, 0.125, 0.5, 1, 2, 3, 4, 10, the negative versions of all of those, and 240 other numbers close to zero. So I am satisfied that my compression case will be hit often enough. In the case where it doesn’t get hit, the float will still only use four bytes on disk and I will just waste a few CPU cycles that I would otherwise spend waiting for disk.

Once I defined this eight bit float format, (sign bit plus three bits exponent plus four bits mantissa) the question arose of what to do with denormalized floats in this new format. I could use them to indicate special common values like 0.1, and I could also use them to support NaN and infinity again. I have an implementation of supporting NaN and infinity using that method, but I don’t have it enabled by default. The reason for that is that I have to be a bit careful with code size. Any time that I increase the code size, the inliner changes behavior. Since my code needs to be inlined to get its performance, I need to be careful.

I did not write float compression code for double precision numbers yet. That is probably also worth doing, I just haven’t thought about it yet.

Most Values Are Never Changed

The final method of compression that I wrote is the same one that flatbuffers uses: If a value is never changed, don’t write it to disk. This is especially useful for big classes that have grown over the years. For example in Bullet Physics a rigid object can be marked as “static” in which case many of the members on there can never be modified. And the majority of physics objects are static objects. Wouldn’t it be nice if all the meaningless members that are never changed just weren’t saved and loaded?

The question is how can I tell whether a member has been modified? Easy: Serialize out a default constructed struct and remember what all the values were. If a member on another struct has the same value as it had on the default constructed one, I don’t need to save it.

For this I store a small bitset with each struct that indicates which members are saved on disk and which aren’t. For small structs this bitset is just one byte, if you have more than eight members it’s two bytes, if you have more than sixteen members it’s four bytes and if you have more than 32 members it’s eight bytes. I don’t support more than 64 members. If you have more members than that, you have to create a nested struct. It’s not super complicated to support more than 64 members, but for now I just made my life easy.

Performance of Type Aware Compression

If I change my test case so that the float[4] member is never changed and so that both the int and the float are filled with random numbers that mostly benefit from my compression, I get this speed:

Benchmark	Time(ns)	CPU(ns)
MemcpyInMemory	482933	483374
MemcpyReading	5467386	1410453
MemcpyCompressedReading	2525274	1779318
ReflectionInMemory	1899832	1903480
ReflectionReading	2692708	2015040
ReflectionCompressedReading	2953027	2203459

Look at how close ReflectionReading now is to MemcpyCompressedReading. ReflectionCompressedReading became slower because now I’m compressing the same data twice and paying for the overhead of that. The biggest speed-up comes from the fact that I’m just not loading 2/3 of the data because it is unchanged, so I don’t have to persist it to disk. Is that a realistic number? I think it is, but it depends on your data. The LZ4 compressed version is also benefitting from that.

So type aware compression gets pretty damn close to type-unaware compression. But it doesn’t beat it. I have ideas for how to improve this further: Right now I’m looking at each object separately. If I could just say “object number 500 is exactly like object number 210 except the transform has changed” then I could compress this a lot more. But generating the code for that is difficult with what I have available at compile time in C++. I’m also thinking about possibly solving that at a higher level. But right now I don’t have a good approach for this idea.

But writing these has been a lot of fun. It’s really rewarding to do these micro optimizations like saving bytes off of an int. There’s so many more ideas for this: You could compress strings based on what they are used for (paths, menu options and scripts all use different character sets), you could save a matrix as a vector, quaternion and scale and save six floats like that, or you could compress normals (graphics programmers really like doing this). That last one is a lossy compression which LZ4 is not allowed to do. If you know that that’s OK sometimes, you can save more bytes.

But for now I admit defeat and say that I recommend using LZ4.

Optimizing the Other Libraries

Could boost serialization, protobuf and flatbuffers also benefit from these optimizations?

Since my library has a similar implementation to Boost Serialization, I believe that all the same optimizations would also apply to that library. The main difference between my library and Boost Serialization is that I support writing a human readable json, and that I don’t support versioning in my binary format. Neither of these is very important for the above optimizations.

Flatbuffers would have a hard time because it wants to do zero processing on loading. I believe that it would still see significant speed ups if it used the type aware compression to compress its ints and floats, but that speed up would turn into a slowdown if you use the members of the loaded data more than once.

Protobuf could use the same optimizations. In fact protobuf should be the fastest of all of these libraries because it can generate its own code. I have no idea why it is so slow. Maybe if I had looked at other libraries that generate custom code for loading I would have found one that is as fast as memcpy. But even then if you apply the above optimizations to a library like that, what is the point? The big trade-off of a library that generates classes for you is that you have to give up code quality in order to get fast loading and saving as well as support in multiple languages. If I can make loading and saving just as fast for any custom class, why would I use a library that generates code? And I am also convinced that I can generate support for other languages from my macros. I haven’t written that code yet, but it seems straightforward. As evidence for this, it should be easy to see that I can just write out a description of all the members, so if I wanted to do this cheaply I could just generate the protobuf file from my macros. Then all I would have to do was write loading code that understands the protobuf file format. So I honestly can’t think of a reason why you would use a library that generates code for you based on a description in a parallel type system. Just use normal classes and write high quality code.

Library

I have uploaded the code for all of this to github. Unfortunately I can’t claim that this library is finished. It works, but it needs a few good clean ups. There is also no documentation. I wouldn’t recommend that you use it for production. That being said for me personally the next step is to use it in production because I believe that only by taking that step can I get the quality to where it has to be. The best way to get a proof of concept to a releasable quality level is to just use it and see what issues you run into. But for anyone else I would consider this as just that: a proof of concept.

4 Comments to “Quickly Loading Things From Disk”

Pavel Medvdev says:

December 19, 2015 at 23:53

There is a YAS library similar to Boost.Serialization: https://github.com/niXman/yas but faster. Maybe you get some ideas from it.

I also have added YAS to other comparison of serialization libraries https://github.com/STEllAR-GROUP/cpp-serializers/pull/2

Chris Glover (@cdglove98) says:

January 20, 2016 at 10:38

std::iostreams are not inherantly slow. Usually it’s the localization component that causes things to be slow (when reading and writing text anyway). See https://github.com/cdglove/daily/blob/master/test/fast_iostream/io.cpp for a way to workaround it.

That being said, using only the binary interface shouldn’t have this problem so that’s not it. It’s also not likely a problem with std::istream because the binary archives are implemented in terms of std::streambuf, which has a non-virtual interface, so it should be close to optimal.

Chris Glover (@cdglove98) says:

January 20, 2016 at 10:43

About numeric compression; you can achieve substantial savings by decorating the values with ranges. compressed_float(-1, 1, 0.1f), I only need 5 bits to represent this. In tests I did a few years ago I was able to bit level process the data and it still was only marginally slower than memcpy because for any significant amoutn of data, bit level processing is faster than waiting for the memory latency.

However, one way that you can win is if you never have to touch the data — no patching, post processing, anything, in which case you can DMA directly and you will never beat that.

naesten says:

January 11, 2019 at 19:29

Neither LZHAM nor brotli claims to outperform zlib on decompression: brotli claims to be about the same, and LZHAM claims to be slower — so it’s just as well you didn’t bother trying them ;-).

Probably Dance