It’s a very interesting idea: The problem is that states often lower taxes with the hope of attracting business or talent, but there is very little evidence about whether that actually works. So the authors of that paper decided to find a group of influential people who are somewhat easy to track: people who apply for lots of patents, the so called “star scientists” from the title. So the authors built a huge database, tracking where the top 5% of scientists who applied for the most patents had moved to over the years.

And the authors claim that they found pretty clear evidence that people like to move from high-tax states to low-tax states, so the conclusion is that if you want to attract top scientists, you should lower taxes.

Except, I dug through the data and I found the opposite. Yes, top scientists do move to states that have lower taxes, but high tax states have such a large lead in the number of scientists, that that little bit of migration doesn’t matter. But we’ll have to get to that conclusion one step at a time.

First some doubt: The “lower taxes attract business and talent” argument is clearly true at some point: If in one state you had to pay 90% taxes and in another 10% taxes, of course you’d move to the state where you have to pay 10% taxes. But if in one state you’d have to pay 35% and in the other you’d have to pay 32%, would you really decide based on that? Or would you decide based on something else, like the price of real estate, the quality of life, quality of schools or quality of the work that you’d be doing?

Reading the above article, there are a few reasons to be suspicious: The top two hacker news comments about the article point out two big mistakes: 1. The authors ignore sales tax. 2. The authors didn’t measure where people choose to locate, they measured where people migrate. Which is a subtle but very important difference.

The other reason to be suspicious is that the article reminded me of this piece, which talks about the “Rich states, poor states” index by a conservative think tank. In that index the authors find a bunch of indicators that allow them to claim that states that tax highly are not doing well and states that have low taxes are doing great. For example they claim that New York and California have been doing terribly for the last ten years, because lots of people are moving away from those states. They apparently never paused to think that if New York and California are doing terribly in an index called “Rich states, poor states”, you probably have it exactly backwards…

So when I saw that same measure of “migration” being used again to claim that states with low taxes attract lots of scientists, I became suspicious…

But I do have to say one very positive thing about the authors: They made their data available. Unfortunately the paper is behind a paywall, but the data is available for download for free. So I looked at the data.

One thing I’ll say at the beginning is that for some reason the authors’ data seems to go wonky somewhere around the year 2003. At that point the number of scientists takes a downward turn in all states until it approaches zero in the year 2009. I don’t know why that is. Maybe something in the way they collected the data. Because of that I chose to put the end point for all my analysis at the year 2003.

With that out of the way, let’s plot where all the scientists were in 1977:

OK in 1977 California had the most scientists, followed by New Jersey, New York, Illinois and Pennsylvania. I was very surprised to see New Jersey up there. I don’t think of it as a center of science these days. But I did expect California and New York to be up high.

Now for comparison here is the same graph for 2003:

In 2003 California completely dominates. It has by far the most scientists. New York is a distant second, followed by Texas.

So obviously the big story over the observed time period is the rise of Silicon Valley in California. This is our first indication that something is not right about the story of scientists moving to low-tax states. California is not exactly known for having low taxes, and seeing how most of the growth happened in California, where did scientists move to low-tax states? Sure there are lots more scientists in Texas and Washington, but then we also saw healthy growth in New York, another high-tax state.

So what did they actually claim in the article? Their main claim is:

Our empirical analysis uncovers large effects of personal and business taxes on star scientists’ migration patterns. The probability of moving from an origin state to a destination state increases when the ‘net-of-tax rate’ (after-tax income) in the latter increases relative to the former.

And they use these plots to back that claim up:

The description in the article reads “Each panel plots out-migration (measured as the log of the ratio of out-migrants to non-movers) against the net-of-tax rate (which is simply one minus the tax rate, thereby representing the share of a dollar’s income that is retained after taxes). The data underlying the figures represent changes in out-migration and changes in net-of-tax rates relative to their average levels over 1977-2000.”

Now I have looked at these graphs for quite a while, and I can honestly say that I have no idea what they mean. The first thing they do is measure out-migration by measuring how many scientists move, compared to how many stay, and then they take the log of that. Why the log? I have no idea, but it makes the y-axis completely unreadable. They also claim to have taken the log of the net-of-tax-rate, but if you take the log of something and arrive at 0.02, that means that the original value was greater than 1. How can the net-of-tax-rate be greater than 1? There are no states with negative tax rates. So I have no idea what the x-axis is either… I tried looking at the outliers to see if I could identify them by looking at the data. Surely California has to be an outlier on one of these graphs. But I honestly don’t know which one it is. Also there is only something like 40 dots. Shouldn’t there be one per state? What are these dots?

They make one other big claim in the article, which is that when you lower taxes, scientists will come to your state:

if after-tax income in a state increases by 1% due to a cut in personal income taxes, the stock of scientists in the state experiences a percentage increase of 0.4% per year until relative taxes change again.

[…]

when we focus on the timing of the effects, we find that changes in mobility follow changes in taxes, rather than preceding them. The effect on mobility tends to grow over time, presumably because it takes time for firms and workers to relocate.

In a sense this claim is even more important than the first one, because it gives us actionable advice: Lower taxes and you will get top scientists. Because of the paywall I don’t have access to the paper behind this article, but the appendix for the paper is available. Luckily the appendix contains a graph that backs up this claim:

So what this displays is that at the time of the red line the taxes decreased, and the number of scientists went up. (or the taxes increased and the number of scientists went down. Both of those events would make the line go up)

These are pretty strong effects. They claim that if you lower your taxes, you will get, on average, 100 star scientists coming into your state over the next ten years. I’m a bit worried about the size of the whiskers on those plots, but the measured effect is so big that I’m inclined to believe them. (also you would expect states to promote the fact that they lowered taxes, so you would expect business to move there)

So the first graph didn’t make sense to me, but this second one does. Before I investigate this claim though, I want to make a detour, because

I mean really, somebody did the work of tracking the population and movement of scientists over several decades, and they make all that data available. How can you not just want to dig in?

The first question that my graph from 2003 above throws up is this: Sure, California has a lot of scientists, but California is also the biggest state. Number 2 and 3 are New York and Texas, which are also the two next biggest states. So isn’t this just “states with a high population have lots of scientists”? I visualized that relationship in Gapminder, but I can’t embed custom HTML in this blog, so here is a picture:

So that’s a pretty clear relationship: The smaller your population, the fewer scientists you have. Sure, Florida is punching a bit under its weight and Massachusetts is punching above its weight, but overall there is a very clear trend here: If you’re further to the right, you tend to be higher up. And in the corner we see the giant outlier that is California. If you were to draw a trend-line through all other states, California would be way above it.

What if we account for this population/scientist correlation, and instead of drawing total number of star scientists, we draw number of scientists per million inhabitants?

Now we see which states are really doing a good job at this science thing: Idaho actually had a huge number of star scientists in 2003 relative to its population. It had 1004 star scientists for a population of 1.36 million, giving it more than 700 star scientists per million inhabitants. California is still clearly above average, but New York and Texas now look less impressive. They’re still doing good, but plenty of other states are doing better.

But I’m now convinced that this number is the right choice for the Y-axis. We need to remove the effect that the population has on the number of scientists. If we do that we can see which states are doing a good job or a bad job. With that, let’s try plotting the number of scientists against the taxes.

On the x axis I have the average income tax that somebody in the top 1% had to pay in that state. Why the top 1%? Because that’s what the paper chose to do. Their reasoning is that top scientists are more likely to be in the top 1%. In their own data, they found that 14% of the top scientists are in the top 1% in terms of income. Should we therefore use the tax rate that the top 1% has to pay? It’s not clearly the right choice, but it’s also not clearly wrong. If you’re going to make a choice of moving somewhere based on taxes, is it plausible that you’d look at the top tax rate? I think it is. Also one nice thing of using the tax rate for the 1% is that the difference in taxes is bigger there. (for reference: To be in the top 1% you had to make more than $380,000 in the year 2011. (the last year that they have data for) Apparently 14% of star scientists make that much money)

In any case if we sort these by tax rate, we see that a lot of states with high taxes are punching above their weight. Idaho, Vermont, Minnesota and California all have a lot of scientists relative to their population. In the low tax states we see that Washington state is doing pretty well. But overall there seems to be an upward trend here.

Ah, I hear your objection: I’m not looking at the sales tax here. Sure, Washington and Texas have no income tax, but they get a lot of their income from the sales tax. So what happens if we factor in the sales tax?

I added the sales tax to the income tax. That seems like a good approximation of what actually happens. When I do that the upward trend becomes even clearer. Washington is not as big of an outlier anymore.

I should mention that there are a few problems with this graph: The choice of using the income tax of the top 1% is now more questionable, because it exaggerates the size of the income tax relative to the sales tax. Also the sales tax is actually too small here because many cities have sales tax on top of that: Seattle, New York City, Houston, Austin, San Antonio and others have sales taxes on top of the state sales tax. But to be consistent with the original paper I chose to just use the same numbers.

At this point I would say that there is a correlation between having high taxes and having lots of scientists. Clearly just having high taxes is not enough. Look at Arkansas and Maine down there: High taxes but no science. Also I should say that this is still an approximation. New Hampshire is still listed as 0 taxes on the left, because it has no income and no sales tax. But clearly New Hampshire has to have some taxes. I just don’t have the numbers for that because I only have the data that the original authors collected. (or maybe they have the total tax numbers somewhere, but I don’t know what all their numbers mean) But with all these caveats out of the way, it is clear that the states with lots of scientists tend to have lots of taxes.

Now could this be a fluke because Idaho and Vermont are very small states, and there tend to be more outliers in small samples? Let’s find out by plotting the number of scientists again, as opposed to the number of scientists per million inhabitants:

Now California is our big outlier again, but even without it I still think there is a slight upward trend. Clearly we still have the high-tax-no-scientists states at the bottom right, but if you read from top to bottom, you get California, New York, Texas, Massachusetts and Minnesota as the top 5. Four out of five have high taxes, and Texas has a large population.

So now we know what the situation is at the end of the capture. But that doesn’t disprove the paper: It could still be that these states got all their scientists when they had lower taxes, and that the states have since raised the taxes.

That is actually the main reason why I did the work of adding the data to Gapminder: In Gapminder it’s very easy to find out which came first: Did the number of scientists go up first, or did the taxes go up first? Here is what that last graph looks like animated:

Watch this a couple times and see what you notice. It’s very noisy movement at first, but after a while you start to see patterns.

The first big pattern is that taxes used to be lower on average. Most states moved to the right. New York is unusual in that it moved to the left. The second pattern is that in the 90s, science went up pretty much across the board. It seems like there is more movement on the right side of the graph than on the left, but not by much.

To try to figure out the question of “does lowering taxes attract scientists,” let’s look at a couple of the big movers: California, Texas and Washington moved right before they moved up. New York moved left before it moved up. Massachusetts and Minnesota seem to walk left and right, but mostly stay in place while they go up. New Jersey moved to the right and down. The trends are ambiguous. This is weird because this is the graph where we should see the “if you lower taxes, you will get on average 100 scientists, if you raise taxes you will lose on average 100 scientists” result from the appendix of the paper that I posted above. A result that strong should be pretty clear, but the data is ambiguous at best, and maybe even points in the other direction.

Let’s look at the graph again where we measure scientists per million inhabitants:

In this one we can see the smaller states that are doing well again. Idaho and Vermont got lots of scientists in the 90s. That was buried in the graph above because these are small states. Both of these states moved to the right before moving up. On the other side we see Delaware which moved to the left, then hops up and down a bit but generally seems to move up.

So overall there is just no clear trend that would say that you lose scientists if you raise taxes and you gain scientists if you lower taxes. So what happened here? This is probably the most important claim from the article, and it should have been very clear, but it’s just not there.

The answer is in the number that they chose for their “Number of Stars Before and After Tax Change Event” graph: They chose the total income tax, not just the state income tax. Let me show you what it looks like when we plot the total income tax:

This looks pretty funny because all the dots are moving around together. Why? Because the total income tax is dominated by the federal income tax. Any time that the federal income tax went up or down, all the points move left or right.

With that knowledge we can explain the finding of the paper: In general the federal income tax went down over the observed period, and the number of scientists went up. With that alone you would find “after lowering taxes, the number of scientists goes up.” You’re measuring a time period where tax rates went down and US population went up. Those two have nothing to do with each other though. You’re just seeing baby boomers becoming great scientists at the same time as the tax rate goes down. (and the tax rate went down because it had been at historic highs after the world wars and several New Deal governments, none of which has anything to do with lowering taxes to attract science) So does lowering taxes lead to more scientists? I don’t know, but I do know that we can not infer it from this last animation, and unfortunately that’s what they did in that graph from the appendix.

So how do we figure out if people move from one state to another because of taxes? Well the authors of the paper collected thousands of migrations between states. So if we plot those against the difference in tax rate, we might be able to see a pattern. Let’s try to do that:

Here I plotted one dot for each pair of states where people moved from one state to the other. For example that highest dot is the migration of 112 scientists from New Jersey to California in the year 2000. It’s to the right of the center because they moved to a state with higher taxes. The second highest point is the migration of 104 star scientists from California to Texas in the year 2001. It’s to the left of the center because they moved to a state with lower taxes. All the other dots are similar pairs of migration. Most of them are pretty far down, because usually just a couple scientists move per year for any given pair of states.

So what can we see from this? It’s hard to see, but there is a slight tilt to the left. Since it’s so hard to see, I added a dot where the average value is. It’s slightly to the left of center. On average people move to a state that has a tax rate that’s 0.3% lower than the state that they came from. So if they came from a state where they paid 9% taxes, on average people will move to a state where they pay 8.7% taxes.

So I finally showed that yes, there is a tendency for people to move to states where they pay less taxes. But it also presents us with a problem: How can this be true at the same time as California has by far the most scientists while also having a high tax rate?

The answer is that those California scientists don’t come from migration. By the year 2003, a total of 11610 scientists had moved to California, and a total of 11048 scientists had moved from California. Which means net-migration to California over the entire observed time period is 562 scientists. California started with 3151 scientists in 1977, yet somehow California had 14428 top scientists in 2003. Which means we have 10715 unaccounted scientists that had to have come from somewhere. And that brings us to the title of this blog post: Where do top scientists come from?

I think the answer lies in the definition chosen for the paper. They used the top 5% of scientists who submitted the most patents. How do you get to be in that group? Well you start off with no patents, then with your first patent you get to be in the bottom 95% and then you work your way up. So if you are a state that wants to have a top scientist, what is the best way to get one? You could either try to entice one to come from a different state, or you could look at some of the bottom 95% of scientists that already live in your state, (or students or young scientists without any patents) and you could try to turn one of them into a top scientists.

Which of those paths is more common? For the California example above clearly the second path is much more common. California got more than 10,000 top scientists from within California, and only 562 scientists from outside California. But maybe the low tax states got lots of scientists from migration. And which state lost the most to migration? New Jersey maybe? Lets find out:

Here I plotted the number of top scientists who migrated to or from a state on the y axis, against the taxes on the x axis. And wow, New York lost a lot of scientists. It lost nearly 1000 scientists to migration over the observed time period. New Jersey was also hit pretty bad, losing 748 top scientists between 1977 and 2003. California, with a gain of 562, gained the most. But with the exception of California there seems to be a downward trend here: The more you’re to the left, the more likely you are to have gained scientists from migration. The more you’re to the right, the more likely you are to have lost scientists to other states. The top five states are now California, Texas, Florida, Washington and North Carolina. Three out of those five have low taxes.

Let’s also plot the migration per million inhabitants, so that small states can get a chance to shine if they did well relative to their size:

New Jersey does even worse now. It lost 87 scientists per million inhabitants. But we have a new worst performer with Delaware, which lost 130 top scientists per million inhabitants. California isn’t looking all that impressive now, and the states who are doing best are New Hampshire, New Mexico, Wyoming, Nevada and Washington. Four out of five are low tax states. We also have a new bad high tax state with Idaho, who lost 55 scientists per million inhabitants in the time period from 1977 to 2003. So it seems pretty clear at this point that top scientists do migrate from tax-heavy states to low tax states.

But wait a second, wasn’t Idaho doing very well above? How can it lose scientists to migration and still be one of the top states in terms of scientist per million inhabitants? To resolve that discrepancy we have to look at a second number: The number of homegrown scientists. For that I take the number at the end of the observation, subtract the number at the beginning of the observation and subtract the number gained from migration. I ran through the example for California above and ended up at 10715 homegrown scientists. Here is what it looks like for all states:

Now New York is looking pretty good again. It had 2490 scientists in 1977, grew to 4260 scientists in 2003, all while losing 998 scientists to migration to other states. That means that 2768 new top scientists came from New York in the time period from 1977 to 2003. That’s very impressive. The top five states that generated the most scientists are California, New York, Minnesota, Texas and Washington. Once again we see that big states are over-represented, so here is the same graph for “homegrown scientists per million inhabitants”:

Idaho does great again now. So does Vermont, Minnesota, Oregon and Washington. Comparing this to the graph of migration per million (two pictures up) we can see that many of the states who do very well here did poorly there: Idaho, Minnesota and New York for example. They generated a lot of scientists, and also lost a lot of scientists to migration. Ohio on the other hand is a state that isn’t doing well: It lost a lot of scientists to migration, and didn’t grow many.

Except, Ohio actually doing alright: It generated 34 top scientists per million inhabitants, and only lost 30 per million. Why does it look so badly in this graph? Because the scales are all different. Look at the difference in scales compared to two pictures up: The top states generated 200 to 800 scientists per million inhabitants, but in the migration graph higher up the top states only managed to attract 40 to 80 scientists per million inhabitants through migration.

So what does that mean? It means if you want top scientists, you’re probably better off helping your local scientists to become top scientists, rather than trying to attract them through migration. You simply can’t get the same numbers through migration that you can get from homegrown scientists.

So from this the data would suggest that

- High tax states seem to generate more scientists
- Some of those scientists move to lower tax states

So does lowering taxes attract scientists? Yes, but not many. States that have higher taxes seem to be able to get more scientists. Why? I don’t know. I mean there are some high tax states that are doing really badly. There are many possible theories that would lead from high taxes to many scientists. California for example has a really strong network of universities. One thing I need to point out though before we make theories is that this phenomenon of high tax states having more scientists is a recent development. At the start of the observation, we don’t see this. I showed the animation above, but let me freeze frame the “scientists per million” on the first year, 1977:

In 1977 there is no clear upward trend. Maybe there is a really slight upward trend, but really if there is a trend here I would say that states that are in the middle of the pack do well. If you have low taxes you do badly, and if you have high taxes you do well, but not as well as the people in the middle. But there are also a few things misleading about this graph. Delaware for example had less than 600,000 inhabitants at the time. It’s easier to be an outlier in the “scientists per million inhabitants” category when you’re small. Also the X axis is misleading, because federal taxes are missing. The income tax rate in New Jersey in 1977 was actually higher than in California in 2003. You can kinda see that in the animation higher up where I plot the “total income tax.” It’s the animation where the bubbles bounce left and right a lot. The only conclusion I can draw from this is that in the 70s high tax states didn’t have such a clear lead over the other states as they did in 2003, so for some reason whatever benefit that you get from higher taxes really took off in those 26 years.

At this point I could try to make theories for why high tax states generated more top scientists from 1977 to 2003, but I’ll leave that for others. I set out to investigate the question whether lower taxes lead to more top scientists, and I disproved that theory. There is clearly no correlation, and therefore there is no causation. In fact the correlation seems to go the other way, but theories for why that is will have to wait for another time.

In conclusion what I found is that the states that are best at generating scientists tend to have higher taxes. Some of the top scientists move from the generating states to lower tax states, but not nearly enough to make up the difference.

So if you want top scientists in your state, should you lower taxes to attract scientists? Maybe, but you can probably get more if you can figure out what those states on the right, the ones with higher taxes, have done between 1977 and 2003. It seems like you can get literally ten times more top scientists from growing scientists in your own state (200 to 800 scientists per million inhabitants) than you can get from migration from other states. (40 to 80 scientists per million inhabitants)

Whether you decide to raise taxes to invest them in science funding, or if you decide to lower taxes to try to get people to migrate to your state, be careful: There are low tax states which didn’t manage to attract migration, and there are high tax states that didn’t manage to generate scientists. The situation is more complex, and it can’t be controlled through taxes alone.

For further reading I have this article, which claims that most science gets done by big corporations. Which would certainly explain why California and New York are doing so well, but it just begs the next question: Why are there so many big companies in Silicon Valley and New York City? Also here is a piece that claims that 76% of venture capital funding goes to California, New York and Massachusetts. I have no idea why those three states in particular, but it also seems relevant… And finally I work in computer science, a field where like half of the really big inventions came from governments or universities, (or government-funded research at universities) so that would be the clearest link between taxes and top scientists. I’m sure somebody has numbers on where most government funding for research goes, and we could correlate that with this database of top scientists.

But for now I end this investigation here and I thank you for reading this.

I exported the Gapminder graph and uploaded the file here. You have to extract that zip archive somewhere and then open index.html. This file is probably what you want if you just want to play around with the data and look for correlations. For example one fun analysis that I didn’t include above is “maybe the higher tax states have a higher GDP per person and because of that they have more scientists.” That is interesting because there is no correlation between tax and GDP, but there is correlation between taxes and scientists, and there is correlation between GDP and scientists. Meaning A correlates with B and B correlates with C, but A does not correlate with C.

If you want more data, the authors of the original paper have made their data available here. There is much, much more data in that dataset than what I included in the Gapminder export.

I used this R script to generate pictures from the data and to convert it for Gapminder. To run the R script you have to change the “base_folder” variable at the top of the file to point to where you extracted the data from the paper. Also, after running the script, you have to manually open the file “for_gapminder.csv” that the script writes to the data/ folder, and remove the first column in there. I did that using sublime text, but you can also remove it using LibreOffice Calc. There is probably also some way to do it in code. After you have removed the first column, you can open the file in Gapminder.

]]>

I’m going to talk about video games, but this is also about games in general. Why do kids play with dolls? Because they want to learn about family life. (or about conflicts when playing with action figures) This is not explicit learning like we learn from a teacher, but you act out situations and adjust your behavior depending on how your play partner reacts. Why do we send our kids to football practice? Not because we think that they need to learn the valuable skill of kicking a ball into a net. No it’s because we want them to learn about working in teams and about pacing themselves and about playing fair and all that.

The things we learn are obvious in those scenarios. It’s well known that it’s important for kids to play in order to figure out how to act in the world in a safe environment. But I claim that the same thing is true for video games, and my example will be Super Mario World.

Why Super Mario World? Because I bought a SNES classic and I’ve been playing it a lot. Going back to it, I’m actually very impressed with the game, but that’s for another blog post.

Let me talk about a specific situation where I learned something small from Super Mario World: Me and my girlfriend currently have a friend staying with us for a week. Last Friday night we went out separately, and when me and my girlfriend came home we found our friend playing Super Mario World in the living room. We briefly made fun of him but then joined him. Very quickly you had three thirty year old adults sitting on the floor in front of the TV playing SNES like six year olds.

Our friend had remembered a secret that allowed him to skip half of world 3 and all of world 4 of the game. You just have to beat two really tough levels. Our friend had a very hard time at those levels, going through several continues and not making much progress. I tried to give him advice, but naturally I also started making fun of him: “this shortcut seems to take longer than going the normal route.” This went on for a while and I found new jokes to make because our friend kept on trying the same level, not making any progress. At some point my girlfriend got upset and told me to stop nagging him. So I scaled it back a little but didn’t completely stop. By the end of the evening my girlfriend was quite upset with me, so we had to have a conversation about this.

There’s a lot to learn in this simple social interaction, so let me walk through it.

First, why do we nag people? I was nagging my friend because I thought he was employing a bad strategy. I have played a lot of games and I know from experience that if you are stuck on something like that you can easily burn out on the game. Even if you eventually overcome the challenge, you might never turn on the game after that because you’re burned out. So the smarter thing to do is to go do something else for a while and then come back to the challenge, so I was trying to get my friend to take the “normal” route instead of the difficult shortcut. My friend thought he was showing perseverance, but I thought he was just being stubborn. And stubbornness leads to burn out which is the opposite of perseverance.

So why nag him instead of just telling him this outright? Because you can’t just tell people “you’re doing this wrong, you should do X instead.” That would introduce some power dynamics because suddenly I claim that I’m better than him, and the other person might get defensive. So instead you try to do a subtle intervention, and you nag him a little bit. He can either respond to it, or he can ignore it. And he can also easily tell me to stop at any time without anyone losing face. (and in this case it actually turned out that I was wrong since he has since beaten the game, so doing a subtle intervention like that is good because if it turns out to be a non-issue, then it’s good that you didn’t escalate it)

We never think about it this explicitly, (I didn’t think of this while I was nagging him) but that’s the kinds of dynamics that are going on here. It’s awkward to state all this and it’s even more awkward to say all of this in person. Since my girlfriend was upset about the nagging (this friend I’m talking about is more her friend than my friend) I thought that our friend might also be upset about it, so I tried explaining things the next morning. Turns out he wasn’t upset, but when I explained that I had learned the “difference between perseverance and stubborness” that I mentioned above, he got defensive. He said things like “I’m not trying to beat the game, I’m just playing for nostalgia.” I tried explaining that what I had learned wasn’t about that, but it wasn’t a good conversation to have so we dropped it. You can’t say these things directly, so the slight nagging I employed the night before was the best strategy.

Social rules like this are really complex and you can’t teach the subtleties directly, because we can’t even articulate all the subtleties. It depends so much on the situation. I claim that we learn these things from playing games, and that these things are the reason why we play games. We never learn them as explicitly as I stated them here, but we try to figure these things out. When should you be stubborn? When should you nag, when should you say things directly? The rules for these are complex, require a lot of experience, and you can’t transfer them directly, so you learn them through play.

In the conversation that my girlfriend and I had that night, she was upset at me: “Why did you have to nag him like that? He was just playing a game.” But my point is that the nagging has to happen precisely because he is playing a game. We all know people that are too stubborn in their job or in other competitive situations, but you can’t necessarily tell them that in those situations. Because in the real world the stakes are often higher, so if you come along and tell them that they are doing something wrong, there is even more tension. And if you nag them, they might snap back at you.

But then we also play games, and the hope is that if a person has a personality like that, you would also see that personality in the game. And since it’s “just a game” this is the perfect time to try to improve on these things. It’s a low-risk environment where we can easily practice our social interactions, and if there is an uneasy interaction, we will all forget about it within a week. Because it really wasn’t that bad, and that’s the whole point of doing this in the context of a game. All that happened was a bit of fine tuning in everybody’s interactions.

So then if games are about learning about ourselves like that, why is Super Mario World a good game? Because there is so much to learn there. It’s an amplified version of the real world: It’s a world rich of challenges, opportunities, risks and rewards.

You always encounter new levels and new enemy types, so you learn about how to approach new situations. How to be careful in a new situation. How to incrementally learn more about it and to explore safely. (but you also learn that often if you’re trying to be too safe, you don’t succeed either, so you can’t be too timid) There are levels where you have to be patient, levels where you have to be quick, levels where you have to be creative and levels that are about perseverance. There are situations where you have to grab an opportunity, and there are situations where the “opportunity” is actually a trap.

So you can learn and experiment with all those skills and you can try to fine-tune them. You can do this much more quickly than you can in real life, because there simply isn’t that much going on in real life. On top of that you can learn a whole different layer of skills when playing with friends. Whether you’re competing, cooperating, or taking turns, there is a lot of social behavior to practice.

These are all really valuable life skills and we learn them in games. This may sound too grandiose, because after all it’s “just a game” and I do agree that it’s weird to state these things so explicitly. It’s not like you’re going to become a really smart person because you found a smart trick in Mario to get extra lives. But if you watch kids play, you will see how bad they are at a lot of these things and how they get better the more they play. And I claim that the things you learn in the game transfer into real life. Obviously something like “here is a trick for easily beating enemy X” is not going to transfer into real life, but the creativity that you employed in finding that trick is going to transfer.

Our subconscious knows this and this is why it likes playing these games. To beat Super Mario World, you need to be good at a lot of high-level-skills like “perseverance,” “patience,” “curiosity” and “creativity” because otherwise you’re going to get stuck on a difficult level half-way through. When a game stops teaching us those meta-skills, it becomes less interesting. If all it presents us with is harder versions of the same problems that we have already solved, we get frustrated or bored. We don’t actually care about getting better at jumping or at killing enemies. So don’t just give us harder versions of that. We care about getting better at the higher level skills, so instead we want new situations where we can try what we have learned and refine it further.

When thinking about a new theory like this, it’s always fun to try to pattern-match it to see if it also explains other things.

One thing it explains is why we play less as we get older: As kids we have to learn so much about the world. All these high level skills have to be learned and fine tuned. But at some point we’re pretty good at them, and we’re learning less from playing new games. At that point we can either play more complex games like Dark Souls or Europa Universalis, or if we don’t have time for that we stop playing. The normal games stop being interesting for us for the same reason that playgrounds stopped being interesting for us: We have learned all we can learn from playing on a playground, and we have learned all we can learn from playing simple games.

This theory also explains why so many educational games are bad. They think that in order for games to be educational, they have to teach you something directly. So they make you do math exercises or something stupid like that. But they lose all the much richer learning about ourselves that happens in game like Super Mario World. It’s been said before, but “The Sims” is how you make a good educational game: The thing you’re learning is intrinsically interesting. It starts off with the same appeal as playing with dolls, where you learn about family life. And then you also learn about having a job, using money, building a house, getting better at skills etc. None of this is taught explicitly, and you’re not going to get any big wisdoms out of it, but you are going to learn and tune some of those implicit social rules that everybody has to learn to live in modern society.

So how do we make games like that? I don’t have the answer to that question yet. A few games came to mind while writing this blog post, but I don’t think the creators of any of those set out to make games where you develop as a person. What I do think though is that if at some point through the development of your game you realize that “this is a game about breaking through your perceived limits” (as in Dark Souls) then you can use that realization to make choices that make the game better. One example that I came up with is if The Sims is a game about learning how to use money (among many other things) then it should probably contain credit cards or student loans or some way to learn about debt and interest on debt. Obviously you don’t want to allow the player to get themselves into really bad situations, but you could for example have debt collectors come to the house and take the most valuable furniture. Would that be fun for the player? Maybe not. Would players enjoy the game more because they have more possibilities in the game? I think they might.

One thing I tried while writing this blog post is to use this angle to try to understand a game that I didn’t understand before. I chose Candy Crush Saga. Playing it, thinking “what do you have to learn to beat this game?” made me think that Candy Crush is a game about being lucky. It’s a game about setting up situations where luck can strike, and spotting lucky opportunities when they come about where you didn’t expect them. You know all those theories about slot machines being “due to hit” or lucky streaks and all that? In Candy Crush these really do happen, because the game is about setting up the board so that these do happen. Is that a valuable skill to learn? I think it’s a very valuable skill. I think being lucky is half of my success in life. You should read this article about the difference between people who consider themselves lucky and people who don’t consider themselves lucky. When I first read that article I realized how much of the good things in my life happened because of habits like those described in the article. Now is Candy Crush a good game to learn these skills? Not necessarily. I think it’s better at it than other match-3 games like Puzzle Quest, but all the free-to-play things in there with the mind tricks that try to get you to pay money are… problematic. But I now think that I could make a game based on Candy Crush that’s more explicitly about teaching luck to people, and I think it would be a good game.

So I think this is a useful theory. And hey, isn’t it great to realize that video games teach important life skills? If somebody ever complains about you “wasting time” because you’re playing video games, you can say “no I’m learning about my role in society, about balancing work, leisure and family and I’m experimenting with different life trajectories” (when playing The Sims) or “no I’m learning about dealing with frustration, exploring novel situations and finding alternate solutions when I’m stuck” (when playing Super Mario World) or “no I’m learning about focus, creative thinking and how to use both my active mind and my background mind to solve difficult problems” (when playing The Witness) or “no I’m learning about my limits and how seemingly impossible things can be achieved with practice, perseverance and attention.” (when playing Dark Souls)

The main part of this blog post is over, but I think it’s interesting how I can’t find people talking about this theory before me. It is well known that as kids, animals (including humans) play games in order to learn how to act. And they do this through play because that’s a safer environment where you can gather a lot of experience in a short amount of time. But for some reason that thinking hasn’t encompassed video games.

I was reading the book “The Art of Failure” by Jesper Juul while writing this blog post, because I thought it might have some thinking in this direction. The book description starts off with

We may think of video games as being “fun,” but in The Art of Failure, Jesper Juul claims that this is almost entirely mistaken. When we play video games, our facial expressions are rarely those of happiness or bliss. Instead, we frown, grimace, and shout in frustation as we lose, or die, or fail to advance to the next level. Humans may have a fundamental desire to succeed and feel competent, but game players choose to engage in an activity in which they are nearly certain to fail and feel incompetent. So why do we play video games even though they make us unhappy? Juul examines this paradox.

And in the book he talks about the paradox of failure, which is that we generally avoid failure, but we also seek out games where we fail. One thing the book talks about is how people generally like a game less if they beat it on their first try, compared to if they fail at least once and then beat the game. It all sounds like my theory of “we play games to learn” is a perfect explanation for this paradox. And the book talks about it, for example it has this quote from Ben Franklin:

The game of Chess is not merely an idle amusement. Several very valuable qualities of the mind, useful in the course of human life, are to be acquired or strengthened by it, so as to become habits, ready on all occasions… we learn by Chess the habit of not being discouraged by present appearances in the state of our affairs, the habit of hoping for a favourable change, and that of persevering in the search of resources.

But somehow, even though the book has that paradox of failure, and it has this quote by Ben Franklin, it never seems to say that the solution is that we play video games to learn these skills, to develop as a person. (just as it’s the reason why animals play) This was really puzzling to me. Even at the end of the book it feels like the authors are unsatisfied because they haven’t really solved the paradox. I wanted to shout at the author “you are so close, why don’t you just take this tiny step and state that this skill development is why we play games.”

I think one part of the book gives away why they don’t think this. Whenever the author talk about “learning” from video games, the author actually only talks about getting better at the game. Whenever he talks about video games, he doesn’t think that you can develop as a person, he thinks that you can only develop to get better at the game.

This became clear to me when the author talks about the difference between games of chance and games of skill. In that chapter he also talks about a third category of games, games of labor, which are games that primarily reward time invested. In those games the actions you perform are not particularly challenging, and you will mostly succeed if you invest a lot of time. Examples are World of Warcraft or Farmville. (it should be noted that most games are a mix of skill, chance and labor, these are just examples where the labor part is particularly strong) After explaining the distinction, the book states this:

For those who are afraid of failure, this is close to an ideal state. For those who think of games as personal struggles for improvement, games of labor are anathema.

This sentence only makes sense if the “improvement” that the author is thinking about is only in terms of getting better at the game. If the “improvement” was about developing as a person, then games of labor are as valid as games of skill or games of chance. Because in the real world there are a lot of activities that are simply a lot of work. For example manual labor on a farm like in Farmville is a lot of repetitive work. Not everyone can do that kind of job. But you can learn to get better at something like that by playing a game like Farmville, which teaches that for some things it’s not about your skill or intelligence, but it’s about putting in the time necessary to do something good. (but of course Farmville is probably not the best game to learn this, as it has all the same problematic elements as Candy Crush)

So I think part of the reason why people don’t think that games are about personal development is that the fact that you’re getting better at the game is so obvious. And then they miss the less obvious effect that you’re also getting better at higher level skills that you’ll need in life.

Games are great, folks. And we shouldn’t feel bad for playing them. Just, you know, don’t let this be an excuse to play too much. After all the reason why you’re learning and developing in games is to use what you learned in the real world. (and to then get better than you could get just from playing games)

]]>

People who saw the talk said that they really liked it, and they keep on telling me how much they liked it. So I decided to record the talk again and upload it.

The pitch for the talk is that the results of the Game Outcomes Project is the best evidence we have for what makes great game development teams and what makes bad game development teams. And I think that every game developer should know this stuff. So I talk about what you should focus on when making a game, and I give advice for how to get there. So the game outcomes project found “really successful teams do X” and I present that, and then also have a section at the end of the talk where I say “here is how you can actually get good at doing X.” Here is the talk:

]]>

The idea is that this should be something similar as George Polya’s “How to Solve It” but for doing research instead of solving problems. There is a lot of overlap between those two ideas, so I will quote a lot from Polya, but I will also add ideas from other sources. I should say though that my sources are mostly from Computer Science, Math and Physics. So this list will be biased towards those fields.

My other background here is that I work in video game AI so I’ve read a lot of AI literature and have found parallels between solving AI problems and solving research problems. So I will try to generalize patterns that AI research has found about how to solve hard problems.

A lot of practical advice will be for getting you unstuck. But there will also be advice for the general approach to doing research.

The general framework is that of exploration and exploitation. Exploitation means you are getting more out of old ideas. Exploration means you are looking for entirely new ideas. You may be thinking that doing research is more exploration than exploitation, but it’s actually a mix which contains more exploitation than exploration. Really new ideas get discovered rarely, and most of the work is to realize all the consequences of existing ideas.

The two analogies I like for this are hill climbing and exploring the ocean.

Exploitation is as an effort in hill climbing. Hill climbing comes from the family of AI problems that deal with search, where search means “I’m at point A, I want to get to point B.” It’s a very general problem that applies to more than just trying to find your way on a map. For research point A is “here is what I know now” and point B is “here is what I would like to find out/prove/demonstrate/get to work/make happen etc.”

There is a large number of search algorithms, and you only really use hill climbing if your problem has the following criteria: You can’t see very far, the problem is very complex, you don’t know where the goal is and progress is slow. Meaning there is heavy fog in the mountains, the mountains are crazy complex, you may end up at a different peak than what you had planned (or maybe you just want to get out of a valley and don’t know ahead of time where that will take you) and to top it off it’s all covered in snow, making progress very slow. So you can’t just say “I’m going to explore a thousands paths” because exploring one path might take you a week before you find out that it leads to a dead end.

At that point all fancy AI techniques are out of the door and we’re left with simple hill climbing. Luckily AI has several improvements over the simple “go up” approach that will just get you stuck on the first small hill.

For the “exploration” part of exploration and exploitation I want to use the analogy of exploring the ocean. You obviously can’t do hill climbing there. You could try to bring a long rope and measure the depth of the ocean, but then you would always just move straight back to the island that you came from. Because if you just left the harbor, in which direction does the floor of the ocean go up? Back into the harbor. You have to cover a large distance before you can do hill climbing.

The “exploring the ocean” analogy is not perfect, because there is a property to this kind of research where the more you’re trying to reach a goal, the less likely you’re going to get there. I guess it works if you have a wrong idea of where the goal is. Like Columbus thinking that India was much closer, and accidentally discovering America.

The best explanation I have found for this is by Kenneth Stanley in his talk The Myth of the Objective – Why Greatness Cannot be Planned. I recommend watching the talk, but if you don’t want to do that I will mention the main points further down.

For now the main point is that there are some discoveries that can only come from free exploration. You find a topic that’s interesting and you go and explore in that direction, without any specific aim other than to find what’s over there. Then at some point you start doing hill climbing to actually get results, but you can’t start off with it.

In this section I’ll mention the general approach to doing research. You’re probably doing many of these things already because they’re common sense but it’s still worth pointing these things out once. Especially students often get these things wrong, and then it’s good to be able to recognize what they do differently than you, and then it’s good to have the words for the common sense.

When doing research it’s easy to fool yourself. So it is very important that you go out of your way to prove yourself wrong. Feynman thought this was very important when talking about Cargo Cult science. I’m slightly misquoting him here because he doesn’t just talk about proving yourself wrong, but about a broader scientific honesty:

But there is

onefeature I notice that is generally missing in Cargo Cult Science. That is the idea that we all hope you have learned in studying science in school – we never explicitly say what thisis, but just hope that you catch on by all the examples of scientific investigation. It is interesting, therefore, to bring it out now and speak of it explicitly. It’s a kind of scientific integrity, a principle of thought that corresponds to a kind of utter honesty – a kind of leaning over backwards. For example, if you’re doing an experiment, you should report everything that you think might make it invalid – not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you’ve eliminated by some other experiment, and how they worked – to make sure the other fellow can tell they have been eliminated.Details that could throw doubt on your interpretation must be given, if you know them. You must do the best you can – if you know anything at all wrong, or possibly wrong – to explain it. If you make a theory, for example, and advertise it, put it out, then you must also put down all the facts that disagree with it, as well as those that agree with it.

He goes on to talk about more things that you should do, but for now I just want to talk about the part of proving yourself wrong, because it is genuinely helpful when doing research.

It’s also the main difference between real research and pseudo-science. People in pseudo-sciences never try to prove themselves wrong. It’s also the main difference between real medicine and alternative medicine. People who promote crystal healing never try to prove themselves wrong. At least not seriously. Same thing between real journalism and conspiracy theorists. A real journalist will try to prove his theories wrong many times before publishing. Especially if it’s about a conspiracy.

But there is a more pragmatic reason: Proving yourself wrong early will save you from wasting time. Of course if you try to do research into a crazy theory like crystal healing, you’re wasting time. But there are also many reasonable cases where the fastest way to find out if an approach will work is to try to prove it wrong.

Now this can be a little bit tricky, because as Feynman also said “The first principle is that you must not fool yourself – and you are the easiest person to fool.” Meaning it’s hard for you to prove yourself wrong. It’s easier for you to fool yourself. But once you get better at proving yourself wrong, you tend to find shortcuts. Ways to rule out in a day what would have taken you a month to confirm. It turns out that often you only need rough heuristics to prove yourself wrong where proving yourself right requires actually working all the way through the problem.

You want to be a little bit careful with this because sometimes good ideas lurk in areas that most people stay far away from because of some “probably won’t work” heuristic, but usually those heuristics are a good idea. And even if you don’t want to use heuristics, it’s still often faster to prove yourself wrong than to prove yourself right, so the advice stands.

For some people it’s very hard to prove themselves wrong. There is one final trick that can even help those people: For some reason we are really good at proving other people wrong. If somebody else comes to you with a crazy idea you can immediately tell that it’s a crazy idea. Much more quickly than if it was your own idea. So the final trick is to ask others to prove you wrong. Meaning just ask a colleague to run an idea by them. And then listen to what they say.

I don’t know a good name for this, so I’ll use the name of the AI technique. You probably do this automatically, but it’s worth pointing out, because sometimes I see people who don’t do this, and they are really screwed.

Simulated Annealing is a very general approach, which roughly says that you should figure out the big picture out before you figure out the details. And it does that by prescribing what your response should be when you’ve walked all the way up in hill climbing and gotten stuck. Getting stuck means you’re at some local peak and it seems like you can’t see any paths that take you any higher. It seems like there are only steep cliffs or options that take you back down the hill. And ideally you’re not just stuck for five minutes, but you’re stuck for an hour or a day or more.

The general pattern is that every time you get stuck, you do a reset. But the size of the reset becomes smaller and smaller. At first you reset your progress completely and start over from the very beginning and try an entirely different approach. After you’ve tried a few different approaches, the next time that you reset yourself, do a smaller reset so that you stay in the area that took you the furthest. Don’t try new approaches any more, but try different variations of one approach. Later you do smaller resets still and maybe just try a few different solutions to specific problems. And at the end you do really small resets and just tweak some numbers.

What this means is that when you first work on a problem, you shouldn’t spend too much time fiddling around with the details. Instead try a different approach.

And then later, when you’re fiddling around with the details, you should not go back and try a whole different approach.

There is a progression to this. You often see inexperienced researches spend too much time on the details early on. Or maybe they have come very far and are already twiddling with the numbers when they find that their whole approach was wrong and they have to start over with an entirely different approach. That is very demoralizing. You have to do that exploration early on before you ever get to the details.

Or sometimes people are just trying lots of different approaches and are never actually doing one approach seriously. Simulated Annealing says that you should walk up until you’re stuck. Don’t switch to a different approach until you’ve gotten stuck. (sometimes that takes too long and you end up spending weeks on one approach. In that case set a time limit and do a reset once a week or so)

So when you first get stuck, do big resets and try entirely different approaches. Then over time do smaller and smaller resets.

This also gives a natural end to your research. Once you’re done twiddling with the details you’re done, period. (you don’t need to go back and try an entirely different approach since you already explored those earlier)

This one is not an AI technique but my own observation. It’s also something that all good researches do automatically, but it’s worth pointing out explicitly.

You want to be incremental. In the hill climbing analogy, imagine that there are several paths already carved into the mountains where previous researchers have made progress before you. You almost always want to start off from one of those paths. In fact some of those paths have become very wide because there are lots of researches doing work up at the end of those paths, so the path is well-trod.

You may actually want to avoid paths that are too wide, but only if you are experienced already. If you are a grad student doing your first research, don’t stray too far from where others are.

The advice to be incremental may be disappointing because you want to invent the next big theory like general relativity or the next Internet or whatever. But actually the more you read about those and how they came about, the more you realize that they were actually quite incremental. There are really very few inventions which can not be traced back to ideas that slowly accumulated and evolved over many years. Sometimes an idea seems really impressive to the outside world because to them it’s all new. But then you look at the author’s work and find that they had been silently working on it incrementally for the last ten years.

You may think that the “be incremental” advice does not apply to the “ocean explorer” analogy of research, but you’d be wrong. Few good things have come from just setting off into completely uncharted territory. Usually you want to hop from island to island. The “Myth of the Objective” talk that I talked about above strongly emphasizes how important stepping stones are for this kind of research. The results in their program couldn’t have come about if people couldn’t have built on top of each other’s results.

The AI technique for this is called Local Beam Search (the link is to “Beam Search” because it seems like Local Beam Search is never mentioned online…) which is a variant of hill climbing where we do several searches at the same time. That’s the whole trick. Programmers are not good at naming things.

Doing several searches at the same time is an easy thing to do for a computer, but it’s hard to do for a person. But I think we can get the same benefits without literally doing the searches at the same time. I’m going to quote from the book “Artificial Intelligence – A Modern Approach” (second edition) by Russel & Norvig to list the benefits:

In a local beam search, useful information is passed among the parallel search threads. […] The algorithm quickly abandons unfruitful searches and moves its resources to where the most progress is being made.

So how do we get these benefits as a simple human who can’t do multiple searches at the same time? One thing we can do is keep track of where you walked down one path but you could have walked down the other path. Explore the other paths every once in a while. Since humans have to do this sequentially, it’s actually similar to the simulated annealing I mentioned above. But the idea would be to do multiple searches at once, where each of them follows the simulated annealing approach.

One other approach to this is to always have more than one project. Here is Robert Gallager talking about this in the context of a talk about Claude Shannon:

Be interested in several interesting problems at all times. Instead of just working intensely on one problem, have ten problems in the back of your mind. Be thinking about them, be reading things about them, wake up in the morning, and review whether there’s anything interesting about any of them. And invariably […] something triggers one of those problems and you start working on it.

Why is that so important? I would say that one of the most difficult things in trying to do research is “how do you give up on a problem?” So many students doing a thesis just beat themselves over the head for year after year after year saying “I’ve got to finish this problem, I said I was going to do it and I’m going to do it.” If it’s an experiment you’re going to do, yes you can do that. You can do it more and more carefully if something isn’t working you can fix it to the point where it works. [But] if you’re trying to prove some theorem and the theorem happens to be not true, then your chances of success are very low.

If you have these ten problems in the back of your mind and there is one problem that’s been driving you crazy, what are you going to do? It’s going to sink further and further back in your mind and just because you’ve had more interesting things to do, you’ve gotten away from it. After a year you won’t even remember you were working on it. It will have disappeared. That’s a far better thing than to reason it out and say “I don’t think I can go on any further with this because of this, this and this reason.” Because your problem is you don’t understand the problem well enough to understand why you oughta give up on it. So you just find that you can do other things which become more interesting temporarily.

I think this quote is spot on, but there is an additional benefit to working on several things at the same time: There is lots of cross pollination between ideas. Also talking about Claude Shannon, this article has another good quote about this:

His information theory paper drew on his fascination with codebreaking, language, and literature. As he once explained to Bush:

“I’ve been working on three different ideas simultaneously, and strangely enough it seems a more productive method than sticking to one problem.”

The final angle in which this is helpful is that research just takes time. Some things can’t be hurried. If you work on multiple things at the same time, then that allows you to work on one thing for a longer time. If one project needs to take ten years, then there is no way that you can work on it full time for ten years. But if you also work on other things during those ten years, all of a sudden it’s doable.

I’m using Barbara Oakley’s terms from her Learning How to Learn online course.

The idea is that the brain has two distinct ways of working: The focused mode where you actively work on a problem, and the diffuse mode where you’re doing something else entirely but your subconscious is working on the problem. The diffuse mode is responsible for a lot of eureka stories, including the original one: Archimedes was stuck on a problem, trying to figure out whether a crown was pure gold or not. Then on a trip to a bath he is relaxing, mind drifting off, watching the water move, when suddenly the answer jumps into his head.

For me I also often get ideas like this while taking a shower. Some people say that physical exercise helps them get into the right mode. That doesn’t work for me, but long walks certainly do work. Some people say that they only get into this mode when sleeping and that they wake up with good ideas. That works very rarely for me, but maybe you have more luck with it. It seems like you need to be somewhat relaxed, your mind mostly idle. Then the background processes in your brain get to work and can form new connections that weren’t clear before. You have to stop thinking about the problem for a while and then an answer may drift back up from some deeper part of your brain that you don’t have direct access to.

The tricky part is that you can’t easily schedule what your brain is going to work on in the diffuse mode. Distractions like smart phones are really harmful, but even if you turn your phone off you can easily get into a mode where all you can think of is the latest controversy in the news. Rich Hickey talks about how he deals with this issue in his talk Hammock Driven Development: (he talks about sleep because for him a lot of this thinking happens while sleeping)

So imagine somebody says “I have this problem…” and you look at it for ten minutes and go to sleep. Are you going to solve that problem in your sleep? No. Because you didn’t think about it hard enough while you were awake for it to become important to your mind when you’re asleep. You really do have to work hard, thinking about the problem during the day, so that it becomes an agenda item for your background mind. That’s how it works.

For me sleeping doesn’t work, but I’ve certainly found the same thing to be true when going for long walks. If the last thing that I did before the walk was check the news, my brain will keep on going back to whatever I was reading. If the last thing was that I worked really hard on a problem, I may have a chance of finding a solution to the problem while going for a walk. (can’t force it though, you need to allow your mind to drift off and then drift back to the topic)

This is no guarantee for success. Oftentimes the solution of the diffuse mind doesn’t actually work. Or it’s just one step of the solution and after you take that one step you’re just as stuck as you were before. But anytime that I’m making really good progress, it’s a combination of focused mode and diffuse mode work.

This is another one one that is so basic that it’s rarely stated, but I sometimes see people confused about this. For example I was going through the book reviews of Kenneth Stanley’s book about how objectives can be harmful, and one person says that the book is clearly wrong because there are studies about how helpful goal setting is, with a link to this article. And since I believe in proving myself wrong I naturally read that article. It turns out that the professor mentioned in that course, Jordan Peterson, has made the course available on Youtube. And if you listen to what he actually says about setting goals, it’s a lot more compatible with Kenneth Stanley’s idea:

You don’t get something you don’t aim at. That just doesn’t work out. So lots of people aim at nothing and that’s what they get. So if you aim at something you have a reasonable crack at getting it. You tend to change what you’re aiming at a bit along the way, because like, what do you know? You aim there, you’re wrong. But you get a little closer. And then you aim there, and you’re still wrong. You get a little closer and you aim there, and as you move towards what you’re aiming at, your characterization of what to aim at becomes more and more sophisticated. So it doesn’t really matter if you’re wrong to begin with as long as you’re smart enough to learn on the way, and as long as you specify a goal.

This is spot on as far as I’m concerned. You need a goal to start with. But as you travel towards that goal, you may find reasons to change the goal and you shouldn’t be afraid to do that if you have a good reason. He goes on to say that it’s OK to specify a vague goal as long as you’re going to refine that goal along the way. (I think there is also some connection here to Scott Adams’ theory of “using systems instead of goals” but I haven’t thought that through)

Kenneth Stanley developed an algorithm called “Minimal Criterion Novelty Search” after his discovery about how harmful it can be to aim for a goal too rigidly. Novelty Search just tries visit as many different places in the search space as possible. Meaning it generates novel approaches to whatever problem you’re working on. Doesn’t matter if those novel approaches don’t look like they would solve the problem. “Minimal Criterion” says that the novel behaviors should still behave above some minimum threshold like “don’t get eaten by predators before you reproduce.” You can define your own minimal criterion for your problem, but it shouldn’t be very challenging to overcome. He has then shown that for tricky problems, novelty search is better than goal oriented search because novelty search doesn’t go for the goal and doesn’t get stuck on whatever the “trick” is. It just tries to reach as many different points as possible and will eventually automatically find its way around the trick.

The diffuse mode from the last section is a good way to get unstuck. But it’s a bit unreliable, and it needs to be fed. You need to work on the problem hard before the diffuse mode can provide you with a solution. But how do you do that if you’re stuck? And are there any more direct ways to get unstuck? In the hill climbing analogy, imagine you have come across a steep cliff and all you can see is ways back down or sideways.

Some of the advice here is to find good sideways steps that have helped others, other advice is for discovering as many sideways steps as you can. Others are about finding new starting points. It’s all about making movement in the hope that you will come across some hidden path that can take you up again.

This is the Feynman Algorithm for solving problems:

- Write down the problem.
- Think real hard.
- Write down the solution.

The algorithm is of course a joke because Richard Feynman made physics look so easy. Except that a friend of mine once said that the Feynman algorithm actually worked for him. Since then I have tried it a few times and it has actually really helped me, too. The important step seems to be step 1: Write down the problem. Sometimes we seem to be stuck, but we’re not actually all that clear of what exactly we’re stuck on. Putting it into writing forces us to consider what exactly the problem is, and sometimes just doing that is enough. If it’s not, step 2 has also brought me to the solution. Literally just sitting there staring at the formulation of the problem on the paper. Seems unlikely, but sometimes it works. (because you never actually thought about the explicitly stated problem)

A related solution from computer science is Rubber Duck Debugging. The idea is that if you’re completely stumped on trying to figure out a bug in your code, sometimes it helps to explain it to somebody else. That other person doesn’t actually have to understand what you’re talking about. It just helps talking through the problem. So a rubber duck is good enough for this.

I have to confess that I don’t find it easy to talk to a rubber duck, so I usually try explaining my current problem to my girlfriend. She doesn’t know a whole lot about computer science, but if I say that “I just need to talk through this problem once” then she will usually make an effort to listen. It’s also a good exercise to try to explain the problem in a way that somebody who is not familiar with algorithms and data structures can understand. The goal isn’t really to get her to understand it, but to get myself to talk about the problem fully enough that she could understand it.

Oftentimes that’s all it takes to find the thing that you forgot to check.

One thing that has really helped me on this is George Polya’s book “How to Solve It” which makes you ask yourself these questions:

What is the unknown? What are the data? What is the condition? Is it possible to satisfy the condition? Is the condition sufficient to determine the unknown? Or is it insufficient? Or redundant? Or contradictory?

Draw a figure. Introduce suitable notation.

Separate the various parts of the condition. Can you write them down?

It’s an exercise in getting good about “stating the problem.” And going through it explicitly somehow helps. Polya also points out that sometimes you’re stuck simply because you lost sight of the goal, which is another explanation for why stating the problem helps you get unstuck.

Polya also has a list of proverbs in his book that I will quote sometimes, here are his proverbs for this one:

Who understands ill, answers ill. (who understands the problem badly, answers it badly)

Think on the end before you begin.

A fool looks to the beginning, a wise man regards the end.

A wise man begins in the end, a fool ends in the beginning.

Here is Claude Shannon about simplifying problems:

Almost every problem that you come across is befuddled with all kinds of extraneous data of one sort or another; and if you can bring this problem down into the main issues, you can see more clearly what you’re trying to do and perhaps find a solution. Now, in so doing, you may have stripped away the problem that you’re after. You may have simplified it to a point that it doesn’t even resemble the problem that you started with; but very often if you can solve this simple problem, you can add refinements to the solution of this until you get back to the solution of the one you started with.

And here is Robert Galager again about an experience when he had a complex problem and asked Claude Shannon for help:

He looked at it, sort of puzzled, and said, ‘Well, do you really need this assumption?’ And I said, well, I suppose we could look at the problem without that assumption. And we went on for a while. And then he said, again, ‘Do you need this other assumption?’ And he kept doing this, about five or six times. At a certain point, I was getting upset, because I saw this neat research problem of mine had become almost trivial. But at a certain point, with all these pieces stripped out, we both saw how to solve it. And then we gradually put all these little assumptions back in and then, suddenly, we saw the solution to the whole problem.

Another thing I like to do for this is to solve the problem for one case. Instead of trying to attack the general problem, pick a simple case and solve it. Then another. Then another. Then another. Then look for patterns. Don’t look for patterns until you’ve solved three or four specific cases. The cases I usually look at are “what if these are all zero?” Or “what if this always takes the same amount of time?” Or “what if everybody wants the exact same thing?” And then further questions are small variations on that like “what if these are all 1? Or what if these are all zero except for that variable?” Or “what if these take different amounts of time but they start at regular intervals?” Or “what if everybody wants the exact same thing except for that one special case?”

Polya in “How to Solve It” of course also has questions for this:

If you cannot solve the proposed problem try to solve first some related problem. Could you imagine a more accessible related problem? A more general problem? A more special problem? Keep only a part of the condition, drop the other part; how far is the unknown then determined, how can it vary? Could you change the unknown or the data, or both if necessary, so that the new unknown and the new data are nearer to each other?

There is a subtle benefit to simplifying the problem which I’ll explain using the concept of “overfitting” from machine learning. Overfitting happens when your algorithm didn’t really learn the underlying pattern, but just memorized all the training examples. (and then doesn’t work on new examples) Overfitting means that you learned both the signal and the noise. One way to make overfitting less likely is to simplify or to generalize because simplifying the problem reduces the noise in the problem. (I will talk about generalizing further down) This is a bit of an abstract concept and probably deserves a fuller discussion (particularly because some simplifications actually increase your risk of overfitting) but for now I just want to say that solving a simplified problem can reveal broader truths than solving a complex problem, so don’t feel bad for simplifying. It can have real benefits.

We already had Polya’s advice of “Draw a picture. Introduce suitable notations” above, but this goes further. We can often use the visual processes in our brain to solve problems.

This applies to many more problems than math problems. Lots of math has geometric interpretations, but so do other fields. You can draw diagrams or plots or maps or simplified sketches or any number of other things.

One trick is to try to visualize as much data as possible. Draw scatter plots. Then draw small multiples of scatter plots. Add layers, colors, work at different scales, anything that allows you to show more data without confusion. Let your eye do the filtering later. While we would have a hard time dealing with thousands of numbers in writing, we have a very easy time finding patterns in thousands of numbers in a scatter plot. Here is a quote from Edward Tufte’s Envisioning Information (page 50):

We thrive in information-thick worlds because of our marvelous and everyday capacities to select, edit, single out, structure, highlight, group, pair, merge, […] focus, organize, condense, reduce, […] categorize, catalog, […] isolate, discriminate, distinguish, […] filter, lump, skip, smooth, chunk, average, approximate, cluster, aggregate, outline, summarize, itemize, review, dip into, flip through, browse, […].

Visual displays rich with data are not only an appropriate and proper complement to human capabilities, but also such designs are frequently optimal. If the visual task is contrast, comparison and choice – as it so often is – then the more relevant information within eyespan, the better. Vacant, low-density displays, the dreaded posterization of data spread over pages and pages, require viewers to rely on visual memory – a weak skill – to make a contrast, a comparison, a choice.

A common theme of Tufte is that we are really good at looking at lots of data. It’s not good to only show parts of the data at a time. Better to show as much as possible, and people will focus on what they want.

Except of course it’s not quite that simple, because if you just show as much as possible, you often have an unreadable mess. Edward Tufte’s books are all about showing as much as possible without having a mess on your hand. But really you can get pretty far by just trying and iterating on your visualizations. Try combining visualizations, then try separating them. Try looking at multiple things next to each other. Try zooming out or try zooming in etc.

This is one of the main points in Polya’s “How to Solve It.” He thinks mobilizing prior knowledge is one of the most important things you can do. To do this you of course have to be fluent in the field that you’re researching. Here are his questions related to this:

Have you seen it before? Or have you seen the same problem in a slightly different form?

Do you know a related problem? Do you know a theorem that could be useful?

Look at the unknown! And try to think of a familiar problem having the same or a similar unknown.

Here is a problem related to yours and solved before. Could you use it? Could you use its result? Could you use its method? Should you introduce some auxiliary element in order to make its use possible?

This is also a point where it’s useful to work on several things at the same time. Because somehow it seems that formulas or methods or insights from one area often apply in a different area. I don’t know why that is. Maybe there are only a finite number of concepts and connections between them, so we see the same concepts in several fields. (whatever explanation we come up with would also have to explain why garbage-can decision making works so well, so my explanation isn’t very good…) Here is Feynman talking about this:

[After deriving the conservation of angular momentum from the laws of gravity]. And thus we can roughly understand the qualitative shape of the spiral nebulae. We can also understand in the same way the way a skater spins when he starts with his leg out, moving slowly, and as he pulls the leg in he spins faster.

But I didn’t prove it for the skater. The skater uses muscle force. Gravity is a different force. Yet it’s true for the skater. Now we have a problem. We can deduce, often, from one part of physics, like the law of gravitation, a principle which turns out to be much more valid than the derivation.

[…]

So we have these wide principles which sweep across all the different laws. And if one takes too seriously these derivations, and feels that “this is only valid because this is valid” you can not understand the interconnections of the different branches of physics. Some day, when physics is complete, then all the deductions will be made. But while we don’t know all the laws, we can use some to make guesses at the theorems which extend beyond the proof. So in order to understand the physics one must always have a neat balance and contain in his head all the various propositions and their interrelationships because the laws often extend beyond the range of their deductions.

(edited heavily for brevity)

This is true in across parts of physics and it’s also true across entirely different fields, but I should also state that most of the time, the “related problems” you want to look at are going to be pretty close by. Polya gives examples like “to find the center of mass of a tetrahedon, see if you can use the method of the simpler related problem of finding the center of mass of a triangle.”

But I also want to bring it back to Polya’s idea of mobilizing prior knowledge: There is a lot of evidence that in most fields, the main difference between experts and novices is how much experience or knowledge of the field they have, and how good they are at organizing this knowledge. This comes out of Kahneman’s and Tversky’s work with expert firemen, but also from research about chess grandmasters. The better somebody gets at chess, the more they use their memory. (as measured by brain activity) So you have to build that pool of knowledge. You have to know lots of related problems and you have to be able to draw connections to them.

This is very related to the previous problem and it’s also something that I can find plenty of quotes for. Here is Claude Shannon for example:

Another approach for a given problem is to try to restate it in just as many different forms as you can. Change the words. Change the viewpoint. Look at it from every possible angle. After you’ve done that, you can try to look at it from several angles at the same time and perhaps you can get an insight into the real basic issues of the problem, so that you can correlate the important factors and come out with the solution.

Polya’s questions about this topic are simpler in that they are simply “Can you restate the problem? Could you restate it still differently? Go back to definitions.”

Polya then goes on to list several reasons for why this helps. One is that a different approach to the problem might reveal different associations, allowing us to find other related problems. (see the point above) A second reason I will just quote:

We cannot hope to solve any worth-while problem without intense concentration. But we are easily tired by intense concentration of our attention upon the same point. In order to keep the attention alive, the object on which it is directed must unceasingly change.

If our work progresses, there is something to do, there are new points to examine, our attention is occupied, our interest is alive. But if we fail to make progress, our attention falters, our interest fades, we get tired of the problem, our thoughts begin to wander, and there is danger of losing the problem altogether. To escape from this danger we have to

set ourselves a new questionabout the problem.The new question unfolds untried possibilities of contact with our previous knowledge, it revives our hope of making useful contacts. The new question reconquers our interest by varying the problem, by showing some new aspect of it.

See also the point about Simulated Annealing above which says that you should frequently try new approaches. But the size of the change that you make should differ over time.

And finally here is Feynman talking about the same idea. When talking about what if several theories have the same mathematical consequences, he says that “every theoretical physicist that’s any good knows six or seven different theoretical representations for exactly the same physics and knows that they’re all equivalent, and that nobody is ever going to be able to decide which one is right, but he keeps them in his head hoping that they will give him different ideas.” As for how they may help he says that a simple change in one approach may be a very different theory than a simple change in a different approach. And that changes which look natural in one theory may not look natural in another.

This one is related to some of the points that I made in “draw a picture” above, but it’s also worth talking about the data separately, without the context of a picture. To start off with here are Polya’s questions related to this topic:

Did you use all the data? Did you use the whole condition? Have you taken into account all essential notions involved in the problem?

Could you derive something useful from the data? Could you think of other data appropriate to determine the unknown? Could you change the unknown or the data, or both if necessary, so that the new unknown and the new data are nearer to each other?

One thing I would like to point out here is that there are many ways to organize data. I have literally given talks where all I did was take existing data and organized it in a different way to put emphasis on different conclusions. The original authors had organized their data by categories. I had organized it by strength of correlation. There are many ways to sort, filter, group or abstract data, and there are often many different insights to be gained depending on how you go about doing this.

Polya is referring to something else here though. For him the “data” are the information given for a problem. His example problem for this question is “We are given three points A, B and C. Draw a line through A which passes between B and C and is at equal distance to B and C.” And his point is that after drawing a picture of the dots with the desired line, the solution comes almost automatically if you just draw lines using all the available data. (the points A, B and C, as well as the desired line) So for your problem the data may just be any available information.

Changing the data can mean a lot of things from “collect more information” to “if I assume that this variable is always 0, would that simplify the problem?” And changing the unknown means that if the data suggests a different goal, at least consider that other goal. Maybe it’s a better goal than what you were looking for.

Richard Feynman was famous for this because he said that his Nobel prize came directly from playing around with physics. Here is the quote for that and it’s a great read. (unfortunately too long to be included in this blog post)

In the Feynman quote playing around means investigating a problem that has no practical applications. But you can even do this within a problem. You can play around with equations. Do random substitutions. See what the consequences would be if you cube a variable rather than squaring it. Take the equations in a circle and back to where they started. Do anything that you are curious about. You can play around with experiments. If you are working on some variable and normal values are in the range from 20 to 30, then try the values 1 and 100, just to see what happens. If nothing bad happens, try the values 0.1 and 1000. Antibiotics were discovered because an experiment went wrong and Alexander Fleming reacted with curiosity rather than frustration.

Here is a quote from Carver Mead that is in a similar vein to the Feynman story above:

John Bardeen was just the most unassuming guy. I remember the second seminar Bardeen gave at Caltech — I think it was just after he got his second Nobel Prize for the BCS theory, and it was some superconducting thing he was doing. He had one grad student working on it and they were working on this little thing, and he gave his whole talk on this little dipshit phenomenon that was just this little thing. I was sitting there in the front row being very jazzed about this, because it was great; he was still going strong.

So on the way out, people were talking and one of my colleagues was saying, “I can’t imagine, here’s this guy that has two Nobel Prizes and he’s telling us about this dipshit little thing.” I said, “Don’t you understand? That’s how he got his second Nobel Prize.” Because most people, one Nobel Prize will kill them for life, because nothing would be good enough for them to work on because it’s not Nobel Prize–quality stuff, whereas if you’re just doing it because it’s interesting, something might be interesting enough that it’s going to be another home run. But you’re never going to find out if all you think about is Nobel prizes.

This one is connected to the point about “use a related problem” above, but there is additional value to be gained from reading a related paper that I haven’t talked about above.

Reading a related paper is especially valuable if you try to reproduce the related paper. For me that’s often easy to do in computer science because I can implement the program. If it’s hard to do in your field, don’t be afraid to take shortcuts. (potentially huge shortcuts) You’re not trying to verify the paper, the value of the exercise actually comes from walking in other people’s shoes for a while. See what they did and why they did it. Criticize their ideas and their approach.

If you start from the other paper’s starting point, you will come across plenty of opportunities to do things differently. Maybe one of those different paths can give you an idea for your problem. And different starting points run into different problems, which sometimes allows you to dodge a problem that you ran into. Meaning the problem literally doesn’t even show up just because you came from a different angle.

Another thing I like to do is read old papers. You will be surprised at which alternatives they explored back then. (whatever “back then” means for your field) When a field is young, people are more open-minded. Often, the old alternative theories are obviously ridiculous now, but sometimes there are ideas there that should be revisited. Even if you don’t come across anything like that, I still just get random ideas from exposing myself to naive (but smart) ways of thinking about the problem.

Just as reading a paper is a good exercise for getting a different view point, so is starting from the end. The AI method for this is called bidirectional search, and there are real mathematical reasons for why this helps. Here is the picture for bidirectional search from Russel & Norvig’s “Artificial Intelligence – A Modern Approach”:

To explain this image, imagine we have no idea where the goal is. So we start branching out from the start point, exploring all directions. The longer this keeps on going, the bigger area we have to explore and the more we’ll slow down. If we also search from the goal, we can cut that time down dramatically. Instead of having to make one very big circle, we can make two small circles. In this picture the circles are about to touch, and as soon as they touch it’s an easy exercise to connect the circles and to draw a single path connecting the start to the goal.

With this picture in mind you can also see why so much of the advice above is about finding different starting points: If we had multiple starting points, chances are good that the circles can be even smaller. The further we move on from a starting point the more expansion slows (because the number of paths grows proportional to the area which grows at the square of the radius) so you want to be incremental (pick a goal that’s not too far away) and you may want to try multiple starting points.

Now strictly speaking bidirectional search is not a valid thing to do when doing hill climbing, because in hill climbing we have no idea where the goal is. But usually when doing research you have at least some idea of what you’re looking for or what you expect to find. Or you have some idea of what would overcome the current thing that you’re stuck on. Sometimes it helps to just make goals up. Meaning literally say “it would be really helpful if X was true” and then work backwards, try to figure out what you would need to make X true. Making good guesses as to what are good points to work backwards from is something that takes practice.

We all work at some level of abstraction, but sometimes you need to dive deeper and get into the lower levels. Meaning you need to take apart the machine that you’re working with and put it back together. Hook the sensors up directly to your computer instead of a separate display. (so you can write your own display code) Step through the lower level code. Write your own version of the lower level code. Multiply the equations all the way out. Run through them with real world numbers instead of using abstract symbols.

Meaning do the work that the people who provided you with your tools did.

Here is Bob Johnstone talking about Nobel laureate Shuji Nakamura:

Modifying the equipment was the key to his success… For the first three months after he began his experiments, Shuji tried making minor adjustments to the machine. It was frustrating work… Nakamura eventually concluded that he was going to have to make major changes to the system. Once again he would have to become a tradesman, or rather, tradesmen: plumber, welder, electrician — whatever it took. He rolled up his sleeves, took the equipment apart, then put it back together exactly the way he wanted it. …

Elite researchers at big firms prefer not to dirty their hands monkeying with the plumbing: that is what technicians are paid for. If at all possible, most MOCVD researchers would rather not modify their equipment. When modification is unavoidable, they often have to ask the manufacturer to do it for them. That typically means having to wait for several months before they can try out a new idea.

The ability to remodel his reactor himself thus gave Nakamura a huge competitive advantage. There was nothing stopping him; he could work as fast as he wanted. His motto was: Remodel in the morning, experiment in the afternoon. …

Previously he had served a ten-year self-taught apprenticeship in growing LEDs. Now he had rebuilt a reactor with his own hands. This experience gave him an intimate knowledge of the hardware that none of his rivals could match. Almost immediately, Nakamaura was able to grow better films of gallium nitride than anyone had ever produced before.

(from Brilliant!: Shuji Nakamura And the Revolution in Lighting Technology, p 107)

I want to caution against being too eager about this. You can waste huge amounts of time diving into the deeper levels. There is infinite amount of work down there, and there are reasons why we work at higher levels. The approach for this is to do the smallest dive possible. Only if that doesn’t work should you dive into the lower levels for longer amount of times. (the quote above mentions how Shuji Nakamura was frustrated for three months before he decided to dive deeper. That sounds like a reasonable amount of time)

A related problem is that sometimes you need to doubt the lower levels, but you have to be especially careful about this. But it does happen that the lower level formulas are wrong about something. Even the laws of physics still have holes in them which we have to fill up with Dark Matter and Dark Energy. That doesn’t mean that you should immediately question those laws of physics. You should do the smallest intervention possible and dive one level down. Don’t ever skip levels. Meaning first question whether something in your experiment is wrong. Then question whether your equipment is wrong, then maybe question if a formula from a previous paper is wrong, then slowly work your way down. Only if no higher level mistake can explain your observations should you keep on diving deeper. Think of it as detective work. There are heuristics for what to doubt (“how many known problems does this have?” “how much would break if this changed?”) but you will often follow the heuristics automatically if you just work one level at a time. In computer science this still happens with some regularity, and here is a good read about somebody who did this properly and worked his way slowly through every level until they could conclude that they found a hardware bug.

Sometimes it helps to show your unfinished idea to someone who is going to hate it. It’s a very unpleasant experience to do this. But if you do this you will hear all the many reasons why your idea can’t possibly work and why you should just abandon it right now. This can do two things: 1. It can actually increase your resolve to fix this problem. (I’ll show that idiot who thinks this can’t be done) 2. It brings up areas that you have avoided so far. Somehow, people who hate your idea are really good at finding open wounds that they can drive their thumb into to hurt you. Often times those open wounds are what you have avoided even though it’s exactly what you should be working on, as unpleasant as that may be. It sucks when somebody tells you “your idea sucks because it can’t deal with X” because you suspect that it’s true and you have unconsciously avoided dealing with X so far. But it can feel great when you then go back and finally tackle X and it turns out that you find a really elegant way to solve that problem, proving the idiot hater wrong and making some progress while you were at it.

The Internet is a great source for this kind of negativity. Sometimes coworkers and friends can identify your problem spots in a nicer way, but the problem with coworkers and friends is that they often have the same mindset as you. You can avoid that by asking new people in your group for advice. You have to get them when they’re still in the “why the heck do we do it like this?” stage, before they have advanced to the acceptance of the “this is just how we do things here” stage. So it’s tricky. (the two stages may not be this obvious) The most reliable way to get criticism is to ask someone who will hate your idea.

I started this section off with making this sound totally sucky. Because it usually is, and to do this you have to be ready for the unpleasant emotions. But this can actually be a more or less sucky experience, depending on who you get the criticism from. When you’re on one side of an argument, it’s easy to find someone on the other side who is a bit of an idiot and then you point and laugh and say “look at how much of an idiot they are on the other side.” That is easier for you to do, but it’s harder to learn from that. You have to put in more work to understand their point. And even though you’ll dismiss it, it will still negatively affect your mood. The better way to go is to find a smart person on the other side who can articulate themselves well. Ideally they can even state your viewpoint pretty well and can still tell you why their side is right. It’s easier to learn from those people, but you won’t naturally seek them out because you’ll learn all the parts where you are wrong.

If you’re out of all other options, sometimes you should just do something stupid.

Do something that would never work. Do something that might work, but it’s obviously inefficient or inelegant. Add five special cases. Do something hand-wavy that would never survive peer-review. Assume something that you can’t justify assuming. Do something where you already know three cases where it won’t work. Sometimes those surprise you by unexpectedly working or by giving you an answer that is almost right.

Do you have an idea that probably won’t help and it involves going through fifty cases that take an hour of tedious work each? Sometimes you just gotta do it, even if it probably won’t help. Repetition helps understanding, so maybe you will discover a new angle. Or maybe you will find ways to automate the work.

If all of the other advice for getting unstuck hasn’t helped, doing something stupid can help. In the hill climbing analogy it’s taking a step downhill. Or spending way too much work on a sideways step. The idea is to specifically do what you have tried to avoid doing. Obviously don’t do this as your first attempt at getting unstuck.

If after this last point you’re still stuck, maybe try being more incremental. Maybe the thing you’re trying to do is just not ready to be tackled yet. Find a half-way goal and aim for that. Otherwise I’ll talk about making progress next, and there may be more hints there.

In this part I will talk about the normal day to day things that you should do all the time. Why didn’t I put this before the “getting unstuck” section? Because getting unstuck is more interesting and now that I have your attention, I can spend it on making you read things that you should do every day.

Polya’s book “How to Solve It” has a chapter called “Wisdom of Proverbs” in which he talks about some of these always applicable things using proverbs. I kinda like that. It’s cute. So I will quote his proverbs when appropriate.

This one is trivial, because this is what we have been talking about for the whole list. Going up means taking one step towards your goal.

Even though this is obvious, I often catch myself doing this wrong. I’ll be thinking way too much about all the possible paths I could take and which problem I would encounter where, that I never actually end up doing a step. For me as a programmer a step may just mean “start writing some code.” (and don’t worry too much about organizing for now) Or it may just mean “work through a few cases” or just doing anything that gets you to actually do something as opposed to just thinking about it. Doing helps with thinking. I’ve found that solutions just come automatically as soon as I start working. Half the problems I worried about never actually show up. Half of the remaining problems end up being simple. Just start doing a step that seems to go uphill. (there’s actually an AI technique for this called Stochastic Hill Climbing which relies on the same insight that sometimes it’s too much work to find the best path and you should just choose any path)

Being lucky is a skill that you can learn. And it’s actually a fairly easy skill to learn. That may sound surprising to some people (especially to unlucky people) but it’s true. Here, Richard Wiseman writes about his research into luck. What he did is he found people who thought of themselves as especially lucky or especially unlucky and he asked them a lot of questions. Here are a few excerpt from the article that gives you a good idea for what he found:

I gave both lucky and unlucky people a newspaper, and asked them to look through it and tell me how many photographs were inside. On average, the unlucky people took about two minutes to count the photographs, whereas the lucky people took just seconds. Why? Because the second page of the newspaper contained the message: “Stop counting. There are 43 photographs in this newspaper.” This message took up half of the page and was written in type that was more than 2in high. It was staring everyone straight in the face, but the unlucky people tended to miss it and the lucky people tended to spot it.

For fun, I placed a second large message halfway through the newspaper: “Stop counting. Tell the experimenter you have seen this and win £250.” Again, the unlucky people missed the opportunity because they were still too busy looking for photographs.

[…]

And so it is with luck – unlucky people miss chance opportunities because they are too focused on looking for something else. They go to parties intent on finding their perfect partner and so miss opportunities to make good friends. They look through newspapers determined to find certain types of job advertisements and as a result miss other types of jobs. Lucky people are more relaxed and open, and therefore see what is there rather than just what they are looking for.

My research revealed that lucky people generate good fortune via four basic principles. They are skilled at creating and noticing chance opportunities, make lucky decisions by listening to their intuition, create self-fulfilling prophesies via positive expectations, and adopt a resilient attitude that transforms bad luck into good.

[…]

In the wake of these studies, I think there are three easy techniques that can help to maximise good fortune:

- Unlucky people often fail to follow their intuition when making a choice, whereas lucky people tend to respect hunches. Lucky people are interested in how they both think and feel about the various options, rather than simply looking at the rational side of the situation. I think this helps them because gut feelings act as an alarm bell – a reason to consider a decision carefully.
- Unlucky people tend to be creatures of routine. They tend to take the same route to and from work and talk to the same types of people at parties. In contrast, many lucky people try to introduce variety into their lives. For example, one person described how he thought of a colour before arriving at a party and then introduced himself to people wearing that colour. This kind of behaviour boosts the likelihood of chance opportunities by introducing variety.
- Lucky people tend to see the positive side of their ill fortune. They imagine how things could have been worse. In one interview, a lucky volunteer arrived with his leg in a plaster cast and described how he had fallen down a flight of stairs. I asked him whether he still felt lucky and he cheerfully explained that he felt luckier than before. As he pointed out, he could have broken his neck.

I can’t overstate how important this stuff is. Half of the advice from this blog post is due to me being lucky. For example the way I found Kenneth Stanley’s great talk “Why Greatness Cannot Be Planned: The Myth of the Objective” was that I was following Bret Victor on Twitter (or maybe it was from the RSS feed of his quotes page) because he is a constant source of new perspectives. That lead me to watch this talk by Carver Mead about a new theory of gravity. Which I watched even though I have no reason at all to look into this. I barely know any physics. But come on, a new theory of gravity. And it’s supposed to be simpler than Einstein’s theory while still making all the same predictions. That’s interesting. Then I went to find out more about the conference that that talk was given at and finally stumbled onto Kenneth Stanley’s talk.

None of these steps have any obvious practical benefit for me, but they lead me to a great talk, which coincidentally has the best demonstration I have ever seen about why you should behave in exactly this way.

Being lucky can mean that you never actually find what you’re looking for. You may find something else entirely. The list of scientific discoveries that were made “accidentally” is long. But you need to learn to be lucky, otherwise you will miss those chances when you encounter them.

Here are Polya’s proverbs for this topic:

Arrows are made of all sorts of wood.

As the wind blows you must set your sail.

Cut your cloak according to the cloth.

We must do as we may if we can’t do as we would.

A wise man changes his mind, a fool never does.

Have two strings in your bow.

A wise man will make more opportunities than he finds.

A wise man will make tools of what comes to hand.

A wise man turns chance into good fortune.

The title of this section is referring to the Woody Allen quote “80 percent of success is showing up”. This means showing up to work every day and working on a problem. Thomas Edison is supposed to have said that “ninety per cent of a man’s success in business is perspiration.”

“Showing up” can be more broadly applied: Show up to conferences. Show up to lunch with coworkers because that’s where you will have good discussions. Show up to dinner parties because that’s where you might meet people who can give you fresh ideas. Write the papers you’re supposed to write. Read the papers you’re supposed to read.

Part of this is to “be lucky” as in the point above. You can’t be lucky if you don’t show up. So you also want to get yourself into environments where you can show up to all these events. There is a reason why good research rarely comes out of some small town in the middle of nowhere: There are not enough opportunities to show up to out there. You want to at least live in a college town or a big city.

Here is Polya’s list of proverbs for this section:

Diligence is the mother of good luck.

Perseverance kills the game.

An oak is not felled at one stroke.

If at first you don’t succeed, try, try again.

Try all the keys in the bunch.

One final thing that I should point out is that I intentionally didn’t call this section “work hard.” I think that “show up” is a better advice. This is not about working 80 hour weeks. It’s about showing up to work on a problem every day.

The term “iteration time” is a standard term in video game development which roughly measures how much time passes between being finished with a change and seeing the change in the game. So for me as a programmer I make a change, then I have to compile the code, launch the game, get to a point where I can test my change and then test my change. Let’s say compiling takes ten seconds, launching the game takes twenty seconds, and getting to my test setup takes another ten seconds, then my iteration time is 40 seconds. So if I decide to make another small change, I have to wait another 40 seconds before I can see the result. If I can cut the compile time in half then my iteration time is just 35 seconds, which is a good improvement. If I can create a test setup that doesn’t require the whole game to boot then maybe I can get my iteration time down to just 15 seconds.

At the beginning of this blog post I talked about how research is often characterized by slow progress. Exploring one path might take you a week before you find out that it’s a dead end. You shouldn’t just accept that. You should find ways to reduce that time.

Improving iteration time helps in many non-obvious ways: If you can improve iteration times, you can make it cheaper to make mistakes. If an experiment takes you two hours, you probably don’t want to make a mistake and you’ll be very careful. If you can do the experiment in a minute, then some mistakes are OK and you can play around more. But even if you just reduce it from two hours to one hour and 45 minutes, that will still improve your work a little bit. And maybe you can find more improvements after that.

Now you have to invest time to save time, so sometimes it’s not worth it. But sometimes you’ll be surprised. I’ve had an argument about improving iteration times below two seconds. The other person argued that if your iteration time is only two seconds, how much time are you going to save by reducing the iteration time to one second? (and how much effort do you need to invest to achieve a 50% reduction?) But what happens is that when you reduce iteration times, you work differently. If your iteration time is milliseconds, all of a sudden you can work entirely differently. You can try several alternatives per second and create an interactive animation showing the alternatives. You can try different parameters in real time and see what happens. You can show several different variations of the problem on the screen at the same time. At some point you can write a program that just explores a million options and gives out the best one. (but then ironically that program would have slow iteration times, so maybe an interactive tool would be better)

Improving iteration times is a lot about automation, but often it’s also just about being observant as to where you are losing time. You can apply a lot of lessons from factories here. Standardize processes, specialize, batch your work etc. Also if you don’t know how to program, then you should probably learn how to. It’s easier now than it ever was. And to automate simple tasks like “entering numbers into an excel sheet” you don’t need a full computer science education.

Here is something you should do whenever you’re finished with a step: See what the implications of that step are beyond the specific step. See if it has broader applications. Here is Claude Shannon talking about this:

Another mental gimmick for aid in research work, I think, is the idea of generalization. This is very powerful in mathematical research. The typical mathematical theory developed in the following way to prove a very isolated, special result, particular theorem – someone always will come along and start generalizing it. He will leave it where it was in two dimensions before he will do it in N dimensions; or if it was in some kind of algebra, he will work in a general algebraic field; if it was in the field of real numbers, he will change it to a general algebraic field or something of that sort. This is actually quite easy to do if you only remember to do it. If the minute you’ve found an answer to something, the next thing to do is to ask yourself if you can generalize this anymore – can I make the same, make a broader statement which includes more – there, I think, in terms of engineering, the same thing should be kept in mind. As you see, if somebody comes along with a clever way of doing something, one should ask oneself “Can I apply the same principle in more general ways? Can I use this same clever idea represented here to solve a larger class of problems? Is there any place else that I can use this particular thing?”

In this talk Clay Christensen points out that generalizing makes it easier to prove yourself wrong. When you generalize your concept, you have more examples to test it against, and you can use those examples to improve your theory. If you test it against a new example and your theory doesn’t work, you have to either define the limits of your theory, or you have to explain why it sometimes behaves differently. (and then sometimes these new explanation help explain oddities in your original data)

There is also a quote by Feynman which I can’t find right now where he essentially says “if you’re not generalizing, then what’s the point?” With the reason being that the only way that science progresses is to make guesses beyond the specifics of what we observed.

One word of caution about this is that you can also be over-eager about this. Don’t try to find a pattern if all you have is two examples. (or god forbid only one example) You usually want to generalize after you’ve seen three or four examples of something. Of course the trick is in recognizing that three apparently different things are actually examples of the same thing.

This is another thing that you probably do automatically, but it’s worth pointing out: When you’re moving forward, you should try hard to keep moving forward. The usual example of this is when you were stuck for a while: Once you’re over the hurdle, you should keep working on the thing that got you over the hurdle, because you can probably make more progress there.

Csikszentmihalyi talks about the concept of “Flow” in relation to this, which is a highly focused mental state that you enter when you’re doing concentrated work. You want to stay in that state.

The easiest way to get this wrong is to get stuck on small bumps. There are plenty of small speed bumps along the way that will just slow you down. If there is something in a paper that you don’t understand, ignore it and keep reading. Maybe it will become clear later. If you already know something to be true, but proving it is tricky, skip over the proof. You can fill in the gaps later. If you’re working an algorithm but an edge case is driving you nuts, don’t handle the edge case now. Just solve the cases that you actually need.

It’s important that you revisit each of these points later to fill in the gaps (because sometimes good discoveries hide in small irregularities) but you shouldn’t let a small speed bump stop you when you were making good progress before.

Here are Polya’s proverbs for this, the first one being ironic:

Do and undo, the day is long enough.

If you will sail without danger you must never put to sea.

Do the likeliest and hope the best.

Use the means and God will give the blessing.

This is the opposite advice of the previous point, but what can I say. Sometimes you gotta keep on moving forward, sometimes you have to be careful. Often you have to do both.

You can waste a huge amount of time if you mess up a step and never notice.

Polya’s questions for this are “Carrying out the plan of the solution, check each step. Can you see clearly that the step is correct? Can you prove it?”

You get better at this with experience. As you gain more experience, you will just intuitively avoid problems. So if it looks like a very experienced person isn’t checking every step, it may just be that they have taken steps like this a thousand times before.

On the other hand for me personally I feel like I’ve become more and more careful the more I have programmed. My changelists these days tend to be smaller than they used to be. I rarely make huge changes nowadays. Instead I try to make many smaller steps, each of which I can reason about.

The other thing I’d like to mention in this context is that sometimes slowing down can help. Sometimes if you have to make a decision, it’s best to wait for a while before making it. Try to work around it and get a better lay of the land. This is why procrastination sometimes works. Sometimes with delay the correct choice becomes clear. Sometimes all you’re doing is delaying though…

Polya’s proverbs for this section are these:

Look before you leap.

Try before you trust.

A wise delay makes the road safe.

Step after step the ladder is ascended.

Little by little as the cat ate the flickle.

Do it by degrees.

This is a point that I can’t possibly do justice to. Whole books have been written about how to form effective teams, so my advice in a blog post like this has to be hopelessly incomplete.

Research has the curious character where it’s often better when done by yourself. Kenneth Stanley has an amazing illustration of the damage that committees can do to research in his talk. (same talk that I keep referring to) If you have to constantly justify what you’re doing, you won’t do the exploration that’s necessary to actually get anywhere. Yet at the same time none of the pictures that he shows in his talk are the result of people working alone. So how do we square that circle?

Research about effective teams has shown that one of the most important things is emotional safety. You should be safe to speak up, safe to ask stupid questions, safe to follow hunches, safe to take a risk, and safe to admit mistakes. If you make decisions by committee, none of these things are true because you have to constantly justify what you are doing, and you have to constantly compete with others to make sure that your priority is still everyone’s priority.

One piece of advice that I like for this is the practice of “Yes, And” from improv comedy. If somebody has an idea, you can’t say “no that’s stupid.” (or use a more subtle way to shut it down) You have to say “yes”, and you have to add something to it to keep the idea alive. I have the idea from this talk by Uri Alon, who gives the following example:

We were stuck for a year trying to understand the intricate biochemical networks inside our cells, and we said, “We are deeply in the cloud,” and we had a playful conversation where my student Shai Shen Orr said, “Let’s just draw this on a piece of paper, this network,” and instead of saying, “But we’ve done that so many times and it doesn’t work,” I said, “Yes, and let’s use a very big piece of paper,” and then Ron Milo said, “Let’s use a gigantic architect’s blueprint kind of paper, and I know where to print it,” and we printed out the network and looked at it, and that’s where we made our most important discovery, that this complicated network is just made of a handful of simple, repeating interaction patterns like motifs in a stained glass window.

(the term “being in the cloud” is what I would call being stuck in a local maximum using the hill climbing analogy)

Other important things are having a clear, well communicated vision for what you’re trying to do. This doesn’t have to be a specific goal, but it should at least be a direction. That way all the creative attempts that people are taking in your group (because it’s safe for them to do so) will automatically work together. Competing goals within the group can be really harmful here, so you want to resolve disagreements. And changing the vision can also be really harmful. If you have to change direction, you have to communicate that very well.

The final thing is that diversity has been shown to help. Which makes sense if you look at how much of the advice above is about finding different view points.

Whew, you’ve made it to the end and you’ve made a discovery. Now make sure to look back. Polya has these questions for you:

Can you derive the result differently? Can you see it at a glance? Can you use the result, or the method, for some other problem?

The last question aims at the “generalizing” point I have talked about above. But the moment just after you have finished is often the moment where you can do your best work. You can flatten out all the bumps that accumulated in your work over time. You can straighten out the lines, clean up the formulas. Maybe something that seemed odd before now makes a lot of sense and offers a hint for further research. This is the time where you can turn this result into something really good that others will actually want to use. Take some extra time here.

Polya’s proverbs are “He thinks not well that thinks not again.” And “Second thoughts are best.” He also says that it’s really good if you can, with the benefit of hindsight, find a second way to derive the result. “It is safe riding at two anchors.”

You have reached the end of my list. If you still haven’t had enough, here are some of my sources. Otherwise the conclusion is below.

George Polya – How to Solve It – A New Aspect of Mathematical Method

This book is written by someone who was thought carefully about how we solve problems. If I hadn’t read this book, I couldn’t have noticed other patterns.

Kenneth Stanley – Why Greatness Can Not Be Planned – The Myth of the Objective

I love this talk because he explains everything with pictures. For example when he shows the pictures that you get from voting compared to the pictures you get from individual exploration, it really is better than a thousand words about the subject could be.

Rich Hickey – Hammock Driven Development

This has more insights about focused mode and diffuse mode than I actually used in this blog post. I think this is also the place where I first heard about “How to Solve It.”

Claude Shannon – Creative Thinking

This is a transcript of a talk that Claude Shannon gave. The good section is the part about his tricks for doing research. I suspect that the text got messed up by some kind of automatic digitization method, so if somebody has a better source, I would be very thankful.

Robert Galager talking about Claude Shannong

This talk builds on the above list and adds more tricks that Claude Shannon used. Some of those I didn’t mention because I didn’t talk about how to find good topics.

One thing I wish I could link to is a talk or article that generalizes from AI methods to scientific research. I did some of that above, but I have no sources for that other than my own interpretations. I could link you to AI books but they typically spend a very small amount of time on hill climbing.

I don’t think my list is complete, but I think I have a pretty good sample. For example I have not read any of Csikszentmihalyi’s work. I’m sure I could add at least one or two points to my list if I did. But as I kept adding things over the years, I was frustrated by how few people seem to know these things. For example I referred to a TED talk above that talks about being stuck, and the guy doesn’t refer to Polya. And Polya’s “How to Solve It” simply has the best list for getting unstuck, so it should always be mentioned when you’re talking about being stuck. After I saw a few incomplete opinions like that, I decided I had to write this blog post, even if my own list was also incomplete.

The list is necessarily short because it’s a blog post and it’s intended as something that you can re-read the next time that you’re having problems.

There are several directions that a list of “advice for doing research” could be expanded. For example I could talk about heuristics for identifying good research, (it seems solvable, the old theory has known problems, it would simplify things, the underlying conditions have changed, it would help someone, your unconscious keeps on drawing you back to it…) or I could talk about progress and about what you should do in which stage of research (Clay Christensen talks about that here) but I had to stop at some point, and having a list of tricks and habits seems like a good thing to have.

If you’ve made it to the end of this blog post, then I thank you very much for reading. I recommend that you come back here every once in a while to re-read the list. It’s what I’m doing with Polya’s book.

]]>

That is until recently, when I came across the paper Imaginary Numbers are not Real – the Geometric Algebra of Spacetime which arrives at quaternions using only 3D math, using no imaginary numbers, and in a form that generalizes to 2D, 3D, 4D or any other number of dimensions. (and quaternions just happen to be a special case of 3D rotations)

In the last couple weeks I finally took the time to work through the math enough that I am convinced that this is a much better way to think of quaternions. So in this blog post I will explain…

- … how quaternions are 3D constructs. The 4D interpretation just adds confusion
- … how you don’t need imaginary numbers to arrive at quaternions. The term will not come up (other than to point out the places where other people need it, and why we don’t need it)
- … where the double cover of quaternions comes from, as well as how you can remove it if you want to (which makes quaternions a whole lot less weird)
- … why you actually want to keep the double cover, because the double cover is what makes quaternion interpolation great

Unfortunately I will have to teach you a whole new algebra to get there: Geometric Algebra. I only know the basics though, so I’ll stick to those and keep it simple. You will see that the geometric algebra interpretation of quaternions is much simpler than the 4D interpretation, so I can promise you that it’s worth spending a little bit of time to learn the basics of Geometric Algebra to get to the good stuff.

OK so what is this Geometric Algebra? It’s an alternative to linear algebra. Instead of matrices, there are multiple kinds of vectors, and there is a more powerful vector multiplication.

Let’s start with vector multiplication. In linear algebra we know two ways to multiply vectors: The dot product (producing a scalar) and the cross product (producing a vector). Where the dot product works for any number of dimensions, and the cross product only works in 3D. Geometric algebra also uses the dot product, but it adds a new product, the wedge product: . The result of the wedge product is not a vector or a scalar, but a plane. Specifically it’s the plane spanned by the two vectors. This plane is called a bivector because it’s the result of the wedge product of two vectors. There is also a trivector which describes a volume. The general principle is that the wedge product increases the dimension of the vectors by one. Vectors (lines) turn into bivectors (planes), and bivectors turn into trivectors (volumes). When we do math in more than 3 dimensions, we can go even higher, but I’ll stick to 2D and 3D for this blog post.

Before I tell you how to actually evaluate the wedge product, I first have to tell you the properties that it has:

- It’s anti-commutative:
- The wedge product of a vector with itself is 0:

The first property will make sense when we talk about rotations. The second product should already make sense if we just think of a bivector as a plane. There is no plane between a vector and itself, so it’s 0.

The other thing I have to explain is how vector multiplication works: In geometric algebra, the vector product is defined as the dot product plus the wedge product:

The result of the dot product is a scalar, and the result of the wedge product is a bivector. So how do we add a scalar to a bivector? We don’t, we just leave them as is. It works the same way as when adding polynomials or when adding apples and oranges or when working with complex numbers: . We just leave both terms.

Note that usually I will leave out the star and just write .

In 3D space we have three basis vectors:

When multiplying these with each other we notice three properties of this new way of multiplying:

So when multiplying the basis vectors with each other, either the dot product or the wedge product is zero. We are left only with one of the two.

All other vectors can be expressed using the basis vectors. So the vector can also be written as and I will use the second notation more often, because it makes multiplication easier.

With that out of the way, we can finally give one real example of how vector multiplication works in geometric algebra. It’s actually pretty simple because we just multiply every component with every other component:

Let’s walk through a few of the steps I did there:

- because .
- because , so the scalar part is zero, and can write the wedge product of basis-vectors shorter as . This short-hand notation is only valid for vectors which are orthogonal to each other.
- because

So as promised the result of multiplying two vectors is a scalar () and a bivector (). A sum of different components like this is called a multivector.

When doing these multiplications you quickly notice that just as all vectors can be represented as combinations of , and , all bivectors can be represented as combinations of , and . So I’ll just use these as my basis-bivectors. We could make different choices here, for example we could use instead of but I like how the bivectors circle around like that. The choice of bivectors doesn’t really matter, just as the choice of basis-vectors doesn’t really matter. We could for example have also chosen , and as our basis vectors. All the math works out the same, we just get different signs in a few places.

Once we have three basis-vectors and three basis-bivectors, we notice that we can represent all 3D multivectors as combinations of 8 numbers: 1 scalar, 3 vector-coefficients, 3 bivector-coefficients and 1 trivector-coefficient. If we did the same exercise in a different number of dimensions, we would find similar sets of numbers. In 2D space for example we have 1 scalar, 2 vector-coefficients and 1 bivector-coefficient. That makes sense, because in 2D there are only 2 directions, only 1 plane and no trivector because there is no volume. If we went to 4D we would have 1 scalar, 4 vector-coefficients, 6 bivector-coefficients, 4 trivector-coefficients and 1 quadvector-coefficient. I’m sure you can spot the pattern that would allow you to go to any number of dimensions. (but really these come out naturally depending on how many orthogonal basis-vectors you start with)

We’re almost finished with our introduction to geometric algebra, so I need to mention one final important property: vector multiplication is associative. Meaning so we can choose which multiplication we want to do first.

OK with that we’re finished with the introduction, but I want to practice a few more multiplications so that you get the hang of it. Maybe do a few yourself. It takes a couple minutes, but then you have the rules ingrained into muscle memory. This practice section is optional though.

Let’s do some practice runs to build up an intuition for how these vectors and bivectors behave. You can skip this section entirely if you don’t care about geometric algebra and just want to get to rotations.

What happens if we multiply two similar bivectors?

So what I did there is I used to re-order the basis-elements. Then everything collapses down because . So what we see here is that the dot product of a bivector is a negative number. Isn’t that interesting? In particular if we have a bivector of length 1 and multiply it with itself: we see that . Remember how in quaternions there are these three components , and which have ? We’re going to be using the bivectors for that. However it just so happens that the bivector is a mathematical construct whose square is -1. That does not mean that it is the result of . I could build any number of mathematical constructs that square to -1, (for example trivectors also square to minus one) that doesn’t mean that they are all the square root of -1. How many square roots is -1 supposed to have?

Speaking of squaring a trivector, let’s try that to get practice at re-ordering these components:

Getting the hang of it yet? It’s all about re-ordering components until things collapse.

Let’s try multiplying two different bivectors:

The result of two bivectors is another bivector. If we have more complicated bivectors that are made up of multiple basis-bivectors, the result is a scalar plus a bivector:

So this is a scalar () plus quite a complicated bivector ().

What happens if we multiply across dimension. Like multiplying a vector with a bivector?

If we multiply the plane with a vector that’s on the plane, we get another vector on the plane. In fact if we do this a few more times:

We notice that after four multiplications we are back at the original vector . So every multiplication with a bivector rotates by 90 degrees. If we multiply on the left side instead of multiplying on the right side, we would rotate in the other direction.

What if we multiply the plane with a vector that’s orthogonal to it?

Well that’s disappointing, we just get the trivector. What if we multiply the trivector with the plane?

If we multiply the trivector with the plane, the plane collapses and we’re left with just the vector that’s normal to the plane. This works even for more complicated bivectors:

Which is the normal of the original plane. What if we multiply a vector with the trivector?

If we multiply a vector with the trivector, the vector part collapses out and we’re left with the plane that the vector is normal to. This works even for more complicated vectors:

And with that we’re back at the original plane. Almost. The sign got flipped. If we had multiplied by we would have been back at the original plane.

So multiplying with the trivector turns planes into normals and normals into planes, because the other dimensions collapse out. This also allows us to define the cross product in geometric algebra: . So first we build a plane by doing the wedge product, then we get the normal by multiplying with the trivector.

If you went through the practice chapter you will have already seen places where geometric algebra does rotations: bivectors rotate vectors on their plane by 90 degrees. It’s not quite clear how we can build arbitrary rotations with that though.

One thing that’s a little bit easier to do is reflections, and we will see that we can get from reflections to rotations.

Let’s say we want to reflect the vector a in the picture below on the normalized vector r, to get the resulting vector b:

To do that it’s useful to break the vector a into two parts: The part that’s parallel to r, and the part that’s perpendicular to r, :

(forgive my crappy graphing skills)

These have a few properties:

(the result is a scalar and we can flip the order)

(the result is a bivector and flipping the order flips the sign)

From the picture it should be clear that if we subtract instead of adding it, we should get to . Or in other words:

So how do we get these and vectors? You may already know how to do it, but we actually never need to explicitly calculate them. Because we can actually represent this reflection as

How do we get to that magical formula? Let’s multiply it out:

The important step is that , allowing us to re-order the elements until we’re left with which is just 1, as long as is normalized.

The reflections above look kinda like rotations. In fact if all we want to do is rotate a single vector, we can always do that with a reflection. The problem is if we want to rotate multiple vectors, like in a 3d model, then the rotated model would be a mirror version of the original model.

The solution to that is to do a second reflection. There are many possible pairs of reflections that we could choose, but here is an easy one. First we reflect on the half-way vector between and , (where writing pipes around a vector like is the length of the vector, so is a normalized vector):

So in this picture I am reflecting on the vector , which is half-way between and , landing us at . To get from to we just have to do a second reflection with the vector itself. (which is a bit weird, but if you follow the equations it works out) Given that is one reflection, is two reflections. First we reflect on , then we reflect on .

Earlier we chose . We can multiply this out and define

Then the rotation is written as (where you could work out by multiplying out the other side, or you can just flip the sign on the bivector parts of ), and the inverse is written as .

And just like that we have quaternions. How? Where? I hear you asking. That part in the last equation is a quaternion. If you multiply it all out, you will find that all the vector parts and trivector parts collapse to 0, and you’re just left with the scalar part and the bivector coefficients. And it just so happens that if you have a multivector which consists of only a scalar and the bivectors, multiplication behaves exactly like multiplication of quaternions.

Now isn’t that interesting? All we did was we did the math for reflections, and if we do two of those we get quaternions? No imaginary numbers, no fourth dimension, just 3d vector math. All we had to do was introduce that wedge product .

And you’ll notice that the way we apply , by doing looks an awful lot like how we multiply quaternions with vectors. To multiply a quaternion with a vector we do .

OK so let’s convince ourselves that these really are quaternions and work out the quaternion equations. They are . Our quaternion consists of a scalar and three bivectors, , , and . (I use them in this order because the plane rotates around the x axis, so it should come first). So let’s try this:

.

Seems to work so far. But I actually don’t fulfill the equation because for me . I could fix that by choosing a different set of basis-bivectors. For example if I chose , and , then this would work out because . But I kinda like my choice of basis vectors and all the rotations work out the same way. If this bothers you, just choose different basis bivectors.

One super cool thing is that when doing the derivations using reflections, I never had to specify the number of dimensions. We could use 3D vectors or 2D vectors or any number of dimensions. So if we work out the math in 2D, what do you think we get? That’s right, we get complex numbers: One scalar and one bivector. Because that’s how you do rotations in 2D. But we could go to any number of dimensions using this method. (except in 1D this kinda collapses, because you can’t really rotate things in 1D)

Also we didn’t specify what we are rotating. We assumed that it was a vector, but we never required that. So this can rotate bivectors and it can rotate other quaternions.

So we found a new way to derive quaternions. This new way is neat because we don’t need 4 dimensions and we don’t need imaginary numbers. But can we learn anything new from this? Already we have two possible new interpretations:

- A quaternion is the result of two reflections
- A quaternion is a scalar plus three bivectors

Maybe one of these has some interesting conclusions.

Before that I want to kill the 4D interpretation properly: There are two reasons why people say quaternions are 4D: The fact that quaternions have four numbers, and the fact that quaternions have double cover. I’ll talk about the double cover separately later, but here I briefly want to talk about the four numbers thing. There are lots of 3D constructs that have more than three numbers. For example a plane equation has four numbers: . Or if we want to do rotations using matrices in 3D, we need a 3×3 matrix. That’s 9 numbers. But nobody would ever suggest that we should think of a rotation matrix as a 9 dimensional hyper-cube with rounded edges of radius 3. So don’t think of quaternions as a 4 dimensional hypersphere of radius 1. Yes, there are some useful conclusions to draw from that interpretation (for example it explains why we have to use slerp instead of lerp) but it’s such a weird interpretation that it should come up very rarely.

With that out of the way let’s get to these two new interpretations:

1. Interpreting quaternions as two reflections. I couldn’t get much useful out of this. The first reflection is always on the vector half-way between the start of the rotation and the end of the rotation. The second reflection is always on the end of the rotation. I’ve played around with visualizing that, but the visualizations always looked predictable and didn’t offer any insights.

2. Interpreting quaternions as a scalar plus three bivectors. This interpretation on the other hand turned out to be a goldmine. Not only can you get an intuitive feeling for how this behaves, you can also get visualizations from this. This interpretation also allowed me to get rid of the double cover of quaternions.

So even though we have derived quaternions using reflections above, I will actually spend the rest of the blog post talking about quaternions as scalars and bivectors.

A quaternion is made up of a scalar and three bivectors. We all know what a scalar does: Multiplying with a scalar makes a vector longer or shorter. I said above that multiplying with a bivector rotates a vector by 90 degrees on the plane of the bivector.

So how can we build up all possible rotations if all we have is a scalar and three rotations of exactly 90 degrees? The answer is that a bivector actually does slightly more: It rotates by 90 degrees, and then scales the vector.

I said that a bivector is a plane. But because of its rotating behavior, I actually like to visualize it as a curved line. So I visualize a vector as a straight line, and a bivector as a 90 degree curve. So here is a visualization of three different bivectors:

These are the bivectors (bottom), (middle) and (top). It’s a 90 degree rotation followed by a scale. I find this visualization particularly useful when chaining a bunch of operations together.

For example let’s say we want to rotate by 45 degrees on the xy plane. To do that we can multiply a vector with the quaternion . (that 0.707 is actually , but I’ll truncate it to 0.707 here) Now let’s multiply the vector with that quaternion. That gives us

Here’s how I would visualize that:

First we rotate by the bivector to get :

So the bivector is a rotation by 90 degrees followed by a scale of 0.707.

Next we multiply the original vector with the scalar to get the vector , which we add to the previous result:

Which then gives us the final vector of :

Which is the original vector rotated by 45 degrees.

This way of visualizing makes it very clear that multiplication with a quaternion is just multiplication with a scalar and multiplication with a bivector. And this also shows how we got a 45 degree rotation, even though all we can do is 90 degree rotations followed by scaling. It also explains why we need the single scalar value, and why the three bivectors are not enough: We sometimes want to add some of the original vector back in to get the desired rotation.

One thing to note is that in here I chose to do the bivector multiplication first, and the scalar multiplication second. But the choice is kinda arbitrary as both of these happen at the same time, and they don’t depend on each other.

Let’s rotate that same vector again to show what this looks like when we didn’t start off with one of our basis vectors:

So let’s visualize that:

First we rotate with the bivector, which puts us at :

So once again this does a 90 degree rotation followed by a scale of 0.707.

Next we multiply the original vector by 0.707 and add the resulting vector :

Which then gives us the final vector of :

Which is exactly what we would expect after rotating by 45 degrees twice.

I think these visualizations also explain how we can get arbitrary rotations: For bigger rotations we just have to make the scalar component smaller as the bivector component gets bigger.

So far we have only looked at the xy plane. To visualize this in 3D, I wrote a small program in Unity that can do the above visualization for all three bivectors. Here is what that looks like for rotating from the vector to the vector . That gives me the particularly nice quaternion .

This is going to be hard to do in pictures because it’s a 3D construct, but I’ll give it a shot. Here is what the two vectors look like:

So I want to rotate from the vector on the left to the vector on the right.

Here is what the contribution of the bivector looks like:

So this bivector is rotating on the xy plane. It takes the end point of the vector and rotates it 90 degrees down on the xy plane. It may be a bit hard to see, but imagine all the yellow lines lying on a xy plane.

The result of that 90 degree rotation is the vector . (the lower edge of the plane) I used the end of that rotation to start our result vector. (see how I have a third short vector sticking out at the bottom now? That’s )

Next I’m doing the contribution of the bivector:

The original vector was already rotated 45 degrees on the yz plane, so this rotation started off at a 45 degree angle and it rotated 90 degrees on the yz plane. Then it scaled the result by 0.5, giving us the result vector . (the bottom of the teal plane)

I also added the result of that rotation to the result vector. (the shorter vector that was sticking out now has a corner in it, indicating that I added the new )

Next we add the contribution of the bivector:

This took the end point of the original vector, and rotated it by 90 degrees on the zx plane. Then it scaled the result by 0.5, giving us the new vector (the end of the purple plane). The reason why the purple plane is floating above the other planes is an artifact of my visualization: I start at the end point and then I only move on the zx plane, so I end up floating above everything else. I also added this to our result vector at the bottom there.

Finally I’m going to add the scalar component into this:

This just took the original vector and scaled it by 0.5, giving us . I then added that to the results of the three bivector rotations. And as we can see, if we add up the contributions of the three bivectors and of the scalar part, we end up exactly at the end point of the vector that we were rotating into. (it may look like the last part is longer than 0.5 times the original vector, but that’s a trick of the perspective. The reason I picked this perspective is that you can see all three rotations from this angle)

So the rotation happened by doing three bivector multiplications and one scalar multiplication and adding all the results up.

Once again I want to point out that the order in which I added these up is arbitrary. All of these multiplications happen at the same time and don’t depend on each other, since they all just use the original vector as input. I chose to do this in the order xy, yz, zx, scalar, because that gave me a nice visualization.

I wanted to make the above visualization available for you to play with. I thought I could be really cool and upload a webgl version so that you can just play with it in your browser. So I built a webgl version, but then I found out that I can’t upload that to my wordpress account. So… I just put it in a zip file which you have to download and then open locally… Here it is.

There is an alternate visualization for the above rotation: Just as we would think of the vector as a single vector, we can also think of the bivector as a single bivector. It’s the plane with the normal , which is the plane spanned between the start vector and the end vector of the rotation. Then the visualization shows a 90 degree rotation on that plane, followed by a scaling of the length of this bivector. (which is ) That visualization looks like this:

So we rotate on this shared plane, then scale by 0.866, and finally add the original vector scaled by 0.5. This visualization as a single 90 degree rotation by the sum-bivector is equally valid as the visualization of the component bivectors. Just as we can visualize vectors either by their components, or as one line, we can visualize bivectors either by their components or as a single plane.

That finishes the part about visualization. As far as I know this is the first quaternion visualization that doesn’t try to visualize them as 4D constructs, and I think that really helps. Every component now has a distinct meaning and a picture. And we can see how the behavior of the whole quaternion is a sum of the behavior of its components.

One quick aside I want to make is that sometimes people say that quaternions are related to the axis/angle representation of rotations. That is a good way to get people started with quaternions, but then it breaks down relatively quickly because the equations don’t make sense and the numbers behave weirdly. The scalar & bivector interpretation is actually related to the axis/angle interpretation, and it explains what’s really going on here. Because when I say that something rotates 90 degrees on a plane, we can also say that it rotates 90 degrees around the normal of the plane. So in this interpretation quaternions first: rotate 90 degrees around the normal, followed by being scaled down, and second: multiply the original vector times a scalar and add that. It’s not quite axis/angle, but we can see how it’s related and why the axis/angle interpretation sometimes seems to work.

With the scalar & bivector interpretation of quaternions, we have a good idea of what quaternions do. With that, we’re ready to tackle the final quaternion mystery:

When I was working on this, a few friends asked me how the “scalar and bivector” explanation explains the double cover of quaternions. If you’re not familiar, the double cover means that for any desired rotation, there are actually two quaternions that represent that rotation. For example the quaternions that have 1 or -1 in the scalar part, and 0 for all the bivectors both represent a rotation by 0 degrees. (or by 360 degrees depending on how you look at it)

At first I responded that I hadn’t gotten to that part yet, but as I was working on this, the double cover just never came up. So eventually I decided to go looking for it, and… I couldn’t find it. It seemed like my quaternions didn’t have double cover. So I double checked everything and noticed that I have one difference: Remember how in order to multiply a quaternion with a vector we did this multiplication: . I accidentally didn’t do that. I just did .

And the simple multiplication actually works as long as you’re only rotating vectors on a plane that they actually lie on. For example rotating the vector on the plane works out: . The problems start if we’re rotating a vector that doesn’t completely lie on the plane that you’re rotating on. So let’s say I’m rotating the vector on the plane:

That’s strange: Some of our vector part has disappeared, and instead we have a trivector. This is not good. You don’t want part of the vector to disappear after a rotation. Rotating with fixes the problem, because the trivector part cancels out:

So now the part that’s on the plane (the component) got rotated, but the part that’s not on the plane (the component) was left unchanged. This is exactly what we want.

But look at what happened: The first rotation was a 90 degree rotation and the part that’s on the plane ended up at . And now we did a full 180 degree rotation and that part ended up at . How did that happen?

Well it actually makes sense. We are multiplying with the quaternion twice after all. Of course it would do a double rotation. It’s clearest if you multiply it all out, but the short explanation is that the conjugate allows us to rotate roughly in the same direction while multiplying from the other side: , and we went ahead and just multiplied on both sides . So if we multiply on both sides of course we get twice the rotation.

This is literally where the half-angles of quaternions and the double cover come from: From the way we multiply quaternions with vectors. Internally quaternions actually don’t have double cover. If you multiply one 90 degree quaternion with a different quaternion, then after four rotations that second quaternion will end up exactly where it started. But then we chose a vector multiplication function that applies the quaternion twice. So we have to change the interpretation and that 90 degree quaternion becomes a 180 degree quaternion. And actually my visualizations above don’t make sense any more because the vector multiplication always does that operation twice.

So if the vector multiplication is the problem, could we define a vector multiplication that doesn’t lead to double cover? That would make quaternions much simpler.

And the answer is that yes, we can. Remember that rotating vectors that lie on the plane already worked correctly. The problem was that rotating an orthogonal vector would turn into a trivector. (but rotations should leave orthogonal vectors unchanged) The solution is that we have to first project the vector down onto the plane, then rotate within the plane, and then apply the original offset again. Here is an outline of the algorithm:

- Compute the normal of the plane by multiplying with the trivector (very fast)
- Project the vector onto that normal (fast, as long as you use the version without a square root)
- Subtract that projected part (very fast)
- Multiply the vector with the quaternion
- Add the projected part (very fast)

So now we only have to do a single multiplication instead of two multiplications. And since all other operations are fast, this might even be faster than the double-cover-giving quaternion/vector multiplication.

And yes, this totally works and it’s faster and it’s less confusing. But you don’t want to use it. The reason is that as soon as I didn’t have double cover in my quaternions, I discovered why double cover is actually awesome.

Double cover is what makes quaternion interpolation so great. (by interpolation I mean getting from rotation a to rotation b in multiple small steps as opposed to one large step) Without double cover, there are some quaternions that you can not interpolate between. Having to worry about those special cases makes interpolation a giant pain and defeats the whole point of why we used quaternions to begin with.

To explain what the problem is, let’s do a couple 90 degree rotations on the plane, once using double cover and once not using double cover:

Rotation | Single Cover | Double Cover |
---|---|---|

If we interpreted these two numbers as vectors, the double cover version would do a 45 degree rotations of the vector each time. But since the double cover quaternion will rotate twice, this will actually give us a 90 degree rotation from one row to the next.

Here is a visualization of the same numbers. The idea here is that I put the scalar value on the x axis and the bivector on the y axis:

I drew the double cover as two lines, and the single cover as one line. Once again we see that a quaternion that uses double cover rotation is simply half-way towards the quaternion that uses single cover rotation.

I said that double cover is what makes quaternion interpolation so great. To see why, let’s try interpolating between these. To keep it simple I won’t do a slerp, but I’ll just try to find the rotation half-way between any of these rotations. We do that by adding the quaternions and then renormalizing them. Interpolating from the rotation to the rotation is pretty easy in both cases:

For single cover: and after normalization that comes out to be which is a 45 degree rotation.

For double cover: and after normalization that comes out to be , which is a 22.5 degree rotation, or with the double cover it’s a 45 degree rotation.

So interpolating a 90 degree rotation works just fine in both cases.

However we run into problems when interpolating from the rotation to the rotation:

For single cover: . Huh. We can’t find the half-way rotation between these two because we just get 0, which we can’t normalize. You may think that this is just a problem because I chose to find the exact midpoint between these two vectors. But this is also a problem if we want to slerp from one to the other. It all collapses and we’re left with a zero vector.

So let’s reason through this manually. How would we interpolate from +1 to -1? We could rotate on the xy plane or on the yz plane or on the zx plane, or on any combined bivector. How do we know which bivector to choose? They’re all zero in both of our inputs. We’re missing information. In order to interpolate between two rotations, we need to know a plane on which we want to interpolate.

Let’s see how the double cover solves this: and after normalization we’re left with which was our 90 degree rotation, which is exactly the half-way point between the 0 degree rotation and the 180 degree rotation.

Isn’t that neat? In the double cover version one of our quaternions had a component, so we could interpolate on that plane. In fact you could build many possible 180 degree rotations in the double cover version. We could build a 180 degree rotation that rotates on the plane or on a linear combination of the and planes, or on any arbitrary plane. They all look different and they all interpolate differently. That’s a great property because we want to be able to interpolate on any plane of our choosing. In the single cover version however we only have one way to rotate 180 degrees and it looks the same no matter which plane you’re on. Which works fine if all you want to do is rotate 180 degrees, but it doesn’t work if you want to interpolate from one rotation to the other.

One way of thinking of this is that the trick of double cover is that you can express any rotation as a rotation of less than 90 degrees. We already saw that if we want to go 180 degrees, we just go 90 degrees twice. Want to go 270 degrees? Just go -45 degrees twice. Like that we can always stay far away from the problem point of the 180 degree rotation that we would run into often if we used the single cover version of quaternions. And like that we always keep the information of which plane we are rotating on, making interpolation easy.

Another way of thinking of this is that the double cover version always gives us a midpoint of the rotation which we can use to interpolate. For some pairs of rotations, there are a lot of possible midpoints depending on which plane we want to interpolate on. Double cover solves that problem by giving us one midpoint, which narrows our choices down to one plane. And we can derive any other desired interpolation if we have the midpoint.

You may be wondering if there is a problem point where the double cover breaks down. Looking at the table above, we can find one: Rotating by 360 degrees: . Which we can not renormalize. But that case is easy to handle, and in fact every slerp implementation already handles this: We detect if the dot product of the quaternions is negative, and if it is we flip the target quaternion. So then we interpolate from to which is just a 0 degree rotation. Which is exactly what we wanted. So as long as we handle the “negative dot product” case in our interpolation function, we can handle all possible rotations. Because there are two possible ways to express every rotation, and if we run into one that’s inconvenient, we just switch to the other one.

So I hope I have convinced you that you want to have double cover. It’s a neat trick that makes interpolation easy. Quaternions do not “naturally” have double cover, but the double cover comes from the way we define the vector multiplication. If we used a different algorithm to multiply a quaternion with a vector (I outlined one above) then we could get rid of the double cover, but we would be making interpolation more difficult. I actually think that the double cover trick is not unique to quaternions. I think we could also apply it to rotation matrices to make them easier to interpolate. I haven’t done the math for that though.

So in summary I hope that I was able to make quaternions a whole lot less weird. The geometric algebra interpretation of quaternions shows us that they are normal 3D constructs, not weird four-dimensional beasts. They consist of a scalar and three bivectors. Bivectors do 90 degree rotations followed by scaling, and we saw how we can create any rotation just from those 90 degree rotations and linear scaling. The rules that govern these constructs are simple, making the equations easy to derive and understand. (as opposed to the quaternion equations which can only be memorized) Also quaternions do not naturally have a double cover. The double cover comes from the way we define the multiplication of vectors and quaternions. We could get rid of it, but the double cover is a great trick for making interpolations easier.

Unfortunately this still only makes it slightly easier to understand the numbers in quaternion. The double cover makes it so that each rotation actually gets applied twice, so my visualizations above only show half of what’s going on. This also makes it difficult to interpret the numbers because you have to know what happens if a rotation gets applied twice, which is a whole lot harder to do in your head than doing a single rotation. But still I now have a picture of quaternions, and I know what each component means, and why they behave the way they do. I hope I was able to do something similar for you.

I also think that Geometric Algebra is a very interesting field that merits further study. The fact that quaternions came out so naturally (in fact they almost don’t even need a special name) and that if we do the same derivation in 2D we end up with complex numbers is fascinating to me. The paper I linked at the beginning, Imaginary Numbers are not Real, spends a lot of time talking about how various equations in physics come out much simpler if we use geometric algebra instead of imaginary numbers and matrices. Simplicity like that is a good hint that there is something good going on here. If you’re interested in this for doing 3D math, there is something called Conformal Geometric Algebra which adds translation to quaternions. I didn’t look too much into it, but a brief glance shows that it might be related to dual quaternions. So there’s much more to discover.

]]>

The trick is to use Robin Hood hashing with an upper limit on the number of probes. If an element has to be more than X positions away from its ideal position, you grow the table and hope that with a bigger table every element can be close to where it wants to be. Turns out that this works really well. X can be relatively small which allows some nice optimizations for the inner loop of a hashtable lookup.

If you just want to try it, here is a download link. Or scroll down to the bottom of the blog post to the section “Source Code and Usage.” If you want more details read on.

There are many types of hashtables. For this one I chose

- Open addressing
- Linear probing
- Robing hood hashing
- Prime number amount of slots (but I provide an option for using powers of two)
- With an upper limit on the probe count

I believe that the last of these points is a new contribution to the world of hashtables. This is the main source of my speed up, but first I need to talk about all the other points.

Open addressing means that the underlying storage of the hashtable is a contiguous array. This is not how std::unordered_map works, which stores every element in a separate heap allocation.

Linear probing means that if you try to insert an element into the array and the current slot is already full, you just try the next slot over. If that one is also full, you pick the slot next to that etc. There are known problems with this simple approach, but I believe that putting an upper limit on the probe count resolves that.

Robin Hood hashing means that when you’re doing linear probing, you try to position every element such that it is as close as possible to its ideal position. You do this by moving objects around whenever you insert or erase an element, and the method for doing that is that you take from rich elements and give to poor elements. (hence the name Robin Hood hashing) A “rich” element is an element that received a slot close to its ideal insertion point. A “poor” element is one that’s far from its ideal insert point. When you insert a new element using linear probing you count how far you are from your ideal position. If you are further from your ideal position than the current element, you swap the new element with the existing element and try to find a new spot for the existing element.

The prime number amount of slots means that the underlying array has a prime number size. Meaning it grows for example from 5 slots to 11 slots to 23 slots to 47 slots etc. Then to find the insertion point you simply use the modulo operator to assign the hash value of an element to a slot. The other most common choice is to use powers of two to size your array. Later in this blog post I will go more into why I chose prime numbers by default and when you want to use which.

With the basics out of the way, let’s talk about my new contribution: Limiting how many slots the table will look at before it gives up and grows the underlying array.

My first idea was to set this to a very low number, say 4. Meaning when inserting I try the ideal slot and if that doesn’t work I try the next slot over, the next slot over, the slot after that and if all of them are full I grow the table and try inserting again. This works great for small tables, but when I insert random values into a large table, I would get unlucky all the time and hit four probes and I would have to grow the table even though it was mostly empty.

Instead I found that using log2(n) as the limit, where n is the number of slots in the table, makes it so that the table only has to reallocate once it’s roughly two thirds full. That is when inserting random values. When inserting sequential values the table can be filled up completely before it needs to reallocate.

Even though I found that the table can fill up to roughly two thirds, every now and then it would have to reallocate when it’s only 60% full. Or rarely even when it’s only 55% full. So I set the max_load_factor of the table to 0.5. Meaning the table will grow when its half full, even when it hasn’t reached the limit of the probe count. The reason for that is that I want a table that you can trust to reallocate only when you actually grow it: If you insert a thousand elements, then erase a couple elements and then insert that same number of elements again, you can be almost certain that the table won’t reallocate. I can’t put a number on the certainty, but I ran a simple test where I built thousands of tables of all kinds of sizes and filled them with random integers. Overall I inserted hundreds of billions of integers into the tables, and they only reallocated at a load factor of less than 0.5 once. (that time the table grew when it was 48% full, so it grew slightly too soon) So I think you can trust that this will very, very rarely reallocate when you weren’t expecting it.

That being said if you don’t need control over when the table grows, feel free to set the max_load_factor higher. It’s totally safe to set it to 0.9: Robin hood hashing combined with the maximum probe count will ensure that all operations remain fast. Don’t set it to 1.0 though: You can get into bad situations when inserting then because you might hit a case where every single element in the table has to be shifted around when inserting the last element. (say every element is in the slot it wants to be, except the very last slot is empty. Then you insert an element that wants to be in the first slot, but the first slot is already full. So it will go into the second slot, pushing the second element one over, which will push the third element one over etc. all the way through the table until the element in the second to last slot gets pushed into the last slot. You now have a table where every element except for the first is one slot from its ideal slot so lookups are still really fast, but that last insert took a long time) By keeping a few empty slots around you can ensure that newly inserted elements only have to move a few elements over until one of them finds an empty slot.

So if I set the max_load_factor so low that I never reach the probe count limit anyway, why have the limit at all? Because it allows a really neat optimization: Let’s say you rehash the table to have 1000 slots. My hashtable will then grow to 1009 slots because that’s the closest prime number. The log2 of that is 10, so I set the probe count limit to 10. The trick now is that instead of allocating an array of 1009 slots, I actually allocate an array of 1019 slots. But all other hash operations will still pretend that I only have 1009 slots. Now if two elements hash to index 1008, I can just go over the end and insert at index 1009. I never have to do any bounds checking because the probe count limit ensures that I will never go beyond index 1018. If I ever have eleven elements that want to go into the last slot, the table will grow and all those elements will hash to different slots. Without bounds checking, my inner loops are tiny. Here is what the find function looks like:

iterator find(const FindKey & key) { size_t index = hash_policy.index_for_hash(hash_object(key)); EntryPointer it = entries + index; for (int8_t distance = 0;; ++distance, ++it) { if (it->distance_from_desired < distance) return end(); else if (compares_equal(key, it->value)) return { it }; } }

It’s basically a linear search. The assembly of this code is beautiful. This is better than simple linear probing in two ways: 1. No bounds checking. Empty slots have -1 in their distance_from_desired value so the empty case is the same case as finding a different element. 2. This will do at most log2(n) iterations through the loop. Normally the worst case for looking things up in a hashtable is O(n). For me it’s O(log n). This makes a real difference. Especially since linear probing actually makes it pretty likely that you will hit the worst case since linear probing tends to bunch elements together.

My memory overhead on this is one byte per element. I store the distance_from_desired in an int8_t. That being said that one byte will be padded out to the alignment of the type that you insert. So if you insert ints, the one byte will get three bytes of padding so there is four bytes of overhead per element. If you insert pointers there will be 7 bytes of padding so you get eight bytes of overhead per element. I’ve thought about changing my memory layout to solve this, but my worry is that then I would have two cache misses for each lookup instead of one cache miss. So the memory overhead is one byte per element plus padding. Oh and with a max_load_factor of 0.5 (which is the default) your table will only be between 25% and 50% full, so there is more overhead there. (but once again it’s safe to increase the max_load_factor to 0.9 to save memory while only suffering a small decrease in speed)

Measuring hash tables is actually not easy. You need to measure at least these cases:

- Looking up an element that’s in the table
- Looking up an element that can not be found in the table
- Inserting a bunch of random numbers
- Inserting a bunch of random numbers after calling reserve()
- Erasing elements

And you need to run each of these with different keys and different sizes of values. I use an int or a string as the key, and I use value types of size 4, 32 and 1024. I will prefer to use int keys because with strings you’re mostly measuring the overhead of the hash function and the comparison operator, and that overhead is the same for all hash tables.

The reason for testing both successful lookups and unsuccessful lookups is that for some tables there is a huge difference in performance between these cases. For example I came across a really bad case when inserting all the numbers from 0 to 500000 into a google::dense_hash_map (meaning they were not random numbers) and then did unsuccessful lookups: The hashtable suddenly was five hundred times slower than it usually is. This is a edge case of using a power of two for the size of the table. I’ll go more into when you should pick powers of two and when you should pick prime numbers below. This example suggests that maybe I should measure each of these with random numbers and with sequential numbers, but that ended up being too many graphs. So I will only test tables with random numbers which should prevent bad cases caused by specific patterns.

The first graph is looking up an element that’s in the table:

This is a pretty dense graph so let’s spend some time on this one. flat_hash_map is the new hash table I’m presenting in this blog post. flat_hash_map_power_of_two is that same hash table but using powers of two for the array size instead of prime numbers. You can see that it’s much faster and I’ll explain why that is below. dense_hash_map is google::dense_hash_map which is the fastest hashtable I could find. sherwood_map is my old hashtable from my “I Wrote a Faster Hashtable” blog post. It’s embarrassingly slow… std::unordered_map and boost::unordered_map are self-explanatory. multi_index is boost::multi_index.

I want to talk about this graph a little. The Y-axis is the number of nanosecons that it takes to look up a single item. I use google benchmark and that calls table.find() over and over again for half a second, and then counts how many times it was able to call that function. You get nanoseconds by dividing the time that all iterations took together by the loop count. All the keys I’m looking for are guaranteed to be in the table. I chose to use a log scale for the X axis because performance tends to change on a log scale. Also this makes it possible to see the performance at different scales: If you care about small tables, you can look at the left side of the graph.

The first thing to notice is that all of the graphs are spiky. This is because all hashtables have different performance depending on the current load factor. Meaning depending on how full they are. When a table is 25% full lookups will be faster than when it’s 50% full. The reason for this is that there are more hash collisions when the table is more full. So you can see the cost go up until at some point the table decides that it’s too full and that it should reallocate, which makes lookups fast again.

This would be very clear if I were to plot the load factors of each of these tables. One more thing that would be clear is that the tables at the bottom have a max_load_factor of 0.5, the tables at the top have a max_load_factor of 1.0. This immediately raises the question of “wouldn’t those other tables also be faster if they used a max_load_factor of 0.5?” The answer is that they would only be a little faster, but I will answer that question more fully with a different graph further down. (but just from the graph above you can see that the lowest point of the upper graph, when those tables have just reallocated and have a load factor of just over 0.5 is far over the highest point of the lower graphs, just before they reallocate because their load factor is just under 0.5)

Another thing that we notice is that all the graphs are essentially flat on the left half of the screen. This is because the table fits entirely into the cache. Only when we get to the point where the data doesn’t fit into my L3 cache do we see the different graphs really diverge. I think this is a big problem. I think the numbers on the right are far more realistic than the numbers on the left. You will only get the numbers on the left if the element you’re looking for is already in the cache.

So I tried to come up with a test that would measure how fast the table would be if it wasn’t in the cache: I create enough tables so that they don’t all fit into L3 cache and I use a different table for every element that I look up. Let’s say I want to measure a table that has 32 elements in it and the elements in the table are 8 bytes in size. My L3 cache is 6 mebibytes so I can fit roughly 25000 of these tables into my L3 cache. To be sure that the tables won’t be in the cache I actually create three times that number, meaning 75000 tables. And each lookup is from a different table. That gives me this graph:

First, I removed a couple of the lines because they didn’t add much information. boost::unordered_map is usually the same speed as std::unordered_map (sometimes it’s a little faster, but it’s still always above everything else) and nobody cares about my old slow hash table sherwood_map. So now we’re left with just the important ones: std::unordered_map as a normal node based container, boost::multi_index as a really fast node based container, (I believe that std::unordered_map could be this fast) google::dense_hash_map as a fast open addressing container, and my new container in its prime number version and its power of two version.

So in this new benchmark, where I try to force a cache miss, we can see big differences very early on. What we find is that the pattern that we saw at the end of the last graph emerges very early in this graph: Starting at ten elements in the table there are clear winners in terms of performance. This is actually pretty impressive: All of these hash tables maintain consistent performance across many different orders of magnitude.

Let’s also look at the graph for unsuccessful lookups: Meaning trying to find an item that is not in the table:

When it comes to unsuccessful lookups the graph is even more spiky: The load factor really matters here. The more full a table is the more elements the search has to look at before it can conclude that an item is not in the table. But I’m actually really happy about how my new table is doing here: Limiting the probe count seems to work. I get more consistent performance than any other table.

What I take from these graphs is that my new table is a really big improvement: The red line, with the powers of two, is my table configured the same way as dense_hash_map: With max_load_factor 0.5 and using a power of two to size the table so that a hash can be mapped to a slot just by looking at the lower bits. The only big difference is that my table requires one byte of extra storage (plus padding) per slot in the table. So my table will use slightly more memory than dense_hash_map.

The surprising thing is that my table is as fast as dense_hash_map even when using prime numbers to size the table. So let me talk about that.

There are three expensive steps in looking up an item in a hashtable:

- Hashing the key
- Mapping the key to a slot
- Fetching the memory for that slot

Step 1 can be really cheap if your key is an integer: You just cast the int to a size_t. But this step can be more expensive for other types, like strings.

Step 2 is just an integer modulo.

Step 3 is a pointer dereference, for std::unordered_map it’s actually multiple pointer dereferences.

Intuitively you would expect that if you don’t have a very slow hash function, step 3 is the most expensive of these three. But if you’re not getting cache misses for every single lookup, chances are that the integer modulo will end up being your most expensive operation. Integer modulo is really slow, even on modern hardware. The Intel manual lists it as taking between 80 and 95 cycles.

This is the main reason why really fast hash tables usually use powers of two for the size of the array. Then all you have to do is mask off the upper bits, which you can do in one cycle.

There is however one big problem with using a power of two: There are many patterns of input data that result in lots of hash collisions when using powers of two. For example here is that last graph again, except I didn’t use random numbers:

Yes you see correctly that google::dense_hash_map just takes off into the stratosphere. What pattern of inputs did I have to use to get such poor performance out of dense_hash_map? It’s just sequential numbers. Meaning I insert all numbers [0, 1, 2, …, n – 2, n – 1]. If you do that, trying to look up a key that’s not in the table will be super slow. Successful lookups will still be fine. But if some of your lookups are for keys that are in the table and some are for keys that are not, then you might find that some of your lookups are a thousand times slower than others.

Another example of bad performance due to using powers of two is how the standard hashtable in Rust was accidentally quadratic when inserting keys from one table into another. So using powers of two can bite you in non-obvious ways.

It just so happens that my hashtable doesn’t suffer from either of these problems: The limit of the probe count resolves both of these problems in the best way possible. The table doesn’t even have to reallocate unnecessarily. Does that mean that I’m immune against problems that come from using powers of two? No. For example one problem that I have personally experienced in the past is that when you insert pointers into a hash table that uses powers of two, some slots will never be used. The reason is heap allocations in my program were sixteen byte aligned and I used a hash function that just reinterpret_casted the pointer to a size_t. Because of that only one out of sixteen slots in my table was ever used. You would run into the same problem if you use the power of two version of my new hashtable.

All of these problems are solvable if you’re careful about choosing a hash function that’s appropriate for your inputs. But that’s not a good user experience: You now always have to be vigilant when using hashtables. Sometimes that’s OK, but sometimes you just want to not have to think too much about this. You just want something that works and doesn’t randomly get slow. That’s why I decided to make my hashtable use prime number sizes by default and to only give an option for using powers of two.

Why do prime numbers help? I can’t quite explain the math behind that, but the intuition is that since the prime number doesn’t share common divisors with anything, all numbers get different remainders. For example let’s say I’m using powers of two, my hashtable has 32 slots, and I am trying to insert pointers which are all sixteen byte aligned. (meaning all my numbers are multiples of sixteen) Now using integer modulo to find a slot in the table will only ever give me two possible slots: slot 0 or slot 16. Since 32 is divisible by 16, you simply can’t get more possible values than that. If I use a prime numbered size instead I don’t run into that problem. For example if I use the prime number 37, then all divisions using multiples of sixteen give me different slots in the table, and I will use all 37 slots. (try doing the math and you will see that the first 37 multiples of 16 all would end up in different slots)

So then how do we solve the problem of the slow integer modulo? For this I’m using a trick that I copied from boost::multi_index: I make all integer modulos use a compile time constant. I don’t allow all possible prime numbers as sizes for the table. Instead I have a selection of pre-picked prime numbers and will always grow the table to the next largest one out of that list. Then I store the index of the number that your table has. When it later comes time to do the integer modulo to assign the hash value to a slot, you will see that my code does this:

switch(prime_index) { case 0: return 0llu; case 1: return hash % 2llu; case 2: return hash % 3llu; case 3: return hash % 5llu; case 4: return hash % 7llu; case 5: return hash % 11llu; case 6: return hash % 13llu; case 7: return hash % 17llu; case 8: return hash % 23llu; case 9: return hash % 29llu; case 10: return hash % 37llu; case 11: return hash % 47llu; case 12: return hash % 59llu; // // ... more cases // case 185: return hash % 14480561146010017169llu; case 186: return hash % 18446744073709551557llu; }

Each of these cases is a integer modulo by a compile time constant. Why is this a win? Turns out if you do a modulo by a constant, the compiler knows a bunch of tricks to make this fast. You get custom assembly for each of these cases and that custom assembly will be much faster than an integer modulo would be. It looks kinda crazy but it’s a huge speed up.

You can see the difference in the graphs above: Using prime numbers is a little bit slower, but it’s still really fast when compared to other hash tables, and you’re immune against most bad cases of hash tables. Of course you’re never immune against all bad cases. If you really worry about that, you should use std::map with its strict upper bounds. But the difference is that when using powers of two, there are many bad cases and you have to be careful not to stumble into them. When using prime numbers you will basically only hit a bad case if you intentionally create bad keys.

That brings up security: A clever attack that you can do on hash tables is that you insert keys that all collide with each other. How might you do that? If you know that a website uses a hashtable in its internal caches, then you could engineer website requests such that all your requests will collide in that hashtable. Like that you can really slow down the internal cache of a webserver and possibly bring down the website. So for example if you know that google uses dense_hash_map internally and you see the graph above where it gets really slow if you insert sequential numbers, you could just request sequential websites and hope that that pollutes their cache. You might think that setting an upper limit on the probe count prevents attackers from filling up your table with bad keys. That is true: My hashtable will not suffer from this problem. However a new attack immediately presents itself: If you know which prime numbers I use internally you could insert keys in an order so that my table repeatedly hits the limit of the probe count and has to repeatedly reallocate. So the new attack is that you can make the server run out of memory. You can solve this by using a custom hash function, but I can’t give you advice for what such a hash function should look like. All I can tell you is that if you use the hashtable in an environment where users can insert keys, don’t use std::hash as your hash function and use something stateful instead that can’t be predicted ahead of time. On the other hand if you don’t think that people will be malicious, you can be confident that using the prime number version of my hashtable will result in an even spread of values and there will be no problems.

But let’s say you know that your hash function returns numbers that are well distributed and that you’re rarely going to get hash collisions even if you use powers of two. Then you should use the power of two version of my table. To do that you have to typedef a hash_policy in your hash function object. I decided to put this customization point into the hash function object because that is the place that actually knows the quality of the returned keys.

So you put this typedef into your custom hash function object:

struct CustomHashFunction { size_t operator()(const YourStruct & foo) { // your hash function here } typedef ska::power_of_two_hash_policy hash_policy; }; // later: ska::flat_hash_map<YourStruct, int, CustomHashFunction> your_hash_map;

In your custom hash function you typedef ska::power_of_two_hash_policy as hash_policy. Then my flat_hash_map will switch to using powers of two. Also if you know that std::hash is good enough in your case, I provide a type called power_of_two_std_hash that will just call std::hash but will use the power_of_two_hash_policy:

ska::flat_hash_map<K, V, ska::power_of_two_std_hash<K>> your_hash_map;

With either of these you can get a faster hashtable if you know that you won’t be getting too many hash collisions.

After that lengthy detour of talking about hash table theory let’s get back to the performance of my table. Here is a graph that measures how long it takes to insert an item into a map. The way I measure this is that I measure how long it takes to insert N elements, then I divide the time by N to get the time that the average element took. The first graph is the speed if I do not call reserve() before inserting:

This graph is also spiky, but the spikes point in the other direction. Any time that the table has to reallocate the average cost shoots up. Then that cost gets amortized until the table has to reallocate again.

The other point about this graph is that on the left half you once again only have tables that fit entirely in the L3 cache. I decided to not write a cache-miss-triggering test for this one because that would take time and we learned above that just looking at the right half is a good approximation for a cache miss.

Here google::dense_hash_map beats my new hash table, but not by much. My table is still very fast, just not quite as fast as dense_hash_map. The reason for this is that dense_hash_map doesn’t move elements around when inserting. It simply looks for an empty slot and inserts the element. The Robin Hood hashing that I’m using requires that I move elements around when inserting to keep the property that every node is as close to its ideal position as possible. It’s a trade-off where insertion becomes more expensive, but lookups will be faster. But I’m happy with how it seems to only have a small impact.

Next is the time it takes to insert elements if the table had reserve() called ahead of time:

I don’t know what’s happening with the node-based containers at the end there. It might be fun to investigate what’s going on there, but I didn’t do that. I actually have a suspicion that that’s due to the malloc call in my standard library. (Linux gcc) I had several problems with it while measuring this graph and others because some operations would randomly take a long time.

But overall this graph looks similar to the last one, except less spiky because the reserve removes the need for reallocations. I have fast inserts, but they’re not as fast as those of google::dense_hash_map.

Finally let’s look at how long it takes to erase elements. For this I built a map of N elements, then measured how long it takes to erase every element in the map in a random order. Then I would divide the time it takes to erase all elements by N to get the cost per element:

The node based containers are slow once again, and the flat containers are all roughly equally fast. dense_hash_map is slightly faster than my hash table, but not by much: It takes roughly 20 nanoseconds to erase something from dense_hash_map and it takes roughly 23 nanoseconds to erase something from my hash table. Overall these are both very fast.

But there is one big difference between my table and dense_hash_map: When dense_hash_map erases an element, it leaves behind a tombstone in the table. That tombstone will only be removed if you insert a new element in that slot. A tombstone is a requirement of the quadratic probing that google::dense_hash_map does on lookup: When an element gets erased, it’s very difficult to find another element to take its slot. In Robin Hood hashing with linear probing it’s trivial to find an element that should go into the now empty slot: just move the next element one forward if it isn’t already in its ideal slot. In quadratic probing it might have to be an element that’s four slots over. And when that one gets moved you have to again solve the problem of finding a node to insert into the newly vacated slot. So instead you insert a tombstone and then the table knows to ignore tombstones on lookup. And they will be replaced on the next insert.

What this means though is that the table will get slightly slower once you have tombstones in your table. So dense_hash_map has a fast erase at the cost of slowing down lookups after an erase. Measuring the impact of that is a bit difficult, but I believe I have found a test that works for this purpose: I insert and erase elements over and over again:

The way this test works is that I first generate a million random ints. Then I insert these into the hashtable and erase them again and insert them again. The trick is that I do this in a random order: So let’s say I only had the four integers 1, 2, 3 and 4. Then a valid order for “insert, erase and insert again” would be insert 1, insert 3, erase 1, insert 2, insert 4, erase 4, insert 4, insert 1, erase 2, erase 3, insert 3, insert 2. Every element gets inserted, erased and inserted again. But the order is random. The graph above counts the number of inserts and measures how long the insert takes per element. The first data point, all the way on the left is just inserting a million elements. The second data point is inserting a million elements, erasing them and inserting them again in a random order like I explained. The next data point does insert, erase, insert, erase, insert. So three inserts in total. You get the idea.

What we see is that at first dense_hash_map is faster because its inserts are faster. At the second data point my hashtable has already caught up to it. At the third data point my hashtable is winning, and at six million inserts even my prime number version is winning. The reason why my tables keep on getting faster is that as there are more erases, you would expect the average load factor of the table to go down. If you insert and erase a million items often enough, the table will always have close to 500,000 elements in it. So as you get further to the right in this graph, the table will be less full on average. My hash table can take advantage of that, but dense_hash_map has a bunch of tombstones in the table which prevent it from going faster. That being said if we compare dense_hash_map against other hash tables, it’s still very fast:

So from this angle it totally makes sense for dense_hash_map to use quadratic probing, even if that requires inserting tombstones into the table on erase. The table is still very fast, certainly much faster than any node based container. But the point remains that Robin Hood linear probing gives me a more elegant way of erasing elements because it’s easy to find which element should go into the empty slot. And if you have a table where you often erase and insert elements, that’s an advantage.

One final graph that I promised above is a way to resolve the problem that std::unordered_map and boost::multi_index use a max_load_factor of 1.0, while my table and google::dense_hash_map use 0.5. Wouldn’t the other tables also be faster if they used a lower max_load_factor? To determine that I ran the same benchmark that I used to generate the very first graph (successful lookups) but I set the max_load_factor to 0.5 on each table. And then I took measurements just before a table reallocates. I’ll explain it a bit better after the graph:

This is the same graph as the very first graph in this blog post, except all the tables use a max_load_factor of 0.5. And then I wanted to only measure these tables when they really do have the same load factor, so I measured each table just before it would reallocate its internal storage. So if you look back at the very first graph in this blog post, imagine that I drew lines from one peak to the next. If we want to directly compare performance of hashtables and we want to eradicate the effect of different hash tables using different max_load_factor values and different strategies for when they reallocate, I think this is the right graph.

In this graph we see that flat_hash_map is faster than dense_hash_map, just as it was in the initial graph. It’s now much clearer though because all the noisiness is gone. Btw that brief time where dense_hash_map is faster is a result of dense_hash_map using less memory: At that point dense_hash_map still fits in my L3 cache but my flat_hash_map does not. Knowing this I can also see the same thing in the first graph, but it’s much clearer here.

But the main point of this was to compare boost::multi_index and std::unordered_map, which use a max_load_factor of 1.0 to my flat_hash_map and dense_hash_map which use a max_load_factor of 0.5. As you can see even if we use the same max_load_factor for every table, the flat tables are faster.

This was expected, but I still think this was worth measuring. In a sense this is the truest measure of hash table performance because here all hash tables are configured the same way and have the same load factor: Every single data point has a current load factor of 0.5. That being said I did not use this method of measuring for my other graphs, because in the real world you probably will never change the max_load_factor. And in the real world you will see the spiky performance of the initial graph where similar tables can have very different performance, depending on how many hash collisions there are. (and the load factor is actually only one part of that, as I also discussed above when talking about powers of two vs prime numbers) And also this graph hides one benefit of my table: Limiting the probe count leads to more consistent performance, making the lines of my hash_map less spiky than the lines of other tables.

So far every graph was measuring performance of a map from int to int. However there might be differences in performance when using different keys or larger values. First, here are the graphs for successful lookups and unsuccessful lookups when using strings as keys:

Yes, I went with the version of the graph where the table is already in the cache. It’s easier to generate. What we see here is that using a string is just moving all lines up a little. Which is expected, because the main cost here is that the hash function changed and the comparison is more expensive. Let’s look at unsuccessful lookups, too:

This is very interesting: It looks like looking for an element that’s not in the table is more expensive in google::dense_hash_map than in boost::multi_index. The reason for this is interesting: When creating a dense_hash_map you have to provide a special key that indicates that a slot is empty, and a special key that indicates that a slot is a tombstone. I used std::string(1, 0) and std::string(1, 255) respectively. But what this means is that the table has to do a string comparison to see that the slot is empty. All the other tables just do an integer comparison to see that a slot is empty.

That being said a string comparison that only compares a single character should be really cheap. And indeed the overhead is not that big. It just looks big above because every lookup is a cache hit. The cache miss picture looks different:

In this we can see that when the table is not already in the cache, dense_hash_map remains faster. Except it gets slower when the table gets very big. (more than a million entries) I didn’t find out why that is.

The next thing I tried to vary was the size of the value. What happens if I don’t have a map from an int to an int, but from an int to a 32 byte struct? Or from an int to a 1024 byte struct. So for the lookups I have 12 graphs in total ([int value, 32 byte value, 1024 byte value] x [int key, string key] x [successful lookup, unsuccessful lookup]) and most of them look exactly like the graphs above: All string lookups look the same independent of value size, and most int lookups also look the same. Except for one: Unsuccessful lookup of an int key and a 1024 byte value:

What we see here is that at a 1024 byte value, multi_index is actually competitive to the flat tables. The reason for this is that in an unsuccessful lookup you have to do the maximum number of probes, and with a value type that’s as huge as 1024 bytes, your prefetcher has to work hard. My table still seem to be winning, but for a value that’s this large, everything is essentially a node based container.

The reason why all other lookup graphs looked the same (and why I don’t show them) is this: For the node-based containers you don’t care how big the value is. Everything is a separate heap allocation anyway. For the flat containers you would expect that you would get more cache misses. But since the max_load_factor is 0.5, the element is usually found in the table pretty quickly. The most common case is exactly one lookup: Either you find it in the first probe or you know with the first probe that it won’t be in the table. Two probes also happen pretty often, but three probes are rare. Also at least in my table the lookups are just a linear search. CPUs are great at prefetching the next element in a linear search, no matter how big the item is.

So lookups mostly don’t change with the size of the type, the graph for inserts and erases changes a lot though. Here is inserting with an int as a key and a 32 byte struct as a value:

All the graphs have moved up a bit, but the graphs of the flat tables have moved up the most and have become more spiky: Reallocations hurt more when you have to move more data. The node based containers are not affected by this, and boost::multi_index stays competitive for a very long time. Let’s see what this looks like for a really large type, a 1024 byte struct:

Now the order has flipped completely: The flat containers are more expensive and very spiky, the node based containers keep their speed. At this point reallocation cost dominates completely.

One oddity is that it’s really expensive to insert a single element into a dense_hash_map. (all the way on the left of the yellow line) The reason for this is that dense_hash_map allocates 32 slots at first and it fills all of them with a default constructed value type. Since my value type is 1024 bytes in size, it has to set 32 kib of data to 0. This probably won’t affect you, but I felt like I should explain the strange shape of the line.

The other thing that happened is that dense_hash_map is now slower than my hashtable. I didn’t look into why that is, but I would assume it’s for the same reason as the above paragraph: dense_hash_map fills every slot with a default constructed value type, so reallocation is even more expensive because all the slots have to be initialized, even the ones that will never be used.

If reallocation is expensive, the solution is to call reserve() on the container ahead of time so that no reallocation has to happen. Let’s see what happens when we insert the same elements but call reserve first:

When calling reserve first, my container is faster than the node based containers at first, but at some point boost::multi_index is still faster. dense_hash_map is still slower and again I think that’s because it initializes more elements than necessary and with a value this big, even just initializing the whole table to the “empty” key/value pair takes a lot of time. They could probably optimize this by only initializing the key to the “empty” key and not initializing the value, but then again how often do you insert a value that’s 1024 bytes? It’s neat as a benchmark to test the behavior of containers as the stored values grow very large, but it might not happen in the real world.

My containers are faster until they get large: at exactly 16385 elements there is a sudden jump in the cost. At 16384 elements things are still at the normal speed. Since every element in the container is 1028 bytes, that means that if your container is more than 16 megabytes, it can suddenly get slower. At first I thought this was a random reallocation because I hit the probe count limit, which would have been embarrassing because I explained further up in this blog post about how rare that is, but luckily that’s not the case. The reason for this is interesting: At exactly that measuring point the amount of time I spend in clear_page_c_e goes up drastically. It’s not easy to find out what that is, but luckily Bruce Dawson wrote a blog post where he mentions the cost of zeroing out memory and that this happens in a function called clear_page_c_e. So for some reason at exactly that measuring point it takes the OS a lot longer to provide me with cleared pages of memory. So depending on your memory manager and your OS, this may or may not happen to you.

That also means though that this is a one time cost. Once you’ve grown the container, you will not hit that spike in cost again. So if your container is long lived this cost will be amortized.

Let’s try inserting strings:

dense_hash_map is surprisingly slow in this benchmark. The reason for that is that my version of dense_hash_map doesn’t support move semantics yet. So it makes unnecessary string copies. I’m using the version of dense_hash_map that comes with Ubuntu 16.04, which is probably a bit out of date. But I also can’t seem to find a version that does support move semantics, so I’ll stick with this version.

So we’ll use this graph mostly to compare my table against the node base containers, and my table loses. Once again I blame this on the higher cost of reallocation. So let’s try what happens if I reserve first:

… Honestly, I can’t read anything from this. The cost of the string copy dominates and all tables look the same. The main lesson to learn from this is that when your type is expensive to copy, that will dominate the insertion time and you should probably pick your hash table based on the other measures, like lookup time. We can see a similar picture when I insert strings as key with a large value type:

Once again dense_hash_map is slow because it initializes all those bytes. The other tables are pretty much the same because the copying cost dominates. Except that my flat_hash_map_power_of_two has that same weird spike at exactly 16385 elements due to increased time spent in clear_page_c_e that I also had when inserting ints with a 1024 byte value.

Lesson learned from this: If you have a large type, inserts will be equally slow in all tables, you should call reserve ahead of time, and the node based containers are a much more competitive option for large types than they are for small types.

Let’s also measure erasing elements. Once again I ran three tests for ints as keys and three tests for strings as keys: Using a 4 byte value, a 32 byte value and a 1024 byte value. The four byte value picture is shown above. The 32 byte value picture looks identical, so I’m not even going to show it. The 1024 byte value picture looks like this:

The main difference is that dense_hash_map got a lot slower. This is the same problem as in the others pictures with large value types: The other tables just consider an item deleted and call the destructor which is a no-op for my struct. dense_hash_map will overwrite the value with the “empty” key/value pair which is a large operation if you have 1024 bytes of data.

Otherwise the main difference here is that erasing from flat_hash_map has gotten much more spiky than it was in the other erase picture above, and the line has moved up considerably, getting almost as expensive as in the node based containers. I think the reason for this is that the flat_hash_map has to move elements around when an item gets erased, and that is expensive if each element is 1028 bytes of data.

The graphs for erasing strings look similar enough to the graphs for erasing ints that they’re almost not worth showing. Here is one though:

Looks very similar to erasing ints. If I make the value size 1024 bytes, the graph looks very similar to the one above this one, so just look at that one again.

The final test is the “insert, erase and insert again” test I ran above where I do inserts and erases in random orders. I reduced the number to 10,000 elements because I ran out of memory when running this with ten million 1028 byte elements. It’s also much faster to generate these graphs when I’m using fewer elements. Let’s start with a 32 byte value type:

flat_hash_map actually beats dense_hash_map now. The difference gets bigger if we increase the size of the value even more:

With a really large value type my table beats dense_hash_map. However now the node based containers beat my hash table at first, but over time my table seems to catch up. The reason for this is that I’m not reserving in these graphs. So in the first insert the table has to reallocate a bunch of times and that is very expensive in the flat containers, and it’s cheaper for the node based containers. However as we erase a few elements and insert a few elements, the reallocation cost gets amortized and my tables beat unordered_map, and they would probably also beat multi_index at some point. If I reserve ahead of time they beat multi_index right away:

I actually didn’t expect this because even though I reserve ahead of time, my container still has to move elements around, and that should be really expensive for 1028 bytes of data. My only explanation for why my container remains fast is that the load factor is pretty low and that collisions are rare. When I measure this test with strings, I do see that my container slows down as expected and multi_index is competitive:

The other pictures for inserting strings look similar to the above pictures: When inserting strings with a 32 byte value type, the graph looks like this last one. When inserting with a 1024 byte value type it looks like the graphs where I did the same thing with ints as a key, both for the case where I do reserve and the case where I don’t reserve.

That was quite a lot of measurements. Measuring hashtables is surprisingly complex. I’m still not entirely sure if I should measure all tables with the same load factors or with their default setting. I chose to go for the default setting here. And then there are so many different cases: Different keys, different value sizes, different table sizes, reserve or not etc. And there are a lot of different tests to do. I could have put thousands of graphs into this blog post, but at some point it just gets too much, so let me summarize:

- My new table has the fastest lookups of any table I could find
- It also has really fast insert and erase operations. Especially if you reserve the correct size ahead of time.
- For large types the node based containers can be faster if you don’t know ahead of time how many elements there will be. The cost of reallocations kills the flat containers. Without reallocation my flat container is the fastest in all benchmarks for large types.
- When inserting strings the cost of the string hashing, comparison and copy dominate and the choice of hashtable doesn’t matter much.
- google::dense_hash_map has some surprising cases where it slows down.
- boost::multi_index is a really impressive hash table. It has very good performance for a node based container.
- If you know that your hash function returns a good distribution of values, you can get a significant speed up by using the power_of_two version of my hashtable.

When using my table it’s safe for you to throw exceptions in your constructor, your copy constructor, in your hash function, in your equality function and in your allocator. You are not allowed to throw exceptions in a move constructor or in a destructor. The reason for this is that I have to move elements around and maintain invariants. And if you throw in a move constructor, I don’t know how to do that.

I’ve uploaded the source code to github. You can download it here. It’s licensed under the boost license. It’s a single header that contains both ska::flat_hash_map and ska::flat_hash_set. The interface is the same as that of std::unordered_map and std::unordered_set.

There is one complicated bit if you want to use the power_of_two version of the table: I explain how to do that further up in this blog post. Search for “ska::power_of_two_hash_policy” to get to the explanation.

Also I want to point out that my default max_load_factor is 0.5. It’s safe to set it as high as 0.9. Just be aware that your table will probably reallocate before it hits that number. It tends to reallocate before it’s 70% full because it hits the probe count limit. But if you don’t care that your table might reallocate when you’re not expecting it, you can save a bit of memory by using a higher max_load_factor while only suffering a tiny loss of performance.

I think I wrote the fastest hash table there is. It’s definitely the fastest for lookups, and it’s also really fast for insert and erase operations. The main new trick is to set an upper limit on the probe count. The probe count limit can be set to log2(n) which makes the worst case lookup time O(log(n)) instead of O(n). This actually makes a difference. The probe count limit works great with Robin Hood hashing and allows some neat optimizations in the inner loop.

The hash table is available under the boost license as both a hash_map and a hash_set version. Enjoy!

]]>

Somebody was nice enough to link my blog post on Hacker News and Reddit. While I didn’t do that, I still read most of the comments on those website. For some reasons the comments I got on my website were much better than the comments on either of those websites. But there seem to be some common misunderstandings underlying the bad comments, so I’ll try to clear them up.

The top comment on Hacker News essentially says “meh, this can’t sort everything and we already knew that radix sort was faster.” Firstly, I don’t understand that negativity. My blog post was essentially “hey everyone, I am very excited to share that I have optimized and generalized radix sort” and your first response is “but you didn’t generalize it all the way.” Why the negativity? I take one step forward and you complain that I didn’t take two steps? Why not be happy that I made radix sort apply to more domains than where it applied before?

So I want to talk about generalizing radix sort even more: The example of something that I don’t handle is sorting a vector of std::sets. Say a vector of sets of ints. The reason why I can’t sort that is that std::set doesn’t have an operator[]. std::sort does not have that problem because std::set provides comparison operators.

There are two possible solutions here:

- Where I currently use operator[], I could use std::next instead. So instead of writing container[index] I could write *std::next(container.begin(), index).
- I could not use an index to indicate the current position, and only use iterators instead. For that I would have to allocate a second buffer to store one iterator per collection to sort.

Both of these approaches have problems. The first one is obviously slow because I have to iterate over the collection from the beginning every time that I want to look up the current element that I want to sort by. Meaning if I need to look at the first n elements to tell two lists apart, I need to walk over those elements n times, resulting in O(n^2) pointer dereferences. The normal set comparison operators don’t have that problem because when they compare two sets, they can iterate over both in parallel. So when they need to look at n elements to tell two lists apart, they can do that in O(n).

I also didn’t want to allocate the extra memory that would be required for the second approach because I didn’t want ska_sort to sometimes require heap allocations, and sometimes not require heap allocations depending on what type it is sorting.

The point is: I could easily generalize radix sort even more so that it can handle this case as well, but it doesn’t seem interesting. Both approaches here have clear problems. I think you should just use std::sort here. So I’ll limit ska_sort to things that can be accessed in constant time.

The other question is why you would want me to handle this. I stopped generalizing when I thought that I could handle all real use cases. Radix sort can be generalized more so that it can sort everything if you want it to. If you really need to sort a vector of std::sets or a vector of std::lists, then I can probably implement the second solution for you. But **the real question isn’t whether ska_sort can sort everything, but whether it can sort your use case. And the answer is almost certainly yes.** If you really have a use case that ska_sort can not sort, then I can understand the criticism. But what do you want to sort that can not be reached in constant time?

That being said one thing that still needs to be done is that I need to allow customization of sorting behavior. Which is also what I wrote in my last blog post. Especially when sorting strings there are good use cases for wanting custom sorting behavior. Like case insensitive sorting. Or number aware sorting so that “foo100” comes after “foo99”. I’ll present an idea for that further down in this blog post. But the work there is not to generalize ska_sort further so that it can sort more data, but instead to give more customization options for the data that it can sort.

Before finishing this section, I actually quickly implemented solution 1 from the approaches for sorting sets above, and the graph is interesting:

ska_sort actually beats std::sort for quite a while there. std::sort is only faster when there are a lot of sets. If I construct the sets such that there is very little overlap between them, ska_sort is actually always faster. Does that mean that I should provide this code? I decided against it for now because it’s not a clear win. I think if I did handle this, I would want to use the allocating solution because I expect a bigger win from that one.

One criticism that I didn’t understand at first was that I am comparing apples to oranges when I’m comparing my ska_sort to std::sort. You can see that same criticism voiced in that top Hacker New comment mentioned above. To me they are both sorting algorithms and who cares how they work internally? If you want to sort things, the only thing you care about is speed, not what the algorithm does internally.

A friend of mine had a good analogy though: **Comparing a better radix sort to std::sort is like writing a faster hash table and saying “look at how much faster this is than std::map.”**

However I contest that **what I did is not equivalent to writing a better hash table, but it is equivalent to writing the first general purpose hash table**. Imagine a parallel world where people have used hash tables for a while, but only ever for special purposes. Say everybody knows that you can only use hash tables if your key is an integer, otherwise you have to use a search tree. And then somebody comes along with the realization that you can store anything in a hash table as long as you provide a custom hash function. In that case it doesn’t make sense to compare this new hash table to older hash tables, because older hash tables simply can’t run most of your benchmarks because they only support integer keys.

Similarly the only thing that I could compare my radix sort to was std::sort, because older radix sort implementations couldn’t run my benchmarks because they could literally only sort ints, floats and strings.

However the above argument doesn’t make sense for me because I made two claims: I claimed that I generalized radix sort, and also that I optimized radix sort. For the second claim I should have provided benchmarks against other radix sort implementations. And also even though something like boost::spreadsort can’t run all my benchmarks, I should have still compared it in the benchmarks that it can run. Meaning for sorting of ints, floats and strings. So yeah, I don’t know what I was thinking there… Sometimes your brain just skips to the wrong conclusions and you never think about it a second time…

So anyways, here is ska_sort compared to boost::spreadsort when sorting uniformly distributed integers:

What we see here is that ska_sort is generally faster than boost::spreadsort. Except for that area between 2048 elements and 16384 elements. The reason for this is mainly that spreadsort picks a different number of partitions than I do. In each recursive step I split the input into 256 partitions. spreadsort uses more. It doesn’t use a fixed amount like I do, so I can’t tell you a simple number, except that it usually picks more than 256.

I had played around with using a different number of partitions in my first blog post about radix sort, but I didn’t have good results that time. That time I found that if I used 2048 partitions, sorting would be faster if the collection had between 1024 and 4096 elements. In other cases using 256 partitions was faster. It’s probably worth trying a variable amount of partitions like spreadsort uses. Maybe I can come up with an algorithm that’s always faster than spreadsort. So there you go, a real benefit just from comparing against another radix sort implementation.

Let’s also look at the graph for sorting uniformly distributed floats:

This graph is interesting for two reasons: 1. spreadsort has much smaller waves than when it sorts ints. 2. All of the algorithms seem to suddenly speed up when there are a lot of elements.

I have no good explanation for the second thing. But it is reproducible so I could do more investigation. My best guess is that this is because uniform_real_distribution just can’t produce this many unique values. (I’m just asking for floats in the range from 0 to 1) So I’m getting more duplicates back there. I tried switching to a exponential_distribution, but the graph looked similar.

The reason for why spreadsort has smaller waves seems to be that spreadsort adjusts its algorithm based on how big the range of inputs is. When sorting ints it could get ints over the entire 32 bit range. When sorting floats it only gets values form zero to one. I need to look more into what spreadsort actually does with that information, but it does compute it and use it to determine how many partitions to use. But there’s no time for looking into that. Instead let’s look at sorting of strings:

This is the “sorting long strings” graph from my last blog post. And oops, that’s embarrassing: spreadsort seems to be better at sorting strings than ska_sort. Which is surprising because I copied parts of my algorithm from spreadsort, so it should work similarly. Stepping through it, there seem to be two main reasons:

- When sorting strings, I have to subdivide the input into 257 partitions: One partition for all the strings that are shorter than my current index, and 256 partitions for all the possible character values. I do that in two passes over the input: First split off all the shorter ones, second run my normal sorting algorithm which splits the remaining data into 256 partitions. spreadsort does this in one pass. There is no reason why I couldn’t do the same thing. Except that it would complicate my algorithm even more because I’d need two slightly different versions of my inner loop. I’ll try to do it when I get to it.
- Spreadsort takes advantage of the fact that strings are always stored in a single chunk of memory. When it tries to find the longest common prefix between all the strings, it uses memcmp which will internally compare several bytes at a time. In my algorithm I have no special treatment for strings: It’s the same algorithm for strings, deques, vectors, arrays or anything else with operator[]. This means I have to compare one element at a time because if you pass in a std::deque, memcmp wouldn’t work. I could solve that by specializing for containers that have a .data() function. I would run into a second problem though: You might be sorting a vector of custom types, in which case memcmp would once again be the wrong thing. It still seems solvable: I just need even more template special cases for when the thing to be sorted is a character pointer in a container that has a data() member function. Doable, but adds more complexity.

So in conclusion spreadsort will stay faster than ska_sort at sorting strings for now. The reason for that is simply that I don’t want to spend the time to implement the same optimizations at the moment.

The top reddit comment talked about something I wrote about the recursion count: It quotes from a part where I make two statements about the recursion count: 1. If I sort a million ints or a thousand ints, I always have to recurse at most four times. 2. If I can tell all the values apart in the first byte, I can stop recursing right there. The comment points out that these two statements apply to different ranges of inputs. Which, yes, they do. The comment makes fun of me for not stating that these apply to different ranges, then it contains some bike shedding about internal variable names and some wrong advice about merging loops that ping-pong between the buffers in ska_sort_copy. (when ping-ponging between two buffers A and B, you can’t start the loop that reads B until the loop that reads A is finished writing to buffer B. Otherwise you read uninitialized data) I really don’t understand why this is the top comment…

But I’ll use this as an excuse to talk in detail about the recursion count and about big O complexity because that was a common topic in the comments. (including in the responses to that comment)

The point where I have to recurse into the second byte is actually more complex than you might think: I fall back to std::sort if a partition has fewer than 128 elements in it. That means that if the inputs are uniformly distributed on the first byte, I can handle up to 127*256 = 32512 values without any recursive calls. The 256 comes from the number of possible values for the first byte, and the 127 comes from the fact that if I create 256 partitions of 127 elements each, I will fall back to std::sort within each of those partitions instead of recursing to a second call of ska_sort.

Now in reality things are not that nicely distributed. Let me insert the graph again about sorting uniformly distributed ints:

The “waves” that you can see on ska_sort happen every time that I have to do one more recursive call. So what we see here is that in that middle wave, from 4096 to 16384 items, is when the pass that looks at the first byte creates more and more partitions that are large enough to require a recursive call. For example let’s say that at 2048 elements I randomly get 80 items with the value 62 in the first byte. Then at 4096 bytes I randomly get 130 elements with the value 62 in the first byte. At 2048 elements I call std::sort directly, at 4096 elements I will do one more recursive call, splitting those 130 elements into another 256 partitions and then call std::sort on each of those.

Then after 16384 what happens is that those partitions are big enough that I can do nice large loops through them, and the algorithm speeds up again. That is until I have to recurse a second time starting at 512k items and I slow down again.

For integers there is a natural limit to these waves: There can be at most four of these waves because there are only four bytes in an int.

That brings us to the discussion about big O complexity. A funny thing to observe in the comments was that the more confident somebody was in claiming that I got my big O complexity wrong, the more likely they were to not understand big O complexity. But I will admit that the big O complexity for radix sort is confusing because it depends on the type of the data that you’re sorting.

To start with I claim that a single pass over the data for me is O(n). This is not obvious from the source code because there are five loops in there, three of which are nested. But after looking at it a bit you find that those loops depends on two things: 1. The number of partitions, 2. the number of elements in the collection. So a first guess for the complexity would be O(n+p) where p is the number of partitions. That number is fixed in my algorithm to 256, so we end up with O(n+256) which is just O(n). But that 256 is the reason why ska_sort slows down when it has lots of small partitions.

Now every time that I can’t separate the elements into partitions of fewer than 128 elements, I have to do a recursive call. So what is the impact of that on the complexity? A simple way of looking at that is to say it’s O(n*b) where b is the number of bytes I need to look at until I can tell all the elements apart. When sorting integers, in the worst case this would be 4, so we end up with O(n*4) which is just O(n). When sorting something with a variable number of bytes, like strings, that b number could be arbitrarily big though. One trick I do to reduce the risk of hitting a really bad case there is that I skip over common prefixes. Still it’s easy to create inputs where b is equal to n. So the algorithm is O(n^2) for sorting strings. But I actually detect that case and fall back to std::sort for the entire range then. So ska_sort is actually O(n log n) for sorting strings.

I like the O(n*b) number better though because the graph doesn’t look like a O(n) graph. (ska_sort_copy however does look like a O(n) graph) The O(n*b) number gives a better understanding of what’s going on. Then we can look at the waves in the graph and can say that at that point b increased by 1. And we can also see that b is not independent of n. (it will become independent of n once n is large enough. Say I’m sorting a trillion ints. But for small numbers b increases together with n)

From this analysis you would think that my algorithm is slowest when all numbers are very close to each other. Say they’re all close to 0. Because then I would have to look at all four bytes until I can tell all the numbers apart. In fact the opposite happens: My algorithm is fastest in these cases. The reason is that the first three passes over the data are very fast in this case because all elements have the same value for the first three bytes. Only the last pass actually has to do anything. (this is the “sorting geometric_distribution ints” graph from my last blog post where ska_sort ends up more than five times faster than std::sort)

Finally when looking at the complexity we have to consider the std::sort fallback. I will only ever call std::sort on partitions of fewer than 128 items. That means that the complexity of the std::sort fallback is not O(n log n) but it’s O(n log 127), which is just O(n). It’s O(n log 127) because a) I call std::sort on every element, so it has to be at least O(n), b) each of those calls to std::sort only sees at most 127 elements, so the recursive calls in quick sort are limited, and those recursive calls are responsible for the log n part of the complexity. If this sounds weird, it’s the same reason that Introsort (which is used in std::sort) is O(n log n) even though it calls insertion sort (an O(n^2) algorithm) on every single element.

Some of the best comments I got were about other good sorting algorithms. And it turns out that other people have generalized radix sort before me.

One of those is BinarSort which was written by William Gilreath who commented on my blog post. BinarSort basically says “everything is made out of bits, so if we sort bits, we can sort everything.” Which is a similar line of thinking that lead to me generalizing radix sort. The big downside with looking at everything as bits is that it leads to a slow sorting algorithm: For an int with 32 bits you have to do up to 31 recursive calls. Running BinarSort through my benchmark for sorting ints looks like this:

The first thing to notice is that BinarSort looks an awful lot as if it’s O(n log n). The reason for that is the same reason that ska_sort doesn’t look like a true O(n) graph: The number of recursive calls is related to the number of elements. BinarSort has to do up to 31 recursive calls. At the point where it reaches that number of recursive calls, you would expect the graph to flatten out. The quick sort which is used in std::sort would continue to grow even then. However it looks like you need to sort a huge number of items to get to that point in the graph. Instead you see an algorithm that keeps on getting slower and slower as it has to do more and more recursive calls, never reaching the point where it would turn linear.

The other big problem with BinarSort is that even though it claims to be general, it only provides source code for sorting ints. It doesn’t provide a method for sorting other data. For example it’s easy to see that you can’t sort floats with it directly, because if you sort floats one bit at a time, positive floats come before negative floats. I now know how to sort floats using BinarSort because I did that work for ska_sort, but if I had read the paper a while ago, I wouldn’t have believed the claim that you can sort everything with it. If you only provide source code for sorting ints, I will believe that you can only sort ints.

A much more promising approach is this paper by Fritz Henglein. I didn’t read all of the paper but it looks like he did something very similar to what I did, except he did it five years ago. According to his graphs, his sorting algorithm is also much faster than older sorting algorithms. So I think he did great work, but for some reason nobody has heard about it. The lesson that I would take from that is that if you’re doing work to improve performance of algorithms, don’t do it in Haskell. The problem is that sorting is slow in Haskell to begin with. So he is comparing his algorithm against slow sorting algorithms and it’s easy to beat slow sorting algorithms. I think that his algorithm would be faster than std::sort if he wrote it in C++, but it’s hard to tell.

A great thing that happened in the comments was that Morwenn adapted his sorting algorithm Vergesort to run on top of ska_sort. The result is an algorithm that performs very well on pre-sorted data while still being fast in random data.

This is the graph that he posted. ska_sort is just ska_sort by itself, verge_sort is a combination of verge sort and ska_sort. Best of all, he posted a comment explaining how he did it.

So that is absolutely fantastic. I’ll definitely attempt to bring his changes into the main algorithm. There might even be a way to merge his loop over the data with my first loop over the data, so that I don’t even have to do an extra pass.

This brings me to future work:

Custom sorting behavior is my next task. I don’t have a full solution yet, but I have something that can handle case-insensitive sorting of ASCII characters and it can do number-aware sorting. The hope is that something like Unicode sorting could be done with the same approach. The idea is that I expose a customization point where you can change how I use iterators in collections. You can change the return value from the iterator, and you can change how far the iterator will advance. So for case insensitive sorting you could simply return ‘a’ from the iterator when the actual value was ‘A’.

The tricky part is number aware sorting. My current idea is that you could return an int instead of a char, and then advance the iterator several positions. You would have to be a bit tricky with the int that you return because you would want to return either a character or an int. I could add support for std::variant (should probably do that anyway) but we can also just say that for characters, we just cast it to an int, and for numbers we return the lowest int plus the number. So for “foo100” you would return the integers ‘f’, ‘o’, ‘o’, INT_MIN+100. And for “foo99” you would return the integers ‘f’, ‘o’, ‘o’, INT_MIN+99. Then you would advance the iterator one position for the first three characters, and three positions for the number 100 and two positions for the number 99. One tricky part on this is that you have to always move the iterator forward by the same distance when elements have the same value. Meaning if two different strings have the value INT_MIN+100 for the current index, they both have to advance their iterators by three elements. Can’t have one of them advancing it by four elements. I need that assumption so that for recursive calls, I only need to advance a single index. So I won’t actually store the iterators that you return, but only a single index that I can use for all values that fell into the same partition. I think it’s a promising idea. It might also work for sorting Unicode, but that is such a complicated topic that I have no idea if this will work or not. I think the only way to find out is to start working on this and see if I run into problems.

The other task that I want to do is to merge my algorithm with verge sort so that I can also be fast for pre-sorted ranges.

The big problem that I have right now is that I actually want to take a break from this. I don’t want to work on this sorting algorithm for a while. I was actually already kinda burned out on this before I even wrote that first blog post. At that point I had spent a month of my free time on this and I was very happy to finally be done with this when I hit “Publish” on that blog post. So sorry, but the current state is what it’s going to stay at. I’m doing this in my spare time, and right now I’ve got other things that I would like to do with my spare time. (Dark Souls III is pretty great, guys)

That being said I do intend to use this algorithm at work, and I do expect some small improvements to come out of that. These things always get improved as soon as you actually start using them. Also I’ll probably get back to this at some point this year.

Until then you should give ska_sort a try. This can literally make your sorting two times faster. Also if you have data that this can’t sort, I am very curious to hear about that. Here’s a link to the source code and you can find instructions on how to use it in my last blog post.

]]>

The easiest way to quickly generate truly random numbers is to use a std::random_device to seed a std::mt19937_64. That way we pay a one-time cost of using random device to generate a seed, and then have quick random numbers after that. Except that the standard doesn’t provide a way to do that. In fact it’s more dangerous than that: It provides an easy wrong way to do it (use a std::random_device to generate a single int and use that single int as the seed) and it provides a slow, slightly wrong way to do it. (use a std::random_device to fill a std::seed_seq and use that as the seed) There’s a proposal to fix this, (that link also contains reasons for why the existing methods are wrong) but I’ve actually been using a tiny class for this:

struct random_seed_seq { template<typename It> void generate(It begin, It end) { for (; begin != end; ++begin) { *begin = device(); } } static random_seed_seq & get_instance() { static thread_local random_seed_seq result; return result; } private: std::random_device device; };

(the license for the code in this blog post is the Unlicense)

This class has the same generate() function that std::seed_seq has and can be used to initialize a std::mt19937_64. The static get_instance() function is a small convenience to make initialization easier so that you can write this:

std::mt19937_64 random_source{random_seed_seq::get_instance()};

Without the get_instance() function this would have to be a two-liner.

Finally a lot of code doesn’t care where their random numbers come from. Sometimes you just want a random float in the range from zero to one and you don’t want to have to set up a random engine and a random distribution. In that case you can write something like this:

float random_float_0_1() { static thread_local std::mt19937_64 randomness(random_seed_seq::get_instance()); static thread_local std::uniform_real_distribution<float> distribution; return distribution(randomness); }

And just like that we have easy, fast, high quality floating point numbers. Well we do if your compiler is GCC. On my machine this last function is slightly faster than the old-school “rand() * (1.0f / RAND_MAX)”. This function takes 11ns, and the old-school method takes 14 nanoseconds. (measured with Google Benchmark) I attribute most of that to the Mersenne Twister being a very fast random number generator.

When I compiled it with Clang however this new function takes 80ns. Stepping through the assembly generated by both compilers reveals that the problem is that Clang doesn’t inline aggressively enough. There are some calls to compute the logarithm of the upper bound and lower bound in the uniform_real_distribution. GCC inlines those expensive calls away, Clang does not.

Not sure what to do about that last problem: The problem is with how std::uniform_real_distribution is defined: It takes the upper bound and lower bound as runtime arguments. In my code listing above they are the default arguments of 0 and 1, but since Clang doesn’t inline the call, it doesn’t know that they are constants. The only way I see around that is to re-implement std::uniform_real_distribution with constants. But that’s beyond the scope of this blog post.

This blog post was only supposed to be about the random_seed_seq. The other code are just examples showing how you could use it. So let’s not worry about the details of std::uniform_real_distribution, and end this by saying that you should probably use random_seed_seq to seed your random number generators.

It’s a tiny class that I find myself needing all the time. Hopefully it will also be useful for you.

]]>

Why is that an unfortunate claim? Because I’ll probably have a hard time convincing you that I did speed up sorting by a factor of two. But this should turn out to be quite a lengthy blog post, and all the code is open source for you to try out on whatever your domain is. So I might either convince you with lots of arguments and measurements, or you can just try the algorithm yourself.

Following up from my last blog post, this is of course a version of radix sort. Meaning its complexity is lower than O(n log n). I made two contributions:

- I optimized the inner loop of in-place radix sort. I started off with the Wikipedia implementation of American Flag Sort and made some non-obvious improvements. This makes radix sort much faster than std::sort, even for a relatively small collections. (starting at 128 elements)
- I generalized in-place radix sort to work on arbitrary sized ints, floats, tuples, structs, vectors, arrays, strings etc. I can sort anything that is reachable with random access operators like operator[] or std::get. If you have custom structs, you just have to provide a function that can extract the key that you want to sort on. This is a trivial function which is less complicated than the comparison operator that you would have to write for std::sort.

If you just want to try the algorithm, jump ahead to the section “Source Code and Usage.”

To start off with, I will explain how you can build a sorting algorithm that’s O(n). If you have read my last blog post, you can skip this section. If you haven’t, read on:

If you are like me a month ago, you knew for sure that it’s proven that the fastest possible sorting algorithm has to be O(n log n). There are mathematical proofs that you can’t make anything faster. I believed that until I watched this lecture from the “Introduction to Algorithms” class on MIT Open Course Ware. There the professor explains that sorting has to be O(n log n) when all you can do is compare items. But if you’re allowed to do more operations than just comparisons, you can make sorting algorithms faster.

I’ll show an example using the counting sort algorithm:

template<typename It, typename OutIt, typename ExtractKey> void counting_sort(It begin, It end, OutIt out_begin, ExtractKey && extract_key) { size_t counts[256] = {}; for (It it = begin; it != end; ++it) { ++counts[extract_key(*it)]; } size_t total = 0; for (size_t & count : counts) { size_t old_count = count; count = total; total += old_count; } for (; begin != end; ++begin) { std::uint8_t key = extract_key(*begin); out_begin[counts[key]++] = std::move(*begin); } }

This version of the algorithm can only sort unsigned chars. Or rather it can only sort types that can provide a sort key that’s an unsigned char. Otherwise we would index out of range in the first loop. Let me explain how the algorithm works:

We have three arrays and three loops. We have the input array, the output array, and a counting array. In the first loop we fill the counting array by iterating over the input array, counting how often each element shows up.

The second loop turns the counting array into a prefix sum of the counts. So let’s say the array didn’t have 256 entries, but only 8 entries. And let’s say the numbers come up this often:

index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

count | 0 | 2 | 1 | 0 | 5 | 1 | 0 | 0 |

prefix sum | 0 | 0 | 2 | 3 | 3 | 8 | 9 | 9 |

So in this case there were nine elements in total. The number 1 showed up twice, the number 2 showed up once, the number 4 showed up five times and the number 5 showed up once. So maybe the input sequence was { 4, 4, 2, 4, 1, 1, 4, 5, 4 }.

The final loop now goes over the initial array again and uses the key to look up into the prefix sum array. And lo and behold, that array tells us the final position where we need to store the integer. So when we iterate over that sequence, the 4 goes to position 3, because that’s the value that the prefix sum array tells us. We then increment the value in the array so that the next 4 goes to position 4. The number 2 will go to position 2, the next 4 goes to position 5 (because we incremented the value in the prefix sum array twice already) etc. I recommend that you walk through this once manually to get a feeling for it. The final result of this should be { 1, 1, 2, 4, 4, 4, 4, 4, 5 }.

And just like that we have a sorted array. The prefix sum told us where we have to store everything, and we were able to compute that in linear time.

Also notice how this works on any type, not just on integers. All you have to do is provide the extract_key() function for your type. In the last loop we move the type that you provided, not the key returned from that function. So this can be any custom struct. For example you could sort strings by length. Just use the size() function in extract_key, and clamp the length to at most 255. You could write a modified version of counting_sort that tells you where the position of the last partition is, so that you can then sort all long strings using std::sort. (which should be a small subset of all your strings so that the second pass on those strings should be fast)

The above algorithm stores the sorted elements in a separate array. But it doesn’t take much to get an in-place sorting algorithm for unsigned chars: One thing we could try is that instead of moving the elements, we swap them.

The most obvious problem that we run into with that is that when we swap the first element out of the first spot, the new element probably doesn’t want to be in the first spot. It might want to be at position 10 instead. The solution for that is simple: Keep on swapping the first element until we find an element that actually wants to be in the first spot. Only when that has happened do we move on to the second item in the array.

The second problem that we then run into is that we’ll find a lot of partitions that are already sorted. We may not know however that those are already sorted. Imagine if we have the number 3 two times and it wants to be in positions six and seven. And lets say that as part of swapping the first element into place, we swap the first 3 to slot six, and the second 3 to slot seven. Now these are sorted and we don’t need to do anything with them any more. But when we advance on from the first element, we will at some point come across the 3 in slot six. And we’ll swap it to spot eight, because that’s the next spot that a 3 would go to. Then we find the next 3 and swap it to spot nine. Then we find the first 3 again and swap it to spot ten etc. This keeps going until we index out of bounds and crash.

The solution for the second problem is to keep a copy of the initial prefix array around so that we can tell when a partition is finished. Then we can skip over those partitions when advancing through the array.

With those two changes we have an in-place sorting algorithm that sorts unsigned chars. This is the American Flag Sort algorithm as described on Wikipedia.

Radix sort takes the above algorithm, and generalizes it to integers that don’t fit into a single unsigned char. The in-place version actually uses a fairly simple trick: Sort one byte at a time. First sort on the highest byte. That will split the input into 256 partitions. Now recursively sort within each of those partitions using the next byte. Keep doing that until you run out of bytes.

If you do the math on that you will find that for a four byte integer you get 256^3 recursive calls: We subdivide into 256 partitions then recurse, subdivide each of those into 256 partitions and recurse again and then subdivide each of the smaller partitions into 256 partitions again and recurse a final time. If we actually did all of those recursions this would be a very slow algorithm. The way to get around that problem is to stop recursing when the number of items in a partition is less than some magic number, and to use std::sort within that sub-partition instead. In my case I stop recursing when a partition is less than 128 elements in size. When I have split an array into partitions that have less than that many elements, I call std::sort within these partitions.

If you’re curious: The reason why the threshold is at 128 is that I’m splitting the input into 256 partitions. If the number of partitions is k, then the complexity of sorting on a single byte is O(n+k). The point where radix sort gets faster than std::sort is when the loop that depends on n starts to dominate over the loop that depends on k. In my implementation that’s somewhere around 0.5k. It’s not easy to move it much lower than that. (I have some ideas, but nothing has worked yet)

It should be clear that the algorithm described in the last section works for unsigned integers of any size. But it also works for collections of unsigned integers, (including pairs and tuples) and strings. Just sort by the first element, then by the next, then by the next etc. until the partition sizes are small enough. (as a matter of fact the paper that Wikipedia names as the source for its American Flag Sort article intended the algorithm as a sorting algorithm for strings)

But it’s straightforward to generalize this to work on signed integers: Just shift all the values up into the range of the unsigned integer of the same size. Meaning for an int16_t, just cast to uint16_t and add 32768.

Michael Herf has also discovered a good way to generalize this to floating point numbers: Reinterpret cast the float to a uint32, then flip every bit if the float was negative, but flip only the sign bit if the float was positive. The same trick works for doubles and uint64s. Michael Herf explains why this works in the linked piece, but the short version of it is this: Positive floating point numbers already sort correctly if we just reinterpret cast them to a uint32. The exponent comes before the mantissa, so we would sort by the exponent first, then by the mantissa. Everything works out. Negative floating point numbers however would sort the wrong way. Flipping all the bits on them fixes that. The final remaining problem is that positive floating point numbers need to sort as bigger than negative numbers, and the easiest way to do that is to flip the sign bit since it’s the most significant bit.

Of the fundamental types that leaves only booleans and the various char types. Chars can just be reinterpret_casted to the unsigned types of the same size. Booleans could also be turned into a unsigned char, but we can also use a custom, more efficient algorithm for booleans: Just use std::partition instead of the normal sorting algorithm. And if we need to recurse because we’re sorting on more than one key, we can recurse into each of the partitions.

And just like that we have generalized in-place radix sort to all types. Now all it takes is a bunch of template magic to make the code do the right thing for each case. I’ll spare you the details of that. It wasn’t fun.

The brief recap of the sorting algorithm for sorting one byte is:

- Count elements and build the prefix sum that tells us where to put the elements
- Swap the first element into place until we find an item that wants to be in the first position (according to the prefix sum)
- Repeat step 2 for all positions

I have implemented this sorting algorithm using Timo Bingmann’s Sound of Sorting. Here is a what it looks (and sounds) like:

As you can see from the video, the algorithm spends most of its time on the first couple elements. Sometimes the array is mostly sorted by the time that the algorithm advances forward from the first item. What you can’t see in the video is the prefix sum array that’s built on the side. Visualizing that would make the algorithm more understandable, (it would make clear how the algorithm can know the final position of elements to swap them directly there) but I haven’t done the work of visualizing that.

If we want to sort multiple bytes we recurse into each of the 256 partitions and do a sort within those using the next byte. But that’s not the slow part of this. The slow part is step 2 and step 3.

If you profile this you will find that this is spending all of its time on the swapping. At first I thought that that was because of cache misses. Usually when the line of assembly that’s taking a lot of time is dereferencing a pointer, that’s a cache miss. I’ll explain what the real problem was further down, but even though my intuition was wrong it drove me towards a good speed up: If we have a cache miss on the first element, why not try swapping the second element into place while waiting for the cache miss on the first one?

I already have to keep information about which elements are done swapping, so I can skip over those. So what I do is that I Iterate over all elements that have not yet been swapped into place, and I swap them into place. In one pass over the array, this will swap at least half of all elements into place. To see why, let’s think how this works in this list: { 4, 3, 1, 2 }: We look at the first element, the 4, and swap it with the 2 at the end, giving us this list: { 2, 3, 1, 4 }, then we look at the second element, the 3, and swap it with the 1, giving us this list: { 2, 1, 3, 4 } then we have iterated half-way through the list and find that all the remaining elements are sorted, (we do this by checking that the offset stored in the prefix sum array is the same as the initial offset of the next partition) so we’re done, but our list is not sorted. The solution for that is to say that when we get to the end of the list, we just start over from the beginning, swapping all unsorted elements into place. In that case we only need to swap the 2 into place to get { 1, 2, 3, 4 } at which point we know that all partitions are sorted and we can stop.

In Sound of Sorting that looks like this:

This is what the above algorithm looks like in code:

struct PartitionInfo { PartitionInfo() : count(0) { } union { size_t count; size_t offset; }; size_t next_offset; }; template<typename It, typename ExtractKey> void ska_byte_sort(It begin, It end, ExtractKey & extract_key) { PartitionInfo partitions[256]; for (It it = begin; it != end; ++it) { ++partitions[extract_key(*it)].count; } uint8_t remaining_partitions[256]; size_t total = 0; int num_partitions = 0; for (int i = 0; i < 256; ++i) { size_t count = partitions[i].count; if (count) { partitions[i].offset = total; total += count; remaining_partitions[num_partitions] = i; ++num_partitions; } partitions[i].next_offset = total; } for (uint8_t * last_remaining = remaining_partitions + num_partitions, * end_partition = remaining_partitions + 1; last_remaining > end_partition;) { last_remaining = custom_std_partition(remaining_partitions, last_remaining, [&](uint8_t partition) { size_t & begin_offset = partitions[partition].offset; size_t & end_offset = partitions[partition].next_offset; if (begin_offset == end_offset) return false; unroll_loop_four_times(begin + begin_offset, end_offset - begin_offset, [partitions = partitions, begin, &extract_key, sort_data](It it) { uint8_t this_partition = extract_key(*it); size_t offset = partitions[this_partition].offset++; std::iter_swap(it, begin + offset); }); return begin_offset != end_offset; }); } }

The algorithm starts off similar to counting sort above: I count how many items fall into each partition. But I changed the second loop: In the second loop I build an array of indices into all the partitions that have at least one element in them. I need this because I need some way to keep track of all the partitions that have not been finished yet. Also I store the end index for each partition in the next_offset variable. That will allow me to check whether a partition is finished sorting.

The third loop is much more complicated than counting sort. It’s three nested loops, and only the outermost is a normal for loop:

The outer loop iterates over all of the remaining unsorted partitions. It stops when there is only one unsorted partition remaining. That last partition does not need to be sorted if all other partitions are already sorted. This is an important optimization because the case where all elements fall into only one partition is quite common: When sorting four byte integers, if all integers are small, then in the first call to this function, which sorts on the highest byte, all of the keys will have the same value and will fall into one partition. In that case this algorithm will immediately recurse to the next byte.

The middle loop uses std::partition to remove finished partitions from the list of remaining partitions. I use a custom version of std::partition because std::partition will unroll its internal loop, and I do not want that. I need the innermost loop to be unrolled instead. But the behavior of custom_std_partition is identical to that of std::partition. What this loop means is that if the items fall into partitions of different sizes, say for the input sequence { 3, 3, 3, 3, 2, 5, 1, 4, 5, 5, 3, 3, 5, 3, 3 } where the partitions for 3 and 5 are larger than the other partitions, this will very quickly finish the partitions for 1, 2 and 4, and then after that the outer loop and inner loop only have to iterate over the partitions for 3 and 5. You might think that I could use std::remove_if here instead of std::partition, but I need this to be non-destructive, because I will need the same list of partitions when making recursive calls. (not shown in this code listing)

The innermost loop finally swaps elements. It just iterates over all remaining unsorted elements in a partition and swaps them into their final position. This would be a normal for loop, except I need this loop unrolled to get fast speeds. So I wrote a function called “unroll_loop_four_times” that takes an iterator and a loop count and then unrolls the loop.

This new algorithm was immediately much faster than American Flag Sort. Which made sense because I thought I had tricked the cache misses. But as soon as I profiled this I noticed that this new sorting algorithm actually had slightly more cache misses. It also had more branch mispredictions. It also executed more instructions. But somehow it took less time. This was quite puzzling so I profiled it whichever way I could. For example I ran it in Valgrind to see what Valgrind thought should be happening. In Valgrind this new algorithm was actually slower than American Flag Sort. That makes sense: Valgrind is just a simulator, so something that executes more instructions, has slightly more cache misses and slightly more branch mispredictions would be slower. But why would it be faster running on real hardware?

It took me more than a day of staring at profiling numbers before I realized why this was faster: It has better instruction level parallelism. You couldn’t have invented this algorithm on old computers because it would have been slower on old computers. The big problem with American Flag Sort is that it has to wait for the current swap to finish before it can start on the next swap. It doesn’t matter that there is no cache-miss: Modern CPUs could execute several swaps at once if only they didn’t have to wait for the previous one to finish. Unrolling the inner loop also helps to ensure this. Modern CPUs are amazing, so they could actually run several loops in parallel even without loop unrolling, but the loop unrolling helps.

The Linux perf command has a metric called “instructions per cycle” which measures instruction level parallelism. In American Flag Sort my CPU achieves 1.61 instructions per cycle. In this new sorting algorithm it achieves 2.24 instructions per cycle. It doesn’t matter if you have to do a few instructions more, if you can do 40% more at a time.

And the thing about cache misses and branch mispredictions turned out to be a red herring: The numbers for those are actually very low for both algorithms. So the slight increase that I saw was a slight increase to a low number. Since there are only 256 possible insertion points, chances are that a good portion of them are always going to be in the cache. And for many real world inputs the number of possible insertion points will actually be much lower. For example when sorting strings, you usually get less than thirty because we simply don’t use that many different characters.

All that being said, for small collections American Flag Sort is faster. The instruction level parallelism really seems to kick in at collections of more than a thousand elements. So my final sort algorithm actually looks at the number of elements in the collection, and if it’s less than 128 I call std::sort, if it’s less than 1024 I call American Flag Sort, and if it’s more than that I run my new sorting algorithm.

std::sort is actually a similar combination, mixing quick sort, insertion sort and heap sort, so in a sense those are also part of my algorithm. If I tried hard enough, I could construct an input sequence that actually uses all of these sorting algorithms. That input sequence would be my very worst case: I would have to trigger the worst case behavior of radix sort so that my algorithm falls back to std::sort, and then I would also have to trigger the worst case behavior of quick sort so that std::sort falls back to heap sort. So let’s talk about worst cases and best cases.

The best case for my implementation of radix sort is if the inputs fit in few partitions. For example if I have a thousand items and they all fall into only three partitions, (say I just have the number 1 a hundred times, the number 2 four hundred times, and the number 3 five hundred times) then my outer loops do very little and my inner loop can swap everything into place in nice long uninterrupted runs.

My other best case is on already sorted sequences: In that case I iterate over the data exactly twice, once to look at each item, and once to swap each item with itself.

The worst case for my implementation can only be reached when sorting variable sized data, like strings. For fixed size keys like integers or floats, I don’t think there is a really bad case for my algorithm. One way to construct the worst case is to sort the strings “a”, “ab”, “abc”, “abcd”, “abcde”, “abcdef” etc. Since radix sort looks at one byte at a time, and that byte only allows it to split off one item, this would take O(n^2) time. My implementation detects this by recording how many recursive calls there were. If there are too many, I fall back to std::sort. Depending on your implementation of quick sort, this could also be the worst case for quick sort, in which case std::sort falls back to heap sort. I debugged this briefly and it seemed like std::sort did not fall back to heap sort for my test case. The reason for that is that my test case was sorted data and std::sort seems to use the median-of-three rule for pivot selection, which selects a good pivot on already sorted sequences. Knowing that, it’s probably possible to create sequences that hit the worst case both for my algorithm and for the quick sort used in std::sort, in which case the algorithm would fall back to heap sort. But I haven’t attempted to construct such a sequence.

I don’t know how common this case is in the real world, but one trick I took from the boost implementation of radix sort is that I skip over common prefixes. So if you’re sorting log messages and you have a lot of messages that start with “warning:” or “error:” then my implementation of radix sort would first sort those into separate partitions, and then within each of those partitions it would skip over the common prefix and continue sorting at the first differing character. That behavior should help reduce how often we hit the worst case.

Currently I fall back to std::sort if my code has to recurse more than sixteen times. I picked that number because that was the first power of two for which the worst case detection did not trigger when sorting some log files on my computer.

The sorting algorithm that I provide as a library is called “Ska Sort”. Because I’m not going to come up with new algorithms very often in my lifetime, so might as well put my name on one when I do. The improved algorithm for sorting bytes that I described above in the sections “Optimizing the Inner Loop” and “Implementation Details” is only a small part of that. That algorithm is called “Ska Byte Sort”.

In summary, Ska Sort:

- Is an in-place radix sort algorithm
- Sorts one byte at a time (into 256 partitions)
- Falls back to std::sort if a collection contains less than some threshold of items (currently 128)
- Uses the inner loop of American Flag Sort if a collection contains less than a larger threshold of items (currently 1024)
- Uses Ska Byte Sort if the collection is larger than that
- Calls itself recursively on each of the 256 partitions using the next byte as the sort key
- Falls back to std::sort if it recurses too many times (currently 16 times)
- Uses std::partition to sort booleans
- Automatically converts signed integers, floats and char types to the correct unsigned integer type
- Automatically deals with pairs, tuples, strings, vectors and arrays by sorting one element at a time
- Skips over common prefixes of collections. (for example when sorting strings)
- Provides two customization points to extract the sort key from an object: A function object that can be passed to the algorithm, or a function called to_radix_sort_key() that can be placed in the namespace of your type

So Ska Sort is a complicated algorithm. Certainly more complicated than a simple quick sort. One of the reasons for this is that in Ska Sort, I have a lot more information about the types that I’m sorting. In comparison based sorting algorithms all I have is a comparison function that returns a bool. In Ska Sort I can know that “for this collection, I first have to sort on a boolean, then on a float” and I can write custom code for both of those cases. In fact I often need custom code: The code that sorts tuples has to be different from the code that sorts strings. Sure, they have the same inner loop, but they both need to do different work to get to that inner loop. In comparison based sorting you get the same code for all types.

If you’ve got enough time on your hands that you clicked on the pieces I linked above, you will find that there are two optimizations that are considered important in my sources that I didn’t do.

The first is that the piece that talks about sorting floating point numbers sorts 11 bits at a time, instead of one byte at a time. Meaning it subdivides the range into 2048 partitions instead of 256 partitions. The benefit of this is that you can sort a four byte integer (or a four byte float) in three passes instead of four passes. I tried this in my last blog post and found it to only be faster for a few cases. In most cases it was slower than sorting one byte at a time. It’s probably worth trying that trick again for in-place radix sort, but I didn’t do that.

The second is that the American Flag Sort paper talks about managing recursions manually. Instead of making recursive calls, they keep a stack of all the partitions that still need to be sorted. Then they loop until that stack is empty. I didn’t attempt this optimization because my code is already far too complex. This optimization is easier to do when you only have to sort strings because you always use the same function to extract the current byte. But if you can sort ints, floats, tuples, vectors, strings and more, this is complicated.

Finally we get to how fast this algorithm actually is. Since my last blog post I’ve changed how I calculate these numbers. In my last blog post I actually made a big mistake: I measured how long it takes to set up my test data and to then sort it. The problem with that is that the set up can actually be a significant portion of the time. So this time I also measure the set up separately and subtract that time from the measurements so that I’m left with only the time it takes to actually sort the data. With that let’s get to our first measurement: Sorting integers: (generated using std::uniform_int_distribution)

This graph shows how long it takes to sort various numbers of items. I didn’t mention ska_sort_copy before, but it’s essentially the algorithm from my last blog post, except that I changed it so that it falls back to ska_sort instead of falling back to std::sort. (ska_sort may still decide to fall back to std::sort of course)

One problem I have with this graph that even though I made the scale logarithmic, it’s still very difficult to see what’s going on. Last time I added another line at the bottom that showed the relative scale, but this time I have a better approach. Instead of a logarithmic scale, I can divide the total time by the number of items, so that I get the time that the sort algorithm spends per item:

With this visualization, we can see much more clearly what’s going on. All pictures below use “nanoseconds per item” as scale, like in this graph. Let’s analyze this graph a little:

For the first couple items we see that the lines are essentially the same. That’s because for less than 128 elements, I fall back to std::sort. So you would expect all of the lines to be exactly the same. Any difference in that area is measurement noise.

Then past that we see that std::sort is exactly a O(n log n) sorting algorithm. It goes up linearly when we divide the time by the number of items, which is exactly what you’d expect for O(n log n). It’s actually impressive how it forms an exactly straight line once we’re past a small number of items. ska_sort_copy is truly an O(n) sorting algorithm: The cost per item stays mostly constant as the total number of items increases. But ska_sort is… more complicated.

Those waves that we’re seeing in the ska_sort line have to do with the number of recursive calls: ska_sort is fastest when the number of items is large. That’s why the line starts off as decreasing. But then at some point we have to recurse into a bunch of partitions that are just over 128 items in size, which is slow. Then those partitions grow as the number of items increase and the algorithm is faster again, until we get to a point where the partitions are over 128 elements in size again, and we need to add another recursive step. One way to visualize this is to look at the graph of sorting a collection of int8_t:

As you can see the cost per item goes down dramatically at the beginning. Every time that the algorithm has to recurse into other partitions, we see that initial part of the curve overlaid, giving us the waves of the graph for sorting ints.

One point I made above is that ska_sort is fastest when there are few partitions to sort elements into. So let’s see what happens when we use a std::geometric_distribution instead of a std::uniform_int_distribution:

This graph is sorting four byte ints again, so you would expect to see the same “waves” that we saw in the uniformly distributed ints. I’m using a std::geometric_distribution with 0.001 as the constructor argument. Which means it generates numbers from 0 to roughly 18000, but most numbers will be close to zero. (in theory it can generate numbers that are much bigger, but 18882 is the biggest number I measured when generating the above data) And since most numbers are close to zero, we will see few recursions and because of that we see few waves, making this many times faster than std::sort.

Btw that bump at the beginning is surprising to me. For all other data that I could find, ska_sort starts to beat std::sort at 128 items. Here it seems like ska_sort only starts to win later. I don’t know why that is. I might investigate it at a different point, but I don’t want to change the threshold because this is a good number for all other data. Changing the threshold would move all other lines up by a little. Also since we’re sorting few items there, the difference in absolute terms is not that big: 15.8 microseconds to 16.7 microseconds for 128 items, and 32.3 microseconds to 32.9 microseconds for 256 items.

Let’s look at some more use cases. Here is my “real world” use case that I talked about in the last blog post, where I had to sort enemies in a game by distance to the player. But I wanted all enemies that are currently in combat to come first, sorted by distance, followed by all enemies that are not in combat, also sorted by distance. So I sort by a std::pair:

This turned out to be the same graph as sorting ints, except every line is shifted up by a bit. Which I guess I should have expected. But it’s good to see that the conversion trick that I have to do for floats and the splitting I have to do for pairs does not add significant overhead. A more interesting graph is the one for sorting int64s:

This is the point where ska_sort_copy is sometimes slower than ska_sort. I actually decided to lower the threshold where ska_sort_copy falls back to ska_sort: It will now only do the copying radix sort when it has to do less than eight iterations over the input data. Meaning I have changed the code, so that for int64s ska_sort_copy actually just calls ska_sort. Based on the above graph you might argue that it should still do the copying radix sort, but here is a measurement of sorting an 128 byte struct that has an int64 as a sort key:

As the structs get larger, ska_sort_copy gets slower. Because of this I decided to make ska_sort_copy fall back to ska_sort for sort keys of this size.

One other thing to notice from the above graph is that it looks like std::sort and ska_sort get closer. So does ska_sort ever become slower? It doesn’t look like it. Here’s what it looks like when I sort a 1024 byte struct:

Once again this is a very interesting graph. I wish I could spend time on investigating where that large gap at the end comes from. It’s not measurement noise. It’s reproducible. The way I build these graphs is that I run Google Benchmark thirty times to reduce the chance of random variation.

Talking about large data, in my last blog post my worst case was sorting a struct that has a 256 byte sort key. Which in this case means using a std::array as a sort key. This was very slow on copying radix sort because we actually have to do 256 passes over the data. In-place radix sort only has to look at enough bytes until it’s able to tell two pieces of data apart, so it might be faster. And looking at benchmarks, it seems like it is:

ska_sort_copy will fall back to ska_sort for this input, so its graph will look identical. So I fixed the worst case from my last blog post. One thing that I couldn’t profile in my last blog post was sorting of strings, because ska_sort_copy simply can not sort strings because it can not sort variable sized data.

So let’s look at what happens when I’m sorting strings:

The way I build the input data here is that I take between one and three random words from my words file and concatenate them. Once again I am very happy to see how well my algorithm does. But this was to be expected: It was already known that radix sort is great for sorting strings.

But sorting strings is also when I can hit my worst case. In theory you might get cases where you have to do many passes over the data, because there simply are a lot of bytes in the input data and a lot of them are similar. So I tried what happens when I sort strings of different length, concatenating between zero and ten words from my words file:

What we see here is that ska_sort seems to become a O(n log n) algorithm when sorting millions of long strings. However it doesn’t get slower than std::sort. My best guess for the curve going up like that is that ska_sort has to do a lot of recursions on this data. It doesn’t do enough recursions to trigger my worst case detection, but those recursions are still expensive because they require one more pass over the data.

One thing I tried was lowering my recursion limit to eight, in which case I do hit my worst case detection starting at a million items. But the graph looks essentially unchanged in that case. The reason is that it’s a false positive: I didn’t actually hit my worst case. The sorting algorithm still succeeded at splitting the data into many smaller partitions, so when I fall back to std::sort, it has a much easier time than it would have had sorting the whole range.

Finally, here is what it looks like when I sort containers that are slightly more complicated than strings:

For this I generate vectors with between 0 and 20 ints in them. So I’m sorting a vector of vectors. That spike at the end is very interesting. My detection for too many recursive calls does not trigger here, so I’m not sure why sorting gets so much more expensive. Maybe my CPU just doesn’t like dealing with this much data. But I’m happy to report that ska_sort is faster than std::sort throughout, like in all other graphs.

Since ska_sort seems to always be faster, I also generated input data that intentionally triggers the worst case for ska_sort. The below graph hits the worst case immediately starting at 128 elements. But ska_sort detects that and falls back to std::sort:

For this I’m sorting random combinations of the vectors {}, { 0 }, { 0, 1 }, { 0, 1, 2 }, … { 0, 1, 2, … , 126, 127 }. Since each element only tells my algorithm how to split off 1/128th of the input data, it would have to recurse 128 times. But at the sixteenth recursion ska_sort gives up and falls back to std::sort. In the above graph you see how much overhead that is. The overhead is bigger than I like, especially for large collections, but for smaller collections it seems to be very low. I’m not happy that this overhead exists, but I’m happy that ska_sort detects the worst case and at least doesn’t go O(n^2).

Ska_sort isn’t perfect and it has problems. I do believe that it will be faster than std::sort for nearly all data, and it should almost always be preferred over std::sort.

The biggest problem it has is the complexity of the code. Especially the template magic to recursively sort on consecutive bytes. So for example currently when sorting on a std::pair<int, int> this will instantiate the sorting algorithm eight times, because there will be eight different functions for extracting a byte out of this data. I can think of ways to reduce that number, but they might be associated with runtime overhead. This needs more investigation, but the complexity of the code is also making these kinds of changes difficult. For now you can get slow compile times with this if your sort key is complex. The easiest way to get around that is to try to use a simpler sort key.

Another problem is that I’m not sure what to do for data that I can’t sort. For example this algorithm can not sort a vector of std::sets. The reason is that std::set does not have random access operators, and I need random access when sorting on one element at a time. I could write code that allows me to sort std::sets by using std::advance on iterators, but it might be slow. Alternatively I could also fall back to std::sort. Right now I do neither: I simply give a compiler error. The reason for that is that I provide a customization point, a function called to_radix_sort_key(), that allows you to write custom code to turn your structs into sortable data. If I did an automatic fallback whenever I can’t sort something, using that customization point would be more annoying: Right now you get an error message when you need to provide it, and when you have provided it, the error goes away. If I fall back to std::sort for data that I can’t sort, your only feedback for would be that sorting is slightly slower. You would have to either profile this and compare it to std::sort, or you would have to step through the sorting function to be sure that it actually uses your implementation of to_radix_sort_key(). So for now I decided on giving an error message when I can’t sort a type. And then you can decide whether you want to implement to_radix_sort_key() or whether you want to use std::sort.

Another problem is that right now there can only be one sorting behavior per type. You have to provide me with a sort key, and if you provide me with an integer, I will sort your data in increasing order. If you wanted it in decreasing order, there is currently no easy interface to do that. For integers you could solve this by flipping the sign in your key function, so this might not be too bad. But it gets more difficult for strings: If you provide me a string then I will sort the string, case sensitive, in increasing order. There is currently no way to do a case-insensitive sort for strings. (or maybe you want number aware sorting so that “bar100” comes after “bar99”, also can’t do that right now) I think this is a solvable problem, I just haven’t done the work yet. Since the interface of this sorting algorithm works differently from existing sorting algorithms, I have to invent new customization points.

I have uploaded the code for this to github. It’s licensed under the boost license.

The interface works slightly differently from other sorting algorithms. Instead of providing a comparison function, you provide a function which returns the sort key that the sorting algorithm uses to sort your data. For example let’s say you have a vector of enemies, and you want to sort them by distance to the player. But you want all enemies that are in combat with the player to come first, sorted by distance, and then all enemies that are not in combat, also sorted by distance. The way to do that in a classic sorting algorithm would be like this:

std::sort(enemies.begin(), enemies.end(), [](const Enemy & lhs, const Enemy & rhs) { return std::make_tuple(!is_in_combat(lhs), distance_to_player(lhs)) < std::make_tuple(!is_in_combat(rhs), distance_to_player(rhs)); });

In ska_sort, you would do this instead:

ska_sort(enemies.begin(), enemies.end(), [](const Enemey & enemy) { return std::make_tuple(!is_in_combat(enemy), distance_to_player(enemy)); });

As you can see the transformation is fairly straightforward. Similarly let’s say you have a bunch of people and you want to sort them by last name, then first name. You could do this:

ska_sort(contacts.begin(), contacts.end(), [](const Contact & c) { return std::tie(c.last_name, c.first_name); });

It is important that I use std::tie here, because presumably last_name and first_name are strings, and you don’t want to copy those. std::tie will capture them by reference.

Oh and of course if you just have a vector of simple types, you can just sort them directly:

ska_sort(durations.begin(), durations.end());

In this I assume that “durations” is a vector of doubles, and you might want to sort them to find the median, 90th percentile, 99th percentile etc. Since ska_sort can already sort doubles, no custom code is required.

There is one final case and that is when sorting a collection of custom types. ska_sort only takes a single customization function, but what do you do if you have a custom type that’s nested? In that case my algorithm would have to recurse into the top-level-type and would then come across a type that it doesn’t understand. When this happens you will get an error message about a missing overload for to_radix_sort_key(). What you have to do is provide an implementation of the function to_radix_sort_key() that can be found using ADL for your custom type:

struct CustomInt { int i; }; int to_radix_sort_key(const CustomInt & i) { return i.i; } //... later somewhere std::vector<std::vector<CustomInt>> collections = ...; ska_sort(collections.begin(), collections.end());

In this case ska_sort will call to_radix_sort_key() for the nested CustomInts. You have to do this because there is no efficient way to provide a custom extract_key function at the top level. (at the top level you would have to convert the std::vector<CustomInt> to a std::vector<int>, and that requires a copy)

Finally I also provide a copying sort function, ska_sort_copy, which will be much faster for small keys. To use it you need to provide a second buffer that’s the same size as the input buffer. Then the return value of the function will tell you whether the final sorted sequence is in the second buffer (the function returns true) or in the first buffer (the function return false).

std::vector<int> temp_buffer(to_sort.size()); if (ska_sort_copy(to_sort.begin(), to_sort.end(), temp_buffer.begin())) to_sort.swap(temp_buffer);

In this code I allocate a temp buffer, and if the function tells me that the result ended up in the temp buffer, I swap it with the input buffer. Depending on your use case you might not have to do a swap. And to make this fast you wouldn’t want to allocate a temp buffer just for the sorting. You’d want to re-use that buffer.

I’ve talked to a few people about this, and the usual questions I get are all related to people not believing that this is actually faster.

Q: Isn’t Radix Sort O(n+m) where m is large so that it’s actually slower than a O(n log n) algorithm? (or alternatively: Isn’t radix sort O(n*m) where m is larger than log n?)

A: Yes, radix sort has large constant factors, but in my benchmarks it starts to beat std::sort at 128 elements. And if you have a large collection, say a thousand elements, radix sort is a very clear winner.

Q: Doesn’t Radix Sort degrade to a O(n log n) algorithm? (or alternatively: Isn’t the worst case of Radix Sort O(n log n) or maybe even O(n^2)?)

A: In a sense Radix Sort has to do log(n) passes over the data. When sorting an int16, you have to do two passes over the data. When sorting an int32, you have to do four passes over the data. When sorting an int64 you have to do eight passes etc. However this is not O(n log n) because this is a constant factor that’s independent of the number of elements. If I sort a thousand int32s, I have to do four passes over that data. If I sort a million int32s, I still have to do four passes over that data. The amount of work grows linearly. And if the ints are all different in the first byte, I don’t even have to do the second, third or fourth pass. I only have to do enough passes until I can tell them all apart.

So the worst case for radix sort is O(n*b) where b is the number of bytes that I have to read until I can tell all the elements apart. If you make me sort a lot of long strings, then the number of bytes can be quite large and radix sort may be slow. That is the “worst case” graph above. If you have data where radix sort is slower than std::sort (something that I couldn’t find except when intentionally creating bad data) please let me know. I would be interested to see if we can find some optimizations for those cases. When I tried to build more plausible strings, ska_sort was always clearly faster.

And if you’re sorting something fixed size, like floats, then there simply is no accidental worst case. You are limited by the number of bytes and you will do at most four passes over the data.

Q: If those performance graphs were true, we’d be radix sorting everything.

A: They are true. Not sure what to tell you. The code is on github, so try it for yourself. And yes, I do expect that we will be radix sorting everything. I honestly don’t know why everybody settled on Quick Sort back in the day.

There are a couple obvious improvements that I may make to the algorithm. The algorithm is currently in a good state, but if I ever feel like working on this again, here are three things that I might do:

As I said in the problems section, there is currently no way to sort strings case-insensitive. Adding that specific feature is not too difficult, but you’d want some kind of generic way to customize sorting behavior. Currently all you can do is provide a custom sort key. But you can not change how the algorithm uses that sort key. You always get items sorted in increasing order by looking at one byte at a time.

When I fall back to std::sort, I re-start sorting from the beginning. As I said above I fall back to std::sort when I have split the input into partitions of less than 128 items. But let’s say that one of those partitions is all the strings starting with “warning:” and one partition is all the strings starting with “error:” then when I fall back to std::sort, I could skip the common prefix. I have the information of how many bytes are already sorted. I suspect that the fact that std::sort has to start over from the beginning is the reason why the lines in the graph for sorting strings are so parallel between ska_sort and std::sort. Making this optimization might make the std::sort fallback much faster.

I might also want to write a function that can either take a comparison function, or an extract_key function. The way it would work is that if you pass a function object that takes two arguments, this uses comparison based sorting, and if you pass a function object that takes one argument, this uses radix sorting. The reason for creating a function like that is that it could be backwards compatible to std::sort.

In Summary I have a sorting algorithm that’s faster than std::sort for most inputs. The sorting algorithm is on github and is licensed under the boost license, so give it a try.

I mainly did two things:

- I optimized the inner loop of in-place radix sort, resulting in the ska_byte_sort algorithm
- I provide an algorithm, ska_sort, that can perform Radix Sort on arbitrary types or combinations of types

To use it on custom types you need to provide a function that provides a “sort key” to ska_sort, which should be a int, float, bool, vector, string, or a tuple or pair consisting of one of these. The list of supported types is long: Any primitive type will work or anything with operator[], so std::array and std::deque and others will also work.

If sorting of data is critical to your performance (good chance that it is, considering how important sorting is for several other algorithms) you should try this algorithm. It’s fastest when sorting a large number of elements, but even for small collections it’s never slower than std::sort. (because it uses std::sort when the collection is too small)

The main lessons to learn from this are that even “solved” problems like sorting are worth revisiting every once in a while. And it’s always good to learn the basics properly. I didn’t expect to learn anything from an “Introduction to Algorithms” course but I already wrote this algorithm and I’m also tempted to attempt once again to write a faster hashtable.

If you do use this algorithm in your code, let me know how it goes for you. Thanks!

]]>

But first an explanation of what radix sort is: **Radix sort is a O(n) sorting algorithm working on integer keys.** I’ll explain below how it works, but the claim that there’s an O(n) searching algorithm was surprising to me the first time that I heard it. I always thought there were proofs that sorting had to be O(n log n). Turns out sorting has to be O(n log n) if you use the comparison operator to sort. Radix sort does not use the comparison operator, and because of that it can be faster.

The other reason why I never looked into radix sort is that it only works on integer keys. Which is a huge limitation. Or so I thought. Turns out all this means is that your struct has to be able to provide something that acts somewhat like an integer. **Radix sort can be extended to floats, pairs, tuples and std::array**. So if your struct can provide for example a std::pair<bool, float> and use that as a sort key, you can sort it using radix sort.

I actually do this somewhat often when I write C++ code nowadays. One recent example was that I had to sort enemies in a game that I was working on. I wanted to sort enemies by distance, but I wanted all enemies that were already fighting with the player to come first. So here is what the comparison function looked like:

bool operator<(const Enemy & a, const Enemy & b) { return std::make_tuple(!IsInCombat(a), DistanceToPlayer(a)) < std::make_tuple(!IsInCombat(b), DistanceToPlayer(b)); }

Using that comparison operator will sort the enemies so that all enemies that are in combat with the player come first, (and they’re sorted by distance) and then there will be all enemies that are not in combat with the player. (also sorted by distance)

Except that by using this comparison operator I have to use an O(n log n) sorting algorithm. But you can use radix sort to sort tuples, so I could sort this in O(n). All I have to do is provide this function

auto sort_key(const Enemy & a) { return std::make_tuple(!IsInCombat(a), DistanceToPlayer(a)); }

If I use that sort_key function as input to radix sort, I can sort in O(n) instead of O(n log n). Neat, huh? So how does radix sort work?

Radix sort builds on top of an algorithm called counting sort, so I’ll explain that one first. Counting sort is also a O(n) sorting algorithm that works on integer keys. The big trick is that instead of using the comparison operator, we use integers as indices into an array. The big downside is that we need an array big enough that the largest integer can index into it. For a uint32 that’s 4 gigabytes of memory… But radix sort will overcome that downside, so for now let’s just look at counting sort on bytes. Then all we need is an array of size 256, because that’s big enough that any byte can index into it. I’ll start off by dumping in a full implementation in C++, then I’ll explain how this works.

template<typename It, typename OutIt, typename ExtractKey> void counting_sort(It begin, It end, OutIt out_begin, ExtractKey && extract_key) { size_t counts[256] = {}; for (It it = begin; it != end; ++it) { ++counts[extract_key(*it)]; } size_t total = 0; for (size_t & count : counts) { size_t old_count = count; count = total; total += old_count; } for (; begin != end; ++begin) { std::uint8_t key = extract_key(*begin); out_begin[counts[key]++] = std::move(*begin); } }

There are three loops here: We iterate over the input array, then we iterate over our buffer, then we iterate over the input array a second time and write the sorted data to the output array:

The first loop counts how often each byte comes up. Remember, we can only sort bytes using this version because we only have an array of size 256. But that array is big enough to hold the information of how often each byte shows up.

The second loop turns that buffer into a prefix sum of the counts. So let’s say the array didn’t have 256 entries, but only 8 entries. And let’s say the numbers come up this often:

index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

count | 0 | 2 | 1 | 0 | 5 | 1 | 0 | 0 |

prefix sum | 0 | 0 | 2 | 3 | 3 | 8 | 9 | 9 |

So in this case there were nine elements in total. The number 1 showed up twice, the number 2 showed up once, the number 4 showed up 5 times and the number 5 showed up once. So maybe the input sequence was { 4, 4, 2, 4, 1, 1, 4, 5, 4 }.

The final loop now goes over the initial array again and uses the number to look up into the prefix sum array. And lo and behold, that array tells us the final position where we need to store the integer. So when we iterate over that sequence, the 4 goes to position 3, because that’s the value that the prefix sum array tells us. We then increment the value in the array so that the next 4 goes to position 4. The number 2 will go to position 2, the next 4 goes to position 5 (because we incremented the value in the prefix sum array twice already) etc. I recommend that you walk through this once manually to get a feeling for it. The final result of this should be { 1, 1, 2, 4, 4, 4, 4, 4, 5 }.

And just like that we have a sorted array. The prefix sum told us where we have to store everything, and we were able to compute that in linear time.

Also notice how this works on any type, not just on integers. All you have to do is provide the extract_key() function for your type. In the last loop we move the type that you provided, not the key returned from that function. So this can be any custom struct. For example you could sort strings by length. Just use the size() function in extract_key, and clamp the length to at most 255. You could write a modified version of counting_sort that tells you where the position of the last bucket is, so that you can then sort all long strings using std::sort. (which should be a small subset of all your strings so that the second pass on those strings should be fast) I could also get my “enemy sorting” example from above to work: Store the boolean in the highest bit, and use the remaining bits to sort all enemies that are within 127 meters of the player. In my example 1 meter resolution would have been fine, (if one enemy is 1.1 meters away and the other is 1.2 meters away, I don’t care which comes first) and I really don’t care about enemies that are hundreds of meters away.

Counting sort is crazy fast and it really should be used more widely. But it sure would be nice if we could use keys bigger than a uint8_t.

Radix Sort builds on top of counting sort. The big problem with counting sort is that we need that buffer that counts how often every input comes up. If our input contains the number ten million, then our buffer has to be ten million items large because we need to increment the count at position ten million. Not good.

Radix sort builds on top of two neat principles:

1. counting sort is a stable sort. If two entries have the same number, they will stay in the same order.

2. If you sort a numbers by their lowest digit first, and then do a stable sort on higher digits, the result will be a sorted list.

Point 2 is not obvious, so let me walk through an example. Let’s sort the integers {11, 55, 52, 61, 12, 73, 93, 44 } first by their lowest digit. What we get is the list { 11, 61, 52, 12, 73, 93, 44, 55 }. You could try it using counting sort using “i % 10” as the extract_key function. Note that this is a stable sort, so for example 52 stays before 12. If we now do a second counting sort on this using the higher digit, we get the list { 11, 12, 44, 52, 55, 61, 73, 93 }. Which is a sorted list! Try it with counting sort using “i / 10” as the extract_key function.

This is a super neat observation. As long as you use a stable sorting algorithm, you can sort the low digits first and then sort the high digits after that.

So with that the implementation of radix sort is obvious: Just sort using one byte at a time, going from the lowest byte to the highest byte.

Now it should also be clear how to generalize radix sort to pairs, tuples and arrays. For a pair sort using the .second member first, and then sort using the .first member. For tuples and fixed size arrays use every element in decreasing order. (Unfortunately we can not use dynamically sized arrays as keys using this method, so we can for example not use strings as sort keys)

And with that we’re also coming to the biggest downside that radix sort has: This means that if we want to sort a two byte integer, we have to go over the input list four times. (counting sort goes over the list twice, and we have to call counting sort twice) For four bytes we have to go over the input list eight times, and for eight bytes we have to traverse sixteen times. For pairs and tuples this gets even bigger.

So radix sort is O(n), but it’s a large O(n). Counting sort is crazy fast, radix sort is not.

But still there should be some number for n where radix sort is faster than a sorting algorithm with O(n log n) complexity. Let’s find out where that is!

(oh but before we move on I should briefly mention how to make it work for signed integers and floats. Signed integers are somewhat straightforward: Just cast to the unsigned version and offset the values so that every value is positive. So for example for a int8_t, cast to uint8_t and add 128 so that -128 turns into 0, 0 turns into 128 and 127 turns into 255. For floats you have to reinterpret_cast to uint32_t then flip every bit if the float was negative, but flip only the sign bit if the float was positive. Michael Herf explains it here. The same approach works for doubles and uint64_t)

To start off with, lets measure how fast it is to sort a single byte using counting sort:

I got a little bit creative on the scales here, so this graph needs some explaining. I measured how fast radix sort (which for one byte is just doing counting sort) and std::sort can sort an array. I measure for each power of two from 2 to 2^30. Since my data growth exponentially, I had to use a logarithmic scale. Then I had a problem because on the logarithmic scale it was difficult to see how big the difference between the two sorting algorithms was, so I added another line that follows a linear scale which shows the relative speed. **That dotted line at the bottom follows a different, linear scale.** But the numbers on it show you how big the relative speed is between the two algorithms.

With that explanation the first thing we notice is that both std::sort and radix sort seem to grow almost linearly. But then the second thing we notice is that even for fairly small numbers, counting sort beats std::sort handily. And as our data set grows, counting sort is between four and six times faster!

Next, let’s see how this holds up when we go from counting sort to radix sort:

When sorting four bytes, radix sort needs to do several passes over the data, and because of that it takes longer for it to beat std::sort. But even for relatively small data sets with a thousand elements, radix sort is several times faster than std::sort.

One interesting thing is that dip at the end: That is me running out of memory. Counting sort is not an in-place sort. It stores the results in a different buffer than the input buffer. Radix sort on an int32 will shuffle the data back and forth between the two buffers four times, so that the results actually do end up in the same original buffer, but it still needs all that extra storage. At the last data point in that graph I’m sorting one billion elements, which are four bytes each, and I need two buffers. That gives me eight gigabytes of RAM. My machine has sixteen gigabytes of RAM. In theory there should be some more space left, but my machine starts slowing down once you use more than half of the available RAM. If I double the size again, radix sort never finishes because it starts swapping memory.

The big surprise from these measurements is that radix sort stays much faster than std::sort even though it now has to go over the input data four more times than when sorting a single byte. It’s hard to see in the graph, but in the underlying numbers it looks like **running radix sort on four bytes is two times slower than running radix sort on one byte**. And apparently std::sort also gets slower when sorting a bigger chunk of data, so radix sort still beats it.

In theory though there should be some data size where radix sort is slower than std::sort. Let’s try increasing the data size some more:

When sorting an int64, radix sort is “only” two to three times faster than std::sort. At least once you have more than 500 elements in your array. Since the difference in scale is linear though, we should be able to decrease the gap further by sorting bigger input data:

Aha, it looks like when we use sixteen bytes of data as the sort key, radix sort is finally slower than std::sort. At this point my implementation of radix sort has to do 18 passes over the input data, shuffling back and forth between the two buffers sixteen times. At some point that had to be slow. Note though that this does not mean that you can not sort large structs using radix sort. It only means that the sort key that you provide to radix sort has to be small. The size of the struct matters less. To prove that point here are the measurements for sorting a sixteen byte struct using an eight byte key:

When using a smaller key, radix sort is faster again. One thing to note though is that it’s not as fast as as when we were just sorting an int64. That suggests that maybe radix sort gets slower relative to std::sort as the size of the struct increases. The performance depends on the key size and the data size. So I decided to calculate the relative speed for a vector of size 2048. Meaning I did the above measurements with 2048 for “number of elements” and varied the key size and the data size, and plotted that in a table:

Time (in microseconds) to sort 2048 elements | key size | ||||||
---|---|---|---|---|---|---|---|

1 | 4 | 16 | 64 | 256 | |||

data size | 1 | radix sort | 16 | ||||

std::sort | 81 | ||||||

relative speed | 5.2 |
||||||

4 | radix sort | 18 | 24 | ||||

std::sort | 88 | 87 | |||||

relative speed | 4.8 |
3.7 |
|||||

16 | radix sort | 24 | 40 | 123 | |||

std::sort | 100 | 97 | 112 | ||||

relative speed | 4.1 |
2.4 |
0.9 |
||||

64 | radix sort | 57 | 119 | 347 | 1881 | ||

std::sort | 141 | 138 | 150 | 254 | |||

relative speed | 2.4 |
1.2 |
0.4 |
0.1 |
|||

256 | radix sort | 144 | 341 | 1195 | 5501 | 17657 | |

std::sort | 413 | 443 | 459 | 577 | 698 | ||

relative speed | 2.9 |
1.3 |
0.4 |
0.1 |
0.04 |

One thing I should note is that my benchmark loop also generated 2048 random numbers. So the measurements above are really for generating 2048 random numbers using std::mt19937_64, and then sorting those random numbers. For the key size of 64 I had to generate eight random numbers and for the key size of 256 I had to generate 32 random numbers, so the overhead for the random number generation is larger in those columns.

So what can we read from this table? There are two main things to notice:

- As the key size increases (reading from left to right), radix sort gets much slower. std::sort also slows down, but not by as much. When sorting one byte (two passes) radix sort is always faster. Same thing when sorting four bytes (five passes). At sixteen bytes (eighteen passes in my implementation, but you could do it in seventeen) radix sort starts to lose, especially when the data to move around is large. Moving the data back and forth sixteen times is just slow.
- When the data size increases (reading from top to bottom) radix sort also gets slower relative to std::sort. However it looks like a data size increase does not cause radix sort to switch from being faster to being slower. In fact the gap in absolute terms actually widens every time that the data size increases.

The main reason why std::sort is not affected as much by an increase in key size is that it uses std::lexicographical_compare. Meaning if I have a key of size 256, which in my case was just a std::array<uint64_t, 32> and **if the first entry in the key differs, then std::sort can early out** and doesn’t even have to look at the remaining bytes. Since radix sort starts sorting at the least significant digit, it has to actually look at every single byte in the key. There is a variant of radix sort that looks at the most significant digit first, so it should perform better for larger keys, but I won’t talk about that too much in this piece.

All of this being said, how does radix sort perform on my initial use case of sorting a std::pair<bool, float>?

Radix sort performs very well on my initial use case. It’s faster starting at 64 elements in the array. That’s because this is the same speed as sorting by a four byte int, and then by a boolean. And sorting by a boolean is the fastest possible version of counting sort. You don’t even need the buffer of 256 counts, you only need to count how many “false” elements there are in the array. So adding a boolean to something will barely slow it down when using radix sort. Actually let’s talk about some more optimizations:

- As just mentioned, you can write a faster version of counting sort for booleans. You don’t need to keep track of 256 counts for booleans, you just need one: How many “false” elements there were. Then you write all “true” elements starting at that offset, and all “false” elements starting at offset 0.
- When sorting multiple bytes, you can combine the first two loops of all of them. For example when sorting four bytes, the straightforward implementation is to just call counting_sort four times. Then you would get eight passes over the input data. But if you allocate four counting buffers of size 256 on the stack, you can initialize all of them in one loop, and turn all of them into prefix sums in one loop. Then you only have to do five passes in total over the data.
- The article that explains how to sort floating point numbers using radix sort also has a trick of sorting 11 bits at a time. Instead of sorting one byte at a time. The benefit of that is that you can sort a 32 bit number in four passes instead of five. I tried that, and for me it only gave me performance benefits if the input data is between 1024 and 4096 elements large. For any input sizes larger or smaller than that, sorting one byte at a time was faster. The reason for these numbers is that when sorting 11 bits, the counting array is of size 2048, and apparently if you do the math, the algorithm is fastest when the counting array is roughly the same size as the input data. I haven’t looked too much into that.
- In my implementation of counting_sort above I use an array of type size_t[256]. If you know that each of the buckets in there is less than four billion elements in size, you could also use a uint32_t[256]. In fact I use a different type depending on the size of the input data. This does actually help because the main cost in counting_sort is cache misses. So if your count array is small, that means more of the other arrays can be in the cache.

Now that we know that radix sort can be fast, we can write a sorting algorithm that has O(n) for many inputs. I think that std::sort should be implemented like this:

template<typename It, typename OutIt, typename ExtractKey> bool linear_sort(It begin, It end, OutIt buffer_begin, ExtractKey && key) { std::ptrdiff_t num_elements = end - begin; auto compare_keys = [&](auto && lhs, auto && rhs) { return key(lhs) < key(rhs); }; if (num_elements <= 16) { insertion_sort(begin, end, compare_keys); return false; } else if (num_elements <= 1024 || radix_sort_pass_count<ExtractKey, It>::value > 10) { std::sort(begin, end, compare_keys); return false; } else return radix_sort(begin, end, buffer_begin, std::forward<ExtractKey>(key)); }

First, the interface: Since this calls radix_sort, you have to provide a buffer that has the same size as the input array and a function to extract the sort key from the object. There could be a second version of this function with a default argument for the extract key function that just returns the value directly. So you sort can any type that radix sort supports. You would only have to provide an ExtractKey function for custom structs.

Next we decide which algorithm to use based on the number of elements. For a small number of elements, insertion_sort is generally thought to be the fastest algorithm. And for that I build a comparison function from the ExtractKey object. For a medium number of elements I would call std::sort. And for a large number of elements I would call radix_sort.

There is one more case where I call std::sort instead of radix_sort, and that is when radix sort would have to do a lot of passes over the input data. I can calculate how many passes radix sort has to do at compile time.

And finally the return value is a boolean that says whether the result was stored in the input buffer or in the output buffer. Depending on how many passes radix_sort has to do, the result could end up in either. So for example when sorting an int32, the function would return false because radix sort does four passes and the data ends up back in the input array, but when sorting a std::pair<bool, float> the function would return true because radix sort does five passes and the data ends up in the output array. The calling function then has to do something sensible with this information. If the two buffers are std::vectors, it could just swap them afterwards to get the data where it wants it to be.

Based on the benchmarks above, this algorithm would be several times faster than current implementations of std::sort for many inputs, and it would never be slower than std::sort.

How would we go about getting something like this into the standard? Well clearly we can’t change the interface of std::sort at this point. We could provide a function called std::sort_copy though that would have the above interface and could call radix_sort when that makes sense.

There is an in-place version of radix sort. If we used that, we could even use radix sort in std::sort. Except that we can’t get the ExtractKey function because std::sort takes a comparison functor. One solution for that would be to provide a customization point called std::sort_key which would work similar to std::hash. If your class provides a specialization for std::sort_key, std::sort is allowed to use an in-place version of radix sort when it makes sense, or it could build a comparison operator using std::sort_key and fall back to the old behavior.

This entire time we were building on top of counting_sort which needs to copy results to a different buffer. But if we could provide a version of radix sort that does all operations in one buffer, we could get that version into std::sort.

The in-place version of radix sort has one other very nice benefit: It starts sorting at the most significant digit. The version of radix sort that we used above started sorting at the least significant digit. This made radix sort slow for large keys because it always had to look at every byte of the key. **The in-place version could early out after looking at the first byte, which would potentially make it much faster for large keys**.

I will sketch out how in-place radix sort works, but I’ll leave the work of implementing it and measuring it to “future work.” I’ll explain why I didn’t implement it after I explain how it works.

We can’t build on top of counting sort because counting sort needs to copy results into a new buffer. But there is an in-place O(n) sorting algorithm called American Flag Sort. It works similar to counting sort in that you need an array to count how often each element appears. So if we sort bytes, we also need a count array of 256 elements. Then we also compute the prefix sum of this count array, like we did in counting sort. Only the final loop is different:

In the final loop of counting sort, we would directly move elements into the position that they need to be at. The prefix sum would tell us directly what the right position is. Since American Flag Sort is in-place, we need to swap instead. So let’s say the first element in the array actually wants to be at position 5. We swap it with whatever was at position 5. If the new element actually wants to be at position 3, we swap it with whatever was at position 3. We keep doing this until we find an element that actually wants to be the first element of the array. Only then do we move on to the second element in the array.

What tends to happen is that all the swapping at the first element moves a lot of elements into the right positions. Then all the swapping at the second element moves a lot more elements into the right position. So by the time that we’re a third of the way through the array, most elements are actually already sorted. So a lot of work happens on the first few elements, but at the later elements you mostly just determine that the elements are already where they want to be.

If you implement this you will need two copies of the prefix array. One copy that you change as you swap elements into place, (so that if two elements want to be in the bucket starting at position 5, the first one gets moved to position 5, and the second one to position 6) and one copy that you leave unchanged so that you can determine whether the element is already in the bucket that it wants to be in. (otherwise the element that you swapped into position 5 would think that its bucket now starts at position 6)

Now that we know how American Flag Sort works, we can implement radix sort on top of this. For that, American Flag Sort has to return the 256 offsets into the 256 buckets that it created. Then we call American Flag Sort again to sort within each of those 256 buckets, using the next byte in the key as the byte that we want to sort on. Meaning for a four byte integer, we have to sort recursively within a smaller bucket three more times after the initial sort. Since there’s 256 buckets and each of those gets split into 256 buckets after the second iteration and each of those gets split again after the third iteration, that means that we’ll call the function 256^3 times. Since that is a crazy number, we can just call insertion_sort for any bucket that is less than 16 elements in size, which will be most buckets. And actually since the in-place radix sort isn’t stable, we can also just call std::sort for any bucket that is less than 1024 elements in size. That gets the number of recursive calls down by a lot.

This sounds simple: Use American Flag Sort to subdivide into 256 buckets using the first byte, then sort each of those buckets recursively using the remaining bytes as sort key. The problem that I ran into was that I had generalized radix sort to work on std::pair, std::tuple and std::array. On the in-place version these were far more complicated because you have to pass the logic for advancing and the next comparison function through all recursion layers. Especially the std::tuple code drove me template-crazy. Since American Flag Sort was also significantly slower than counting sort, I abandoned the in-place version for now and decided to leave that for future work.

So for now the takeaway is this: There is an in-place version of radix sort, but for now I decided that it’s too much work to implement. The part that I did implement looked several times slower than the copying radix sort. It might still be faster than std::sort, but I haven’t measured that.

In this article I found that radix sort is several times faster than std::sort for what seem like pretty normal use cases. So why isn’t it used all over the place? I did a quick poll at work, and many people had heard of it, but didn’t know how to use it or whether it was even worth using at all. Which is exactly what I was like before I started this investigation. So why isn’t it popular? I have a few explanations

- Size overhead: Radix sort requires a second array of the same size to store the data in. If your data is small you may not want to pay for the overhead of allocating and freeing a second buffer. If your data is large you may not have enough memory to have that second buffer. Radix sort may still be a great option if your data is sized somewhere in the middle, but using radix sort means that you have to worry about these things.
- Can’t use radix sort on variable sized keys. The version of radix sort presented here only works on fixed sized keys. So it can’t sort strings for example. The in-place version of radix sort can sort strings, but I didn’t look into that too much.
- We used to always write custom comparison functions. If you don’t use std::make_tuple or std::tie to implement your comparison function, it may not be obvious how to use radix sort for your class. You need to know that you can sort tuples using radix sort, and you need to notice that you’re using tuples in your comparison functions already.
- I can’t find any place that generalizes radix sort to std::pair, std::tuple and std::array. So this might actually be an original contribution of mine. Googling for it I can find mentions of using radix sort on tuples of ints, but it seems like those people don’t realize that you can generalize beyond that. Certainly nobody suggests that you could use radix sort on a std::pair<bool, float>. (for example the boost version of radix sort can not sort a std::pair<bool, float>, and boost code is usually way too generic) If you think that radix sort is only for integers, it’s not very useful.
- Radix sort can not take advantage of already sorted data. It always has to do the same operations, no matter what the data looks like. std::sort can be much faster on already sorted data.

So there are certainly some good reasons for not using radix sort. There simply can’t be one best sorting algorithm. However I also think that radix sort lost some popularity due to historical accidents. People often don’t seem to think that it applies to their data even though it does.

Radix sort is an old algorithm, but I think it should be used more. Much more. Depending on what your sort key is, it will be several times faster than comparison based sorting algorithms.

Radix sort can be used to implement a general purpose O(n) sorting algorithm that automatically falls back to std::sort when that would be faster. I think the standard library should be modified so that it can provide this behavior. I think this is possible by offering an extension point called std::sort_key which would work similar to std::hash. Even without that the standard could provide std::sort_copy, which would promise O(n) sorting on small keys.

The final conclusion is that it’s worth learning something about algorithms even if you’ve programmed for a while. I learned how radix sort works because I’ve been watching an Introduction to Algorithms course by MIT. I didn’t expect to learn anything new in that course, but it’s already caused me to write this blog post, and it’s inspired me to take another stab at writing the worlds fastest hashtable. (in progress) So never stop learning, and try to fill in the gaps in your knowledge of the basics. You never know when it will be useful.

I have uploaded my implementation of radix sort here, licensed under the boost license.

]]>