The idea is that this should be something similar as George Polya’s “How to Solve It” but for doing research instead of solving problems. There is a lot of overlap between those two ideas, so I will quote a lot from Polya, but I will also add ideas from other sources. I should say though that my sources are mostly from Computer Science, Math and Physics. So this list will be biased towards those fields.

My other background here is that I work in video game AI so I’ve read a lot of AI literature and have found parallels between solving AI problems and solving research problems. So I will try to generalize patterns that AI research has found about how to solve hard problems.

A lot of practical advice will be for getting you unstuck. But there will also be advice for the general approach to doing research.

The general framework is that of exploration and exploitation. Exploitation means you are getting more out of old ideas. Exploration means you are looking for entirely new ideas. You may be thinking that doing research is more exploration than exploitation, but it’s actually a mix which contains more exploitation than exploration. Really new ideas get discovered rarely, and most of the work is to realize all the consequences of existing ideas.

The two analogies I like for this are hill climbing and exploring the ocean.

Exploitation is as an effort in hill climbing. Hill climbing comes from the family of AI problems that deal with search, where search means “I’m at point A, I want to get to point B.” It’s a very general problem that applies to more than just trying to find your way on a map. For research point A is “here is what I know now” and point B is “here is what I would like to find out/prove/demonstrate/get to work/make happen etc.”

There is a large number of search algorithms, and you only really use hill climbing if your problem has the following criteria: You can’t see very far, the problem is very complex, you don’t know where the goal is and progress is slow. Meaning there is heavy fog in the mountains, the mountains are crazy complex, you may end up at a different peak than what you had planned (or maybe you just want to get out of a valley and don’t know ahead of time where that will take you) and to top it off it’s all covered in snow, making progress very slow. So you can’t just say “I’m going to explore a thousands paths” because exploring one path might take you a week before you find out that it leads to a dead end.

At that point all fancy AI techniques are out of the door and we’re left with simple hill climbing. Luckily AI has several improvements over the simple “go up” approach that will just get you stuck on the first small hill.

For the “exploration” part of exploration and exploitation I want to use the analogy of exploring the ocean. You obviously can’t do hill climbing there. You could try to bring a long rope and measure the depth of the ocean, but then you would always just move straight back to the island that you came from. Because if you just left the harbor, in which direction does the floor of the ocean go up? Back into the harbor. You have to cover a large distance before you can do hill climbing.

The “exploring the ocean” analogy is not perfect, because there is a property to this kind of research where the more you’re trying to reach a goal, the less likely you’re going to get there. I guess it works if you have a wrong idea of where the goal is. Like Columbus thinking that India was much closer, and accidentally discovering America.

The best explanation I have found for this is by Kenneth Stanley in his talk The Myth of the Objective – Why Greatness Cannot be Planned. I recommend watching the talk, but if you don’t want to do that I will mention the main points further down.

For now the main point is that there are some discoveries that can only come from free exploration. You find a topic that’s interesting and you go and explore in that direction, without any specific aim other than to find what’s over there. Then at some point you start doing hill climbing to actually get results, but you can’t start off with it.

In this section I’ll mention the general approach to doing research. You’re probably doing many of these things already because they’re common sense but it’s still worth pointing these things out once. Especially students often get these things wrong, and then it’s good to be able to recognize what they do differently than you, and then it’s good to have the words for the common sense.

When doing research it’s easy to fool yourself. So it is very important that you go out of your way to prove yourself wrong. Feynman thought this was very important when talking about Cargo Cult science. I’m slightly misquoting him here because he doesn’t just talk about proving yourself wrong, but about a broader scientific honesty:

But there is

onefeature I notice that is generally missing in Cargo Cult Science. That is the idea that we all hope you have learned in studying science in school – we never explicitly say what thisis, but just hope that you catch on by all the examples of scientific investigation. It is interesting, therefore, to bring it out now and speak of it explicitly. It’s a kind of scientific integrity, a principle of thought that corresponds to a kind of utter honesty – a kind of leaning over backwards. For example, if you’re doing an experiment, you should report everything that you think might make it invalid – not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you’ve eliminated by some other experiment, and how they worked – to make sure the other fellow can tell they have been eliminated.Details that could throw doubt on your interpretation must be given, if you know them. You must do the best you can – if you know anything at all wrong, or possibly wrong – to explain it. If you make a theory, for example, and advertise it, put it out, then you must also put down all the facts that disagree with it, as well as those that agree with it.

He goes on to talk about more things that you should do, but for now I just want to talk about the part of proving yourself wrong, because it is genuinely helpful when doing research.

It’s also the main difference between real research and pseudo-science. People in pseudo-sciences never try to prove themselves wrong. It’s also the main difference between real medicine and alternative medicine. People who promote crystal healing never try to prove themselves wrong. At least not seriously. Same thing between real journalism and conspiracy theorists. A real journalist will try to prove his theories wrong many times before publishing. Especially if it’s about a conspiracy.

But there is a more pragmatic reason: Proving yourself wrong early will save you from wasting time. Of course if you try to do research into a crazy theory like crystal healing, you’re wasting time. But there are also many reasonable cases where the fastest way to find out if an approach will work is to try to prove it wrong.

Now this can be a little bit tricky, because as Feynman also said “The first principle is that you must not fool yourself – and you are the easiest person to fool.” Meaning it’s hard for you to prove yourself wrong. It’s easier for you to fool yourself. But once you get better at proving yourself wrong, you tend to find shortcuts. Ways to rule out in a day what would have taken you a month to confirm. It turns out that often you only need rough heuristics to prove yourself wrong where proving yourself right requires actually working all the way through the problem.

You want to be a little bit careful with this because sometimes good ideas lurk in areas that most people stay far away from because of some “probably won’t work” heuristic, but usually those heuristics are a good idea. And even if you don’t want to use heuristics, it’s still often faster to prove yourself wrong than to prove yourself right, so the advice stands.

For some people it’s very hard to prove themselves wrong. There is one final trick that can even help those people: For some reason we are really good at proving other people wrong. If somebody else comes to you with a crazy idea you can immediately tell that it’s a crazy idea. Much more quickly than if it was your own idea. So the final trick is to ask others to prove you wrong. Meaning just ask a colleague to run an idea by them. And then listen to what they say.

I don’t know a good name for this, so I’ll use the name of the AI technique. You probably do this automatically, but it’s worth pointing out, because sometimes I see people who don’t do this, and they are really screwed.

Simulated Annealing is a very general approach, which roughly says that you should figure out the big picture out before you figure out the details. And it does that by prescribing what your response should be when you’ve walked all the way up in hill climbing and gotten stuck. Getting stuck means you’re at some local peak and it seems like you can’t see any paths that take you any higher. It seems like there are only steep cliffs or options that take you back down the hill. And ideally you’re not just stuck for five minutes, but you’re stuck for an hour or a day or more.

The general pattern is that every time you get stuck, you do a reset. But the size of the reset becomes smaller and smaller. At first you reset your progress completely and start over from the very beginning and try an entirely different approach. After you’ve tried a few different approaches, the next time that you reset yourself, do a smaller reset so that you stay in the area that took you the furthest. Don’t try new approaches any more, but try different variations of one approach. Later you do smaller resets still and maybe just try a few different solutions to specific problems. And at the end you do really small resets and just tweak some numbers.

What this means is that when you first work on a problem, you shouldn’t spend too much time fiddling around with the details. Instead try a different approach.

And then later, when you’re fiddling around with the details, you should not go back and try a whole different approach.

There is a progression to this. You often see inexperienced researches spend too much time on the details early on. Or maybe they have come very far and are already twiddling with the numbers when they find that their whole approach was wrong and they have to start over with an entirely different approach. That is very demoralizing. You have to do that exploration early on before you ever get to the details.

Or sometimes people are just trying lots of different approaches and are never actually doing one approach seriously. Simulated Annealing says that you should walk up until you’re stuck. Don’t switch to a different approach until you’ve gotten stuck. (sometimes that takes too long and you end up spending weeks on one approach. In that case set a time limit and do a reset once a week or so)

So when you first get stuck, do big resets and try entirely different approaches. Then over time do smaller and smaller resets.

This also gives a natural end to your research. Once you’re done twiddling with the details you’re done, period. (you don’t need to go back and try an entirely different approach since you already explored those earlier)

This one is not an AI technique but my own observation. It’s also something that all good researches do automatically, but it’s worth pointing out explicitly.

You want to be incremental. In the hill climbing analogy, imagine that there are several paths already carved into the mountains where previous researchers have made progress before you. You almost always want to start off from one of those paths. In fact some of those paths have become very wide because there are lots of researches doing work up at the end of those paths, so the path is well-trod.

You may actually want to avoid paths that are too wide, but only if you are experienced already. If you are a grad student doing your first research, don’t stray too far from where others are.

The advice to be incremental may be disappointing because you want to invent the next big theory like general relativity or the next Internet or whatever. But actually the more you read about those and how they came about, the more you realize that they were actually quite incremental. There are really very few inventions which can not be traced back to ideas that slowly accumulated and evolved over many years. Sometimes an idea seems really impressive to the outside world because to them it’s all new. But then you look at the author’s work and find that they had been silently working on it incrementally for the last ten years.

You may think that the “be incremental” advice does not apply to the “ocean explorer” analogy of research, but you’d be wrong. Few good things have come from just setting off into completely uncharted territory. Usually you want to hop from island to island. The “Myth of the Objective” talk that I talked about above strongly emphasizes how important stepping stones are for this kind of research. The results in their program couldn’t have come about if people couldn’t have built on top of each other’s results.

The AI technique for this is called Local Beam Search (the link is to “Beam Search” because it seems like Local Beam Search is never mentioned online…) which is a variant of hill climbing where we do several searches at the same time. That’s the whole trick. Programmers are not good at naming things.

Doing several searches at the same time is an easy thing to do for a computer, but it’s hard to do for a person. But I think we can get the same benefits without literally doing the searches at the same time. I’m going to quote from the book “Artificial Intelligence – A Modern Approach” (second edition) by Russel & Norvig to list the benefits:

In a local beam search, useful information is passed among the parallel search threads. […] The algorithm quickly abandons unfruitful searches and moves its resources to where the most progress is being made.

So how do we get these benefits as a simple human who can’t do multiple searches at the same time? One thing we can do is keep track of where you walked down one path but you could have walked down the other path. Explore the other paths every once in a while. Since humans have to do this sequentially, it’s actually similar to the simulated annealing I mentioned above. But the idea would be to do multiple searches at once, where each of them follows the simulated annealing approach.

One other approach to this is to always have more than one project. Here is Robert Gallager talking about this in the context of a talk about Claude Shannon:

Be interested in several interesting problems at all times. Instead of just working intensely on one problem, have ten problems in the back of your mind. Be thinking about them, be reading things about them, wake up in the morning, and review whether there’s anything interesting about any of them. And invariably […] something triggers one of those problems and you start working on it.

Why is that so important? I would say that one of the most difficult things in trying to do research is “how do you give up on a problem?” So many students doing a thesis just beat themselves over the head for year after year after year saying “I’ve got to finish this problem, I said I was going to do it and I’m going to do it.” If it’s an experiment you’re going to do, yes you can do that. You can do it more and more carefully if something isn’t working you can fix it to the point where it works. [But] if you’re trying to prove some theorem and the theorem happens to be not true, then your chances of success are very low.

If you have these ten problems in the back of your mind and there is one problem that’s been driving you crazy, what are you going to do? It’s going to sink further and further back in your mind and just because you’ve had more interesting things to do, you’ve gotten away from it. After a year you won’t even remember you were working on it. It will have disappeared. That’s a far better thing than to reason it out and say “I don’t think I can go on any further with this because of this, this and this reason.” Because your problem is you don’t understand the problem well enough to understand why you oughta give up on it. So you just find that you can do other things which become more interesting temporarily.

I think this quote is spot on, but there is an additional benefit to working on several things at the same time: There is lots of cross pollination between ideas. Also talking about Claude Shannon, this article has another good quote about this:

His information theory paper drew on his fascination with codebreaking, language, and literature. As he once explained to Bush:

“I’ve been working on three different ideas simultaneously, and strangely enough it seems a more productive method than sticking to one problem.”

The final angle in which this is helpful is that research just takes time. Some things can’t be hurried. If you work on multiple things at the same time, then that allows you to work on one thing for a longer time. If one project needs to take ten years, then there is no way that you can work on it full time for ten years. But if you also work on other things during those ten years, all of a sudden it’s doable.

I’m using Barbara Oakley’s terms from her Learning How to Learn online course.

The idea is that the brain has two distinct ways of working: The focused mode where you actively work on a problem, and the diffuse mode where you’re doing something else entirely but your subconscious is working on the problem. The diffuse mode is responsible for a lot of eureka stories, including the original one: Archimedes was stuck on a problem, trying to figure out whether a crown was pure gold or not. Then on a trip to a bath he is relaxing, mind drifting off, watching the water move, when suddenly the answer jumps into his head.

For me I also often get ideas like this while taking a shower. Some people say that physical exercise helps them get into the right mode. That doesn’t work for me, but long walks certainly do work. Some people say that they only get into this mode when sleeping and that they wake up with good ideas. That works very rarely for me, but maybe you have more luck with it. It seems like you need to be somewhat relaxed, your mind mostly idle. Then the background processes in your brain get to work and can form new connections that weren’t clear before. You have to stop thinking about the problem for a while and then an answer may drift back up from some deeper part of your brain that you don’t have direct access to.

The tricky part is that you can’t easily schedule what your brain is going to work on in the diffuse mode. Distractions like smart phones are really harmful, but even if you turn your phone off you can easily get into a mode where all you can think of is the latest controversy in the news. Rich Hickey talks about how he deals with this issue in his talk Hammock Driven Development: (he talks about sleep because for him a lot of this thinking happens while sleeping)

So imagine somebody says “I have this problem…” and you look at it for ten minutes and go to sleep. Are you going to solve that problem in your sleep? No. Because you didn’t think about it hard enough while you were awake for it to become important to your mind when you’re asleep. You really do have to work hard, thinking about the problem during the day, so that it becomes an agenda item for your background mind. That’s how it works.

For me sleeping doesn’t work, but I’ve certainly found the same thing to be true when going for long walks. If the last thing that I did before the walk was check the news, my brain will keep on going back to whatever I was reading. If the last thing was that I worked really hard on a problem, I may have a chance of finding a solution to the problem while going for a walk. (can’t force it though, you need to allow your mind to drift off and then drift back to the topic)

This is no guarantee for success. Oftentimes the solution of the diffuse mind doesn’t actually work. Or it’s just one step of the solution and after you take that one step you’re just as stuck as you were before. But anytime that I’m making really good progress, it’s a combination of focused mode and diffuse mode work.

This is another one one that is so basic that it’s rarely stated, but I sometimes see people confused about this. For example I was going through the book reviews of Kenneth Stanley’s book about how objectives can be harmful, and one person says that the book is clearly wrong because there are studies about how helpful goal setting is, with a link to this article. And since I believe in proving myself wrong I naturally read that article. It turns out that the professor mentioned in that course, Jordan Peterson, has made the course available on Youtube. And if you listen to what he actually says about setting goals, it’s a lot more compatible with Kenneth Stanley’s idea:

You don’t get something you don’t aim at. That just doesn’t work out. So lots of people aim at nothing and that’s what they get. So if you aim at something you have a reasonable crack at getting it. You tend to change what you’re aiming at a bit along the way, because like, what do you know? You aim there, you’re wrong. But you get a little closer. And then you aim there, and you’re still wrong. You get a little closer and you aim there, and as you move towards what you’re aiming at, your characterization of what to aim at becomes more and more sophisticated. So it doesn’t really matter if you’re wrong to begin with as long as you’re smart enough to learn on the way, and as long as you specify a goal.

This is spot on as far as I’m concerned. You need a goal to start with. But as you travel towards that goal, you may find reasons to change the goal and you shouldn’t be afraid to do that if you have a good reason. He goes on to say that it’s OK to specify a vague goal as long as you’re going to refine that goal along the way. (I think there is also some connection here to Scott Adams’ theory of “using systems instead of goals” but I haven’t thought that through)

Kenneth Stanley developed an algorithm called “Minimal Criterion Novelty Search” after his discovery about how harmful it can be to aim for a goal too rigidly. Novelty Search just tries visit as many different places in the search space as possible. Meaning it generates novel approaches to whatever problem you’re working on. Doesn’t matter if those novel approaches don’t look like they would solve the problem. “Minimal Criterion” says that the novel behaviors should still behave above some minimum threshold like “don’t get eaten by predators before you reproduce.” You can define your own minimal criterion for your problem, but it shouldn’t be very challenging to overcome. He has then shown that for tricky problems, novelty search is better than goal oriented search because novelty search doesn’t go for the goal and doesn’t get stuck on whatever the “trick” is. It just tries to reach as many different points as possible and will eventually automatically find its way around the trick.

The diffuse mode from the last section is a good way to get unstuck. But it’s a bit unreliable, and it needs to be fed. You need to work on the problem hard before the diffuse mode can provide you with a solution. But how do you do that if you’re stuck? And are there any more direct ways to get unstuck? In the hill climbing analogy, imagine you have come across a steep cliff and all you can see is ways back down or sideways.

Some of the advice here is to find good sideways steps that have helped others, other advice is for discovering as many sideways steps as you can. Others are about finding new starting points. It’s all about making movement in the hope that you will come across some hidden path that can take you up again.

This is the Feynman Algorithm for solving problems:

- Write down the problem.
- Think real hard.
- Write down the solution.

The algorithm is of course a joke because Richard Feynman made physics look so easy. Except that a friend of mine once said that the Feynman algorithm actually worked for him. Since then I have tried it a few times and it has actually really helped me, too. The important step seems to be step 1: Write down the problem. Sometimes we seem to be stuck, but we’re not actually all that clear of what exactly we’re stuck on. Putting it into writing forces us to consider what exactly the problem is, and sometimes just doing that is enough. If it’s not, step 2 has also brought me to the solution. Literally just sitting there staring at the formulation of the problem on the paper. Seems unlikely, but sometimes it works. (because you never actually thought about the explicitly stated problem)

A related solution from computer science is Rubber Duck Debugging. The idea is that if you’re completely stumped on trying to figure out a bug in your code, sometimes it helps to explain it to somebody else. That other person doesn’t actually have to understand what you’re talking about. It just helps talking through the problem. So a rubber duck is good enough for this.

I have to confess that I don’t find it easy to talk to a rubber duck, so I usually try explaining my current problem to my girlfriend. She doesn’t know a whole lot about computer science, but if I say that “I just need to talk through this problem once” then she will usually make an effort to listen. It’s also a good exercise to try to explain the problem in a way that somebody who is not familiar with algorithms and data structures can understand. The goal isn’t really to get her to understand it, but to get myself to talk about the problem fully enough that she could understand it.

Oftentimes that’s all it takes to find the thing that you forgot to check.

One thing that has really helped me on this is George Polya’s book “How to Solve It” which makes you ask yourself these questions:

What is the unknown? What are the data? What is the condition? Is it possible to satisfy the condition? Is the condition sufficient to determine the unknown? Or is it insufficient? Or redundant? Or contradictory?

Draw a figure. Introduce suitable notation.

Separate the various parts of the condition. Can you write them down?

It’s an exercise in getting good about “stating the problem.” And going through it explicitly somehow helps. Polya also points out that sometimes you’re stuck simply because you lost sight of the goal, which is another explanation for why stating the problem helps you get unstuck.

Polya also has a list of proverbs in his book that I will quote sometimes, here are his proverbs for this one:

Who understands ill, answers ill. (who understands the problem badly, answers it badly)

Think on the end before you begin.

A fool looks to the beginning, a wise man regards the end.

A wise man begins in the end, a fool ends in the beginning.

Here is Claude Shannon about simplifying problems:

Almost every problem that you come across is befuddled with all kinds of extraneous data of one sort or another; and if you can bring this problem down into the main issues, you can see more clearly what you’re trying to do and perhaps find a solution. Now, in so doing, you may have stripped away the problem that you’re after. You may have simplified it to a point that it doesn’t even resemble the problem that you started with; but very often if you can solve this simple problem, you can add refinements to the solution of this until you get back to the solution of the one you started with.

And here is Robert Galager again about an experience when he had a complex problem and asked Claude Shannon for help:

He looked at it, sort of puzzled, and said, ‘Well, do you really need this assumption?’ And I said, well, I suppose we could look at the problem without that assumption. And we went on for a while. And then he said, again, ‘Do you need this other assumption?’ And he kept doing this, about five or six times. At a certain point, I was getting upset, because I saw this neat research problem of mine had become almost trivial. But at a certain point, with all these pieces stripped out, we both saw how to solve it. And then we gradually put all these little assumptions back in and then, suddenly, we saw the solution to the whole problem.

Another thing I like to do for this is to solve the problem for one case. Instead of trying to attack the general problem, pick a simple case and solve it. Then another. Then another. Then another. Then look for patterns. Don’t look for patterns until you’ve solved three or four specific cases. The cases I usually look at are “what if these are all zero?” Or “what if this always takes the same amount of time?” Or “what if everybody wants the exact same thing?” And then further questions are small variations on that like “what if these are all 1? Or what if these are all zero except for that variable?” Or “what if these take different amounts of time but they start at regular intervals?” Or “what if everybody wants the exact same thing except for that one special case?”

Polya in “How to Solve It” of course also has questions for this:

If you cannot solve the proposed problem try to solve first some related problem. Could you imagine a more accessible related problem? A more general problem? A more special problem? Keep only a part of the condition, drop the other part; how far is the unknown then determined, how can it vary? Could you change the unknown or the data, or both if necessary, so that the new unknown and the new data are nearer to each other?

There is a subtle benefit to simplifying the problem which I’ll explain using the concept of “overfitting” from machine learning. Overfitting happens when your algorithm didn’t really learn the underlying pattern, but just memorized all the training examples. (and then doesn’t work on new examples) Overfitting means that you learned both the signal and the noise. One way to make overfitting less likely is to simplify or to generalize because simplifying the problem reduces the noise in the problem. (I will talk about generalizing further down) This is a bit of an abstract concept and probably deserves a fuller discussion (particularly because some simplifications actually increase your risk of overfitting) but for now I just want to say that solving a simplified problem can reveal broader truths than solving a complex problem, so don’t feel bad for simplifying. It can have real benefits.

We already had Polya’s advice of “Draw a picture. Introduce suitable notations” above, but this goes further. We can often use the visual processes in our brain to solve problems.

This applies to many more problems than math problems. Lots of math has geometric interpretations, but so do other fields. You can draw diagrams or plots or maps or simplified sketches or any number of other things.

One trick is to try to visualize as much data as possible. Draw scatter plots. Then draw small multiples of scatter plots. Add layers, colors, work at different scales, anything that allows you to show more data without confusion. Let your eye do the filtering later. While we would have a hard time dealing with thousands of numbers in writing, we have a very easy time finding patterns in thousands of numbers in a scatter plot. Here is a quote from Edward Tufte’s Envisioning Information (page 50):

We thrive in information-thick worlds because of our marvelous and everyday capacities to select, edit, single out, structure, highlight, group, pair, merge, […] focus, organize, condense, reduce, […] categorize, catalog, […] isolate, discriminate, distinguish, […] filter, lump, skip, smooth, chunk, average, approximate, cluster, aggregate, outline, summarize, itemize, review, dip into, flip through, browse, […].

Visual displays rich with data are not only an appropriate and proper complement to human capabilities, but also such designs are frequently optimal. If the visual task is contrast, comparison and choice – as it so often is – then the more relevant information within eyespan, the better. Vacant, low-density displays, the dreaded posterization of data spread over pages and pages, require viewers to rely on visual memory – a weak skill – to make a contrast, a comparison, a choice.

A common theme of Tufte is that we are really good at looking at lots of data. It’s not good to only show parts of the data at a time. Better to show as much as possible, and people will focus on what they want.

Except of course it’s not quite that simple, because if you just show as much as possible, you often have an unreadable mess. Edward Tufte’s books are all about showing as much as possible without having a mess on your hand. But really you can get pretty far by just trying and iterating on your visualizations. Try combining visualizations, then try separating them. Try looking at multiple things next to each other. Try zooming out or try zooming in etc.

This is one of the main points in Polya’s “How to Solve It.” He thinks mobilizing prior knowledge is one of the most important things you can do. To do this you of course have to be fluent in the field that you’re researching. Here are his questions related to this:

Have you seen it before? Or have you seen the same problem in a slightly different form?

Do you know a related problem? Do you know a theorem that could be useful?

Look at the unknown! And try to think of a familiar problem having the same or a similar unknown.

Here is a problem related to yours and solved before. Could you use it? Could you use its result? Could you use its method? Should you introduce some auxiliary element in order to make its use possible?

This is also a point where it’s useful to work on several things at the same time. Because somehow it seems that formulas or methods or insights from one area often apply in a different area. I don’t know why that is. Maybe there are only a finite number of concepts and connections between them, so we see the same concepts in several fields. (whatever explanation we come up with would also have to explain why garbage-can decision making works so well, so my explanation isn’t very good…) Here is Feynman talking about this:

[After deriving the conservation of angular momentum from the laws of gravity]. And thus we can roughly understand the qualitative shape of the spiral nebulae. We can also understand in the same way the way a skater spins when he starts with his leg out, moving slowly, and as he pulls the leg in he spins faster.

But I didn’t prove it for the skater. The skater uses muscle force. Gravity is a different force. Yet it’s true for the skater. Now we have a problem. We can deduce, often, from one part of physics, like the law of gravitation, a principle which turns out to be much more valid than the derivation.

[…]

So we have these wide principles which sweep across all the different laws. And if one takes too seriously these derivations, and feels that “this is only valid because this is valid” you can not understand the interconnections of the different branches of physics. Some day, when physics is complete, then all the deductions will be made. But while we don’t know all the laws, we can use some to make guesses at the theorems which extend beyond the proof. So in order to understand the physics one must always have a neat balance and contain in his head all the various propositions and their interrelationships because the laws often extend beyond the range of their deductions.

(edited heavily for brevity)

This is true in across parts of physics and it’s also true across entirely different fields, but I should also state that most of the time, the “related problems” you want to look at are going to be pretty close by. Polya gives examples like “to find the center of mass of a tetrahedon, see if you can use the method of the simpler related problem of finding the center of mass of a triangle.”

But I also want to bring it back to Polya’s idea of mobilizing prior knowledge: There is a lot of evidence that in most fields, the main difference between experts and novices is how much experience or knowledge of the field they have, and how good they are at organizing this knowledge. This comes out of Kahneman’s and Tversky’s work with expert firemen, but also from research about chess grandmasters. The better somebody gets at chess, the more they use their memory. (as measured by brain activity) So you have to build that pool of knowledge. You have to know lots of related problems and you have to be able to draw connections to them.

This is very related to the previous problem and it’s also something that I can find plenty of quotes for. Here is Claude Shannon for example:

Another approach for a given problem is to try to restate it in just as many different forms as you can. Change the words. Change the viewpoint. Look at it from every possible angle. After you’ve done that, you can try to look at it from several angles at the same time and perhaps you can get an insight into the real basic issues of the problem, so that you can correlate the important factors and come out with the solution.

Polya’s questions about this topic are simpler in that they are simply “Can you restate the problem? Could you restate it still differently? Go back to definitions.”

Polya then goes on to list several reasons for why this helps. One is that a different approach to the problem might reveal different associations, allowing us to find other related problems. (see the point above) A second reason I will just quote:

We cannot hope to solve any worth-while problem without intense concentration. But we are easily tired by intense concentration of our attention upon the same point. In order to keep the attention alive, the object on which it is directed must unceasingly change.

If our work progresses, there is something to do, there are new points to examine, our attention is occupied, our interest is alive. But if we fail to make progress, our attention falters, our interest fades, we get tired of the problem, our thoughts begin to wander, and there is danger of losing the problem altogether. To escape from this danger we have to

set ourselves a new questionabout the problem.The new question unfolds untried possibilities of contact with our previous knowledge, it revives our hope of making useful contacts. The new question reconquers our interest by varying the problem, by showing some new aspect of it.

See also the point about Simulated Annealing above which says that you should frequently try new approaches. But the size of the change that you make should differ over time.

And finally here is Feynman talking about the same idea. When talking about what if several theories have the same mathematical consequences, he says that “every theoretical physicist that’s any good knows six or seven different theoretical representations for exactly the same physics and knows that they’re all equivalent, and that nobody is ever going to be able to decide which one is right, but he keeps them in his head hoping that they will give him different ideas.” As for how they may help he says that a simple change in one approach may be a very different theory than a simple change in a different approach. And that changes which look natural in one theory may not look natural in another.

This one is related to some of the points that I made in “draw a picture” above, but it’s also worth talking about the data separately, without the context of a picture. To start off with here are Polya’s questions related to this topic:

Did you use all the data? Did you use the whole condition? Have you taken into account all essential notions involved in the problem?

Could you derive something useful from the data? Could you think of other data appropriate to determine the unknown? Could you change the unknown or the data, or both if necessary, so that the new unknown and the new data are nearer to each other?

One thing I would like to point out here is that there are many ways to organize data. I have literally given talks where all I did was take existing data and organized it in a different way to put emphasis on different conclusions. The original authors had organized their data by categories. I had organized it by strength of correlation. There are many ways to sort, filter, group or abstract data, and there are often many different insights to be gained depending on how you go about doing this.

Polya is referring to something else here though. For him the “data” are the information given for a problem. His example problem for this question is “We are given three points A, B and C. Draw a line through A which passes between B and C and is at equal distance to B and C.” And his point is that after drawing a picture of the dots with the desired line, the solution comes almost automatically if you just draw lines using all the available data. (the points A, B and C, as well as the desired line) So for your problem the data may just be any available information.

Changing the data can mean a lot of things from “collect more information” to “if I assume that this variable is always 0, would that simplify the problem?” And changing the unknown means that if the data suggests a different goal, at least consider that other goal. Maybe it’s a better goal than what you were looking for.

Richard Feynman was famous for this because he said that his Nobel prize came directly from playing around with physics. Here is the quote for that and it’s a great read. (unfortunately too long to be included in this blog post)

In the Feynman quote playing around means investigating a problem that has no practical applications. But you can even do this within a problem. You can play around with equations. Do random substitutions. See what the consequences would be if you cube a variable rather than squaring it. Take the equations in a circle and back to where they started. Do anything that you are curious about. You can play around with experiments. If you are working on some variable and normal values are in the range from 20 to 30, then try the values 1 and 100, just to see what happens. If nothing bad happens, try the values 0.1 and 1000. Antibiotics were discovered because an experiment went wrong and Alexander Fleming reacted with curiosity rather than frustration.

Here is a quote from Carver Mead that is in a similar vein to the Feynman story above:

John Bardeen was just the most unassuming guy. I remember the second seminar Bardeen gave at Caltech — I think it was just after he got his second Nobel Prize for the BCS theory, and it was some superconducting thing he was doing. He had one grad student working on it and they were working on this little thing, and he gave his whole talk on this little dipshit phenomenon that was just this little thing. I was sitting there in the front row being very jazzed about this, because it was great; he was still going strong.

So on the way out, people were talking and one of my colleagues was saying, “I can’t imagine, here’s this guy that has two Nobel Prizes and he’s telling us about this dipshit little thing.” I said, “Don’t you understand? That’s how he got his second Nobel Prize.” Because most people, one Nobel Prize will kill them for life, because nothing would be good enough for them to work on because it’s not Nobel Prize–quality stuff, whereas if you’re just doing it because it’s interesting, something might be interesting enough that it’s going to be another home run. But you’re never going to find out if all you think about is Nobel prizes.

This one is connected to the point about “use a related problem” above, but there is additional value to be gained from reading a related paper that I haven’t talked about above.

Reading a related paper is especially valuable if you try to reproduce the related paper. For me that’s often easy to do in computer science because I can implement the program. If it’s hard to do in your field, don’t be afraid to take shortcuts. (potentially huge shortcuts) You’re not trying to verify the paper, the value of the exercise actually comes from walking in other people’s shoes for a while. See what they did and why they did it. Criticize their ideas and their approach.

If you start from the other paper’s starting point, you will come across plenty of opportunities to do things differently. Maybe one of those different paths can give you an idea for your problem. And different starting points run into different problems, which sometimes allows you to dodge a problem that you ran into. Meaning the problem literally doesn’t even show up just because you came from a different angle.

Another thing I like to do is read old papers. You will be surprised at which alternatives they explored back then. (whatever “back then” means for your field) When a field is young, people are more open-minded. Often, the old alternative theories are obviously ridiculous now, but sometimes there are ideas there that should be revisited. Even if you don’t come across anything like that, I still just get random ideas from exposing myself to naive (but smart) ways of thinking about the problem.

Just as reading a paper is a good exercise for getting a different view point, so is starting from the end. The AI method for this is called bidirectional search, and there are real mathematical reasons for why this helps. Here is the picture for bidirectional search from Russel & Norvig’s “Artificial Intelligence – A Modern Approach”:

To explain this image, imagine we have no idea where the goal is. So we start branching out from the start point, exploring all directions. The longer this keeps on going, the bigger area we have to explore and the more we’ll slow down. If we also search from the goal, we can cut that time down dramatically. Instead of having to make one very big circle, we can make two small circles. In this picture the circles are about to touch, and as soon as they touch it’s an easy exercise to connect the circles and to draw a single path connecting the start to the goal.

With this picture in mind you can also see why so much of the advice above is about finding different starting points: If we had multiple starting points, chances are good that the circles can be even smaller. The further we move on from a starting point the more expansion slows (because the number of paths grows proportional to the area which grows at the square of the radius) so you want to be incremental (pick a goal that’s not too far away) and you may want to try multiple starting points.

Now strictly speaking bidirectional search is not a valid thing to do when doing hill climbing, because in hill climbing we have no idea where the goal is. But usually when doing research you have at least some idea of what you’re looking for or what you expect to find. Or you have some idea of what would overcome the current thing that you’re stuck on. Sometimes it helps to just make goals up. Meaning literally say “it would be really helpful if X was true” and then work backwards, try to figure out what you would need to make X true. Making good guesses as to what are good points to work backwards from is something that takes practice.

We all work at some level of abstraction, but sometimes you need to dive deeper and get into the lower levels. Meaning you need to take apart the machine that you’re working with and put it back together. Hook the sensors up directly to your computer instead of a separate display. (so you can write your own display code) Step through the lower level code. Write your own version of the lower level code. Multiply the equations all the way out. Run through them with real world numbers instead of using abstract symbols.

Meaning do the work that the people who provided you with your tools did.

Here is Bob Johnstone talking about Nobel laureate Shuji Nakamura:

Modifying the equipment was the key to his success… For the first three months after he began his experiments, Shuji tried making minor adjustments to the machine. It was frustrating work… Nakamura eventually concluded that he was going to have to make major changes to the system. Once again he would have to become a tradesman, or rather, tradesmen: plumber, welder, electrician — whatever it took. He rolled up his sleeves, took the equipment apart, then put it back together exactly the way he wanted it. …

Elite researchers at big firms prefer not to dirty their hands monkeying with the plumbing: that is what technicians are paid for. If at all possible, most MOCVD researchers would rather not modify their equipment. When modification is unavoidable, they often have to ask the manufacturer to do it for them. That typically means having to wait for several months before they can try out a new idea.

The ability to remodel his reactor himself thus gave Nakamura a huge competitive advantage. There was nothing stopping him; he could work as fast as he wanted. His motto was: Remodel in the morning, experiment in the afternoon. …

Previously he had served a ten-year self-taught apprenticeship in growing LEDs. Now he had rebuilt a reactor with his own hands. This experience gave him an intimate knowledge of the hardware that none of his rivals could match. Almost immediately, Nakamaura was able to grow better films of gallium nitride than anyone had ever produced before.

(from Brilliant!: Shuji Nakamura And the Revolution in Lighting Technology, p 107)

I want to caution against being too eager about this. You can waste huge amounts of time diving into the deeper levels. There is infinite amount of work down there, and there are reasons why we work at higher levels. The approach for this is to do the smallest dive possible. Only if that doesn’t work should you dive into the lower levels for longer amount of times. (the quote above mentions how Shuji Nakamura was frustrated for three months before he decided to dive deeper. That sounds like a reasonable amount of time)

A related problem is that sometimes you need to doubt the lower levels, but you have to be especially careful about this. But it does happen that the lower level formulas are wrong about something. Even the laws of physics still have holes in them which we have to fill up with Dark Matter and Dark Energy. That doesn’t mean that you should immediately question those laws of physics. You should do the smallest intervention possible and dive one level down. Don’t ever skip levels. Meaning first question whether something in your experiment is wrong. Then question whether your equipment is wrong, then maybe question if a formula from a previous paper is wrong, then slowly work your way down. Only if no higher level mistake can explain your observations should you keep on diving deeper. Think of it as detective work. There are heuristics for what to doubt (“how many known problems does this have?” “how much would break if this changed?”) but you will often follow the heuristics automatically if you just work one level at a time. In computer science this still happens with some regularity, and here is a good read about somebody who did this properly and worked his way slowly through every level until they could conclude that they found a hardware bug.

Sometimes it helps to show your unfinished idea to someone who is going to hate it. It’s a very unpleasant experience to do this. But if you do this you will hear all the many reasons why your idea can’t possibly work and why you should just abandon it right now. This can do two things: 1. It can actually increase your resolve to fix this problem. (I’ll show that idiot who thinks this can’t be done) 2. It brings up areas that you have avoided so far. Somehow, people who hate your idea are really good at finding open wounds that they can drive their thumb into to hurt you. Often times those open wounds are what you have avoided even though it’s exactly what you should be working on, as unpleasant as that may be. It sucks when somebody tells you “your idea sucks because it can’t deal with X” because you suspect that it’s true and you have unconsciously avoided dealing with X so far. But it can feel great when you then go back and finally tackle X and it turns out that you find a really elegant way to solve that problem, proving the idiot hater wrong and making some progress while you were at it.

The Internet is a great source for this kind of negativity. Sometimes coworkers and friends can identify your problem spots in a nicer way, but the problem with coworkers and friends is that they often have the same mindset as you. You can avoid that by asking new people in your group for advice. You have to get them when they’re still in the “why the heck do we do it like this?” stage, before they have advanced to the acceptance of the “this is just how we do things here” stage. So it’s tricky. (the two stages may not be this obvious) The most reliable way to get criticism is to ask someone who will hate your idea.

I started this section off with making this sound totally sucky. Because it usually is, and to do this you have to be ready for the unpleasant emotions. But this can actually be a more or less sucky experience, depending on who you get the criticism from. When you’re on one side of an argument, it’s easy to find someone on the other side who is a bit of an idiot and then you point and laugh and say “look at how much of an idiot they are on the other side.” That is easier for you to do, but it’s harder to learn from that. You have to put in more work to understand their point. And even though you’ll dismiss it, it will still negatively affect your mood. The better way to go is to find a smart person on the other side who can articulate themselves well. Ideally they can even state your viewpoint pretty well and can still tell you why their side is right. It’s easier to learn from those people, but you won’t naturally seek them out because you’ll learn all the parts where you are wrong.

If you’re out of all other options, sometimes you should just do something stupid.

Do something that would never work. Do something that might work, but it’s obviously inefficient or inelegant. Add five special cases. Do something hand-wavy that would never survive peer-review. Assume something that you can’t justify assuming. Do something where you already know three cases where it won’t work. Sometimes those surprise you by unexpectedly working or by giving you an answer that is almost right.

Do you have an idea that probably won’t help and it involves going through fifty cases that take an hour of tedious work each? Sometimes you just gotta do it, even if it probably won’t help. Repetition helps understanding, so maybe you will discover a new angle. Or maybe you will find ways to automate the work.

If all of the other advice for getting unstuck hasn’t helped, doing something stupid can help. In the hill climbing analogy it’s taking a step downhill. Or spending way too much work on a sideways step. The idea is to specifically do what you have tried to avoid doing. Obviously don’t do this as your first attempt at getting unstuck.

If after this last point you’re still stuck, maybe try being more incremental. Maybe the thing you’re trying to do is just not ready to be tackled yet. Find a half-way goal and aim for that. Otherwise I’ll talk about making progress next, and there may be more hints there.

In this part I will talk about the normal day to day things that you should do all the time. Why didn’t I put this before the “getting unstuck” section? Because getting unstuck is more interesting and now that I have your attention, I can spend it on making you read things that you should do every day.

Polya’s book “How to Solve It” has a chapter called “Wisdom of Proverbs” in which he talks about some of these always applicable things using proverbs. I kinda like that. It’s cute. So I will quote his proverbs when appropriate.

This one is trivial, because this is what we have been talking about for the whole list. Going up means taking one step towards your goal.

Even though this is obvious, I often catch myself doing this wrong. I’ll be thinking way too much about all the possible paths I could take and which problem I would encounter where, that I never actually end up doing a step. For me as a programmer a step may just mean “start writing some code.” (and don’t worry too much about organizing for now) Or it may just mean “work through a few cases” or just doing anything that gets you to actually do something as opposed to just thinking about it. Doing helps with thinking. I’ve found that solutions just come automatically as soon as I start working. Half the problems I worried about never actually show up. Half of the remaining problems end up being simple. Just start doing a step that seems to go uphill. (there’s actually an AI technique for this called Stochastic Hill Climbing which relies on the same insight that sometimes it’s too much work to find the best path and you should just choose any path)

Being lucky is a skill that you can learn. And it’s actually a fairly easy skill to learn. That may sound surprising to some people (especially to unlucky people) but it’s true. Here, Richard Wiseman writes about his research into luck. What he did is he found people who thought of themselves as especially lucky or especially unlucky and he asked them a lot of questions. Here are a few excerpt from the article that gives you a good idea for what he found:

I gave both lucky and unlucky people a newspaper, and asked them to look through it and tell me how many photographs were inside. On average, the unlucky people took about two minutes to count the photographs, whereas the lucky people took just seconds. Why? Because the second page of the newspaper contained the message: “Stop counting. There are 43 photographs in this newspaper.” This message took up half of the page and was written in type that was more than 2in high. It was staring everyone straight in the face, but the unlucky people tended to miss it and the lucky people tended to spot it.

For fun, I placed a second large message halfway through the newspaper: “Stop counting. Tell the experimenter you have seen this and win £250.” Again, the unlucky people missed the opportunity because they were still too busy looking for photographs.

[…]

And so it is with luck – unlucky people miss chance opportunities because they are too focused on looking for something else. They go to parties intent on finding their perfect partner and so miss opportunities to make good friends. They look through newspapers determined to find certain types of job advertisements and as a result miss other types of jobs. Lucky people are more relaxed and open, and therefore see what is there rather than just what they are looking for.

My research revealed that lucky people generate good fortune via four basic principles. They are skilled at creating and noticing chance opportunities, make lucky decisions by listening to their intuition, create self-fulfilling prophesies via positive expectations, and adopt a resilient attitude that transforms bad luck into good.

[…]

In the wake of these studies, I think there are three easy techniques that can help to maximise good fortune:

- Unlucky people often fail to follow their intuition when making a choice, whereas lucky people tend to respect hunches. Lucky people are interested in how they both think and feel about the various options, rather than simply looking at the rational side of the situation. I think this helps them because gut feelings act as an alarm bell – a reason to consider a decision carefully.
- Unlucky people tend to be creatures of routine. They tend to take the same route to and from work and talk to the same types of people at parties. In contrast, many lucky people try to introduce variety into their lives. For example, one person described how he thought of a colour before arriving at a party and then introduced himself to people wearing that colour. This kind of behaviour boosts the likelihood of chance opportunities by introducing variety.
- Lucky people tend to see the positive side of their ill fortune. They imagine how things could have been worse. In one interview, a lucky volunteer arrived with his leg in a plaster cast and described how he had fallen down a flight of stairs. I asked him whether he still felt lucky and he cheerfully explained that he felt luckier than before. As he pointed out, he could have broken his neck.

I can’t overstate how important this stuff is. Half of the advice from this blog post is due to me being lucky. For example the way I found Kenneth Stanley’s great talk “Why Greatness Cannot Be Planned: The Myth of the Objective” was that I was following Bret Victor on Twitter (or maybe it was from the RSS feed of his quotes page) because he is a constant source of new perspectives. That lead me to watch this talk by Carver Mead about a new theory of gravity. Which I watched even though I have no reason at all to look into this. I barely know any physics. But come on, a new theory of gravity. And it’s supposed to be simpler than Einstein’s theory while still making all the same predictions. That’s interesting. Then I went to find out more about the conference that that talk was given at and finally stumbled onto Kenneth Stanley’s talk.

None of these steps have any obvious practical benefit for me, but they lead me to a great talk, which coincidentally has the best demonstration I have ever seen about why you should behave in exactly this way.

Being lucky can mean that you never actually find what you’re looking for. You may find something else entirely. The list of scientific discoveries that were made “accidentally” is long. But you need to learn to be lucky, otherwise you will miss those chances when you encounter them.

Here are Polya’s proverbs for this topic:

Arrows are made of all sorts of wood.

As the wind blows you must set your sail.

Cut your cloak according to the cloth.

We must do as we may if we can’t do as we would.

A wise man changes his mind, a fool never does.

Have two strings in your bow.

A wise man will make more opportunities than he finds.

A wise man will make tools of what comes to hand.

A wise man turns chance into good fortune.

The title of this section is referring to the Woody Allen quote “80 percent of success is showing up”. This means showing up to work every day and working on a problem. Thomas Edison is supposed to have said that “ninety per cent of a man’s success in business is perspiration.”

“Showing up” can be more broadly applied: Show up to conferences. Show up to lunch with coworkers because that’s where you will have good discussions. Show up to dinner parties because that’s where you might meet people who can give you fresh ideas. Write the papers you’re supposed to write. Read the papers you’re supposed to read.

Part of this is to “be lucky” as in the point above. You can’t be lucky if you don’t show up. So you also want to get yourself into environments where you can show up to all these events. There is a reason why good research rarely comes out of some small town in the middle of nowhere: There are not enough opportunities to show up to out there. You want to at least live in a college town or a big city.

Here is Polya’s list of proverbs for this section:

Diligence is the mother of good luck.

Perseverance kills the game.

An oak is not felled at one stroke.

If at first you don’t succeed, try, try again.

Try all the keys in the bunch.

One final thing that I should point out is that I intentionally didn’t call this section “work hard.” I think that “show up” is a better advice. This is not about working 80 hour weeks. It’s about showing up to work on a problem every day.

The term “iteration time” is a standard term in video game development which roughly measures how much time passes between being finished with a change and seeing the change in the game. So for me as a programmer I make a change, then I have to compile the code, launch the game, get to a point where I can test my change and then test my change. Let’s say compiling takes ten seconds, launching the game takes twenty seconds, and getting to my test setup takes another ten seconds, then my iteration time is 40 seconds. So if I decide to make another small change, I have to wait another 40 seconds before I can see the result. If I can cut the compile time in half then my iteration time is just 35 seconds, which is a good improvement. If I can create a test setup that doesn’t require the whole game to boot then maybe I can get my iteration time down to just 15 seconds.

At the beginning of this blog post I talked about how research is often characterized by slow progress. Exploring one path might take you a week before you find out that it’s a dead end. You shouldn’t just accept that. You should find ways to reduce that time.

Improving iteration time helps in many non-obvious ways: If you can improve iteration times, you can make it cheaper to make mistakes. If an experiment takes you two hours, you probably don’t want to make a mistake and you’ll be very careful. If you can do the experiment in a minute, then some mistakes are OK and you can play around more. But even if you just reduce it from two hours to one hour and 45 minutes, that will still improve your work a little bit. And maybe you can find more improvements after that.

Now you have to invest time to save time, so sometimes it’s not worth it. But sometimes you’ll be surprised. I’ve had an argument about improving iteration times below two seconds. The other person argued that if your iteration time is only two seconds, how much time are you going to save by reducing the iteration time to one second? (and how much effort do you need to invest to achieve a 50% reduction?) But what happens is that when you reduce iteration times, you work differently. If your iteration time is milliseconds, all of a sudden you can work entirely differently. You can try several alternatives per second and create an interactive animation showing the alternatives. You can try different parameters in real time and see what happens. You can show several different variations of the problem on the screen at the same time. At some point you can write a program that just explores a million options and gives out the best one. (but then ironically that program would have slow iteration times, so maybe an interactive tool would be better)

Improving iteration times is a lot about automation, but often it’s also just about being observant as to where you are losing time. You can apply a lot of lessons from factories here. Standardize processes, specialize, batch your work etc. Also if you don’t know how to program, then you should probably learn how to. It’s easier now than it ever was. And to automate simple tasks like “entering numbers into an excel sheet” you don’t need a full computer science education.

Here is something you should do whenever you’re finished with a step: See what the implications of that step are beyond the specific step. See if it has broader applications. Here is Claude Shannon talking about this:

Another mental gimmick for aid in research work, I think, is the idea of generalization. This is very powerful in mathematical research. The typical mathematical theory developed in the following way to prove a very isolated, special result, particular theorem – someone always will come along and start generalizing it. He will leave it where it was in two dimensions before he will do it in N dimensions; or if it was in some kind of algebra, he will work in a general algebraic field; if it was in the field of real numbers, he will change it to a general algebraic field or something of that sort. This is actually quite easy to do if you only remember to do it. If the minute you’ve found an answer to something, the next thing to do is to ask yourself if you can generalize this anymore – can I make the same, make a broader statement which includes more – there, I think, in terms of engineering, the same thing should be kept in mind. As you see, if somebody comes along with a clever way of doing something, one should ask oneself “Can I apply the same principle in more general ways? Can I use this same clever idea represented here to solve a larger class of problems? Is there any place else that I can use this particular thing?”

In this talk Clay Christensen points out that generalizing makes it easier to prove yourself wrong. When you generalize your concept, you have more examples to test it against, and you can use those examples to improve your theory. If you test it against a new example and your theory doesn’t work, you have to either define the limits of your theory, or you have to explain why it sometimes behaves differently. (and then sometimes these new explanation help explain oddities in your original data)

There is also a quote by Feynman which I can’t find right now where he essentially says “if you’re not generalizing, then what’s the point?” With the reason being that the only way that science progresses is to make guesses beyond the specifics of what we observed.

One word of caution about this is that you can also be over-eager about this. Don’t try to find a pattern if all you have is two examples. (or god forbid only one example) You usually want to generalize after you’ve seen three or four examples of something. Of course the trick is in recognizing that three apparently different things are actually examples of the same thing.

This is another thing that you probably do automatically, but it’s worth pointing out: When you’re moving forward, you should try hard to keep moving forward. The usual example of this is when you were stuck for a while: Once you’re over the hurdle, you should keep working on the thing that got you over the hurdle, because you can probably make more progress there.

Csikszentmihalyi talks about the concept of “Flow” in relation to this, which is a highly focused mental state that you enter when you’re doing concentrated work. You want to stay in that state.

The easiest way to get this wrong is to get stuck on small bumps. There are plenty of small speed bumps along the way that will just slow you down. If there is something in a paper that you don’t understand, ignore it and keep reading. Maybe it will become clear later. If you already know something to be true, but proving it is tricky, skip over the proof. You can fill in the gaps later. If you’re working an algorithm but an edge case is driving you nuts, don’t handle the edge case now. Just solve the cases that you actually need.

It’s important that you revisit each of these points later to fill in the gaps (because sometimes good discoveries hide in small irregularities) but you shouldn’t let a small speed bump stop you when you were making good progress before.

Here are Polya’s proverbs for this, the first one being ironic:

Do and undo, the day is long enough.

If you will sail without danger you must never put to sea.

Do the likeliest and hope the best.

Use the means and God will give the blessing.

This is the opposite advice of the previous point, but what can I say. Sometimes you gotta keep on moving forward, sometimes you have to be careful. Often you have to do both.

You can waste a huge amount of time if you mess up a step and never notice.

Polya’s questions for this are “Carrying out the plan of the solution, check each step. Can you see clearly that the step is correct? Can you prove it?”

You get better at this with experience. As you gain more experience, you will just intuitively avoid problems. So if it looks like a very experienced person isn’t checking every step, it may just be that they have taken steps like this a thousand times before.

On the other hand for me personally I feel like I’ve become more and more careful the more I have programmed. My changelists these days tend to be smaller than they used to be. I rarely make huge changes nowadays. Instead I try to make many smaller steps, each of which I can reason about.

The other thing I’d like to mention in this context is that sometimes slowing down can help. Sometimes if you have to make a decision, it’s best to wait for a while before making it. Try to work around it and get a better lay of the land. This is why procrastination sometimes works. Sometimes with delay the correct choice becomes clear. Sometimes all you’re doing is delaying though…

Polya’s proverbs for this section are these:

Look before you leap.

Try before you trust.

A wise delay makes the road safe.

Step after step the ladder is ascended.

Little by little as the cat ate the flickle.

Do it by degrees.

This is a point that I can’t possibly do justice to. Whole books have been written about how to form effective teams, so my advice in a blog post like this has to be hopelessly incomplete.

Research has the curious character where it’s often better when done by yourself. Kenneth Stanley has an amazing illustration of the damage that committees can do to research in his talk. (same talk that I keep referring to) If you have to constantly justify what you’re doing, you won’t do the exploration that’s necessary to actually get anywhere. Yet at the same time none of the pictures that he shows in his talk are the result of people working alone. So how do we square that circle?

Research about effective teams has shown that one of the most important things is emotional safety. You should be safe to speak up, safe to ask stupid questions, safe to follow hunches, safe to take a risk, and safe to admit mistakes. If you make decisions by committee, none of these things are true because you have to constantly justify what you are doing, and you have to constantly compete with others to make sure that your priority is still everyone’s priority.

One piece of advice that I like for this is the practice of “Yes, And” from improv comedy. If somebody has an idea, you can’t say “no that’s stupid.” (or use a more subtle way to shut it down) You have to say “yes”, and you have to add something to it to keep the idea alive. I have the idea from this talk by Uri Alon, who gives the following example:

We were stuck for a year trying to understand the intricate biochemical networks inside our cells, and we said, “We are deeply in the cloud,” and we had a playful conversation where my student Shai Shen Orr said, “Let’s just draw this on a piece of paper, this network,” and instead of saying, “But we’ve done that so many times and it doesn’t work,” I said, “Yes, and let’s use a very big piece of paper,” and then Ron Milo said, “Let’s use a gigantic architect’s blueprint kind of paper, and I know where to print it,” and we printed out the network and looked at it, and that’s where we made our most important discovery, that this complicated network is just made of a handful of simple, repeating interaction patterns like motifs in a stained glass window.

(the term “being in the cloud” is what I would call being stuck in a local maximum using the hill climbing analogy)

Other important things are having a clear, well communicated vision for what you’re trying to do. This doesn’t have to be a specific goal, but it should at least be a direction. That way all the creative attempts that people are taking in your group (because it’s safe for them to do so) will automatically work together. Competing goals within the group can be really harmful here, so you want to resolve disagreements. And changing the vision can also be really harmful. If you have to change direction, you have to communicate that very well.

The final thing is that diversity has been shown to help. Which makes sense if you look at how much of the advice above is about finding different view points.

Whew, you’ve made it to the end and you’ve made a discovery. Now make sure to look back. Polya has these questions for you:

Can you derive the result differently? Can you see it at a glance? Can you use the result, or the method, for some other problem?

The last question aims at the “generalizing” point I have talked about above. But the moment just after you have finished is often the moment where you can do your best work. You can flatten out all the bumps that accumulated in your work over time. You can straighten out the lines, clean up the formulas. Maybe something that seemed odd before now makes a lot of sense and offers a hint for further research. This is the time where you can turn this result into something really good that others will actually want to use. Take some extra time here.

Polya’s proverbs are “He thinks not well that thinks not again.” And “Second thoughts are best.” He also says that it’s really good if you can, with the benefit of hindsight, find a second way to derive the result. “It is safe riding at two anchors.”

You have reached the end of my list. If you still haven’t had enough, here are some of my sources. Otherwise the conclusion is below.

George Polya – How to Solve It – A New Aspect of Mathematical Method

This book is written by someone who was thought carefully about how we solve problems. If I hadn’t read this book, I couldn’t have noticed other patterns.

Kenneth Stanley – Why Greatness Can Not Be Planned – The Myth of the Objective

I love this talk because he explains everything with pictures. For example when he shows the pictures that you get from voting compared to the pictures you get from individual exploration, it really is better than a thousand words about the subject could be.

Rich Hickey – Hammock Driven Development

This has more insights about focused mode and diffuse mode than I actually used in this blog post. I think this is also the place where I first heard about “How to Solve It.”

Claude Shannon – Creative Thinking

This is a transcript of a talk that Claude Shannon gave. The good section is the part about his tricks for doing research. I suspect that the text got messed up by some kind of automatic digitization method, so if somebody has a better source, I would be very thankful.

Robert Galager talking about Claude Shannong

This talk builds on the above list and adds more tricks that Claude Shannon used. Some of those I didn’t mention because I didn’t talk about how to find good topics.

One thing I wish I could link to is a talk or article that generalizes from AI methods to scientific research. I did some of that above, but I have no sources for that other than my own interpretations. I could link you to AI books but they typically spend a very small amount of time on hill climbing.

I don’t think my list is complete, but I think I have a pretty good sample. For example I have not read any of Csikszentmihalyi’s work. I’m sure I could add at least one or two points to my list if I did. But as I kept adding things over the years, I was frustrated by how few people seem to know these things. For example I referred to a TED talk above that talks about being stuck, and the guy doesn’t refer to Polya. And Polya’s “How to Solve It” simply has the best list for getting unstuck, so it should always be mentioned when you’re talking about being stuck. After I saw a few incomplete opinions like that, I decided I had to write this blog post, even if my own list was also incomplete.

The list is necessarily short because it’s a blog post and it’s intended as something that you can re-read the next time that you’re having problems.

There are several directions that a list of “advice for doing research” could be expanded. For example I could talk about heuristics for identifying good research, (it seems solvable, the old theory has known problems, it would simplify things, the underlying conditions have changed, it would help someone, your unconscious keeps on drawing you back to it…) or I could talk about progress and about what you should do in which stage of research (Clay Christensen talks about that here) but I had to stop at some point, and having a list of tricks and habits seems like a good thing to have.

If you’ve made it to the end of this blog post, then I thank you very much for reading. I recommend that you come back here every once in a while to re-read the list. It’s what I’m doing with Polya’s book.

]]>

That is until recently, when I came across the paper Imaginary Numbers are not Real – the Geometric Algebra of Spacetime which arrives at quaternions using only 3D math, using no imaginary numbers, and in a form that generalizes to 2D, 3D, 4D or any other number of dimensions. (and quaternions just happen to be a special case of 3D rotations)

In the last couple weeks I finally took the time to work through the math enough that I am convinced that this is a much better way to think of quaternions. So in this blog post I will explain…

- … how quaternions are 3D constructs. The 4D interpretation just adds confusion
- … how you don’t need imaginary numbers to arrive at quaternions. The term will not come up (other than to point out the places where other people need it, and why we don’t need it)
- … where the double cover of quaternions comes from, as well as how you can remove it if you want to (which makes quaternions a whole lot less weird)
- … why you actually want to keep the double cover, because the double cover is what makes quaternion interpolation great

Unfortunately I will have to teach you a whole new algebra to get there: Geometric Algebra. I only know the basics though, so I’ll stick to those and keep it simple. You will see that the geometric algebra interpretation of quaternions is much simpler than the 4D interpretation, so I can promise you that it’s worth spending a little bit of time to learn the basics of Geometric Algebra to get to the good stuff.

OK so what is this Geometric Algebra? It’s an alternative to linear algebra. Instead of matrices, there are multiple kinds of vectors, and there is a more powerful vector multiplication.

Let’s start with vector multiplication. In linear algebra we know two ways to multiply vectors: The dot product (producing a scalar) and the cross product (producing a vector). Where the dot product works for any number of dimensions, and the cross product only works in 3D. Geometric algebra also uses the dot product, but it adds a new product, the wedge product: . The result of the wedge product is not a vector or a scalar, but a plane. Specifically it’s the plane spanned by the two vectors. This plane is called a bivector because it’s the result of the wedge product of two vectors. There is also a trivector which describes a volume. The general principle is that the wedge product increases the dimension of the vectors by one. Vectors (lines) turn into bivectors (planes), and bivectors turn into trivectors (volumes). When we do math in more than 3 dimensions, we can go even higher, but I’ll stick to 2D and 3D for this blog post.

Before I tell you how to actually evaluate the wedge product, I first have to tell you the properties that it has:

- It’s anti-commutative:
- The wedge product of a vector with itself is 0:

The first property will make sense when we talk about rotations. The second product should already make sense if we just think of a bivector as a plane. There is no plane between a vector and itself, so it’s 0.

The other thing I have to explain is how vector multiplication works: In geometric algebra, the vector product is defined as the dot product plus the wedge product:

The result of the dot product is a scalar, and the result of the wedge product is a bivector. So how do we add a scalar to a bivector? We don’t, we just leave them as is. It works the same way as when adding polynomials or when adding apples and oranges or when working with complex numbers: . We just leave both terms.

Note that usually I will leave out the star and just write .

In 3D space we have three basis vectors:

When multiplying these with each other we notice three properties of this new way of multiplying:

So when multiplying the basis vectors with each other, either the dot product or the wedge product is zero. We are left only with one of the two.

All other vectors can be expressed using the basis vectors. So the vector can also be written as and I will use the second notation more often, because it makes multiplication easier.

With that out of the way, we can finally give one real example of how vector multiplication works in geometric algebra. It’s actually pretty simple because we just multiply every component with every other component:

Let’s walk through a few of the steps I did there:

- because .
- because , so the scalar part is zero, and can write the wedge product of basis-vectors shorter as . This short-hand notation is only valid for vectors which are orthogonal to each other.
- because

So as promised the result of multiplying two vectors is a scalar () and a bivector (). A sum of different components like this is called a multivector.

When doing these multiplications you quickly notice that just as all vectors can be represented as combinations of , and , all bivectors can be represented as combinations of , and . So I’ll just use these as my basis-bivectors. We could make different choices here, for example we could use instead of but I like how the bivectors circle around like that. The choice of bivectors doesn’t really matter, just as the choice of basis-vectors doesn’t really matter. We could for example have also chosen , and as our basis vectors. All the math works out the same, we just get different signs in a few places.

Once we have three basis-vectors and three basis-bivectors, we notice that we can represent all 3D multivectors as combinations of 8 numbers: 1 scalar, 3 vector-coefficients, 3 bivector-coefficients and 1 trivector-coefficient. If we did the same exercise in a different number of dimensions, we would find similar sets of numbers. In 2D space for example we have 1 scalar, 2 vector-coefficients and 1 bivector-coefficient. That makes sense, because in 2D there are only 2 directions, only 1 plane and no trivector because there is no volume. If we went to 4D we would have 1 scalar, 4 vector-coefficients, 6 bivector-coefficients, 4 trivector-coefficients and 1 quadvector-coefficient. I’m sure you can spot the pattern that would allow you to go to any number of dimensions. (but really these come out naturally depending on how many orthogonal basis-vectors you start with)

We’re almost finished with our introduction to geometric algebra, so I need to mention one final important property: vector multiplication is associative. Meaning so we can choose which multiplication we want to do first.

OK with that we’re finished with the introduction, but I want to practice a few more multiplications so that you get the hang of it. Maybe do a few yourself. It takes a couple minutes, but then you have the rules ingrained into muscle memory. This practice section is optional though.

Let’s do some practice runs to build up an intuition for how these vectors and bivectors behave. You can skip this section entirely if you don’t care about geometric algebra and just want to get to rotations.

What happens if we multiply two similar bivectors?

So what I did there is I used to re-order the basis-elements. Then everything collapses down because . So what we see here is that the dot product of a bivector is a negative number. Isn’t that interesting? In particular if we have a bivector of length 1 and multiply it with itself: we see that . Remember how in quaternions there are these three components , and which have ? We’re going to be using the bivectors for that. However it just so happens that the bivector is a mathematical construct whose square is -1. That does not mean that it is the result of . I could build any number of mathematical constructs that square to -1, (for example trivectors also square to minus one) that doesn’t mean that they are all the square root of -1. How many square roots is -1 supposed to have?

Speaking of squaring a trivector, let’s try that to get practice at re-ordering these components:

Getting the hang of it yet? It’s all about re-ordering components until things collapse.

Let’s try multiplying two different bivectors:

The result of two bivectors is another bivector. If we have more complicated bivectors that are made up of multiple basis-bivectors, the result is a scalar plus a bivector:

So this is a scalar () plus quite a complicated bivector ().

What happens if we multiply across dimension. Like multiplying a vector with a bivector?

If we multiply the plane with a vector that’s on the plane, we get another vector on the plane. In fact if we do this a few more times:

We notice that after four multiplications we are back at the original vector . So every multiplication with a bivector rotates by 90 degrees. If we multiply on the left side instead of multiplying on the right side, we would rotate in the other direction.

What if we multiply the plane with a vector that’s orthogonal to it?

Well that’s disappointing, we just get the trivector. What if we multiply the trivector with the plane?

If we multiply the trivector with the plane, the plane collapses and we’re left with just the vector that’s normal to the plane. This works even for more complicated bivectors:

Which is the normal of the original plane. What if we multiply a vector with the trivector?

If we multiply a vector with the trivector, the vector part collapses out and we’re left with the plane that the vector is normal to. This works even for more complicated vectors:

And with that we’re back at the original plane. Almost. The sign got flipped. If we had multiplied by we would have been back at the original plane.

So multiplying with the trivector turns planes into normals and normals into planes, because the other dimensions collapse out. This also allows us to define the cross product in geometric algebra: . So first we build a plane by doing the wedge product, then we get the normal by multiplying with the trivector.

If you went through the practice chapter you will have already seen places where geometric algebra does rotations: bivectors rotate vectors on their plane by 90 degrees. It’s not quite clear how we can build arbitrary rotations with that though.

One thing that’s a little bit easier to do is reflections, and we will see that we can get from reflections to rotations.

Let’s say we want to reflect the vector a in the picture below on the normalized vector r, to get the resulting vector b:

To do that it’s useful to break the vector a into two parts: The part that’s parallel to r, and the part that’s perpendicular to r, :

(forgive my crappy graphing skills)

These have a few properties:

(the result is a scalar and we can flip the order)

(the result is a bivector and flipping the order flips the sign)

From the picture it should be clear that if we subtract instead of adding it, we should get to . Or in other words:

So how do we get these and vectors? You may already know how to do it, but we actually never need to explicitly calculate them. Because we can actually represent this reflection as

How do we get to that magical formula? Let’s multiply it out:

The important step is that , allowing us to re-order the elements until we’re left with which is just 1, as long as is normalized.

The reflections above look kinda like rotations. In fact if all we want to do is rotate a single vector, we can always do that with a reflection. The problem is if we want to rotate multiple vectors, like in a 3d model, then the rotated model would be a mirror version of the original model.

The solution to that is to do a second reflection. There are many possible pairs of reflections that we could choose, but here is an easy one. First we reflect on the half-way vector between and , (where writing pipes around a vector like is the length of the vector, so is a normalized vector):

So in this picture I am reflecting on the vector , which is half-way between and , landing us at . To get from to we just have to do a second reflection with the vector itself. (which is a bit weird, but if you follow the equations it works out) Given that is one reflection, is two reflections. First we reflect on , then we reflect on .

Earlier we chose . We can multiply this out and define

Then the rotation is written as (where you could work out by multiplying out the other side, or you can just flip the sign on the bivector parts of ), and the inverse is written as .

And just like that we have quaternions. How? Where? I hear you asking. That part in the last equation is a quaternion. If you multiply it all out, you will find that all the vector parts and trivector parts collapse to 0, and you’re just left with the scalar part and the bivector coefficients. And it just so happens that if you have a multivector which consists of only a scalar and the bivectors, multiplication behaves exactly like multiplication of quaternions.

Now isn’t that interesting? All we did was we did the math for reflections, and if we do two of those we get quaternions? No imaginary numbers, no fourth dimension, just 3d vector math. All we had to do was introduce that wedge product .

And you’ll notice that the way we apply , by doing looks an awful lot like how we multiply quaternions with vectors. To multiply a quaternion with a vector we do .

OK so let’s convince ourselves that these really are quaternions and work out the quaternion equations. They are . Our quaternion consists of a scalar and three bivectors, , , and . (I use them in this order because the plane rotates around the x axis, so it should come first). So let’s try this:

.

Seems to work so far. But I actually don’t fulfill the equation because for me . I could fix that by choosing a different set of basis-bivectors. For example if I chose , and , then this would work out because . But I kinda like my choice of basis vectors and all the rotations work out the same way. If this bothers you, just choose different basis bivectors.

One super cool thing is that when doing the derivations using reflections, I never had to specify the number of dimensions. We could use 3D vectors or 2D vectors or any number of dimensions. So if we work out the math in 2D, what do you think we get? That’s right, we get complex numbers: One scalar and one bivector. Because that’s how you do rotations in 2D. But we could go to any number of dimensions using this method. (except in 1D this kinda collapses, because you can’t really rotate things in 1D)

Also we didn’t specify what we are rotating. We assumed that it was a vector, but we never required that. So this can rotate bivectors and it can rotate other quaternions.

So we found a new way to derive quaternions. This new way is neat because we don’t need 4 dimensions and we don’t need imaginary numbers. But can we learn anything new from this? Already we have two possible new interpretations:

- A quaternion is the result of two reflections
- A quaternion is a scalar plus three bivectors

Maybe one of these has some interesting conclusions.

Before that I want to kill the 4D interpretation properly: There are two reasons why people say quaternions are 4D: The fact that quaternions have four numbers, and the fact that quaternions have double cover. I’ll talk about the double cover separately later, but here I briefly want to talk about the four numbers thing. There are lots of 3D constructs that have more than three numbers. For example a plane equation has four numbers: . Or if we want to do rotations using matrices in 3D, we need a 3×3 matrix. That’s 9 numbers. But nobody would ever suggest that we should think of a rotation matrix as a 9 dimensional hyper-cube with rounded edges of radius 3. So don’t think of quaternions as a 4 dimensional hypersphere of radius 1. Yes, there are some useful conclusions to draw from that interpretation (for example it explains why we have to use slerp instead of lerp) but it’s such a weird interpretation that it should come up very rarely.

With that out of the way let’s get to these two new interpretations:

1. Interpreting quaternions as two reflections. I couldn’t get much useful out of this. The first reflection is always on the vector half-way between the start of the rotation and the end of the rotation. The second reflection is always on the end of the rotation. I’ve played around with visualizing that, but the visualizations always looked predictable and didn’t offer any insights.

2. Interpreting quaternions as a scalar plus three bivectors. This interpretation on the other hand turned out to be a goldmine. Not only can you get an intuitive feeling for how this behaves, you can also get visualizations from this. This interpretation also allowed me to get rid of the double cover of quaternions.

So even though we have derived quaternions using reflections above, I will actually spend the rest of the blog post talking about quaternions as scalars and bivectors.

A quaternion is made up of a scalar and three bivectors. We all know what a scalar does: Multiplying with a scalar makes a vector longer or shorter. I said above that multiplying with a bivector rotates a vector by 90 degrees on the plane of the bivector.

So how can we build up all possible rotations if all we have is a scalar and three rotations of exactly 90 degrees? The answer is that a bivector actually does slightly more: It rotates by 90 degrees, and then scales the vector.

I said that a bivector is a plane. But because of its rotating behavior, I actually like to visualize it as a curved line. So I visualize a vector as a straight line, and a bivector as a 90 degree curve. So here is a visualization of three different bivectors:

These are the bivectors (bottom), (middle) and (top). It’s a 90 degree rotation followed by a scale. I find this visualization particularly useful when chaining a bunch of operations together.

For example let’s say we want to rotate by 45 degrees on the xy plane. To do that we can multiply a vector with the quaternion . (that 0.707 is actually , but I’ll truncate it to 0.707 here) Now let’s multiply the vector with that quaternion. That gives us

Here’s how I would visualize that:

First we rotate by the bivector to get :

So the bivector is a rotation by 90 degrees followed by a scale of 0.707.

Next we multiply the original vector with the scalar to get the vector , which we add to the previous result:

Which then gives us the final vector of :

Which is the original vector rotated by 45 degrees.

This way of visualizing makes it very clear that multiplication with a quaternion is just multiplication with a scalar and multiplication with a bivector. And this also shows how we got a 45 degree rotation, even though all we can do is 90 degree rotations followed by scaling. It also explains why we need the single scalar value, and why the three bivectors are not enough: We sometimes want to add some of the original vector back in to get the desired rotation.

One thing to note is that in here I chose to do the bivector multiplication first, and the scalar multiplication second. But the choice is kinda arbitrary as both of these happen at the same time, and they don’t depend on each other.

Let’s rotate that same vector again to show what this looks like when we didn’t start off with one of our basis vectors:

So let’s visualize that:

First we rotate with the bivector, which puts us at :

So once again this does a 90 degree rotation followed by a scale of 0.707.

Next we multiply the original vector by 0.707 and add the resulting vector :

Which then gives us the final vector of :

Which is exactly what we would expect after rotating by 45 degrees twice.

I think these visualizations also explain how we can get arbitrary rotations: For bigger rotations we just have to make the scalar component smaller as the bivector component gets bigger.

So far we have only looked at the xy plane. To visualize this in 3D, I wrote a small program in Unity that can do the above visualization for all three bivectors. Here is what that looks like for rotating from the vector to the vector . That gives me the particularly nice quaternion .

This is going to be hard to do in pictures because it’s a 3D construct, but I’ll give it a shot. Here is what the two vectors look like:

So I want to rotate from the vector on the left to the vector on the right.

Here is what the contribution of the bivector looks like:

So this bivector is rotating on the xy plane. It takes the end point of the vector and rotates it 90 degrees down on the xy plane. It may be a bit hard to see, but imagine all the yellow lines lying on a xy plane.

The result of that 90 degree rotation is the vector . (the lower edge of the plane) I used the end of that rotation to start our result vector. (see how I have a third short vector sticking out at the bottom now? That’s )

Next I’m doing the contribution of the bivector:

The original vector was already rotated 45 degrees on the yz plane, so this rotation started off at a 45 degree angle and it rotated 90 degrees on the yz plane. Then it scaled the result by 0.5, giving us the result vector . (the bottom of the teal plane)

I also added the result of that rotation to the result vector. (the shorter vector that was sticking out now has a corner in it, indicating that I added the new )

Next we add the contribution of the bivector:

This took the end point of the original vector, and rotated it by 90 degrees on the zx plane. Then it scaled the result by 0.5, giving us the new vector (the end of the purple plane). The reason why the purple plane is floating above the other planes is an artifact of my visualization: I start at the end point and then I only move on the zx plane, so I end up floating above everything else. I also added this to our result vector at the bottom there.

Finally I’m going to add the scalar component into this:

This just took the original vector and scaled it by 0.5, giving us . I then added that to the results of the three bivector rotations. And as we can see, if we add up the contributions of the three bivectors and of the scalar part, we end up exactly at the end point of the vector that we were rotating into. (it may look like the last part is longer than 0.5 times the original vector, but that’s a trick of the perspective. The reason I picked this perspective is that you can see all three rotations from this angle)

So the rotation happened by doing three bivector multiplications and one scalar multiplication and adding all the results up.

Once again I want to point out that the order in which I added these up is arbitrary. All of these multiplications happen at the same time and don’t depend on each other, since they all just use the original vector as input. I chose to do this in the order xy, yz, zx, scalar, because that gave me a nice visualization.

I wanted to make the above visualization available for you to play with. I thought I could be really cool and upload a webgl version so that you can just play with it in your browser. So I built a webgl version, but then I found out that I can’t upload that to my wordpress account. So… I just put it in a zip file which you have to download and then open locally… Here it is.

There is an alternate visualization for the above rotation: Just as we would think of the vector as a single vector, we can also think of the bivector as a single bivector. It’s the plane with the normal , which is the plane spanned between the start vector and the end vector of the rotation. Then the visualization shows a 90 degree rotation on that plane, followed by a scaling of the length of this bivector. (which is ) That visualization looks like this:

So we rotate on this shared plane, then scale by 0.866, and finally add the original vector scaled by 0.5. This visualization as a single 90 degree rotation by the sum-bivector is equally valid as the visualization of the component bivectors. Just as we can visualize vectors either by their components, or as one line, we can visualize bivectors either by their components or as a single plane.

That finishes the part about visualization. As far as I know this is the first quaternion visualization that doesn’t try to visualize them as 4D constructs, and I think that really helps. Every component now has a distinct meaning and a picture. And we can see how the behavior of the whole quaternion is a sum of the behavior of its components.

One quick aside I want to make is that sometimes people say that quaternions are related to the axis/angle representation of rotations. That is a good way to get people started with quaternions, but then it breaks down relatively quickly because the equations don’t make sense and the numbers behave weirdly. The scalar & bivector interpretation is actually related to the axis/angle interpretation, and it explains what’s really going on here. Because when I say that something rotates 90 degrees on a plane, we can also say that it rotates 90 degrees around the normal of the plane. So in this interpretation quaternions first: rotate 90 degrees around the normal, followed by being scaled down, and second: multiply the original vector times a scalar and add that. It’s not quite axis/angle, but we can see how it’s related and why the axis/angle interpretation sometimes seems to work.

With the scalar & bivector interpretation of quaternions, we have a good idea of what quaternions do. With that, we’re ready to tackle the final quaternion mystery:

When I was working on this, a few friends asked me how the “scalar and bivector” explanation explains the double cover of quaternions. If you’re not familiar, the double cover means that for any desired rotation, there are actually two quaternions that represent that rotation. For example the quaternions that have 1 or -1 in the scalar part, and 0 for all the bivectors both represent a rotation by 0 degrees. (or by 360 degrees depending on how you look at it)

At first I responded that I hadn’t gotten to that part yet, but as I was working on this, the double cover just never came up. So eventually I decided to go looking for it, and… I couldn’t find it. It seemed like my quaternions didn’t have double cover. So I double checked everything and noticed that I have one difference: Remember how in order to multiply a quaternion with a vector we did this multiplication: . I accidentally didn’t do that. I just did .

And the simple multiplication actually works as long as you’re only rotating vectors on a plane that they actually lie on. For example rotating the vector on the plane works out: . The problems start if we’re rotating a vector that doesn’t completely lie on the plane that you’re rotating on. So let’s say I’m rotating the vector on the plane:

That’s strange: Some of our vector part has disappeared, and instead we have a trivector. This is not good. You don’t want part of the vector to disappear after a rotation. Rotating with fixes the problem, because the trivector part cancels out:

So now the part that’s on the plane (the component) got rotated, but the part that’s not on the plane (the component) was left unchanged. This is exactly what we want.

But look at what happened: The first rotation was a 90 degree rotation and the part that’s on the plane ended up at . And now we did a full 180 degree rotation and that part ended up at . How did that happen?

Well it actually makes sense. We are multiplying with the quaternion twice after all. Of course it would do a double rotation. It’s clearest if you multiply it all out, but the short explanation is that the conjugate allows us to rotate roughly in the same direction while multiplying from the other side: , and we went ahead and just multiplied on both sides . So if we multiply on both sides of course we get twice the rotation.

This is literally where the half-angles of quaternions and the double cover come from: From the way we multiply quaternions with vectors. Internally quaternions actually don’t have double cover. If you multiply one 90 degree quaternion with a different quaternion, then after four rotations that second quaternion will end up exactly where it started. But then we chose a vector multiplication function that applies the quaternion twice. So we have to change the interpretation and that 90 degree quaternion becomes a 180 degree quaternion. And actually my visualizations above don’t make sense any more because the vector multiplication always does that operation twice.

So if the vector multiplication is the problem, could we define a vector multiplication that doesn’t lead to double cover? That would make quaternions much simpler.

And the answer is that yes, we can. Remember that rotating vectors that lie on the plane already worked correctly. The problem was that rotating an orthogonal vector would turn into a trivector. (but rotations should leave orthogonal vectors unchanged) The solution is that we have to first project the vector down onto the plane, then rotate within the plane, and then apply the original offset again. Here is an outline of the algorithm:

- Compute the normal of the plane by multiplying with the trivector (very fast)
- Project the vector onto that normal (fast, as long as you use the version without a square root)
- Subtract that projected part (very fast)
- Multiply the vector with the quaternion
- Add the projected part (very fast)

So now we only have to do a single multiplication instead of two multiplications. And since all other operations are fast, this might even be faster than the double-cover-giving quaternion/vector multiplication.

And yes, this totally works and it’s faster and it’s less confusing. But you don’t want to use it. The reason is that as soon as I didn’t have double cover in my quaternions, I discovered why double cover is actually awesome.

Double cover is what makes quaternion interpolation so great. (by interpolation I mean getting from rotation a to rotation b in multiple small steps as opposed to one large step) Without double cover, there are some quaternions that you can not interpolate between. Having to worry about those special cases makes interpolation a giant pain and defeats the whole point of why we used quaternions to begin with.

To explain what the problem is, let’s do a couple 90 degree rotations on the plane, once using double cover and once not using double cover:

Rotation | Single Cover | Double Cover |
---|---|---|

If we interpreted these two numbers as vectors, the double cover version would do a 45 degree rotations of the vector each time. But since the double cover quaternion will rotate twice, this will actually give us a 90 degree rotation from one row to the next.

Here is a visualization of the same numbers. The idea here is that I put the scalar value on the x axis and the bivector on the y axis:

I drew the double cover as two lines, and the single cover as one line. Once again we see that a quaternion that uses double cover rotation is simply half-way towards the quaternion that uses single cover rotation.

I said that double cover is what makes quaternion interpolation so great. To see why, let’s try interpolating between these. To keep it simple I won’t do a slerp, but I’ll just try to find the rotation half-way between any of these rotations. We do that by adding the quaternions and then renormalizing them. Interpolating from the rotation to the rotation is pretty easy in both cases:

For single cover: and after normalization that comes out to be which is a 45 degree rotation.

For double cover: and after normalization that comes out to be , which is a 22.5 degree rotation, or with the double cover it’s a 45 degree rotation.

So interpolating a 90 degree rotation works just fine in both cases.

However we run into problems when interpolating from the rotation to the rotation:

For single cover: . Huh. We can’t find the half-way rotation between these two because we just get 0, which we can’t normalize. You may think that this is just a problem because I chose to find the exact midpoint between these two vectors. But this is also a problem if we want to slerp from one to the other. It all collapses and we’re left with a zero vector.

So let’s reason through this manually. How would we interpolate from +1 to -1? We could rotate on the xy plane or on the yz plane or on the zx plane, or on any combined bivector. How do we know which bivector to choose? They’re all zero in both of our inputs. We’re missing information. In order to interpolate between two rotations, we need to know a plane on which we want to interpolate.

Let’s see how the double cover solves this: and after normalization we’re left with which was our 90 degree rotation, which is exactly the half-way point between the 0 degree rotation and the 180 degree rotation.

Isn’t that neat? In the double cover version one of our quaternions had a component, so we could interpolate on that plane. In fact you could build many possible 180 degree rotations in the double cover version. We could build a 180 degree rotation that rotates on the plane or on a linear combination of the and planes, or on any arbitrary plane. They all look different and they all interpolate differently. That’s a great property because we want to be able to interpolate on any plane of our choosing. In the single cover version however we only have one way to rotate 180 degrees and it looks the same no matter which plane you’re on. Which works fine if all you want to do is rotate 180 degrees, but it doesn’t work if you want to interpolate from one rotation to the other.

One way of thinking of this is that the trick of double cover is that you can express any rotation as a rotation of less than 90 degrees. We already saw that if we want to go 180 degrees, we just go 90 degrees twice. Want to go 270 degrees? Just go -45 degrees twice. Like that we can always stay far away from the problem point of the 180 degree rotation that we would run into often if we used the single cover version of quaternions. And like that we always keep the information of which plane we are rotating on, making interpolation easy.

Another way of thinking of this is that the double cover version always gives us a midpoint of the rotation which we can use to interpolate. For some pairs of rotations, there are a lot of possible midpoints depending on which plane we want to interpolate on. Double cover solves that problem by giving us one midpoint, which narrows our choices down to one plane. And we can derive any other desired interpolation if we have the midpoint.

You may be wondering if there is a problem point where the double cover breaks down. Looking at the table above, we can find one: Rotating by 360 degrees: . Which we can not renormalize. But that case is easy to handle, and in fact every slerp implementation already handles this: We detect if the dot product of the quaternions is negative, and if it is we flip the target quaternion. So then we interpolate from to which is just a 0 degree rotation. Which is exactly what we wanted. So as long as we handle the “negative dot product” case in our interpolation function, we can handle all possible rotations. Because there are two possible ways to express every rotation, and if we run into one that’s inconvenient, we just switch to the other one.

So I hope I have convinced you that you want to have double cover. It’s a neat trick that makes interpolation easy. Quaternions do not “naturally” have double cover, but the double cover comes from the way we define the vector multiplication. If we used a different algorithm to multiply a quaternion with a vector (I outlined one above) then we could get rid of the double cover, but we would be making interpolation more difficult. I actually think that the double cover trick is not unique to quaternions. I think we could also apply it to rotation matrices to make them easier to interpolate. I haven’t done the math for that though.

So in summary I hope that I was able to make quaternions a whole lot less weird. The geometric algebra interpretation of quaternions shows us that they are normal 3D constructs, not weird four-dimensional beasts. They consist of a scalar and three bivectors. Bivectors do 90 degree rotations followed by scaling, and we saw how we can create any rotation just from those 90 degree rotations and linear scaling. The rules that govern these constructs are simple, making the equations easy to derive and understand. (as opposed to the quaternion equations which can only be memorized) Also quaternions do not naturally have a double cover. The double cover comes from the way we define the multiplication of vectors and quaternions. We could get rid of it, but the double cover is a great trick for making interpolations easier.

Unfortunately this still only makes it slightly easier to understand the numbers in quaternion. The double cover makes it so that each rotation actually gets applied twice, so my visualizations above only show half of what’s going on. This also makes it difficult to interpret the numbers because you have to know what happens if a rotation gets applied twice, which is a whole lot harder to do in your head than doing a single rotation. But still I now have a picture of quaternions, and I know what each component means, and why they behave the way they do. I hope I was able to do something similar for you.

I also think that Geometric Algebra is a very interesting field that merits further study. The fact that quaternions came out so naturally (in fact they almost don’t even need a special name) and that if we do the same derivation in 2D we end up with complex numbers is fascinating to me. The paper I linked at the beginning, Imaginary Numbers are not Real, spends a lot of time talking about how various equations in physics come out much simpler if we use geometric algebra instead of imaginary numbers and matrices. Simplicity like that is a good hint that there is something good going on here. If you’re interested in this for doing 3D math, there is something called Conformal Geometric Algebra which adds translation to quaternions. I didn’t look too much into it, but a brief glance shows that it might be related to dual quaternions. So there’s much more to discover.

]]>

The trick is to use Robin Hood hashing with an upper limit on the number of probes. If an element has to be more than X positions away from its ideal position, you grow the table and hope that with a bigger table every element can be close to where it wants to be. Turns out that this works really well. X can be relatively small which allows some nice optimizations for the inner loop of a hashtable lookup.

If you just want to try it, here is a download link. Or scroll down to the bottom of the blog post to the section “Source Code and Usage.” If you want more details read on.

There are many types of hashtables. For this one I chose

- Open addressing
- Linear probing
- Robing hood hashing
- Prime number amount of slots (but I provide an option for using powers of two)
- With an upper limit on the probe count

I believe that the last of these points is a new contribution to the world of hashtables. This is the main source of my speed up, but first I need to talk about all the other points.

Open addressing means that the underlying storage of the hashtable is a contiguous array. This is not how std::unordered_map works, which stores every element in a separate heap allocation.

Linear probing means that if you try to insert an element into the array and the current slot is already full, you just try the next slot over. If that one is also full, you pick the slot next to that etc. There are known problems with this simple approach, but I believe that putting an upper limit on the probe count resolves that.

Robin Hood hashing means that when you’re doing linear probing, you try to position every element such that it is as close as possible to its ideal position. You do this by moving objects around whenever you insert or erase an element, and the method for doing that is that you take from rich elements and give to poor elements. (hence the name Robin Hood hashing) A “rich” element is an element that received a slot close to its ideal insertion point. A “poor” element is one that’s far from its ideal insert point. When you insert a new element using linear probing you count how far you are from your ideal position. If you are further from your ideal position than the current element, you swap the new element with the existing element and try to find a new spot for the existing element.

The prime number amount of slots means that the underlying array has a prime number size. Meaning it grows for example from 5 slots to 11 slots to 23 slots to 47 slots etc. Then to find the insertion point you simply use the modulo operator to assign the hash value of an element to a slot. The other most common choice is to use powers of two to size your array. Later in this blog post I will go more into why I chose prime numbers by default and when you want to use which.

With the basics out of the way, let’s talk about my new contribution: Limiting how many slots the table will look at before it gives up and grows the underlying array.

My first idea was to set this to a very low number, say 4. Meaning when inserting I try the ideal slot and if that doesn’t work I try the next slot over, the next slot over, the slot after that and if all of them are full I grow the table and try inserting again. This works great for small tables, but when I insert random values into a large table, I would get unlucky all the time and hit four probes and I would have to grow the table even though it was mostly empty.

Instead I found that using log2(n) as the limit, where n is the number of slots in the table, makes it so that the table only has to reallocate once it’s roughly two thirds full. That is when inserting random values. When inserting sequential values the table can be filled up completely before it needs to reallocate.

Even though I found that the table can fill up to roughly two thirds, every now and then it would have to reallocate when it’s only 60% full. Or rarely even when it’s only 55% full. So I set the max_load_factor of the table to 0.5. Meaning the table will grow when its half full, even when it hasn’t reached the limit of the probe count. The reason for that is that I want a table that you can trust to reallocate only when you actually grow it: If you insert a thousand elements, then erase a couple elements and then insert that same number of elements again, you can be almost certain that the table won’t reallocate. I can’t put a number on the certainty, but I ran a simple test where I built thousands of tables of all kinds of sizes and filled them with random integers. Overall I inserted hundreds of billions of integers into the tables, and they only reallocated at a load factor of less than 0.5 once. (that time the table grew when it was 48% full, so it grew slightly too soon) So I think you can trust that this will very, very rarely reallocate when you weren’t expecting it.

That being said if you don’t need control over when the table grows, feel free to set the max_load_factor higher. It’s totally safe to set it to 0.9: Robin hood hashing combined with the maximum probe count will ensure that all operations remain fast. Don’t set it to 1.0 though: You can get into bad situations when inserting then because you might hit a case where every single element in the table has to be shifted around when inserting the last element. (say every element is in the slot it wants to be, except the very last slot is empty. Then you insert an element that wants to be in the first slot, but the first slot is already full. So it will go into the second slot, pushing the second element one over, which will push the third element one over etc. all the way through the table until the element in the second to last slot gets pushed into the last slot. You now have a table where every element except for the first is one slot from its ideal slot so lookups are still really fast, but that last insert took a long time) By keeping a few empty slots around you can ensure that newly inserted elements only have to move a few elements over until one of them finds an empty slot.

So if I set the max_load_factor so low that I never reach the probe count limit anyway, why have the limit at all? Because it allows a really neat optimization: Let’s say you rehash the table to have 1000 slots. My hashtable will then grow to 1009 slots because that’s the closest prime number. The log2 of that is 10, so I set the probe count limit to 10. The trick now is that instead of allocating an array of 1009 slots, I actually allocate an array of 1019 slots. But all other hash operations will still pretend that I only have 1009 slots. Now if two elements hash to index 1008, I can just go over the end and insert at index 1009. I never have to do any bounds checking because the probe count limit ensures that I will never go beyond index 1018. If I ever have eleven elements that want to go into the last slot, the table will grow and all those elements will hash to different slots. Without bounds checking, my inner loops are tiny. Here is what the find function looks like:

iterator find(const FindKey & key) { size_t index = hash_policy.index_for_hash(hash_object(key)); EntryPointer it = entries + index; for (int8_t distance = 0;; ++distance, ++it) { if (it->distance_from_desired < distance) return end(); else if (compares_equal(key, it->value)) return { it }; } }

It’s basically a linear search. The assembly of this code is beautiful. This is better than simple linear probing in two ways: 1. No bounds checking. Empty slots have -1 in their distance_from_desired value so the empty case is the same case as finding a different element. 2. This will do at most log2(n) iterations through the loop. Normally the worst case for looking things up in a hashtable is O(n). For me it’s O(log n). This makes a real difference. Especially since linear probing actually makes it pretty likely that you will hit the worst case since linear probing tends to bunch elements together.

My memory overhead on this is one byte per element. I store the distance_from_desired in an int8_t. That being said that one byte will be padded out to the alignment of the type that you insert. So if you insert ints, the one byte will get three bytes of padding so there is four bytes of overhead per element. If you insert pointers there will be 7 bytes of padding so you get eight bytes of overhead per element. I’ve thought about changing my memory layout to solve this, but my worry is that then I would have two cache misses for each lookup instead of one cache miss. So the memory overhead is one byte per element plus padding. Oh and with a max_load_factor of 0.5 (which is the default) your table will only be between 25% and 50% full, so there is more overhead there. (but once again it’s safe to increase the max_load_factor to 0.9 to save memory while only suffering a small decrease in speed)

Measuring hash tables is actually not easy. You need to measure at least these cases:

- Looking up an element that’s in the table
- Looking up an element that can not be found in the table
- Inserting a bunch of random numbers
- Inserting a bunch of random numbers after calling reserve()
- Erasing elements

And you need to run each of these with different keys and different sizes of values. I use an int or a string as the key, and I use value types of size 4, 32 and 1024. I will prefer to use int keys because with strings you’re mostly measuring the overhead of the hash function and the comparison operator, and that overhead is the same for all hash tables.

The reason for testing both successful lookups and unsuccessful lookups is that for some tables there is a huge difference in performance between these cases. For example I came across a really bad case when inserting all the numbers from 0 to 500000 into a google::dense_hash_map (meaning they were not random numbers) and then did unsuccessful lookups: The hashtable suddenly was five hundred times slower than it usually is. This is a edge case of using a power of two for the size of the table. I’ll go more into when you should pick powers of two and when you should pick prime numbers below. This example suggests that maybe I should measure each of these with random numbers and with sequential numbers, but that ended up being too many graphs. So I will only test tables with random numbers which should prevent bad cases caused by specific patterns.

The first graph is looking up an element that’s in the table:

This is a pretty dense graph so let’s spend some time on this one. flat_hash_map is the new hash table I’m presenting in this blog post. flat_hash_map_power_of_two is that same hash table but using powers of two for the array size instead of prime numbers. You can see that it’s much faster and I’ll explain why that is below. dense_hash_map is google::dense_hash_map which is the fastest hashtable I could find. sherwood_map is my old hashtable from my “I Wrote a Faster Hashtable” blog post. It’s embarrassingly slow… std::unordered_map and boost::unordered_map are self-explanatory. multi_index is boost::multi_index.

I want to talk about this graph a little. The Y-axis is the number of nanosecons that it takes to look up a single item. I use google benchmark and that calls table.find() over and over again for half a second, and then counts how many times it was able to call that function. You get nanoseconds by dividing the time that all iterations took together by the loop count. All the keys I’m looking for are guaranteed to be in the table. I chose to use a log scale for the X axis because performance tends to change on a log scale. Also this makes it possible to see the performance at different scales: If you care about small tables, you can look at the left side of the graph.

The first thing to notice is that all of the graphs are spiky. This is because all hashtables have different performance depending on the current load factor. Meaning depending on how full they are. When a table is 25% full lookups will be faster than when it’s 50% full. The reason for this is that there are more hash collisions when the table is more full. So you can see the cost go up until at some point the table decides that it’s too full and that it should reallocate, which makes lookups fast again.

This would be very clear if I were to plot the load factors of each of these tables. One more thing that would be clear is that the tables at the bottom have a max_load_factor of 0.5, the tables at the top have a max_load_factor of 1.0. This immediately raises the question of “wouldn’t those other tables also be faster if they used a max_load_factor of 0.5?” The answer is that they would only be a little faster, but I will answer that question more fully with a different graph further down. (but just from the graph above you can see that the lowest point of the upper graph, when those tables have just reallocated and have a load factor of just over 0.5 is far over the highest point of the lower graphs, just before they reallocate because their load factor is just under 0.5)

Another thing that we notice is that all the graphs are essentially flat on the left half of the screen. This is because the table fits entirely into the cache. Only when we get to the point where the data doesn’t fit into my L3 cache do we see the different graphs really diverge. I think this is a big problem. I think the numbers on the right are far more realistic than the numbers on the left. You will only get the numbers on the left if the element you’re looking for is already in the cache.

So I tried to come up with a test that would measure how fast the table would be if it wasn’t in the cache: I create enough tables so that they don’t all fit into L3 cache and I use a different table for every element that I look up. Let’s say I want to measure a table that has 32 elements in it and the elements in the table are 8 bytes in size. My L3 cache is 6 mebibytes so I can fit roughly 25000 of these tables into my L3 cache. To be sure that the tables won’t be in the cache I actually create three times that number, meaning 75000 tables. And each lookup is from a different table. That gives me this graph:

First, I removed a couple of the lines because they didn’t add much information. boost::unordered_map is usually the same speed as std::unordered_map (sometimes it’s a little faster, but it’s still always above everything else) and nobody cares about my old slow hash table sherwood_map. So now we’re left with just the important ones: std::unordered_map as a normal node based container, boost::multi_index as a really fast node based container, (I believe that std::unordered_map could be this fast) google::dense_hash_map as a fast open addressing container, and my new container in its prime number version and its power of two version.

So in this new benchmark, where I try to force a cache miss, we can see big differences very early on. What we find is that the pattern that we saw at the end of the last graph emerges very early in this graph: Starting at ten elements in the table there are clear winners in terms of performance. This is actually pretty impressive: All of these hash tables maintain consistent performance across many different orders of magnitude.

Let’s also look at the graph for unsuccessful lookups: Meaning trying to find an item that is not in the table:

When it comes to unsuccessful lookups the graph is even more spiky: The load factor really matters here. The more full a table is the more elements the search has to look at before it can conclude that an item is not in the table. But I’m actually really happy about how my new table is doing here: Limiting the probe count seems to work. I get more consistent performance than any other table.

What I take from these graphs is that my new table is a really big improvement: The red line, with the powers of two, is my table configured the same way as dense_hash_map: With max_load_factor 0.5 and using a power of two to size the table so that a hash can be mapped to a slot just by looking at the lower bits. The only big difference is that my table requires one byte of extra storage (plus padding) per slot in the table. So my table will use slightly more memory than dense_hash_map.

The surprising thing is that my table is as fast as dense_hash_map even when using prime numbers to size the table. So let me talk about that.

There are three expensive steps in looking up an item in a hashtable:

- Hashing the key
- Mapping the key to a slot
- Fetching the memory for that slot

Step 1 can be really cheap if your key is an integer: You just cast the int to a size_t. But this step can be more expensive for other types, like strings.

Step 2 is just an integer modulo.

Step 3 is a pointer dereference, for std::unordered_map it’s actually multiple pointer dereferences.

Intuitively you would expect that if you don’t have a very slow hash function, step 3 is the most expensive of these three. But if you’re not getting cache misses for every single lookup, chances are that the integer modulo will end up being your most expensive operation. Integer modulo is really slow, even on modern hardware. The Intel manual lists it as taking between 80 and 95 cycles.

This is the main reason why really fast hash tables usually use powers of two for the size of the array. Then all you have to do is mask off the upper bits, which you can do in one cycle.

There is however one big problem with using a power of two: There are many patterns of input data that result in lots of hash collisions when using powers of two. For example here is that last graph again, except I didn’t use random numbers:

Yes you see correctly that google::dense_hash_map just takes off into the stratosphere. What pattern of inputs did I have to use to get such poor performance out of dense_hash_map? It’s just sequential numbers. Meaning I insert all numbers [0, 1, 2, …, n – 2, n – 1]. If you do that, trying to look up a key that’s not in the table will be super slow. Successful lookups will still be fine. But if some of your lookups are for keys that are in the table and some are for keys that are not, then you might find that some of your lookups are a thousand times slower than others.

Another example of bad performance due to using powers of two is how the standard hashtable in Rust was accidentally quadratic when inserting keys from one table into another. So using powers of two can bite you in non-obvious ways.

It just so happens that my hashtable doesn’t suffer from either of these problems: The limit of the probe count resolves both of these problems in the best way possible. The table doesn’t even have to reallocate unnecessarily. Does that mean that I’m immune against problems that come from using powers of two? No. For example one problem that I have personally experienced in the past is that when you insert pointers into a hash table that uses powers of two, some slots will never be used. The reason is heap allocations in my program were sixteen byte aligned and I used a hash function that just reinterpret_casted the pointer to a size_t. Because of that only one out of sixteen slots in my table was ever used. You would run into the same problem if you use the power of two version of my new hashtable.

All of these problems are solvable if you’re careful about choosing a hash function that’s appropriate for your inputs. But that’s not a good user experience: You now always have to be vigilant when using hashtables. Sometimes that’s OK, but sometimes you just want to not have to think too much about this. You just want something that works and doesn’t randomly get slow. That’s why I decided to make my hashtable use prime number sizes by default and to only give an option for using powers of two.

Why do prime numbers help? I can’t quite explain the math behind that, but the intuition is that since the prime number doesn’t share common divisors with anything, all numbers get different remainders. For example let’s say I’m using powers of two, my hashtable has 32 slots, and I am trying to insert pointers which are all sixteen byte aligned. (meaning all my numbers are multiples of sixteen) Now using integer modulo to find a slot in the table will only ever give me two possible slots: slot 0 or slot 16. Since 32 is divisible by 16, you simply can’t get more possible values than that. If I use a prime numbered size instead I don’t run into that problem. For example if I use the prime number 37, then all divisions using multiples of sixteen give me different slots in the table, and I will use all 37 slots. (try doing the math and you will see that the first 37 multiples of 16 all would end up in different slots)

So then how do we solve the problem of the slow integer modulo? For this I’m using a trick that I copied from boost::multi_index: I make all integer modulos use a compile time constant. I don’t allow all possible prime numbers as sizes for the table. Instead I have a selection of pre-picked prime numbers and will always grow the table to the next largest one out of that list. Then I store the index of the number that your table has. When it later comes time to do the integer modulo to assign the hash value to a slot, you will see that my code does this:

switch(prime_index) { case 0: return 0llu; case 1: return hash % 2llu; case 2: return hash % 3llu; case 3: return hash % 5llu; case 4: return hash % 7llu; case 5: return hash % 11llu; case 6: return hash % 13llu; case 7: return hash % 17llu; case 8: return hash % 23llu; case 9: return hash % 29llu; case 10: return hash % 37llu; case 11: return hash % 47llu; case 12: return hash % 59llu; // // ... more cases // case 185: return hash % 14480561146010017169llu; case 186: return hash % 18446744073709551557llu; }

Each of these cases is a integer modulo by a compile time constant. Why is this a win? Turns out if you do a modulo by a constant, the compiler knows a bunch of tricks to make this fast. You get custom assembly for each of these cases and that custom assembly will be much faster than an integer modulo would be. It looks kinda crazy but it’s a huge speed up.

You can see the difference in the graphs above: Using prime numbers is a little bit slower, but it’s still really fast when compared to other hash tables, and you’re immune against most bad cases of hash tables. Of course you’re never immune against all bad cases. If you really worry about that, you should use std::map with its strict upper bounds. But the difference is that when using powers of two, there are many bad cases and you have to be careful not to stumble into them. When using prime numbers you will basically only hit a bad case if you intentionally create bad keys.

That brings up security: A clever attack that you can do on hash tables is that you insert keys that all collide with each other. How might you do that? If you know that a website uses a hashtable in its internal caches, then you could engineer website requests such that all your requests will collide in that hashtable. Like that you can really slow down the internal cache of a webserver and possibly bring down the website. So for example if you know that google uses dense_hash_map internally and you see the graph above where it gets really slow if you insert sequential numbers, you could just request sequential websites and hope that that pollutes their cache. You might think that setting an upper limit on the probe count prevents attackers from filling up your table with bad keys. That is true: My hashtable will not suffer from this problem. However a new attack immediately presents itself: If you know which prime numbers I use internally you could insert keys in an order so that my table repeatedly hits the limit of the probe count and has to repeatedly reallocate. So the new attack is that you can make the server run out of memory. You can solve this by using a custom hash function, but I can’t give you advice for what such a hash function should look like. All I can tell you is that if you use the hashtable in an environment where users can insert keys, don’t use std::hash as your hash function and use something stateful instead that can’t be predicted ahead of time. On the other hand if you don’t think that people will be malicious, you can be confident that using the prime number version of my hashtable will result in an even spread of values and there will be no problems.

But let’s say you know that your hash function returns numbers that are well distributed and that you’re rarely going to get hash collisions even if you use powers of two. Then you should use the power of two version of my table. To do that you have to typedef a hash_policy in your hash function object. I decided to put this customization point into the hash function object because that is the place that actually knows the quality of the returned keys.

So you put this typedef into your custom hash function object:

struct CustomHashFunction { size_t operator()(const YourStruct & foo) { // your hash function here } typedef ska::power_of_two_hash_policy hash_policy; }; // later: ska::flat_hash_map<YourStruct, int, CustomHashFunction> your_hash_map;

In your custom hash function you typedef ska::power_of_two_hash_policy as hash_policy. Then my flat_hash_map will switch to using powers of two. Also if you know that std::hash is good enough in your case, I provide a type called power_of_two_std_hash that will just call std::hash but will use the power_of_two_hash_policy:

ska::flat_hash_map<K, V, ska::power_of_two_std_hash<K>> your_hash_map;

With either of these you can get a faster hashtable if you know that you won’t be getting too many hash collisions.

After that lengthy detour of talking about hash table theory let’s get back to the performance of my table. Here is a graph that measures how long it takes to insert an item into a map. The way I measure this is that I measure how long it takes to insert N elements, then I divide the time by N to get the time that the average element took. The first graph is the speed if I do not call reserve() before inserting:

This graph is also spiky, but the spikes point in the other direction. Any time that the table has to reallocate the average cost shoots up. Then that cost gets amortized until the table has to reallocate again.

The other point about this graph is that on the left half you once again only have tables that fit entirely in the L3 cache. I decided to not write a cache-miss-triggering test for this one because that would take time and we learned above that just looking at the right half is a good approximation for a cache miss.

Here google::dense_hash_map beats my new hash table, but not by much. My table is still very fast, just not quite as fast as dense_hash_map. The reason for this is that dense_hash_map doesn’t move elements around when inserting. It simply looks for an empty slot and inserts the element. The Robin Hood hashing that I’m using requires that I move elements around when inserting to keep the property that every node is as close to its ideal position as possible. It’s a trade-off where insertion becomes more expensive, but lookups will be faster. But I’m happy with how it seems to only have a small impact.

Next is the time it takes to insert elements if the table had reserve() called ahead of time:

I don’t know what’s happening with the node-based containers at the end there. It might be fun to investigate what’s going on there, but I didn’t do that. I actually have a suspicion that that’s due to the malloc call in my standard library. (Linux gcc) I had several problems with it while measuring this graph and others because some operations would randomly take a long time.

But overall this graph looks similar to the last one, except less spiky because the reserve removes the need for reallocations. I have fast inserts, but they’re not as fast as those of google::dense_hash_map.

Finally let’s look at how long it takes to erase elements. For this I built a map of N elements, then measured how long it takes to erase every element in the map in a random order. Then I would divide the time it takes to erase all elements by N to get the cost per element:

The node based containers are slow once again, and the flat containers are all roughly equally fast. dense_hash_map is slightly faster than my hash table, but not by much: It takes roughly 20 nanoseconds to erase something from dense_hash_map and it takes roughly 23 nanoseconds to erase something from my hash table. Overall these are both very fast.

But there is one big difference between my table and dense_hash_map: When dense_hash_map erases an element, it leaves behind a tombstone in the table. That tombstone will only be removed if you insert a new element in that slot. A tombstone is a requirement of the quadratic probing that google::dense_hash_map does on lookup: When an element gets erased, it’s very difficult to find another element to take its slot. In Robin Hood hashing with linear probing it’s trivial to find an element that should go into the now empty slot: just move the next element one forward if it isn’t already in its ideal slot. In quadratic probing it might have to be an element that’s four slots over. And when that one gets moved you have to again solve the problem of finding a node to insert into the newly vacated slot. So instead you insert a tombstone and then the table knows to ignore tombstones on lookup. And they will be replaced on the next insert.

What this means though is that the table will get slightly slower once you have tombstones in your table. So dense_hash_map has a fast erase at the cost of slowing down lookups after an erase. Measuring the impact of that is a bit difficult, but I believe I have found a test that works for this purpose: I insert and erase elements over and over again:

The way this test works is that I first generate a million random ints. Then I insert these into the hashtable and erase them again and insert them again. The trick is that I do this in a random order: So let’s say I only had the four integers 1, 2, 3 and 4. Then a valid order for “insert, erase and insert again” would be insert 1, insert 3, erase 1, insert 2, insert 4, erase 4, insert 4, insert 1, erase 2, erase 3, insert 3, insert 2. Every element gets inserted, erased and inserted again. But the order is random. The graph above counts the number of inserts and measures how long the insert takes per element. The first data point, all the way on the left is just inserting a million elements. The second data point is inserting a million elements, erasing them and inserting them again in a random order like I explained. The next data point does insert, erase, insert, erase, insert. So three inserts in total. You get the idea.

What we see is that at first dense_hash_map is faster because its inserts are faster. At the second data point my hashtable has already caught up to it. At the third data point my hashtable is winning, and at six million inserts even my prime number version is winning. The reason why my tables keep on getting faster is that as there are more erases, you would expect the average load factor of the table to go down. If you insert and erase a million items often enough, the table will always have close to 500,000 elements in it. So as you get further to the right in this graph, the table will be less full on average. My hash table can take advantage of that, but dense_hash_map has a bunch of tombstones in the table which prevent it from going faster. That being said if we compare dense_hash_map against other hash tables, it’s still very fast:

So from this angle it totally makes sense for dense_hash_map to use quadratic probing, even if that requires inserting tombstones into the table on erase. The table is still very fast, certainly much faster than any node based container. But the point remains that Robin Hood linear probing gives me a more elegant way of erasing elements because it’s easy to find which element should go into the empty slot. And if you have a table where you often erase and insert elements, that’s an advantage.

One final graph that I promised above is a way to resolve the problem that std::unordered_map and boost::multi_index use a max_load_factor of 1.0, while my table and google::dense_hash_map use 0.5. Wouldn’t the other tables also be faster if they used a lower max_load_factor? To determine that I ran the same benchmark that I used to generate the very first graph (successful lookups) but I set the max_load_factor to 0.5 on each table. And then I took measurements just before a table reallocates. I’ll explain it a bit better after the graph:

This is the same graph as the very first graph in this blog post, except all the tables use a max_load_factor of 0.5. And then I wanted to only measure these tables when they really do have the same load factor, so I measured each table just before it would reallocate its internal storage. So if you look back at the very first graph in this blog post, imagine that I drew lines from one peak to the next. If we want to directly compare performance of hashtables and we want to eradicate the effect of different hash tables using different max_load_factor values and different strategies for when they reallocate, I think this is the right graph.

In this graph we see that flat_hash_map is faster than dense_hash_map, just as it was in the initial graph. It’s now much clearer though because all the noisiness is gone. Btw that brief time where dense_hash_map is faster is a result of dense_hash_map using less memory: At that point dense_hash_map still fits in my L3 cache but my flat_hash_map does not. Knowing this I can also see the same thing in the first graph, but it’s much clearer here.

But the main point of this was to compare boost::multi_index and std::unordered_map, which use a max_load_factor of 1.0 to my flat_hash_map and dense_hash_map which use a max_load_factor of 0.5. As you can see even if we use the same max_load_factor for every table, the flat tables are faster.

This was expected, but I still think this was worth measuring. In a sense this is the truest measure of hash table performance because here all hash tables are configured the same way and have the same load factor: Every single data point has a current load factor of 0.5. That being said I did not use this method of measuring for my other graphs, because in the real world you probably will never change the max_load_factor. And in the real world you will see the spiky performance of the initial graph where similar tables can have very different performance, depending on how many hash collisions there are. (and the load factor is actually only one part of that, as I also discussed above when talking about powers of two vs prime numbers) And also this graph hides one benefit of my table: Limiting the probe count leads to more consistent performance, making the lines of my hash_map less spiky than the lines of other tables.

So far every graph was measuring performance of a map from int to int. However there might be differences in performance when using different keys or larger values. First, here are the graphs for successful lookups and unsuccessful lookups when using strings as keys:

Yes, I went with the version of the graph where the table is already in the cache. It’s easier to generate. What we see here is that using a string is just moving all lines up a little. Which is expected, because the main cost here is that the hash function changed and the comparison is more expensive. Let’s look at unsuccessful lookups, too:

This is very interesting: It looks like looking for an element that’s not in the table is more expensive in google::dense_hash_map than in boost::multi_index. The reason for this is interesting: When creating a dense_hash_map you have to provide a special key that indicates that a slot is empty, and a special key that indicates that a slot is a tombstone. I used std::string(1, 0) and std::string(1, 255) respectively. But what this means is that the table has to do a string comparison to see that the slot is empty. All the other tables just do an integer comparison to see that a slot is empty.

That being said a string comparison that only compares a single character should be really cheap. And indeed the overhead is not that big. It just looks big above because every lookup is a cache hit. The cache miss picture looks different:

In this we can see that when the table is not already in the cache, dense_hash_map remains faster. Except it gets slower when the table gets very big. (more than a million entries) I didn’t find out why that is.

The next thing I tried to vary was the size of the value. What happens if I don’t have a map from an int to an int, but from an int to a 32 byte struct? Or from an int to a 1024 byte struct. So for the lookups I have 12 graphs in total ([int value, 32 byte value, 1024 byte value] x [int key, string key] x [successful lookup, unsuccessful lookup]) and most of them look exactly like the graphs above: All string lookups look the same independent of value size, and most int lookups also look the same. Except for one: Unsuccessful lookup of an int key and a 1024 byte value:

What we see here is that at a 1024 byte value, multi_index is actually competitive to the flat tables. The reason for this is that in an unsuccessful lookup you have to do the maximum number of probes, and with a value type that’s as huge as 1024 bytes, your prefetcher has to work hard. My table still seem to be winning, but for a value that’s this large, everything is essentially a node based container.

The reason why all other lookup graphs looked the same (and why I don’t show them) is this: For the node-based containers you don’t care how big the value is. Everything is a separate heap allocation anyway. For the flat containers you would expect that you would get more cache misses. But since the max_load_factor is 0.5, the element is usually found in the table pretty quickly. The most common case is exactly one lookup: Either you find it in the first probe or you know with the first probe that it won’t be in the table. Two probes also happen pretty often, but three probes are rare. Also at least in my table the lookups are just a linear search. CPUs are great at prefetching the next element in a linear search, no matter how big the item is.

So lookups mostly don’t change with the size of the type, the graph for inserts and erases changes a lot though. Here is inserting with an int as a key and a 32 byte struct as a value:

All the graphs have moved up a bit, but the graphs of the flat tables have moved up the most and have become more spiky: Reallocations hurt more when you have to move more data. The node based containers are not affected by this, and boost::multi_index stays competitive for a very long time. Let’s see what this looks like for a really large type, a 1024 byte struct:

Now the order has flipped completely: The flat containers are more expensive and very spiky, the node based containers keep their speed. At this point reallocation cost dominates completely.

One oddity is that it’s really expensive to insert a single element into a dense_hash_map. (all the way on the left of the yellow line) The reason for this is that dense_hash_map allocates 32 slots at first and it fills all of them with a default constructed value type. Since my value type is 1024 bytes in size, it has to set 32 kib of data to 0. This probably won’t affect you, but I felt like I should explain the strange shape of the line.

The other thing that happened is that dense_hash_map is now slower than my hashtable. I didn’t look into why that is, but I would assume it’s for the same reason as the above paragraph: dense_hash_map fills every slot with a default constructed value type, so reallocation is even more expensive because all the slots have to be initialized, even the ones that will never be used.

If reallocation is expensive, the solution is to call reserve() on the container ahead of time so that no reallocation has to happen. Let’s see what happens when we insert the same elements but call reserve first:

When calling reserve first, my container is faster than the node based containers at first, but at some point boost::multi_index is still faster. dense_hash_map is still slower and again I think that’s because it initializes more elements than necessary and with a value this big, even just initializing the whole table to the “empty” key/value pair takes a lot of time. They could probably optimize this by only initializing the key to the “empty” key and not initializing the value, but then again how often do you insert a value that’s 1024 bytes? It’s neat as a benchmark to test the behavior of containers as the stored values grow very large, but it might not happen in the real world.

My containers are faster until they get large: at exactly 16385 elements there is a sudden jump in the cost. At 16384 elements things are still at the normal speed. Since every element in the container is 1028 bytes, that means that if your container is more than 16 megabytes, it can suddenly get slower. At first I thought this was a random reallocation because I hit the probe count limit, which would have been embarrassing because I explained further up in this blog post about how rare that is, but luckily that’s not the case. The reason for this is interesting: At exactly that measuring point the amount of time I spend in clear_page_c_e goes up drastically. It’s not easy to find out what that is, but luckily Bruce Dawson wrote a blog post where he mentions the cost of zeroing out memory and that this happens in a function called clear_page_c_e. So for some reason at exactly that measuring point it takes the OS a lot longer to provide me with cleared pages of memory. So depending on your memory manager and your OS, this may or may not happen to you.

That also means though that this is a one time cost. Once you’ve grown the container, you will not hit that spike in cost again. So if your container is long lived this cost will be amortized.

Let’s try inserting strings:

dense_hash_map is surprisingly slow in this benchmark. The reason for that is that my version of dense_hash_map doesn’t support move semantics yet. So it makes unnecessary string copies. I’m using the version of dense_hash_map that comes with Ubuntu 16.04, which is probably a bit out of date. But I also can’t seem to find a version that does support move semantics, so I’ll stick with this version.

So we’ll use this graph mostly to compare my table against the node base containers, and my table loses. Once again I blame this on the higher cost of reallocation. So let’s try what happens if I reserve first:

… Honestly, I can’t read anything from this. The cost of the string copy dominates and all tables look the same. The main lesson to learn from this is that when your type is expensive to copy, that will dominate the insertion time and you should probably pick your hash table based on the other measures, like lookup time. We can see a similar picture when I insert strings as key with a large value type:

Once again dense_hash_map is slow because it initializes all those bytes. The other tables are pretty much the same because the copying cost dominates. Except that my flat_hash_map_power_of_two has that same weird spike at exactly 16385 elements due to increased time spent in clear_page_c_e that I also had when inserting ints with a 1024 byte value.

Lesson learned from this: If you have a large type, inserts will be equally slow in all tables, you should call reserve ahead of time, and the node based containers are a much more competitive option for large types than they are for small types.

Let’s also measure erasing elements. Once again I ran three tests for ints as keys and three tests for strings as keys: Using a 4 byte value, a 32 byte value and a 1024 byte value. The four byte value picture is shown above. The 32 byte value picture looks identical, so I’m not even going to show it. The 1024 byte value picture looks like this:

The main difference is that dense_hash_map got a lot slower. This is the same problem as in the others pictures with large value types: The other tables just consider an item deleted and call the destructor which is a no-op for my struct. dense_hash_map will overwrite the value with the “empty” key/value pair which is a large operation if you have 1024 bytes of data.

Otherwise the main difference here is that erasing from flat_hash_map has gotten much more spiky than it was in the other erase picture above, and the line has moved up considerably, getting almost as expensive as in the node based containers. I think the reason for this is that the flat_hash_map has to move elements around when an item gets erased, and that is expensive if each element is 1028 bytes of data.

The graphs for erasing strings look similar enough to the graphs for erasing ints that they’re almost not worth showing. Here is one though:

Looks very similar to erasing ints. If I make the value size 1024 bytes, the graph looks very similar to the one above this one, so just look at that one again.

The final test is the “insert, erase and insert again” test I ran above where I do inserts and erases in random orders. I reduced the number to 10,000 elements because I ran out of memory when running this with ten million 1028 byte elements. It’s also much faster to generate these graphs when I’m using fewer elements. Let’s start with a 32 byte value type:

flat_hash_map actually beats dense_hash_map now. The difference gets bigger if we increase the size of the value even more:

With a really large value type my table beats dense_hash_map. However now the node based containers beat my hash table at first, but over time my table seems to catch up. The reason for this is that I’m not reserving in these graphs. So in the first insert the table has to reallocate a bunch of times and that is very expensive in the flat containers, and it’s cheaper for the node based containers. However as we erase a few elements and insert a few elements, the reallocation cost gets amortized and my tables beat unordered_map, and they would probably also beat multi_index at some point. If I reserve ahead of time they beat multi_index right away:

I actually didn’t expect this because even though I reserve ahead of time, my container still has to move elements around, and that should be really expensive for 1028 bytes of data. My only explanation for why my container remains fast is that the load factor is pretty low and that collisions are rare. When I measure this test with strings, I do see that my container slows down as expected and multi_index is competitive:

The other pictures for inserting strings look similar to the above pictures: When inserting strings with a 32 byte value type, the graph looks like this last one. When inserting with a 1024 byte value type it looks like the graphs where I did the same thing with ints as a key, both for the case where I do reserve and the case where I don’t reserve.

That was quite a lot of measurements. Measuring hashtables is surprisingly complex. I’m still not entirely sure if I should measure all tables with the same load factors or with their default setting. I chose to go for the default setting here. And then there are so many different cases: Different keys, different value sizes, different table sizes, reserve or not etc. And there are a lot of different tests to do. I could have put thousands of graphs into this blog post, but at some point it just gets too much, so let me summarize:

- My new table has the fastest lookups of any table I could find
- It also has really fast insert and erase operations. Especially if you reserve the correct size ahead of time.
- For large types the node based containers can be faster if you don’t know ahead of time how many elements there will be. The cost of reallocations kills the flat containers. Without reallocation my flat container is the fastest in all benchmarks for large types.
- When inserting strings the cost of the string hashing, comparison and copy dominate and the choice of hashtable doesn’t matter much.
- google::dense_hash_map has some surprising cases where it slows down.
- boost::multi_index is a really impressive hash table. It has very good performance for a node based container.
- If you know that your hash function returns a good distribution of values, you can get a significant speed up by using the power_of_two version of my hashtable.

When using my table it’s safe for you to throw exceptions in your constructor, your copy constructor, in your hash function, in your equality function and in your allocator. You are not allowed to throw exceptions in a move constructor or in a destructor. The reason for this is that I have to move elements around and maintain invariants. And if you throw in a move constructor, I don’t know how to do that.

I’ve uploaded the source code to github. You can download it here. It’s licensed under the boost license. It’s a single header that contains both ska::flat_hash_map and ska::flat_hash_set. The interface is the same as that of std::unordered_map and std::unordered_set.

There is one complicated bit if you want to use the power_of_two version of the table: I explain how to do that further up in this blog post. Search for “ska::power_of_two_hash_policy” to get to the explanation.

Also I want to point out that my default max_load_factor is 0.5. It’s safe to set it as high as 0.9. Just be aware that your table will probably reallocate before it hits that number. It tends to reallocate before it’s 70% full because it hits the probe count limit. But if you don’t care that your table might reallocate when you’re not expecting it, you can save a bit of memory by using a higher max_load_factor while only suffering a tiny loss of performance.

I think I wrote the fastest hash table there is. It’s definitely the fastest for lookups, and it’s also really fast for insert and erase operations. The main new trick is to set an upper limit on the probe count. The probe count limit can be set to log2(n) which makes the worst case lookup time O(log(n)) instead of O(n). This actually makes a difference. The probe count limit works great with Robin Hood hashing and allows some neat optimizations in the inner loop.

The hash table is available under the boost license as both a hash_map and a hash_set version. Enjoy!

]]>

Somebody was nice enough to link my blog post on Hacker News and Reddit. While I didn’t do that, I still read most of the comments on those website. For some reasons the comments I got on my website were much better than the comments on either of those websites. But there seem to be some common misunderstandings underlying the bad comments, so I’ll try to clear them up.

The top comment on Hacker News essentially says “meh, this can’t sort everything and we already knew that radix sort was faster.” Firstly, I don’t understand that negativity. My blog post was essentially “hey everyone, I am very excited to share that I have optimized and generalized radix sort” and your first response is “but you didn’t generalize it all the way.” Why the negativity? I take one step forward and you complain that I didn’t take two steps? Why not be happy that I made radix sort apply to more domains than where it applied before?

So I want to talk about generalizing radix sort even more: The example of something that I don’t handle is sorting a vector of std::sets. Say a vector of sets of ints. The reason why I can’t sort that is that std::set doesn’t have an operator[]. std::sort does not have that problem because std::set provides comparison operators.

There are two possible solutions here:

- Where I currently use operator[], I could use std::next instead. So instead of writing container[index] I could write *std::next(container.begin(), index).
- I could not use an index to indicate the current position, and only use iterators instead. For that I would have to allocate a second buffer to store one iterator per collection to sort.

Both of these approaches have problems. The first one is obviously slow because I have to iterate over the collection from the beginning every time that I want to look up the current element that I want to sort by. Meaning if I need to look at the first n elements to tell two lists apart, I need to walk over those elements n times, resulting in O(n^2) pointer dereferences. The normal set comparison operators don’t have that problem because when they compare two sets, they can iterate over both in parallel. So when they need to look at n elements to tell two lists apart, they can do that in O(n).

I also didn’t want to allocate the extra memory that would be required for the second approach because I didn’t want ska_sort to sometimes require heap allocations, and sometimes not require heap allocations depending on what type it is sorting.

The point is: I could easily generalize radix sort even more so that it can handle this case as well, but it doesn’t seem interesting. Both approaches here have clear problems. I think you should just use std::sort here. So I’ll limit ska_sort to things that can be accessed in constant time.

The other question is why you would want me to handle this. I stopped generalizing when I thought that I could handle all real use cases. Radix sort can be generalized more so that it can sort everything if you want it to. If you really need to sort a vector of std::sets or a vector of std::lists, then I can probably implement the second solution for you. But **the real question isn’t whether ska_sort can sort everything, but whether it can sort your use case. And the answer is almost certainly yes.** If you really have a use case that ska_sort can not sort, then I can understand the criticism. But what do you want to sort that can not be reached in constant time?

That being said one thing that still needs to be done is that I need to allow customization of sorting behavior. Which is also what I wrote in my last blog post. Especially when sorting strings there are good use cases for wanting custom sorting behavior. Like case insensitive sorting. Or number aware sorting so that “foo100” comes after “foo99”. I’ll present an idea for that further down in this blog post. But the work there is not to generalize ska_sort further so that it can sort more data, but instead to give more customization options for the data that it can sort.

Before finishing this section, I actually quickly implemented solution 1 from the approaches for sorting sets above, and the graph is interesting:

ska_sort actually beats std::sort for quite a while there. std::sort is only faster when there are a lot of sets. If I construct the sets such that there is very little overlap between them, ska_sort is actually always faster. Does that mean that I should provide this code? I decided against it for now because it’s not a clear win. I think if I did handle this, I would want to use the allocating solution because I expect a bigger win from that one.

One criticism that I didn’t understand at first was that I am comparing apples to oranges when I’m comparing my ska_sort to std::sort. You can see that same criticism voiced in that top Hacker New comment mentioned above. To me they are both sorting algorithms and who cares how they work internally? If you want to sort things, the only thing you care about is speed, not what the algorithm does internally.

A friend of mine had a good analogy though: **Comparing a better radix sort to std::sort is like writing a faster hash table and saying “look at how much faster this is than std::map.”**

However I contest that **what I did is not equivalent to writing a better hash table, but it is equivalent to writing the first general purpose hash table**. Imagine a parallel world where people have used hash tables for a while, but only ever for special purposes. Say everybody knows that you can only use hash tables if your key is an integer, otherwise you have to use a search tree. And then somebody comes along with the realization that you can store anything in a hash table as long as you provide a custom hash function. In that case it doesn’t make sense to compare this new hash table to older hash tables, because older hash tables simply can’t run most of your benchmarks because they only support integer keys.

Similarly the only thing that I could compare my radix sort to was std::sort, because older radix sort implementations couldn’t run my benchmarks because they could literally only sort ints, floats and strings.

However the above argument doesn’t make sense for me because I made two claims: I claimed that I generalized radix sort, and also that I optimized radix sort. For the second claim I should have provided benchmarks against other radix sort implementations. And also even though something like boost::spreadsort can’t run all my benchmarks, I should have still compared it in the benchmarks that it can run. Meaning for sorting of ints, floats and strings. So yeah, I don’t know what I was thinking there… Sometimes your brain just skips to the wrong conclusions and you never think about it a second time…

So anyways, here is ska_sort compared to boost::spreadsort when sorting uniformly distributed integers:

What we see here is that ska_sort is generally faster than boost::spreadsort. Except for that area between 2048 elements and 16384 elements. The reason for this is mainly that spreadsort picks a different number of partitions than I do. In each recursive step I split the input into 256 partitions. spreadsort uses more. It doesn’t use a fixed amount like I do, so I can’t tell you a simple number, except that it usually picks more than 256.

I had played around with using a different number of partitions in my first blog post about radix sort, but I didn’t have good results that time. That time I found that if I used 2048 partitions, sorting would be faster if the collection had between 1024 and 4096 elements. In other cases using 256 partitions was faster. It’s probably worth trying a variable amount of partitions like spreadsort uses. Maybe I can come up with an algorithm that’s always faster than spreadsort. So there you go, a real benefit just from comparing against another radix sort implementation.

Let’s also look at the graph for sorting uniformly distributed floats:

This graph is interesting for two reasons: 1. spreadsort has much smaller waves than when it sorts ints. 2. All of the algorithms seem to suddenly speed up when there are a lot of elements.

I have no good explanation for the second thing. But it is reproducible so I could do more investigation. My best guess is that this is because uniform_real_distribution just can’t produce this many unique values. (I’m just asking for floats in the range from 0 to 1) So I’m getting more duplicates back there. I tried switching to a exponential_distribution, but the graph looked similar.

The reason for why spreadsort has smaller waves seems to be that spreadsort adjusts its algorithm based on how big the range of inputs is. When sorting ints it could get ints over the entire 32 bit range. When sorting floats it only gets values form zero to one. I need to look more into what spreadsort actually does with that information, but it does compute it and use it to determine how many partitions to use. But there’s no time for looking into that. Instead let’s look at sorting of strings:

This is the “sorting long strings” graph from my last blog post. And oops, that’s embarrassing: spreadsort seems to be better at sorting strings than ska_sort. Which is surprising because I copied parts of my algorithm from spreadsort, so it should work similarly. Stepping through it, there seem to be two main reasons:

- When sorting strings, I have to subdivide the input into 257 partitions: One partition for all the strings that are shorter than my current index, and 256 partitions for all the possible character values. I do that in two passes over the input: First split off all the shorter ones, second run my normal sorting algorithm which splits the remaining data into 256 partitions. spreadsort does this in one pass. There is no reason why I couldn’t do the same thing. Except that it would complicate my algorithm even more because I’d need two slightly different versions of my inner loop. I’ll try to do it when I get to it.
- Spreadsort takes advantage of the fact that strings are always stored in a single chunk of memory. When it tries to find the longest common prefix between all the strings, it uses memcmp which will internally compare several bytes at a time. In my algorithm I have no special treatment for strings: It’s the same algorithm for strings, deques, vectors, arrays or anything else with operator[]. This means I have to compare one element at a time because if you pass in a std::deque, memcmp wouldn’t work. I could solve that by specializing for containers that have a .data() function. I would run into a second problem though: You might be sorting a vector of custom types, in which case memcmp would once again be the wrong thing. It still seems solvable: I just need even more template special cases for when the thing to be sorted is a character pointer in a container that has a data() member function. Doable, but adds more complexity.

So in conclusion spreadsort will stay faster than ska_sort at sorting strings for now. The reason for that is simply that I don’t want to spend the time to implement the same optimizations at the moment.

The top reddit comment talked about something I wrote about the recursion count: It quotes from a part where I make two statements about the recursion count: 1. If I sort a million ints or a thousand ints, I always have to recurse at most four times. 2. If I can tell all the values apart in the first byte, I can stop recursing right there. The comment points out that these two statements apply to different ranges of inputs. Which, yes, they do. The comment makes fun of me for not stating that these apply to different ranges, then it contains some bike shedding about internal variable names and some wrong advice about merging loops that ping-pong between the buffers in ska_sort_copy. (when ping-ponging between two buffers A and B, you can’t start the loop that reads B until the loop that reads A is finished writing to buffer B. Otherwise you read uninitialized data) I really don’t understand why this is the top comment…

But I’ll use this as an excuse to talk in detail about the recursion count and about big O complexity because that was a common topic in the comments. (including in the responses to that comment)

The point where I have to recurse into the second byte is actually more complex than you might think: I fall back to std::sort if a partition has fewer than 128 elements in it. That means that if the inputs are uniformly distributed on the first byte, I can handle up to 127*256 = 32512 values without any recursive calls. The 256 comes from the number of possible values for the first byte, and the 127 comes from the fact that if I create 256 partitions of 127 elements each, I will fall back to std::sort within each of those partitions instead of recursing to a second call of ska_sort.

Now in reality things are not that nicely distributed. Let me insert the graph again about sorting uniformly distributed ints:

The “waves” that you can see on ska_sort happen every time that I have to do one more recursive call. So what we see here is that in that middle wave, from 4096 to 16384 items, is when the pass that looks at the first byte creates more and more partitions that are large enough to require a recursive call. For example let’s say that at 2048 elements I randomly get 80 items with the value 62 in the first byte. Then at 4096 bytes I randomly get 130 elements with the value 62 in the first byte. At 2048 elements I call std::sort directly, at 4096 elements I will do one more recursive call, splitting those 130 elements into another 256 partitions and then call std::sort on each of those.

Then after 16384 what happens is that those partitions are big enough that I can do nice large loops through them, and the algorithm speeds up again. That is until I have to recurse a second time starting at 512k items and I slow down again.

For integers there is a natural limit to these waves: There can be at most four of these waves because there are only four bytes in an int.

That brings us to the discussion about big O complexity. A funny thing to observe in the comments was that the more confident somebody was in claiming that I got my big O complexity wrong, the more likely they were to not understand big O complexity. But I will admit that the big O complexity for radix sort is confusing because it depends on the type of the data that you’re sorting.

To start with I claim that a single pass over the data for me is O(n). This is not obvious from the source code because there are five loops in there, three of which are nested. But after looking at it a bit you find that those loops depends on two things: 1. The number of partitions, 2. the number of elements in the collection. So a first guess for the complexity would be O(n+p) where p is the number of partitions. That number is fixed in my algorithm to 256, so we end up with O(n+256) which is just O(n). But that 256 is the reason why ska_sort slows down when it has lots of small partitions.

Now every time that I can’t separate the elements into partitions of fewer than 128 elements, I have to do a recursive call. So what is the impact of that on the complexity? A simple way of looking at that is to say it’s O(n*b) where b is the number of bytes I need to look at until I can tell all the elements apart. When sorting integers, in the worst case this would be 4, so we end up with O(n*4) which is just O(n). When sorting something with a variable number of bytes, like strings, that b number could be arbitrarily big though. One trick I do to reduce the risk of hitting a really bad case there is that I skip over common prefixes. Still it’s easy to create inputs where b is equal to n. So the algorithm is O(n^2) for sorting strings. But I actually detect that case and fall back to std::sort for the entire range then. So ska_sort is actually O(n log n) for sorting strings.

I like the O(n*b) number better though because the graph doesn’t look like a O(n) graph. (ska_sort_copy however does look like a O(n) graph) The O(n*b) number gives a better understanding of what’s going on. Then we can look at the waves in the graph and can say that at that point b increased by 1. And we can also see that b is not independent of n. (it will become independent of n once n is large enough. Say I’m sorting a trillion ints. But for small numbers b increases together with n)

From this analysis you would think that my algorithm is slowest when all numbers are very close to each other. Say they’re all close to 0. Because then I would have to look at all four bytes until I can tell all the numbers apart. In fact the opposite happens: My algorithm is fastest in these cases. The reason is that the first three passes over the data are very fast in this case because all elements have the same value for the first three bytes. Only the last pass actually has to do anything. (this is the “sorting geometric_distribution ints” graph from my last blog post where ska_sort ends up more than five times faster than std::sort)

Finally when looking at the complexity we have to consider the std::sort fallback. I will only ever call std::sort on partitions of fewer than 128 items. That means that the complexity of the std::sort fallback is not O(n log n) but it’s O(n log 127), which is just O(n). It’s O(n log 127) because a) I call std::sort on every element, so it has to be at least O(n), b) each of those calls to std::sort only sees at most 127 elements, so the recursive calls in quick sort are limited, and those recursive calls are responsible for the log n part of the complexity. If this sounds weird, it’s the same reason that Introsort (which is used in std::sort) is O(n log n) even though it calls insertion sort (an O(n^2) algorithm) on every single element.

Some of the best comments I got were about other good sorting algorithms. And it turns out that other people have generalized radix sort before me.

One of those is BinarSort which was written by William Gilreath who commented on my blog post. BinarSort basically says “everything is made out of bits, so if we sort bits, we can sort everything.” Which is a similar line of thinking that lead to me generalizing radix sort. The big downside with looking at everything as bits is that it leads to a slow sorting algorithm: For an int with 32 bits you have to do up to 31 recursive calls. Running BinarSort through my benchmark for sorting ints looks like this:

The first thing to notice is that BinarSort looks an awful lot as if it’s O(n log n). The reason for that is the same reason that ska_sort doesn’t look like a true O(n) graph: The number of recursive calls is related to the number of elements. BinarSort has to do up to 31 recursive calls. At the point where it reaches that number of recursive calls, you would expect the graph to flatten out. The quick sort which is used in std::sort would continue to grow even then. However it looks like you need to sort a huge number of items to get to that point in the graph. Instead you see an algorithm that keeps on getting slower and slower as it has to do more and more recursive calls, never reaching the point where it would turn linear.

The other big problem with BinarSort is that even though it claims to be general, it only provides source code for sorting ints. It doesn’t provide a method for sorting other data. For example it’s easy to see that you can’t sort floats with it directly, because if you sort floats one bit at a time, positive floats come before negative floats. I now know how to sort floats using BinarSort because I did that work for ska_sort, but if I had read the paper a while ago, I wouldn’t have believed the claim that you can sort everything with it. If you only provide source code for sorting ints, I will believe that you can only sort ints.

A much more promising approach is this paper by Fritz Henglein. I didn’t read all of the paper but it looks like he did something very similar to what I did, except he did it five years ago. According to his graphs, his sorting algorithm is also much faster than older sorting algorithms. So I think he did great work, but for some reason nobody has heard about it. The lesson that I would take from that is that if you’re doing work to improve performance of algorithms, don’t do it in Haskell. The problem is that sorting is slow in Haskell to begin with. So he is comparing his algorithm against slow sorting algorithms and it’s easy to beat slow sorting algorithms. I think that his algorithm would be faster than std::sort if he wrote it in C++, but it’s hard to tell.

A great thing that happened in the comments was that Morwenn adapted his sorting algorithm Vergesort to run on top of ska_sort. The result is an algorithm that performs very well on pre-sorted data while still being fast in random data.

This is the graph that he posted. ska_sort is just ska_sort by itself, verge_sort is a combination of verge sort and ska_sort. Best of all, he posted a comment explaining how he did it.

So that is absolutely fantastic. I’ll definitely attempt to bring his changes into the main algorithm. There might even be a way to merge his loop over the data with my first loop over the data, so that I don’t even have to do an extra pass.

This brings me to future work:

Custom sorting behavior is my next task. I don’t have a full solution yet, but I have something that can handle case-insensitive sorting of ASCII characters and it can do number-aware sorting. The hope is that something like Unicode sorting could be done with the same approach. The idea is that I expose a customization point where you can change how I use iterators in collections. You can change the return value from the iterator, and you can change how far the iterator will advance. So for case insensitive sorting you could simply return ‘a’ from the iterator when the actual value was ‘A’.

The tricky part is number aware sorting. My current idea is that you could return an int instead of a char, and then advance the iterator several positions. You would have to be a bit tricky with the int that you return because you would want to return either a character or an int. I could add support for std::variant (should probably do that anyway) but we can also just say that for characters, we just cast it to an int, and for numbers we return the lowest int plus the number. So for “foo100” you would return the integers ‘f’, ‘o’, ‘o’, INT_MIN+100. And for “foo99” you would return the integers ‘f’, ‘o’, ‘o’, INT_MIN+99. Then you would advance the iterator one position for the first three characters, and three positions for the number 100 and two positions for the number 99. One tricky part on this is that you have to always move the iterator forward by the same distance when elements have the same value. Meaning if two different strings have the value INT_MIN+100 for the current index, they both have to advance their iterators by three elements. Can’t have one of them advancing it by four elements. I need that assumption so that for recursive calls, I only need to advance a single index. So I won’t actually store the iterators that you return, but only a single index that I can use for all values that fell into the same partition. I think it’s a promising idea. It might also work for sorting Unicode, but that is such a complicated topic that I have no idea if this will work or not. I think the only way to find out is to start working on this and see if I run into problems.

The other task that I want to do is to merge my algorithm with verge sort so that I can also be fast for pre-sorted ranges.

The big problem that I have right now is that I actually want to take a break from this. I don’t want to work on this sorting algorithm for a while. I was actually already kinda burned out on this before I even wrote that first blog post. At that point I had spent a month of my free time on this and I was very happy to finally be done with this when I hit “Publish” on that blog post. So sorry, but the current state is what it’s going to stay at. I’m doing this in my spare time, and right now I’ve got other things that I would like to do with my spare time. (Dark Souls III is pretty great, guys)

That being said I do intend to use this algorithm at work, and I do expect some small improvements to come out of that. These things always get improved as soon as you actually start using them. Also I’ll probably get back to this at some point this year.

Until then you should give ska_sort a try. This can literally make your sorting two times faster. Also if you have data that this can’t sort, I am very curious to hear about that. Here’s a link to the source code and you can find instructions on how to use it in my last blog post.

]]>

The easiest way to quickly generate truly random numbers is to use a std::random_device to seed a std::mt19937_64. That way we pay a one-time cost of using random device to generate a seed, and then have quick random numbers after that. Except that the standard doesn’t provide a way to do that. In fact it’s more dangerous than that: It provides an easy wrong way to do it (use a std::random_device to generate a single int and use that single int as the seed) and it provides a slow, slightly wrong way to do it. (use a std::random_device to fill a std::seed_seq and use that as the seed) There’s a proposal to fix this, (that link also contains reasons for why the existing methods are wrong) but I’ve actually been using a tiny class for this:

struct random_seed_seq { template<typename It> void generate(It begin, It end) { for (; begin != end; ++begin) { *begin = device(); } } static random_seed_seq & get_instance() { static thread_local random_seed_seq result; return result; } private: std::random_device device; };

(the license for the code in this blog post is the Unlicense)

This class has the same generate() function that std::seed_seq has and can be used to initialize a std::mt19937_64. The static get_instance() function is a small convenience to make initialization easier so that you can write this:

std::mt19937_64 random_source{random_seed_seq::get_instance()};

Without the get_instance() function this would have to be a two-liner.

Finally a lot of code doesn’t care where their random numbers come from. Sometimes you just want a random float in the range from zero to one and you don’t want to have to set up a random engine and a random distribution. In that case you can write something like this:

float random_float_0_1() { static thread_local std::mt19937_64 randomness(random_seed_seq::get_instance()); static thread_local std::uniform_real_distribution<float> distribution; return distribution(randomness); }

And just like that we have easy, fast, high quality floating point numbers. Well we do if your compiler is GCC. On my machine this last function is slightly faster than the old-school “rand() * (1.0f / RAND_MAX)”. This function takes 11ns, and the old-school method takes 14 nanoseconds. (measured with Google Benchmark) I attribute most of that to the Mersenne Twister being a very fast random number generator.

When I compiled it with Clang however this new function takes 80ns. Stepping through the assembly generated by both compilers reveals that the problem is that Clang doesn’t inline aggressively enough. There are some calls to compute the logarithm of the upper bound and lower bound in the uniform_real_distribution. GCC inlines those expensive calls away, Clang does not.

Not sure what to do about that last problem: The problem is with how std::uniform_real_distribution is defined: It takes the upper bound and lower bound as runtime arguments. In my code listing above they are the default arguments of 0 and 1, but since Clang doesn’t inline the call, it doesn’t know that they are constants. The only way I see around that is to re-implement std::uniform_real_distribution with constants. But that’s beyond the scope of this blog post.

This blog post was only supposed to be about the random_seed_seq. The other code are just examples showing how you could use it. So let’s not worry about the details of std::uniform_real_distribution, and end this by saying that you should probably use random_seed_seq to seed your random number generators.

It’s a tiny class that I find myself needing all the time. Hopefully it will also be useful for you.

]]>

Why is that an unfortunate claim? Because I’ll probably have a hard time convincing you that I did speed up sorting by a factor of two. But this should turn out to be quite a lengthy blog post, and all the code is open source for you to try out on whatever your domain is. So I might either convince you with lots of arguments and measurements, or you can just try the algorithm yourself.

Following up from my last blog post, this is of course a version of radix sort. Meaning its complexity is lower than O(n log n). I made two contributions:

- I optimized the inner loop of in-place radix sort. I started off with the Wikipedia implementation of American Flag Sort and made some non-obvious improvements. This makes radix sort much faster than std::sort, even for a relatively small collections. (starting at 128 elements)
- I generalized in-place radix sort to work on arbitrary sized ints, floats, tuples, structs, vectors, arrays, strings etc. I can sort anything that is reachable with random access operators like operator[] or std::get. If you have custom structs, you just have to provide a function that can extract the key that you want to sort on. This is a trivial function which is less complicated than the comparison operator that you would have to write for std::sort.

If you just want to try the algorithm, jump ahead to the section “Source Code and Usage.”

To start off with, I will explain how you can build a sorting algorithm that’s O(n). If you have read my last blog post, you can skip this section. If you haven’t, read on:

If you are like me a month ago, you knew for sure that it’s proven that the fastest possible sorting algorithm has to be O(n log n). There are mathematical proofs that you can’t make anything faster. I believed that until I watched this lecture from the “Introduction to Algorithms” class on MIT Open Course Ware. There the professor explains that sorting has to be O(n log n) when all you can do is compare items. But if you’re allowed to do more operations than just comparisons, you can make sorting algorithms faster.

I’ll show an example using the counting sort algorithm:

template<typename It, typename OutIt, typename ExtractKey> void counting_sort(It begin, It end, OutIt out_begin, ExtractKey && extract_key) { size_t counts[256] = {}; for (It it = begin; it != end; ++it) { ++counts[extract_key(*it)]; } size_t total = 0; for (size_t & count : counts) { size_t old_count = count; count = total; total += old_count; } for (; begin != end; ++begin) { std::uint8_t key = extract_key(*begin); out_begin[counts[key]++] = std::move(*begin); } }

This version of the algorithm can only sort unsigned chars. Or rather it can only sort types that can provide a sort key that’s an unsigned char. Otherwise we would index out of range in the first loop. Let me explain how the algorithm works:

We have three arrays and three loops. We have the input array, the output array, and a counting array. In the first loop we fill the counting array by iterating over the input array, counting how often each element shows up.

The second loop turns the counting array into a prefix sum of the counts. So let’s say the array didn’t have 256 entries, but only 8 entries. And let’s say the numbers come up this often:

index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

count | 0 | 2 | 1 | 0 | 5 | 1 | 0 | 0 |

prefix sum | 0 | 0 | 2 | 3 | 3 | 8 | 9 | 9 |

So in this case there were nine elements in total. The number 1 showed up twice, the number 2 showed up once, the number 4 showed up five times and the number 5 showed up once. So maybe the input sequence was { 4, 4, 2, 4, 1, 1, 4, 5, 4 }.

The final loop now goes over the initial array again and uses the key to look up into the prefix sum array. And lo and behold, that array tells us the final position where we need to store the integer. So when we iterate over that sequence, the 4 goes to position 3, because that’s the value that the prefix sum array tells us. We then increment the value in the array so that the next 4 goes to position 4. The number 2 will go to position 2, the next 4 goes to position 5 (because we incremented the value in the prefix sum array twice already) etc. I recommend that you walk through this once manually to get a feeling for it. The final result of this should be { 1, 1, 2, 4, 4, 4, 4, 4, 5 }.

And just like that we have a sorted array. The prefix sum told us where we have to store everything, and we were able to compute that in linear time.

Also notice how this works on any type, not just on integers. All you have to do is provide the extract_key() function for your type. In the last loop we move the type that you provided, not the key returned from that function. So this can be any custom struct. For example you could sort strings by length. Just use the size() function in extract_key, and clamp the length to at most 255. You could write a modified version of counting_sort that tells you where the position of the last partition is, so that you can then sort all long strings using std::sort. (which should be a small subset of all your strings so that the second pass on those strings should be fast)

The above algorithm stores the sorted elements in a separate array. But it doesn’t take much to get an in-place sorting algorithm for unsigned chars: One thing we could try is that instead of moving the elements, we swap them.

The most obvious problem that we run into with that is that when we swap the first element out of the first spot, the new element probably doesn’t want to be in the first spot. It might want to be at position 10 instead. The solution for that is simple: Keep on swapping the first element until we find an element that actually wants to be in the first spot. Only when that has happened do we move on to the second item in the array.

The second problem that we then run into is that we’ll find a lot of partitions that are already sorted. We may not know however that those are already sorted. Imagine if we have the number 3 two times and it wants to be in positions six and seven. And lets say that as part of swapping the first element into place, we swap the first 3 to slot six, and the second 3 to slot seven. Now these are sorted and we don’t need to do anything with them any more. But when we advance on from the first element, we will at some point come across the 3 in slot six. And we’ll swap it to spot eight, because that’s the next spot that a 3 would go to. Then we find the next 3 and swap it to spot nine. Then we find the first 3 again and swap it to spot ten etc. This keeps going until we index out of bounds and crash.

The solution for the second problem is to keep a copy of the initial prefix array around so that we can tell when a partition is finished. Then we can skip over those partitions when advancing through the array.

With those two changes we have an in-place sorting algorithm that sorts unsigned chars. This is the American Flag Sort algorithm as described on Wikipedia.

Radix sort takes the above algorithm, and generalizes it to integers that don’t fit into a single unsigned char. The in-place version actually uses a fairly simple trick: Sort one byte at a time. First sort on the highest byte. That will split the input into 256 partitions. Now recursively sort within each of those partitions using the next byte. Keep doing that until you run out of bytes.

If you do the math on that you will find that for a four byte integer you get 256^3 recursive calls: We subdivide into 256 partitions then recurse, subdivide each of those into 256 partitions and recurse again and then subdivide each of the smaller partitions into 256 partitions again and recurse a final time. If we actually did all of those recursions this would be a very slow algorithm. The way to get around that problem is to stop recursing when the number of items in a partition is less than some magic number, and to use std::sort within that sub-partition instead. In my case I stop recursing when a partition is less than 128 elements in size. When I have split an array into partitions that have less than that many elements, I call std::sort within these partitions.

If you’re curious: The reason why the threshold is at 128 is that I’m splitting the input into 256 partitions. If the number of partitions is k, then the complexity of sorting on a single byte is O(n+k). The point where radix sort gets faster than std::sort is when the loop that depends on n starts to dominate over the loop that depends on k. In my implementation that’s somewhere around 0.5k. It’s not easy to move it much lower than that. (I have some ideas, but nothing has worked yet)

It should be clear that the algorithm described in the last section works for unsigned integers of any size. But it also works for collections of unsigned integers, (including pairs and tuples) and strings. Just sort by the first element, then by the next, then by the next etc. until the partition sizes are small enough. (as a matter of fact the paper that Wikipedia names as the source for its American Flag Sort article intended the algorithm as a sorting algorithm for strings)

But it’s straightforward to generalize this to work on signed integers: Just shift all the values up into the range of the unsigned integer of the same size. Meaning for an int16_t, just cast to uint16_t and add 32768.

Michael Herf has also discovered a good way to generalize this to floating point numbers: Reinterpret cast the float to a uint32, then flip every bit if the float was negative, but flip only the sign bit if the float was positive. The same trick works for doubles and uint64s. Michael Herf explains why this works in the linked piece, but the short version of it is this: Positive floating point numbers already sort correctly if we just reinterpret cast them to a uint32. The exponent comes before the mantissa, so we would sort by the exponent first, then by the mantissa. Everything works out. Negative floating point numbers however would sort the wrong way. Flipping all the bits on them fixes that. The final remaining problem is that positive floating point numbers need to sort as bigger than negative numbers, and the easiest way to do that is to flip the sign bit since it’s the most significant bit.

Of the fundamental types that leaves only booleans and the various char types. Chars can just be reinterpret_casted to the unsigned types of the same size. Booleans could also be turned into a unsigned char, but we can also use a custom, more efficient algorithm for booleans: Just use std::partition instead of the normal sorting algorithm. And if we need to recurse because we’re sorting on more than one key, we can recurse into each of the partitions.

And just like that we have generalized in-place radix sort to all types. Now all it takes is a bunch of template magic to make the code do the right thing for each case. I’ll spare you the details of that. It wasn’t fun.

The brief recap of the sorting algorithm for sorting one byte is:

- Count elements and build the prefix sum that tells us where to put the elements
- Swap the first element into place until we find an item that wants to be in the first position (according to the prefix sum)
- Repeat step 2 for all positions

I have implemented this sorting algorithm using Timo Bingmann’s Sound of Sorting. Here is a what it looks (and sounds) like:

As you can see from the video, the algorithm spends most of its time on the first couple elements. Sometimes the array is mostly sorted by the time that the algorithm advances forward from the first item. What you can’t see in the video is the prefix sum array that’s built on the side. Visualizing that would make the algorithm more understandable, (it would make clear how the algorithm can know the final position of elements to swap them directly there) but I haven’t done the work of visualizing that.

If we want to sort multiple bytes we recurse into each of the 256 partitions and do a sort within those using the next byte. But that’s not the slow part of this. The slow part is step 2 and step 3.

If you profile this you will find that this is spending all of its time on the swapping. At first I thought that that was because of cache misses. Usually when the line of assembly that’s taking a lot of time is dereferencing a pointer, that’s a cache miss. I’ll explain what the real problem was further down, but even though my intuition was wrong it drove me towards a good speed up: If we have a cache miss on the first element, why not try swapping the second element into place while waiting for the cache miss on the first one?

I already have to keep information about which elements are done swapping, so I can skip over those. So what I do is that I Iterate over all elements that have not yet been swapped into place, and I swap them into place. In one pass over the array, this will swap at least half of all elements into place. To see why, let’s think how this works in this list: { 4, 3, 1, 2 }: We look at the first element, the 4, and swap it with the 2 at the end, giving us this list: { 2, 3, 1, 4 }, then we look at the second element, the 3, and swap it with the 1, giving us this list: { 2, 1, 3, 4 } then we have iterated half-way through the list and find that all the remaining elements are sorted, (we do this by checking that the offset stored in the prefix sum array is the same as the initial offset of the next partition) so we’re done, but our list is not sorted. The solution for that is to say that when we get to the end of the list, we just start over from the beginning, swapping all unsorted elements into place. In that case we only need to swap the 2 into place to get { 1, 2, 3, 4 } at which point we know that all partitions are sorted and we can stop.

In Sound of Sorting that looks like this:

This is what the above algorithm looks like in code:

struct PartitionInfo { PartitionInfo() : count(0) { } union { size_t count; size_t offset; }; size_t next_offset; }; template<typename It, typename ExtractKey> void ska_byte_sort(It begin, It end, ExtractKey & extract_key) { PartitionInfo partitions[256]; for (It it = begin; it != end; ++it) { ++partitions[extract_key(*it)].count; } uint8_t remaining_partitions[256]; size_t total = 0; int num_partitions = 0; for (int i = 0; i < 256; ++i) { size_t count = partitions[i].count; if (count) { partitions[i].offset = total; total += count; remaining_partitions[num_partitions] = i; ++num_partitions; } partitions[i].next_offset = total; } for (uint8_t * last_remaining = remaining_partitions + num_partitions, * end_partition = remaining_partitions + 1; last_remaining > end_partition;) { last_remaining = custom_std_partition(remaining_partitions, last_remaining, [&](uint8_t partition) { size_t & begin_offset = partitions[partition].offset; size_t & end_offset = partitions[partition].next_offset; if (begin_offset == end_offset) return false; unroll_loop_four_times(begin + begin_offset, end_offset - begin_offset, [partitions = partitions, begin, &extract_key, sort_data](It it) { uint8_t this_partition = extract_key(*it); size_t offset = partitions[this_partition].offset++; std::iter_swap(it, begin + offset); }); return begin_offset != end_offset; }); } }

The algorithm starts off similar to counting sort above: I count how many items fall into each partition. But I changed the second loop: In the second loop I build an array of indices into all the partitions that have at least one element in them. I need this because I need some way to keep track of all the partitions that have not been finished yet. Also I store the end index for each partition in the next_offset variable. That will allow me to check whether a partition is finished sorting.

The third loop is much more complicated than counting sort. It’s three nested loops, and only the outermost is a normal for loop:

The outer loop iterates over all of the remaining unsorted partitions. It stops when there is only one unsorted partition remaining. That last partition does not need to be sorted if all other partitions are already sorted. This is an important optimization because the case where all elements fall into only one partition is quite common: When sorting four byte integers, if all integers are small, then in the first call to this function, which sorts on the highest byte, all of the keys will have the same value and will fall into one partition. In that case this algorithm will immediately recurse to the next byte.

The middle loop uses std::partition to remove finished partitions from the list of remaining partitions. I use a custom version of std::partition because std::partition will unroll its internal loop, and I do not want that. I need the innermost loop to be unrolled instead. But the behavior of custom_std_partition is identical to that of std::partition. What this loop means is that if the items fall into partitions of different sizes, say for the input sequence { 3, 3, 3, 3, 2, 5, 1, 4, 5, 5, 3, 3, 5, 3, 3 } where the partitions for 3 and 5 are larger than the other partitions, this will very quickly finish the partitions for 1, 2 and 4, and then after that the outer loop and inner loop only have to iterate over the partitions for 3 and 5. You might think that I could use std::remove_if here instead of std::partition, but I need this to be non-destructive, because I will need the same list of partitions when making recursive calls. (not shown in this code listing)

The innermost loop finally swaps elements. It just iterates over all remaining unsorted elements in a partition and swaps them into their final position. This would be a normal for loop, except I need this loop unrolled to get fast speeds. So I wrote a function called “unroll_loop_four_times” that takes an iterator and a loop count and then unrolls the loop.

This new algorithm was immediately much faster than American Flag Sort. Which made sense because I thought I had tricked the cache misses. But as soon as I profiled this I noticed that this new sorting algorithm actually had slightly more cache misses. It also had more branch mispredictions. It also executed more instructions. But somehow it took less time. This was quite puzzling so I profiled it whichever way I could. For example I ran it in Valgrind to see what Valgrind thought should be happening. In Valgrind this new algorithm was actually slower than American Flag Sort. That makes sense: Valgrind is just a simulator, so something that executes more instructions, has slightly more cache misses and slightly more branch mispredictions would be slower. But why would it be faster running on real hardware?

It took me more than a day of staring at profiling numbers before I realized why this was faster: It has better instruction level parallelism. You couldn’t have invented this algorithm on old computers because it would have been slower on old computers. The big problem with American Flag Sort is that it has to wait for the current swap to finish before it can start on the next swap. It doesn’t matter that there is no cache-miss: Modern CPUs could execute several swaps at once if only they didn’t have to wait for the previous one to finish. Unrolling the inner loop also helps to ensure this. Modern CPUs are amazing, so they could actually run several loops in parallel even without loop unrolling, but the loop unrolling helps.

The Linux perf command has a metric called “instructions per cycle” which measures instruction level parallelism. In American Flag Sort my CPU achieves 1.61 instructions per cycle. In this new sorting algorithm it achieves 2.24 instructions per cycle. It doesn’t matter if you have to do a few instructions more, if you can do 40% more at a time.

And the thing about cache misses and branch mispredictions turned out to be a red herring: The numbers for those are actually very low for both algorithms. So the slight increase that I saw was a slight increase to a low number. Since there are only 256 possible insertion points, chances are that a good portion of them are always going to be in the cache. And for many real world inputs the number of possible insertion points will actually be much lower. For example when sorting strings, you usually get less than thirty because we simply don’t use that many different characters.

All that being said, for small collections American Flag Sort is faster. The instruction level parallelism really seems to kick in at collections of more than a thousand elements. So my final sort algorithm actually looks at the number of elements in the collection, and if it’s less than 128 I call std::sort, if it’s less than 1024 I call American Flag Sort, and if it’s more than that I run my new sorting algorithm.

std::sort is actually a similar combination, mixing quick sort, insertion sort and heap sort, so in a sense those are also part of my algorithm. If I tried hard enough, I could construct an input sequence that actually uses all of these sorting algorithms. That input sequence would be my very worst case: I would have to trigger the worst case behavior of radix sort so that my algorithm falls back to std::sort, and then I would also have to trigger the worst case behavior of quick sort so that std::sort falls back to heap sort. So let’s talk about worst cases and best cases.

The best case for my implementation of radix sort is if the inputs fit in few partitions. For example if I have a thousand items and they all fall into only three partitions, (say I just have the number 1 a hundred times, the number 2 four hundred times, and the number 3 five hundred times) then my outer loops do very little and my inner loop can swap everything into place in nice long uninterrupted runs.

My other best case is on already sorted sequences: In that case I iterate over the data exactly twice, once to look at each item, and once to swap each item with itself.

The worst case for my implementation can only be reached when sorting variable sized data, like strings. For fixed size keys like integers or floats, I don’t think there is a really bad case for my algorithm. One way to construct the worst case is to sort the strings “a”, “ab”, “abc”, “abcd”, “abcde”, “abcdef” etc. Since radix sort looks at one byte at a time, and that byte only allows it to split off one item, this would take O(n^2) time. My implementation detects this by recording how many recursive calls there were. If there are too many, I fall back to std::sort. Depending on your implementation of quick sort, this could also be the worst case for quick sort, in which case std::sort falls back to heap sort. I debugged this briefly and it seemed like std::sort did not fall back to heap sort for my test case. The reason for that is that my test case was sorted data and std::sort seems to use the median-of-three rule for pivot selection, which selects a good pivot on already sorted sequences. Knowing that, it’s probably possible to create sequences that hit the worst case both for my algorithm and for the quick sort used in std::sort, in which case the algorithm would fall back to heap sort. But I haven’t attempted to construct such a sequence.

I don’t know how common this case is in the real world, but one trick I took from the boost implementation of radix sort is that I skip over common prefixes. So if you’re sorting log messages and you have a lot of messages that start with “warning:” or “error:” then my implementation of radix sort would first sort those into separate partitions, and then within each of those partitions it would skip over the common prefix and continue sorting at the first differing character. That behavior should help reduce how often we hit the worst case.

Currently I fall back to std::sort if my code has to recurse more than sixteen times. I picked that number because that was the first power of two for which the worst case detection did not trigger when sorting some log files on my computer.

The sorting algorithm that I provide as a library is called “Ska Sort”. Because I’m not going to come up with new algorithms very often in my lifetime, so might as well put my name on one when I do. The improved algorithm for sorting bytes that I described above in the sections “Optimizing the Inner Loop” and “Implementation Details” is only a small part of that. That algorithm is called “Ska Byte Sort”.

In summary, Ska Sort:

- Is an in-place radix sort algorithm
- Sorts one byte at a time (into 256 partitions)
- Falls back to std::sort if a collection contains less than some threshold of items (currently 128)
- Uses the inner loop of American Flag Sort if a collection contains less than a larger threshold of items (currently 1024)
- Uses Ska Byte Sort if the collection is larger than that
- Calls itself recursively on each of the 256 partitions using the next byte as the sort key
- Falls back to std::sort if it recurses too many times (currently 16 times)
- Uses std::partition to sort booleans
- Automatically converts signed integers, floats and char types to the correct unsigned integer type
- Automatically deals with pairs, tuples, strings, vectors and arrays by sorting one element at a time
- Skips over common prefixes of collections. (for example when sorting strings)
- Provides two customization points to extract the sort key from an object: A function object that can be passed to the algorithm, or a function called to_radix_sort_key() that can be placed in the namespace of your type

So Ska Sort is a complicated algorithm. Certainly more complicated than a simple quick sort. One of the reasons for this is that in Ska Sort, I have a lot more information about the types that I’m sorting. In comparison based sorting algorithms all I have is a comparison function that returns a bool. In Ska Sort I can know that “for this collection, I first have to sort on a boolean, then on a float” and I can write custom code for both of those cases. In fact I often need custom code: The code that sorts tuples has to be different from the code that sorts strings. Sure, they have the same inner loop, but they both need to do different work to get to that inner loop. In comparison based sorting you get the same code for all types.

If you’ve got enough time on your hands that you clicked on the pieces I linked above, you will find that there are two optimizations that are considered important in my sources that I didn’t do.

The first is that the piece that talks about sorting floating point numbers sorts 11 bits at a time, instead of one byte at a time. Meaning it subdivides the range into 2048 partitions instead of 256 partitions. The benefit of this is that you can sort a four byte integer (or a four byte float) in three passes instead of four passes. I tried this in my last blog post and found it to only be faster for a few cases. In most cases it was slower than sorting one byte at a time. It’s probably worth trying that trick again for in-place radix sort, but I didn’t do that.

The second is that the American Flag Sort paper talks about managing recursions manually. Instead of making recursive calls, they keep a stack of all the partitions that still need to be sorted. Then they loop until that stack is empty. I didn’t attempt this optimization because my code is already far too complex. This optimization is easier to do when you only have to sort strings because you always use the same function to extract the current byte. But if you can sort ints, floats, tuples, vectors, strings and more, this is complicated.

Finally we get to how fast this algorithm actually is. Since my last blog post I’ve changed how I calculate these numbers. In my last blog post I actually made a big mistake: I measured how long it takes to set up my test data and to then sort it. The problem with that is that the set up can actually be a significant portion of the time. So this time I also measure the set up separately and subtract that time from the measurements so that I’m left with only the time it takes to actually sort the data. With that let’s get to our first measurement: Sorting integers: (generated using std::uniform_int_distribution)

This graph shows how long it takes to sort various numbers of items. I didn’t mention ska_sort_copy before, but it’s essentially the algorithm from my last blog post, except that I changed it so that it falls back to ska_sort instead of falling back to std::sort. (ska_sort may still decide to fall back to std::sort of course)

One problem I have with this graph that even though I made the scale logarithmic, it’s still very difficult to see what’s going on. Last time I added another line at the bottom that showed the relative scale, but this time I have a better approach. Instead of a logarithmic scale, I can divide the total time by the number of items, so that I get the time that the sort algorithm spends per item:

With this visualization, we can see much more clearly what’s going on. All pictures below use “nanoseconds per item” as scale, like in this graph. Let’s analyze this graph a little:

For the first couple items we see that the lines are essentially the same. That’s because for less than 128 elements, I fall back to std::sort. So you would expect all of the lines to be exactly the same. Any difference in that area is measurement noise.

Then past that we see that std::sort is exactly a O(n log n) sorting algorithm. It goes up linearly when we divide the time by the number of items, which is exactly what you’d expect for O(n log n). It’s actually impressive how it forms an exactly straight line once we’re past a small number of items. ska_sort_copy is truly an O(n) sorting algorithm: The cost per item stays mostly constant as the total number of items increases. But ska_sort is… more complicated.

Those waves that we’re seeing in the ska_sort line have to do with the number of recursive calls: ska_sort is fastest when the number of items is large. That’s why the line starts off as decreasing. But then at some point we have to recurse into a bunch of partitions that are just over 128 items in size, which is slow. Then those partitions grow as the number of items increase and the algorithm is faster again, until we get to a point where the partitions are over 128 elements in size again, and we need to add another recursive step. One way to visualize this is to look at the graph of sorting a collection of int8_t:

As you can see the cost per item goes down dramatically at the beginning. Every time that the algorithm has to recurse into other partitions, we see that initial part of the curve overlaid, giving us the waves of the graph for sorting ints.

One point I made above is that ska_sort is fastest when there are few partitions to sort elements into. So let’s see what happens when we use a std::geometric_distribution instead of a std::uniform_int_distribution:

This graph is sorting four byte ints again, so you would expect to see the same “waves” that we saw in the uniformly distributed ints. I’m using a std::geometric_distribution with 0.001 as the constructor argument. Which means it generates numbers from 0 to roughly 18000, but most numbers will be close to zero. (in theory it can generate numbers that are much bigger, but 18882 is the biggest number I measured when generating the above data) And since most numbers are close to zero, we will see few recursions and because of that we see few waves, making this many times faster than std::sort.

Btw that bump at the beginning is surprising to me. For all other data that I could find, ska_sort starts to beat std::sort at 128 items. Here it seems like ska_sort only starts to win later. I don’t know why that is. I might investigate it at a different point, but I don’t want to change the threshold because this is a good number for all other data. Changing the threshold would move all other lines up by a little. Also since we’re sorting few items there, the difference in absolute terms is not that big: 15.8 microseconds to 16.7 microseconds for 128 items, and 32.3 microseconds to 32.9 microseconds for 256 items.

Let’s look at some more use cases. Here is my “real world” use case that I talked about in the last blog post, where I had to sort enemies in a game by distance to the player. But I wanted all enemies that are currently in combat to come first, sorted by distance, followed by all enemies that are not in combat, also sorted by distance. So I sort by a std::pair:

This turned out to be the same graph as sorting ints, except every line is shifted up by a bit. Which I guess I should have expected. But it’s good to see that the conversion trick that I have to do for floats and the splitting I have to do for pairs does not add significant overhead. A more interesting graph is the one for sorting int64s:

This is the point where ska_sort_copy is sometimes slower than ska_sort. I actually decided to lower the threshold where ska_sort_copy falls back to ska_sort: It will now only do the copying radix sort when it has to do less than eight iterations over the input data. Meaning I have changed the code, so that for int64s ska_sort_copy actually just calls ska_sort. Based on the above graph you might argue that it should still do the copying radix sort, but here is a measurement of sorting an 128 byte struct that has an int64 as a sort key:

As the structs get larger, ska_sort_copy gets slower. Because of this I decided to make ska_sort_copy fall back to ska_sort for sort keys of this size.

One other thing to notice from the above graph is that it looks like std::sort and ska_sort get closer. So does ska_sort ever become slower? It doesn’t look like it. Here’s what it looks like when I sort a 1024 byte struct:

Once again this is a very interesting graph. I wish I could spend time on investigating where that large gap at the end comes from. It’s not measurement noise. It’s reproducible. The way I build these graphs is that I run Google Benchmark thirty times to reduce the chance of random variation.

Talking about large data, in my last blog post my worst case was sorting a struct that has a 256 byte sort key. Which in this case means using a std::array as a sort key. This was very slow on copying radix sort because we actually have to do 256 passes over the data. In-place radix sort only has to look at enough bytes until it’s able to tell two pieces of data apart, so it might be faster. And looking at benchmarks, it seems like it is:

ska_sort_copy will fall back to ska_sort for this input, so its graph will look identical. So I fixed the worst case from my last blog post. One thing that I couldn’t profile in my last blog post was sorting of strings, because ska_sort_copy simply can not sort strings because it can not sort variable sized data.

So let’s look at what happens when I’m sorting strings:

The way I build the input data here is that I take between one and three random words from my words file and concatenate them. Once again I am very happy to see how well my algorithm does. But this was to be expected: It was already known that radix sort is great for sorting strings.

But sorting strings is also when I can hit my worst case. In theory you might get cases where you have to do many passes over the data, because there simply are a lot of bytes in the input data and a lot of them are similar. So I tried what happens when I sort strings of different length, concatenating between zero and ten words from my words file:

What we see here is that ska_sort seems to become a O(n log n) algorithm when sorting millions of long strings. However it doesn’t get slower than std::sort. My best guess for the curve going up like that is that ska_sort has to do a lot of recursions on this data. It doesn’t do enough recursions to trigger my worst case detection, but those recursions are still expensive because they require one more pass over the data.

One thing I tried was lowering my recursion limit to eight, in which case I do hit my worst case detection starting at a million items. But the graph looks essentially unchanged in that case. The reason is that it’s a false positive: I didn’t actually hit my worst case. The sorting algorithm still succeeded at splitting the data into many smaller partitions, so when I fall back to std::sort, it has a much easier time than it would have had sorting the whole range.

Finally, here is what it looks like when I sort containers that are slightly more complicated than strings:

For this I generate vectors with between 0 and 20 ints in them. So I’m sorting a vector of vectors. That spike at the end is very interesting. My detection for too many recursive calls does not trigger here, so I’m not sure why sorting gets so much more expensive. Maybe my CPU just doesn’t like dealing with this much data. But I’m happy to report that ska_sort is faster than std::sort throughout, like in all other graphs.

Since ska_sort seems to always be faster, I also generated input data that intentionally triggers the worst case for ska_sort. The below graph hits the worst case immediately starting at 128 elements. But ska_sort detects that and falls back to std::sort:

For this I’m sorting random combinations of the vectors {}, { 0 }, { 0, 1 }, { 0, 1, 2 }, … { 0, 1, 2, … , 126, 127 }. Since each element only tells my algorithm how to split off 1/128th of the input data, it would have to recurse 128 times. But at the sixteenth recursion ska_sort gives up and falls back to std::sort. In the above graph you see how much overhead that is. The overhead is bigger than I like, especially for large collections, but for smaller collections it seems to be very low. I’m not happy that this overhead exists, but I’m happy that ska_sort detects the worst case and at least doesn’t go O(n^2).

Ska_sort isn’t perfect and it has problems. I do believe that it will be faster than std::sort for nearly all data, and it should almost always be preferred over std::sort.

The biggest problem it has is the complexity of the code. Especially the template magic to recursively sort on consecutive bytes. So for example currently when sorting on a std::pair<int, int> this will instantiate the sorting algorithm eight times, because there will be eight different functions for extracting a byte out of this data. I can think of ways to reduce that number, but they might be associated with runtime overhead. This needs more investigation, but the complexity of the code is also making these kinds of changes difficult. For now you can get slow compile times with this if your sort key is complex. The easiest way to get around that is to try to use a simpler sort key.

Another problem is that I’m not sure what to do for data that I can’t sort. For example this algorithm can not sort a vector of std::sets. The reason is that std::set does not have random access operators, and I need random access when sorting on one element at a time. I could write code that allows me to sort std::sets by using std::advance on iterators, but it might be slow. Alternatively I could also fall back to std::sort. Right now I do neither: I simply give a compiler error. The reason for that is that I provide a customization point, a function called to_radix_sort_key(), that allows you to write custom code to turn your structs into sortable data. If I did an automatic fallback whenever I can’t sort something, using that customization point would be more annoying: Right now you get an error message when you need to provide it, and when you have provided it, the error goes away. If I fall back to std::sort for data that I can’t sort, your only feedback for would be that sorting is slightly slower. You would have to either profile this and compare it to std::sort, or you would have to step through the sorting function to be sure that it actually uses your implementation of to_radix_sort_key(). So for now I decided on giving an error message when I can’t sort a type. And then you can decide whether you want to implement to_radix_sort_key() or whether you want to use std::sort.

Another problem is that right now there can only be one sorting behavior per type. You have to provide me with a sort key, and if you provide me with an integer, I will sort your data in increasing order. If you wanted it in decreasing order, there is currently no easy interface to do that. For integers you could solve this by flipping the sign in your key function, so this might not be too bad. But it gets more difficult for strings: If you provide me a string then I will sort the string, case sensitive, in increasing order. There is currently no way to do a case-insensitive sort for strings. (or maybe you want number aware sorting so that “bar100” comes after “bar99”, also can’t do that right now) I think this is a solvable problem, I just haven’t done the work yet. Since the interface of this sorting algorithm works differently from existing sorting algorithms, I have to invent new customization points.

I have uploaded the code for this to github. It’s licensed under the boost license.

The interface works slightly differently from other sorting algorithms. Instead of providing a comparison function, you provide a function which returns the sort key that the sorting algorithm uses to sort your data. For example let’s say you have a vector of enemies, and you want to sort them by distance to the player. But you want all enemies that are in combat with the player to come first, sorted by distance, and then all enemies that are not in combat, also sorted by distance. The way to do that in a classic sorting algorithm would be like this:

std::sort(enemies.begin(), enemies.end(), [](const Enemy & lhs, const Enemy & rhs) { return std::make_tuple(!is_in_combat(lhs), distance_to_player(lhs)) < std::make_tuple(!is_in_combat(rhs), distance_to_player(rhs)); });

In ska_sort, you would do this instead:

ska_sort(enemies.begin(), enemies.end(), [](const Enemey & enemy) { return std::make_tuple(!is_in_combat(enemy), distance_to_player(enemy)); });

As you can see the transformation is fairly straightforward. Similarly let’s say you have a bunch of people and you want to sort them by last name, then first name. You could do this:

ska_sort(contacts.begin(), contacts.end(), [](const Contact & c) { return std::tie(c.last_name, c.first_name); });

It is important that I use std::tie here, because presumably last_name and first_name are strings, and you don’t want to copy those. std::tie will capture them by reference.

Oh and of course if you just have a vector of simple types, you can just sort them directly:

ska_sort(durations.begin(), durations.end());

In this I assume that “durations” is a vector of doubles, and you might want to sort them to find the median, 90th percentile, 99th percentile etc. Since ska_sort can already sort doubles, no custom code is required.

There is one final case and that is when sorting a collection of custom types. ska_sort only takes a single customization function, but what do you do if you have a custom type that’s nested? In that case my algorithm would have to recurse into the top-level-type and would then come across a type that it doesn’t understand. When this happens you will get an error message about a missing overload for to_radix_sort_key(). What you have to do is provide an implementation of the function to_radix_sort_key() that can be found using ADL for your custom type:

struct CustomInt { int i; }; int to_radix_sort_key(const CustomInt & i) { return i.i; } //... later somewhere std::vector<std::vector<CustomInt>> collections = ...; ska_sort(collections.begin(), collections.end());

In this case ska_sort will call to_radix_sort_key() for the nested CustomInts. You have to do this because there is no efficient way to provide a custom extract_key function at the top level. (at the top level you would have to convert the std::vector<CustomInt> to a std::vector<int>, and that requires a copy)

Finally I also provide a copying sort function, ska_sort_copy, which will be much faster for small keys. To use it you need to provide a second buffer that’s the same size as the input buffer. Then the return value of the function will tell you whether the final sorted sequence is in the second buffer (the function returns true) or in the first buffer (the function return false).

std::vector<int> temp_buffer(to_sort.size()); if (ska_sort_copy(to_sort.begin(), to_sort.end(), temp_buffer.begin())) to_sort.swap(temp_buffer);

In this code I allocate a temp buffer, and if the function tells me that the result ended up in the temp buffer, I swap it with the input buffer. Depending on your use case you might not have to do a swap. And to make this fast you wouldn’t want to allocate a temp buffer just for the sorting. You’d want to re-use that buffer.

I’ve talked to a few people about this, and the usual questions I get are all related to people not believing that this is actually faster.

Q: Isn’t Radix Sort O(n+m) where m is large so that it’s actually slower than a O(n log n) algorithm? (or alternatively: Isn’t radix sort O(n*m) where m is larger than log n?)

A: Yes, radix sort has large constant factors, but in my benchmarks it starts to beat std::sort at 128 elements. And if you have a large collection, say a thousand elements, radix sort is a very clear winner.

Q: Doesn’t Radix Sort degrade to a O(n log n) algorithm? (or alternatively: Isn’t the worst case of Radix Sort O(n log n) or maybe even O(n^2)?)

A: In a sense Radix Sort has to do log(n) passes over the data. When sorting an int16, you have to do two passes over the data. When sorting an int32, you have to do four passes over the data. When sorting an int64 you have to do eight passes etc. However this is not O(n log n) because this is a constant factor that’s independent of the number of elements. If I sort a thousand int32s, I have to do four passes over that data. If I sort a million int32s, I still have to do four passes over that data. The amount of work grows linearly. And if the ints are all different in the first byte, I don’t even have to do the second, third or fourth pass. I only have to do enough passes until I can tell them all apart.

So the worst case for radix sort is O(n*b) where b is the number of bytes that I have to read until I can tell all the elements apart. If you make me sort a lot of long strings, then the number of bytes can be quite large and radix sort may be slow. That is the “worst case” graph above. If you have data where radix sort is slower than std::sort (something that I couldn’t find except when intentionally creating bad data) please let me know. I would be interested to see if we can find some optimizations for those cases. When I tried to build more plausible strings, ska_sort was always clearly faster.

And if you’re sorting something fixed size, like floats, then there simply is no accidental worst case. You are limited by the number of bytes and you will do at most four passes over the data.

Q: If those performance graphs were true, we’d be radix sorting everything.

A: They are true. Not sure what to tell you. The code is on github, so try it for yourself. And yes, I do expect that we will be radix sorting everything. I honestly don’t know why everybody settled on Quick Sort back in the day.

There are a couple obvious improvements that I may make to the algorithm. The algorithm is currently in a good state, but if I ever feel like working on this again, here are three things that I might do:

As I said in the problems section, there is currently no way to sort strings case-insensitive. Adding that specific feature is not too difficult, but you’d want some kind of generic way to customize sorting behavior. Currently all you can do is provide a custom sort key. But you can not change how the algorithm uses that sort key. You always get items sorted in increasing order by looking at one byte at a time.

When I fall back to std::sort, I re-start sorting from the beginning. As I said above I fall back to std::sort when I have split the input into partitions of less than 128 items. But let’s say that one of those partitions is all the strings starting with “warning:” and one partition is all the strings starting with “error:” then when I fall back to std::sort, I could skip the common prefix. I have the information of how many bytes are already sorted. I suspect that the fact that std::sort has to start over from the beginning is the reason why the lines in the graph for sorting strings are so parallel between ska_sort and std::sort. Making this optimization might make the std::sort fallback much faster.

I might also want to write a function that can either take a comparison function, or an extract_key function. The way it would work is that if you pass a function object that takes two arguments, this uses comparison based sorting, and if you pass a function object that takes one argument, this uses radix sorting. The reason for creating a function like that is that it could be backwards compatible to std::sort.

In Summary I have a sorting algorithm that’s faster than std::sort for most inputs. The sorting algorithm is on github and is licensed under the boost license, so give it a try.

I mainly did two things:

- I optimized the inner loop of in-place radix sort, resulting in the ska_byte_sort algorithm
- I provide an algorithm, ska_sort, that can perform Radix Sort on arbitrary types or combinations of types

To use it on custom types you need to provide a function that provides a “sort key” to ska_sort, which should be a int, float, bool, vector, string, or a tuple or pair consisting of one of these. The list of supported types is long: Any primitive type will work or anything with operator[], so std::array and std::deque and others will also work.

If sorting of data is critical to your performance (good chance that it is, considering how important sorting is for several other algorithms) you should try this algorithm. It’s fastest when sorting a large number of elements, but even for small collections it’s never slower than std::sort. (because it uses std::sort when the collection is too small)

The main lessons to learn from this are that even “solved” problems like sorting are worth revisiting every once in a while. And it’s always good to learn the basics properly. I didn’t expect to learn anything from an “Introduction to Algorithms” course but I already wrote this algorithm and I’m also tempted to attempt once again to write a faster hashtable.

If you do use this algorithm in your code, let me know how it goes for you. Thanks!

]]>

But first an explanation of what radix sort is: **Radix sort is a O(n) sorting algorithm working on integer keys.** I’ll explain below how it works, but the claim that there’s an O(n) searching algorithm was surprising to me the first time that I heard it. I always thought there were proofs that sorting had to be O(n log n). Turns out sorting has to be O(n log n) if you use the comparison operator to sort. Radix sort does not use the comparison operator, and because of that it can be faster.

The other reason why I never looked into radix sort is that it only works on integer keys. Which is a huge limitation. Or so I thought. Turns out all this means is that your struct has to be able to provide something that acts somewhat like an integer. **Radix sort can be extended to floats, pairs, tuples and std::array**. So if your struct can provide for example a std::pair<bool, float> and use that as a sort key, you can sort it using radix sort.

I actually do this somewhat often when I write C++ code nowadays. One recent example was that I had to sort enemies in a game that I was working on. I wanted to sort enemies by distance, but I wanted all enemies that were already fighting with the player to come first. So here is what the comparison function looked like:

bool operator<(const Enemy & a, const Enemy & b) { return std::make_tuple(!IsInCombat(a), DistanceToPlayer(a)) < std::make_tuple(!IsInCombat(b), DistanceToPlayer(b)); }

Using that comparison operator will sort the enemies so that all enemies that are in combat with the player come first, (and they’re sorted by distance) and then there will be all enemies that are not in combat with the player. (also sorted by distance)

Except that by using this comparison operator I have to use an O(n log n) sorting algorithm. But you can use radix sort to sort tuples, so I could sort this in O(n). All I have to do is provide this function

auto sort_key(const Enemy & a) { return std::make_tuple(!IsInCombat(a), DistanceToPlayer(a)); }

If I use that sort_key function as input to radix sort, I can sort in O(n) instead of O(n log n). Neat, huh? So how does radix sort work?

Radix sort builds on top of an algorithm called counting sort, so I’ll explain that one first. Counting sort is also a O(n) sorting algorithm that works on integer keys. The big trick is that instead of using the comparison operator, we use integers as indices into an array. The big downside is that we need an array big enough that the largest integer can index into it. For a uint32 that’s 4 gigabytes of memory… But radix sort will overcome that downside, so for now let’s just look at counting sort on bytes. Then all we need is an array of size 256, because that’s big enough that any byte can index into it. I’ll start off by dumping in a full implementation in C++, then I’ll explain how this works.

template<typename It, typename OutIt, typename ExtractKey> void counting_sort(It begin, It end, OutIt out_begin, ExtractKey && extract_key) { size_t counts[256] = {}; for (It it = begin; it != end; ++it) { ++counts[extract_key(*it)]; } size_t total = 0; for (size_t & count : counts) { size_t old_count = count; count = total; total += old_count; } for (; begin != end; ++begin) { std::uint8_t key = extract_key(*begin); out_begin[counts[key]++] = std::move(*begin); } }

There are three loops here: We iterate over the input array, then we iterate over our buffer, then we iterate over the input array a second time and write the sorted data to the output array:

The first loop counts how often each byte comes up. Remember, we can only sort bytes using this version because we only have an array of size 256. But that array is big enough to hold the information of how often each byte shows up.

The second loop turns that buffer into a prefix sum of the counts. So let’s say the array didn’t have 256 entries, but only 8 entries. And let’s say the numbers come up this often:

index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

count | 0 | 2 | 1 | 0 | 5 | 1 | 0 | 0 |

prefix sum | 0 | 0 | 2 | 3 | 3 | 8 | 9 | 9 |

So in this case there were nine elements in total. The number 1 showed up twice, the number 2 showed up once, the number 4 showed up 5 times and the number 5 showed up once. So maybe the input sequence was { 4, 4, 2, 4, 1, 1, 4, 5, 4 }.

The final loop now goes over the initial array again and uses the number to look up into the prefix sum array. And lo and behold, that array tells us the final position where we need to store the integer. So when we iterate over that sequence, the 4 goes to position 3, because that’s the value that the prefix sum array tells us. We then increment the value in the array so that the next 4 goes to position 4. The number 2 will go to position 2, the next 4 goes to position 5 (because we incremented the value in the prefix sum array twice already) etc. I recommend that you walk through this once manually to get a feeling for it. The final result of this should be { 1, 1, 2, 4, 4, 4, 4, 4, 5 }.

And just like that we have a sorted array. The prefix sum told us where we have to store everything, and we were able to compute that in linear time.

Also notice how this works on any type, not just on integers. All you have to do is provide the extract_key() function for your type. In the last loop we move the type that you provided, not the key returned from that function. So this can be any custom struct. For example you could sort strings by length. Just use the size() function in extract_key, and clamp the length to at most 255. You could write a modified version of counting_sort that tells you where the position of the last bucket is, so that you can then sort all long strings using std::sort. (which should be a small subset of all your strings so that the second pass on those strings should be fast) I could also get my “enemy sorting” example from above to work: Store the boolean in the highest bit, and use the remaining bits to sort all enemies that are within 127 meters of the player. In my example 1 meter resolution would have been fine, (if one enemy is 1.1 meters away and the other is 1.2 meters away, I don’t care which comes first) and I really don’t care about enemies that are hundreds of meters away.

Counting sort is crazy fast and it really should be used more widely. But it sure would be nice if we could use keys bigger than a uint8_t.

Radix Sort builds on top of counting sort. The big problem with counting sort is that we need that buffer that counts how often every input comes up. If our input contains the number ten million, then our buffer has to be ten million items large because we need to increment the count at position ten million. Not good.

Radix sort builds on top of two neat principles:

1. counting sort is a stable sort. If two entries have the same number, they will stay in the same order.

2. If you sort a numbers by their lowest digit first, and then do a stable sort on higher digits, the result will be a sorted list.

Point 2 is not obvious, so let me walk through an example. Let’s sort the integers {11, 55, 52, 61, 12, 73, 93, 44 } first by their lowest digit. What we get is the list { 11, 61, 52, 12, 73, 93, 44, 55 }. You could try it using counting sort using “i % 10” as the extract_key function. Note that this is a stable sort, so for example 52 stays before 12. If we now do a second counting sort on this using the higher digit, we get the list { 11, 12, 44, 52, 55, 61, 73, 93 }. Which is a sorted list! Try it with counting sort using “i / 10” as the extract_key function.

This is a super neat observation. As long as you use a stable sorting algorithm, you can sort the low digits first and then sort the high digits after that.

So with that the implementation of radix sort is obvious: Just sort using one byte at a time, going from the lowest byte to the highest byte.

Now it should also be clear how to generalize radix sort to pairs, tuples and arrays. For a pair sort using the .second member first, and then sort using the .first member. For tuples and fixed size arrays use every element in decreasing order. (Unfortunately we can not use dynamically sized arrays as keys using this method, so we can for example not use strings as sort keys)

And with that we’re also coming to the biggest downside that radix sort has: This means that if we want to sort a two byte integer, we have to go over the input list four times. (counting sort goes over the list twice, and we have to call counting sort twice) For four bytes we have to go over the input list eight times, and for eight bytes we have to traverse sixteen times. For pairs and tuples this gets even bigger.

So radix sort is O(n), but it’s a large O(n). Counting sort is crazy fast, radix sort is not.

But still there should be some number for n where radix sort is faster than a sorting algorithm with O(n log n) complexity. Let’s find out where that is!

(oh but before we move on I should briefly mention how to make it work for signed integers and floats. Signed integers are somewhat straightforward: Just cast to the unsigned version and offset the values so that every value is positive. So for example for a int8_t, cast to uint8_t and add 128 so that -128 turns into 0, 0 turns into 128 and 127 turns into 255. For floats you have to reinterpret_cast to uint32_t then flip every bit if the float was negative, but flip only the sign bit if the float was positive. Michael Herf explains it here. The same approach works for doubles and uint64_t)

To start off with, lets measure how fast it is to sort a single byte using counting sort:

I got a little bit creative on the scales here, so this graph needs some explaining. I measured how fast radix sort (which for one byte is just doing counting sort) and std::sort can sort an array. I measure for each power of two from 2 to 2^30. Since my data growth exponentially, I had to use a logarithmic scale. Then I had a problem because on the logarithmic scale it was difficult to see how big the difference between the two sorting algorithms was, so I added another line that follows a linear scale which shows the relative speed. **That dotted line at the bottom follows a different, linear scale.** But the numbers on it show you how big the relative speed is between the two algorithms.

With that explanation the first thing we notice is that both std::sort and radix sort seem to grow almost linearly. But then the second thing we notice is that even for fairly small numbers, counting sort beats std::sort handily. And as our data set grows, counting sort is between four and six times faster!

Next, let’s see how this holds up when we go from counting sort to radix sort:

When sorting four bytes, radix sort needs to do several passes over the data, and because of that it takes longer for it to beat std::sort. But even for relatively small data sets with a thousand elements, radix sort is several times faster than std::sort.

One interesting thing is that dip at the end: That is me running out of memory. Counting sort is not an in-place sort. It stores the results in a different buffer than the input buffer. Radix sort on an int32 will shuffle the data back and forth between the two buffers four times, so that the results actually do end up in the same original buffer, but it still needs all that extra storage. At the last data point in that graph I’m sorting one billion elements, which are four bytes each, and I need two buffers. That gives me eight gigabytes of RAM. My machine has sixteen gigabytes of RAM. In theory there should be some more space left, but my machine starts slowing down once you use more than half of the available RAM. If I double the size again, radix sort never finishes because it starts swapping memory.

The big surprise from these measurements is that radix sort stays much faster than std::sort even though it now has to go over the input data four more times than when sorting a single byte. It’s hard to see in the graph, but in the underlying numbers it looks like **running radix sort on four bytes is two times slower than running radix sort on one byte**. And apparently std::sort also gets slower when sorting a bigger chunk of data, so radix sort still beats it.

In theory though there should be some data size where radix sort is slower than std::sort. Let’s try increasing the data size some more:

When sorting an int64, radix sort is “only” two to three times faster than std::sort. At least once you have more than 500 elements in your array. Since the difference in scale is linear though, we should be able to decrease the gap further by sorting bigger input data:

Aha, it looks like when we use sixteen bytes of data as the sort key, radix sort is finally slower than std::sort. At this point my implementation of radix sort has to do 18 passes over the input data, shuffling back and forth between the two buffers sixteen times. At some point that had to be slow. Note though that this does not mean that you can not sort large structs using radix sort. It only means that the sort key that you provide to radix sort has to be small. The size of the struct matters less. To prove that point here are the measurements for sorting a sixteen byte struct using an eight byte key:

When using a smaller key, radix sort is faster again. One thing to note though is that it’s not as fast as as when we were just sorting an int64. That suggests that maybe radix sort gets slower relative to std::sort as the size of the struct increases. The performance depends on the key size and the data size. So I decided to calculate the relative speed for a vector of size 2048. Meaning I did the above measurements with 2048 for “number of elements” and varied the key size and the data size, and plotted that in a table:

Time (in microseconds) to sort 2048 elements | key size | ||||||
---|---|---|---|---|---|---|---|

1 | 4 | 16 | 64 | 256 | |||

data size | 1 | radix sort | 16 | ||||

std::sort | 81 | ||||||

relative speed | 5.2 |
||||||

4 | radix sort | 18 | 24 | ||||

std::sort | 88 | 87 | |||||

relative speed | 4.8 |
3.7 |
|||||

16 | radix sort | 24 | 40 | 123 | |||

std::sort | 100 | 97 | 112 | ||||

relative speed | 4.1 |
2.4 |
0.9 |
||||

64 | radix sort | 57 | 119 | 347 | 1881 | ||

std::sort | 141 | 138 | 150 | 254 | |||

relative speed | 2.4 |
1.2 |
0.4 |
0.1 |
|||

256 | radix sort | 144 | 341 | 1195 | 5501 | 17657 | |

std::sort | 413 | 443 | 459 | 577 | 698 | ||

relative speed | 2.9 |
1.3 |
0.4 |
0.1 |
0.04 |

One thing I should note is that my benchmark loop also generated 2048 random numbers. So the measurements above are really for generating 2048 random numbers using std::mt19937_64, and then sorting those random numbers. For the key size of 64 I had to generate eight random numbers and for the key size of 256 I had to generate 32 random numbers, so the overhead for the random number generation is larger in those columns.

So what can we read from this table? There are two main things to notice:

- As the key size increases (reading from left to right), radix sort gets much slower. std::sort also slows down, but not by as much. When sorting one byte (two passes) radix sort is always faster. Same thing when sorting four bytes (five passes). At sixteen bytes (eighteen passes in my implementation, but you could do it in seventeen) radix sort starts to lose, especially when the data to move around is large. Moving the data back and forth sixteen times is just slow.
- When the data size increases (reading from top to bottom) radix sort also gets slower relative to std::sort. However it looks like a data size increase does not cause radix sort to switch from being faster to being slower. In fact the gap in absolute terms actually widens every time that the data size increases.

The main reason why std::sort is not affected as much by an increase in key size is that it uses std::lexicographical_compare. Meaning if I have a key of size 256, which in my case was just a std::array<uint64_t, 32> and **if the first entry in the key differs, then std::sort can early out** and doesn’t even have to look at the remaining bytes. Since radix sort starts sorting at the least significant digit, it has to actually look at every single byte in the key. There is a variant of radix sort that looks at the most significant digit first, so it should perform better for larger keys, but I won’t talk about that too much in this piece.

All of this being said, how does radix sort perform on my initial use case of sorting a std::pair<bool, float>?

Radix sort performs very well on my initial use case. It’s faster starting at 64 elements in the array. That’s because this is the same speed as sorting by a four byte int, and then by a boolean. And sorting by a boolean is the fastest possible version of counting sort. You don’t even need the buffer of 256 counts, you only need to count how many “false” elements there are in the array. So adding a boolean to something will barely slow it down when using radix sort. Actually let’s talk about some more optimizations:

- As just mentioned, you can write a faster version of counting sort for booleans. You don’t need to keep track of 256 counts for booleans, you just need one: How many “false” elements there were. Then you write all “true” elements starting at that offset, and all “false” elements starting at offset 0.
- When sorting multiple bytes, you can combine the first two loops of all of them. For example when sorting four bytes, the straightforward implementation is to just call counting_sort four times. Then you would get eight passes over the input data. But if you allocate four counting buffers of size 256 on the stack, you can initialize all of them in one loop, and turn all of them into prefix sums in one loop. Then you only have to do five passes in total over the data.
- The article that explains how to sort floating point numbers using radix sort also has a trick of sorting 11 bits at a time. Instead of sorting one byte at a time. The benefit of that is that you can sort a 32 bit number in four passes instead of five. I tried that, and for me it only gave me performance benefits if the input data is between 1024 and 4096 elements large. For any input sizes larger or smaller than that, sorting one byte at a time was faster. The reason for these numbers is that when sorting 11 bits, the counting array is of size 2048, and apparently if you do the math, the algorithm is fastest when the counting array is roughly the same size as the input data. I haven’t looked too much into that.
- In my implementation of counting_sort above I use an array of type size_t[256]. If you know that each of the buckets in there is less than four billion elements in size, you could also use a uint32_t[256]. In fact I use a different type depending on the size of the input data. This does actually help because the main cost in counting_sort is cache misses. So if your count array is small, that means more of the other arrays can be in the cache.

Now that we know that radix sort can be fast, we can write a sorting algorithm that has O(n) for many inputs. I think that std::sort should be implemented like this:

template<typename It, typename OutIt, typename ExtractKey> bool linear_sort(It begin, It end, OutIt buffer_begin, ExtractKey && key) { std::ptrdiff_t num_elements = end - begin; auto compare_keys = [&](auto && lhs, auto && rhs) { return key(lhs) < key(rhs); }; if (num_elements <= 16) { insertion_sort(begin, end, compare_keys); return false; } else if (num_elements <= 1024 || radix_sort_pass_count<ExtractKey, It>::value > 10) { std::sort(begin, end, compare_keys); return false; } else return radix_sort(begin, end, buffer_begin, std::forward<ExtractKey>(key)); }

First, the interface: Since this calls radix_sort, you have to provide a buffer that has the same size as the input array and a function to extract the sort key from the object. There could be a second version of this function with a default argument for the extract key function that just returns the value directly. So you sort can any type that radix sort supports. You would only have to provide an ExtractKey function for custom structs.

Next we decide which algorithm to use based on the number of elements. For a small number of elements, insertion_sort is generally thought to be the fastest algorithm. And for that I build a comparison function from the ExtractKey object. For a medium number of elements I would call std::sort. And for a large number of elements I would call radix_sort.

There is one more case where I call std::sort instead of radix_sort, and that is when radix sort would have to do a lot of passes over the input data. I can calculate how many passes radix sort has to do at compile time.

And finally the return value is a boolean that says whether the result was stored in the input buffer or in the output buffer. Depending on how many passes radix_sort has to do, the result could end up in either. So for example when sorting an int32, the function would return false because radix sort does four passes and the data ends up back in the input array, but when sorting a std::pair<bool, float> the function would return true because radix sort does five passes and the data ends up in the output array. The calling function then has to do something sensible with this information. If the two buffers are std::vectors, it could just swap them afterwards to get the data where it wants it to be.

Based on the benchmarks above, this algorithm would be several times faster than current implementations of std::sort for many inputs, and it would never be slower than std::sort.

How would we go about getting something like this into the standard? Well clearly we can’t change the interface of std::sort at this point. We could provide a function called std::sort_copy though that would have the above interface and could call radix_sort when that makes sense.

There is an in-place version of radix sort. If we used that, we could even use radix sort in std::sort. Except that we can’t get the ExtractKey function because std::sort takes a comparison functor. One solution for that would be to provide a customization point called std::sort_key which would work similar to std::hash. If your class provides a specialization for std::sort_key, std::sort is allowed to use an in-place version of radix sort when it makes sense, or it could build a comparison operator using std::sort_key and fall back to the old behavior.

This entire time we were building on top of counting_sort which needs to copy results to a different buffer. But if we could provide a version of radix sort that does all operations in one buffer, we could get that version into std::sort.

The in-place version of radix sort has one other very nice benefit: It starts sorting at the most significant digit. The version of radix sort that we used above started sorting at the least significant digit. This made radix sort slow for large keys because it always had to look at every byte of the key. **The in-place version could early out after looking at the first byte, which would potentially make it much faster for large keys**.

I will sketch out how in-place radix sort works, but I’ll leave the work of implementing it and measuring it to “future work.” I’ll explain why I didn’t implement it after I explain how it works.

We can’t build on top of counting sort because counting sort needs to copy results into a new buffer. But there is an in-place O(n) sorting algorithm called American Flag Sort. It works similar to counting sort in that you need an array to count how often each element appears. So if we sort bytes, we also need a count array of 256 elements. Then we also compute the prefix sum of this count array, like we did in counting sort. Only the final loop is different:

In the final loop of counting sort, we would directly move elements into the position that they need to be at. The prefix sum would tell us directly what the right position is. Since American Flag Sort is in-place, we need to swap instead. So let’s say the first element in the array actually wants to be at position 5. We swap it with whatever was at position 5. If the new element actually wants to be at position 3, we swap it with whatever was at position 3. We keep doing this until we find an element that actually wants to be the first element of the array. Only then do we move on to the second element in the array.

What tends to happen is that all the swapping at the first element moves a lot of elements into the right positions. Then all the swapping at the second element moves a lot more elements into the right position. So by the time that we’re a third of the way through the array, most elements are actually already sorted. So a lot of work happens on the first few elements, but at the later elements you mostly just determine that the elements are already where they want to be.

If you implement this you will need two copies of the prefix array. One copy that you change as you swap elements into place, (so that if two elements want to be in the bucket starting at position 5, the first one gets moved to position 5, and the second one to position 6) and one copy that you leave unchanged so that you can determine whether the element is already in the bucket that it wants to be in. (otherwise the element that you swapped into position 5 would think that its bucket now starts at position 6)

Now that we know how American Flag Sort works, we can implement radix sort on top of this. For that, American Flag Sort has to return the 256 offsets into the 256 buckets that it created. Then we call American Flag Sort again to sort within each of those 256 buckets, using the next byte in the key as the byte that we want to sort on. Meaning for a four byte integer, we have to sort recursively within a smaller bucket three more times after the initial sort. Since there’s 256 buckets and each of those gets split into 256 buckets after the second iteration and each of those gets split again after the third iteration, that means that we’ll call the function 256^3 times. Since that is a crazy number, we can just call insertion_sort for any bucket that is less than 16 elements in size, which will be most buckets. And actually since the in-place radix sort isn’t stable, we can also just call std::sort for any bucket that is less than 1024 elements in size. That gets the number of recursive calls down by a lot.

This sounds simple: Use American Flag Sort to subdivide into 256 buckets using the first byte, then sort each of those buckets recursively using the remaining bytes as sort key. The problem that I ran into was that I had generalized radix sort to work on std::pair, std::tuple and std::array. On the in-place version these were far more complicated because you have to pass the logic for advancing and the next comparison function through all recursion layers. Especially the std::tuple code drove me template-crazy. Since American Flag Sort was also significantly slower than counting sort, I abandoned the in-place version for now and decided to leave that for future work.

So for now the takeaway is this: There is an in-place version of radix sort, but for now I decided that it’s too much work to implement. The part that I did implement looked several times slower than the copying radix sort. It might still be faster than std::sort, but I haven’t measured that.

In this article I found that radix sort is several times faster than std::sort for what seem like pretty normal use cases. So why isn’t it used all over the place? I did a quick poll at work, and many people had heard of it, but didn’t know how to use it or whether it was even worth using at all. Which is exactly what I was like before I started this investigation. So why isn’t it popular? I have a few explanations

- Size overhead: Radix sort requires a second array of the same size to store the data in. If your data is small you may not want to pay for the overhead of allocating and freeing a second buffer. If your data is large you may not have enough memory to have that second buffer. Radix sort may still be a great option if your data is sized somewhere in the middle, but using radix sort means that you have to worry about these things.
- Can’t use radix sort on variable sized keys. The version of radix sort presented here only works on fixed sized keys. So it can’t sort strings for example. The in-place version of radix sort can sort strings, but I didn’t look into that too much.
- We used to always write custom comparison functions. If you don’t use std::make_tuple or std::tie to implement your comparison function, it may not be obvious how to use radix sort for your class. You need to know that you can sort tuples using radix sort, and you need to notice that you’re using tuples in your comparison functions already.
- I can’t find any place that generalizes radix sort to std::pair, std::tuple and std::array. So this might actually be an original contribution of mine. Googling for it I can find mentions of using radix sort on tuples of ints, but it seems like those people don’t realize that you can generalize beyond that. Certainly nobody suggests that you could use radix sort on a std::pair<bool, float>. (for example the boost version of radix sort can not sort a std::pair<bool, float>, and boost code is usually way too generic) If you think that radix sort is only for integers, it’s not very useful.
- Radix sort can not take advantage of already sorted data. It always has to do the same operations, no matter what the data looks like. std::sort can be much faster on already sorted data.

So there are certainly some good reasons for not using radix sort. There simply can’t be one best sorting algorithm. However I also think that radix sort lost some popularity due to historical accidents. People often don’t seem to think that it applies to their data even though it does.

Radix sort is an old algorithm, but I think it should be used more. Much more. Depending on what your sort key is, it will be several times faster than comparison based sorting algorithms.

Radix sort can be used to implement a general purpose O(n) sorting algorithm that automatically falls back to std::sort when that would be faster. I think the standard library should be modified so that it can provide this behavior. I think this is possible by offering an extension point called std::sort_key which would work similar to std::hash. Even without that the standard could provide std::sort_copy, which would promise O(n) sorting on small keys.

The final conclusion is that it’s worth learning something about algorithms even if you’ve programmed for a while. I learned how radix sort works because I’ve been watching an Introduction to Algorithms course by MIT. I didn’t expect to learn anything new in that course, but it’s already caused me to write this blog post, and it’s inspired me to take another stab at writing the worlds fastest hashtable. (in progress) So never stop learning, and try to fill in the gaps in your knowledge of the basics. You never know when it will be useful.

I have uploaded my implementation of radix sort here, licensed under the boost license.

]]>

Shenzhen I/O shows you a histogram of all the scores that other people have reached. If my solution would fall on the right of the bell curve, I would optimize it until I was on the left. After a lot of work I would usually arrive at an “optimal” solution that puts me in the best bracket on the histogram. Those solutions were always far from optimal.

When you’re competing with another player, they will probably find a way to beat your score by just a few points. Let’s say my score is 340 and a friend beats me with a score of 335. (lower is better. The score is just the number of executed instructions) What follows is a bunch of head-scratching about how you could possibly get any more cycles out of the algorithm. After an hour of staring and trying different things you find a small improvement, and your new score is 332! Proudly you tell your friend that you beat their score. Soon after your friend will beat your score with 320. Such a big jump seems impossible. But your friend somehow did it. So now you need to think outside of the box. You’re thinking the only way that you could possibly achieve such a big jump is if you could somehow combine these two different parts of the algorithm, so that they can share this one part of the work. It doesn’t seem possible, and it’s not even clear that this will buy this much of a score improvement, but it’s the only thing you can think of. So after another hour of head-scratching about how you could possible achieve this you find a way to do it, and lo and behold it the wins are far bigger than expected, the new score is 310! And the next day your friend comes back with 290…

My friend and I have literally had cases where we went from a score of more than 500, where my friend thought that my score was impossible, down to a score of 202 for me, and 200 for my friend which put us completely off the charts. At that point a new patch hit that changed the level slightly so that our solutions didn’t work any more. (the game is still in early access) But if it hadn’t been for that, I wouldn’t have been surprised if we could have optimized this further. Almost every single time that I thought the limit was reached, we broke through it soon after.

I can now say for a fact that a lot of code out there is far from optimal. Even the code in our standard libraries that’s maintained by some of the best programmers and that’s used by millions is slower than it needs to be. It is simply faster than whatever code they compared it against.

On the second puzzle in the game, which serves as a kind of tutorial, the only possible score is 240. Except there were some people over on the left of the histogram. And wondering how to get over there, my friend somehow got to 180, telling me “I think this one is optimal.” The score seemed unreachable. With a few tricks I got it down to 232. After literally days of thinking about this problem I managed to think outside the box and match my friend’s score of 180. It wasn’t until we talked about it that we realized that we had used different solutions. It was crazy to realize that there were in fact two entirely different ways to reach 180. Once I realized that I had used a different solution, I also realized that the solution that my friend picked could not be optimized further, but mine could. It took me hours, but I got the score down to 156, and then very quickly down to 149. My friend then beat me with 148 using my technique, forcing me to find one last cycle.

If nobody had gotten to the score of 180 before me, I couldn’t have thought of any faster way of solving this puzzle. Without that piece of information, the brain just comes up with reasons why the score is already optimal. Only once you know for a fact that a better solution is possible can you actually think of that solution. If you now say “but how did the first person get to a lower score?” then the answer is that the technique that my friend used is actually useful in other levels, so they could have gotten the trick from one of those and then just applied it in earlier levels once they had come up with it in a later level. Or maybe somebody got to the score of 232 which is just the 240 score with a few dirty tricks, and somebody else thought about how to get to the “impossible” score of 232, and accidentally got to 180 instead.

Or Michael Abrash has told the story of how he was optimizing an inner loop, and he was asking a friend for help. The friend stayed long in the office and at night left a message for Abrash telling him that he had gotten two more instructions out of the seemingly optimal inner loop. Abrash didn’t think that was possible, but before the friend came into work the next day Abrash had already found how to reduce the loop by one instruction. At that point the friend told Abrash that the friend had actually made a mistake, and the two cycle optimization wasn’t valid. But just the thought that the friend could have gotten two more instructions out of the loop made it possible for Abrash to find another optimization.

In the puzzle above where my friend and I brought our score down from more than 500 to 200, the final solution was actually much cleaner than the solution that has a score of 500. Well my final 202 score solution is a dirty mess, but somewhere around 220 I had just the most beautiful code. It was much cleaner than the code I had for a score of 270, which in turn was much cleaner than the code I had for 340, which in turn was cleaner than the code I had for 410. But even though the fast solution is much simpler and cleaner than the bloated, slow solution, you have to write the bloated, slow solution first. It is a necessary step in getting familiar with the problem. Only once you’re familiar with it can you recognize the points where it could be cleaner. The only way to get to the good solution is to perform many steps of filing off the bumps and cleaning up the dust.

Even big, algorithmic improvements come from writing the bad solution first and then making many small improvements. At some points the small improvements clarify something in the solution. They reveal a symmetry or uncover that some work was done twice. Sometimes a new fact reveals itself very hazily, and only more work and thought on the problem can slowly make it clearer. Sometimes you don’t realize that you just made a big, algorithmic improvement until after you’re done. “Oh I can delete this entire chunk of code now. How did that happen?” And then after the fact you can reason through the steps that took you there.

For all of this you have to keep working on the problem and you have to keep it in your head. (partly so that it’s in the back of your head when you’re sleeping or taking a shower) You can’t come up with improvements if you’re not actively working on the problem.

This is obvious for people who have worked with tests, but in the videogame industry where I work, unit tests are still rare. In Shenzhen I/O you are so ridiculously productive thanks to the automated tests, that I point out to everyone who has played it “you could be this productive at work if you just wrote tests.”

Tests allow you to have a feedback loop of seconds. Manual testing requires launching the game, teleporting to the point you want to test something, waiting for loading, then manually doing your test. (say by killing a goblin and checking that the right effects play when the goblin dies) Not only does the automated test drastically improve iteration times, it will probably test more cases and provide more helpful error messages when something goes wrong.

I think the fast iteration times in Shenzhen I/O are one of the main reasons for why it is so much more fun than normal programming. Fast feedback and fast iteration times just make programming better. Suddenly I want to go back to old code to see if I can improve it, because if I get a few more cycles out of it I can find out very quickly. How long does it take you to set up a test case at work that measures performance and measures improvements? How long does it take you to make sure that your optimization didn’t break anything? Does that keep you from trying more risky optimizations?

Slow iteration times make you work differently. Not only do they drag the fun out of programming, but they make you spend less time on improvements. They hurt your code quality. It’s worth spending time on improving iteration times even if you did the math and figured that people didn’t spend a lot of time compiling. It’s not just about time spent.

If our libraries were set up like Shenzhen I/O puzzles, all of our code would run much faster. The way this could work is that the standard library would define an interface, tests, and a simple implementation. Then anyone could submit better implementations. And you could judge how fast each solution completes each test. You pick the test cases that you care about and pick the implementation that does best in those.

People could provide several different implementations that do better in different scenarios. (“this one does better if your data grows and shrinks a lot, this one does better if it’s mostly stable”) All you have to do is make sure that your implementation satisfies certain tests.

I think if we had this we would quickly find a new, faster sorting algorithm. The current favorites seem to be Introsort and Timsort, but I am confident that they would be beat immediately. The reason is simply that nobody has worked on sorting algorithms in an environment like the one in Shenzhen I/O.

Shenzhen I/O has impressively good writing, and it really enhances the game. The story is that you’re a programmer who moves to China for a job. It’s a simple story, entirely told in email conversations with your in-game coworkers, but the small story snippets really lighten up the game. Your coworkers have personalities that seem well-researched, almost as if the author has experience with working abroad himself. Each puzzle also has a little back story. I find that I use the back story to determine whether my solution is “cheating” or not. You can “cheat” by adapting your solution to the test cases, so that only those pass and other tests cases might fail. Usually if the device still fulfills its purpose according to the back story, I’m fine with taking a shortcut. (e.g. it’s fine to err on the side of false positives for the “spoiler-blocking headphones”, but not for the security card reader. For that puzzle however false negatives based on timing are OK because people can just swipe the card again)

The emails contain funny moments between coworkers, informative emails where you learn something about China, and emails that mirror your own emotions: When you get access to a new part that will make puzzles easier your coworkers are ecstatic and so are you. Or you are confused early in the game because you have to learn a lot, and the game acknowledges and plays with your confusion by making part of the documentation Chinese. This actually helps because it makes clear that you don’t have to learn everything to get started, and it’s OK to be a bit confused. It’s very impressive how all of this is told in very short email conversations that take maybe a minute or two between puzzles.

If this was just a series of programming puzzles, it wouldn’t work nearly as well. Before playing this game I wouldn’t have thought that programming puzzles need a story. The game would work without a story, it just wouldn’t be as good.

You should play Shenzhen I/O. It takes all the fun parts of programming and distills them into a game. If you can, convince a friend and start roughly at the same time.

The game teaches persistence and how to improve a solution with any means necessary. The game teaches out of the box thinking. When you have a tiny, constrained problem and somehow people are much faster than you, you have to think outside the box. (or sometimes you think outside the box for an hour only to realize that there was an obvious improvement left to do inside the box)

The game shows a great way to program, making you incredibly productive which will make your work feel sluggish in comparison. It’ll make you want to improve your tools at work.

Many of these lessons aren’t new, but since Shenzhen I/O is such a condensed experience, it makes these lessons clear and easy to acquire. It’s a great way to spend time as a programmer.

]]>

To illustrate let’s look at how objects were composed before C++11, what problems we ran into, and how everything just works automatically since C++11. Let’s build an example of three objects:

struct Expensive { std::vector<float> vec; }; struct Group { Group(); Group(const Group &); Group & operator=(const Group &); ~Group(); int i; float f; std::vector<Expensive *> e; }; struct World { World(); World(const World &); World & operator=(const World &); ~World(); std::vector<Group *> c; };

Before C++11 composition looked something like this. It was OK to have a vector of floats, but you’d never have a vector of more expensive objects because any time that that vector re-allocates, you’d have a very expensive operation on your hand. So instead you’d write a vector of pointers. Let’s implement all those functions:

Group::Group() : i() , f() { } Group::Group(const Group & other) : i(other.i) , f(other.f) { e.reserve(other.e.size()); for (std::vector<Expensive *>::const_iterator it = other.e.begin(), end = other.e.end(); it != end; ++it) { e.push_back(new Expensive(**it)); } } Group & Group::operator=(const Group & other) { i = other.i; f = other.f; for (std::vector<Expensive *>::iterator it = e.begin(), end = e.end(); it != end; ++it) { delete *it; } e.clear(); e.reserve(other.e.size()); for (std::vector<Expensive *>::const_iterator it = other.e.begin(), end = other.e.end(); it != end; ++it) { e.push_back(new Expensive(**it)); } return *this; } Group::~Group() { for (std::vector<Expensive *>::iterator it = e.begin(), end = e.end(); it != end; ++it) { delete *it; } } World::World() { } World::World(const World & other) { c.reserve(other.c.size()); for (std::vector<Group *>::const_iterator it = other.c.begin(), end = other.c.end(); it != end; ++it) { c.push_back(new Group(**it)); } } World & World::operator=(const World & other) { for (std::vector<Group *>::iterator it = c.begin(), end = c.end(); it != end; ++it) { delete *it; } c.clear(); c.reserve(other.c.size()); for (std::vector<Group *>::const_iterator it = other.c.begin(), end = other.c.end(); it != end; ++it) { c.push_back(new Group(**it)); } return *this; } World::~World() { for (std::vector<Group *>::iterator it = c.begin(), end = c.end(); it != end; ++it) { delete *it; } }

Oh god this is painful to do now, but this illustrates how people used to do composition. Or most of the time what people actually did is they just made their type non-copyable. Nobody would have wanted to maintain all this code (Too easy to make typos in this mindless code), so the easiest thing to do is to make the type non-copyable.

In fact oftentimes it just looked like types were non-copyable just because it’s difficult to reason through all these pointers. So in a sense it doesn’t matter that you could have implemented a copy constructor, the problem was that it’s difficult to reason through everything.

Nowadays I would write the above classes like this:

struct Expensive { std::vector<float> vec; }; struct Group { int i = 0; float f = 0.0f; std::vector<Expensive> e; }; struct World { std::vector<Group> c; };

This does everything that the above code did and it does it faster and with less heap allocations. The main feature in C++11 that made this possible was the addition of move semantics. Why isn’t this possible without move semantics? After all that last chunk of code would have compiled fine and run fine before C++11. But before C++11 people would have changed this code to look like the code further up. To see why imagine what happens when we add a new Group to the World.

If the vector in World reallocates its internal storage, we have to create temporary copies of our Groups and may have to allocate thousands of temporary vectors in the nested classes. Just to do an operation that’s internal to the vector. It’s terrible that we can randomly get slowdowns like this from harmless operations like a push_back.

The first time that somebody catches this in a profiler they will take a look at the codebase and find that we rarely copy Groups. So why don’t we just replace the internals with a pointer? That will make the copy more expensive but it will make growing and shrinking the vector practically free because we don’t have to copy in that case. We get a huge performance improvement and everyone is happy. And with that we’re back at the initial code.

Move semantics solve that problem. **With move semantics objects can re-organize their internals without having to copy everything that they own**. That’s obviously very useful for std::vector, but it turns out to be useful in a lot of classes.

Move semantics also gives composition to types that aren’t copyable. Before C++11 you could use RAII for non-copyable types, but then you couldn’t compose them as well as other classes. To illustrate let’s add some kind of OS handle to the Expensive struct. And let’s say that this OS handle requires manual clean-up:

struct Expensive { Expensive() : h(GetOsHandle()) { } ~Expensive() { FreeOsHandle(h); } HANDLE h; std::vector<float> vec; };

And just with that, everything is ruined. Expensive now can’t be copied and can’t be moved. That immediately breaks Group, which immediately breaks World. To fix this we could change Group to use a pointer to Expensive instead of using Expensive by value. But then Group has to be non-copyable, too and World is still broken. So now we also have to change World to store Group by pointer and we propagate our ugliness all the way through the codebase. A single type that requires manual clean-up makes us add the boilerplate code from C++98 to do composition to all other classes that use it directly or indirectly. It’s a mess.

Of course you know the solution already: Move semantics. If we just wrap the OS handle in a type that supports move semantics, everything continues to work:

struct WrappedOsHandle { WrappedOSHandle() : h() { } WrappedOSHandle(HANDLE h) : h(h) { } WrappedOSHandle(WrappedOSHandle && other) : h(other.h) { other.h = HANDLE(); } WrappedOSHandle & operator=(WrappedOSHandle && other) { std::swap(h, other.h); return *this; } ~WrappedOsHandle() { if (h) FreeOsHandle(h); } operator HANDLE() const { return h; } private: HANDLE h; }; struct Expensive { Expensive() : h(GetOsHandle()) { } WrappedOSHandle h; std::vector<float> vec; };

It’s a bit of boilerplate, but there are ways of avoiding it. (for example use a unique_ptr with a custom deleter). Now whenever we use this handle type, our class stays composable. Group keeps working and the World keeps working and everyone is happy.

There is a more fundamental reason why this works and why RAII is important for this: **Composing objects is a lot easier if certain operations are standardized**. If my object A consists of two objects B and C, it’s a lot easier to write the clean-up code for A if the clean-up code for all types is standardized. Otherwise B and C might have custom clean-up code and now A has to also have custom clean-up code. If everyone standardizes on one way to clean up objects, composition is easier.

The list of functions that make composition easier is long. It includes construction, copying, moving, assignment, swapping, destruction, reflection, comparison, hashing, checking for validity, pattern matching, interfacing with scripting languages, serialization in all its many forms and more. For example it’s a lot easier to write a hash function for my type if there is a standard way to hash my components. Or it’s a lot easier to copy my type if there is a standard way to copy my components. Not all types need all operations from this list, but if your type does need one of these, you’ll want a standard interface for your components. In fact once there is a standard way, you might as well automate this.

C++ has decided to automate the bare necessities out of that list: Construction, copying, moving, assignment and destruction. And it did this in the set of rules that we call RAII. If you use RAII, composition will be a lot easier for you. You’ll find that you’ll have a lot more types that just slot together and just work together. It’ll improve your code.

Oh and this is also another good reason to standardize reflection: With reflection, I can automate a lot of other elements in that list.

]]>

At least that’s my first impression. Still just dipping my toes in. But there is one thing I am very impressed with: How much data neural networks can express in how few connections.

To illustrate let me draw a very simple neural network. It’s not a very interesting neural network, I’m just connecting inputs to outputs:

And now let’s say that I want to teach this neural network the following pattern: Whenever input 1 fires, fire output 2. When input 2 fires, fire output 3. When input 3 fires, fire output 4. When input 4 fires, fire output 5. Output 1 never gets fired and input 5 never gets fired. To do that you use an algorithm called back propagation and repeatedly tell the network what output you expect for a given input, but that is not what I want to talk about here. I want to talk about the results. I’ll make the connections that the network learns stronger, and the connections that the network doesn’t learn weaker:

What happened here is that the weights for some connections have increased, while the weights for most connections have decreased. I haven’t talked about weights yet. Weights are what neural networks learn. When we draw networks, the nodes seem more important. But what the network actually learns and stores are weights on the connections. In the picture above the thick lines have a weight of 1, the others have a weight of 0. (weights can also go negative, and they can go up arbitrarily high)

A simple network like the one above can learn simple patterns like this. If we want to learn more complex patterns, we have to create a network that has more layers. Let’s insert a layer with two neurons in the middle:

There are many possible behaviors for those neurons in the middle, and that is the part where most of the magic happens in neural networks. It seems like picking what goes there just requires experimentation and experience. A simple neuron would be the tanh neuron which does these three steps:

- add up all of its inputs multiplied by their weights
- calculate tanh() on the sum
- fire the result to its outputs multiplied by their weights

Why tanh? Because it has a couple convenient properties, and it happens to work better than other functions that have the same properties. But mostly it’s “beause it works like this and it doesn’t work when we change it.”

If you count the number of connections on that last picture you will notice that there are fewer than there were in the first network. There were 5*5=25 at first, now there are 5*2*2=20. That reduction would be larger if I had more input and output nodes. For 10 nodes that reduction would be from 100 connections down to 40 when inserting two nodes in the middle. That’s where the compression in the title of this blog post comes from.

The question is whether we can still represent the original pattern in this new representation. To show why that is not obvious, I’ll explain why it doesn’t work if you just have one middle neuron:

If I initialize the weights on here so that the first node fires the second output, all nodes will fire the second output:

That one node in the middle messes up the ability for my network to learn my very simple rule. This network can not learn different behaviors for different input nodes because they all have to go through that one node in the middle. Remember that the node in the middle just adds all of its inputs, so it can not have different behaviors depending on which input it receives. The information of where a value came from gets lost in the summation.

So how many nodes do I need to put into the middle to still be able to learn my rule? In real neural networks you usually put hundreds of nodes into that middle layer, but what is the minimal number to learn my pattern? Before I get to the answer I need to explain one more type of neuron that enables my compression: The softmax layer. If I make my output layer a softmax layer, that means that it will look at all the activations on that layer, and that it will only fire the output with the highest activation. That’s where the “max” in the softmax comes from: It fires the max node. The “soft” part is also very important in other contexts because a softmax layer can fire more than one output, but in my case it will only ever fire one output so we’ll stay with this explanation. If I make my middle layer a tanh layer and my output layer a softmax layer, I can represent my pattern just by having two nodes in the middle:

Here I’ve made lines with positive weights blue, and lines with negative weights red. This means that if for example the first input fires, I will get these values on the nodes:

The first node activates both the middle nodes. That activates the second output node with weight two. The next two nodes are canceled out because they receive a positive weight from one of the middle nodes and a negative weight from the other. The last node is activated with weight -2. If I then run softmax on this only the top node will fire. Meaning the bottom node will not fire a negative output. Softmax suppresses everything except for the most active node.

This is a little bit simplified, because the tanh layer doesn’t give out nice round numbers and the softmax layer wants bigger numbers, but those complications are not necessary to understand what’s going on, so I’ll keep the numbers in this blog post simple. (the real weights in my test code are +9 and -9 on the connections going from the input layer to the middle layer, and +26 and -26 on the connections going from the middle layer to the output layer. I don’t know why those numbers specifically, that’s just where the network decided to stabilize)

Let’s quickly run through this for the other three cases as well:

As you can see there is always one clear winner and then the softmax layer at the end will make sure that only that one fires and the other outputs remain quiet. When I first saw this behavior I was quite impressed. In fact I saw this behavior on a network with eleven inputs, eleven outputs and just two middle nodes. Can you think of how the above layer would work with eleven inputs? It seems like there’s only four possible combinations for these weights and we’ve used all of them, right? You’re underestimating neural networks. It’s quite impressive how they will try to squeeze any pattern that you throw at them into what’s available. For eleven inputs and eleven outputs it looks like this:

That is a lot of connections and a lot of numbers. The network has now decided to make some connections stronger and other connections weaker. I couldn’t keep using colors and line-style to visualize different strengths because now there are a lot of different values. Whenever I tried to simplify and not use numbers I’d break the network. So instead I just show the numbers. The upper number on each node is the weight of the connection to/from the upper node in the middle layer, and the lower number is the weight of the connection to/from the lower node in the middle layer.

Let’s walk through a random example and see how it works:

The node with the largest output value (72 = -10 * -4 + 8 * 4) is the one we want to fire, and softmax will select that. In this picture I’m just multiplying the numbers linearly because the weights on the left actually already have tanh applied (and were then multiplied by 10 to get them into the range -10 to 10 which is slightly nicer for pictures than the range -1 to 1). I can do that in this case since I only ever fire one input. And if I can just do linear math the example is easier to follow.

I’ll post a second example to show that the weights will activate the desired output for any input:

Here also for my given input n, output n+1 had the highest value at the end and softmax will make that output fire. You could go through a couple more examples and convince yourself that this network has learned the pattern that I want it to learn.

I don’t know about you, but I think this is quite impressive. For the small first network I think I could have figured out how to put the numbers in manually. (once you know the trick that the nodes can cancel each other out it’s easy) For all the connections on the larger network you would have to be very good at balancing these weights, and I honestly don’t think I could have done this by hand. The neural network however seems to just figure this out. Or rather: it figures this out if you use a tanh layer in the middle and a softmax layer at the end. If you don’t, it will stop working.

Since I’m using real numbers now I might as well explain how softmax works. Softmax uses the number at the end as an exponent and then normalizes the column afterwards. If I use 2^x it should be obvious that 2^56, the highest number, is a lot bigger than 2^44, the next highest number. If I have 2^56 in the column and I normalize it, all the other activations will go very close to zero and the output that I want to fire will be close to 1. Also by using an exponent I get no negative numbers. 2^-96 is just a number that’s very close to zero. The standard is to use e^x instead of 2^x, and I also use that, it’s just that as a programmer I find it easier to explain and easier to understand using powers of two.

With that information I can give a bit of an intuition why this happens to work if you use tanh and softmax: When using softmax, the connections can keep on improving the chance of their desired output just by making their own weights bigger. In fact since I just used linear math in my explanation above and didn’t need to run tanh() on anything, couldn’t softmax have learned those weighs even without using a tanh layer? In theory it could, but in practice it will keep on bumping up the numbers to get small improvements and you will very quickly overflow floating point numbers. The tanh layer in the middle is then just responsible for keeping the numbers small: it clamps its outputs to the range -1 to 1, and the inputs won’t grow past a certain point either because at some point a bigger number just means that the output goes from 0.995 to 0.997. But since they can usually still get a small improvement you don’t lose that nice property of softmax where neurons can keep on edging out small improvements over each other.

So if you just use softmax your weights will explode and if you just use tanh your weights will stagnate too soon before a good solution emerges. If you use both you get nice greedy behavior where the connections keep on looking for better solutions, but you also keep the weights from exploding.

Now all I’ve shown is that my network can learn the rule that when input n fires, it needs to fire output n+1. It should be easy to see that just by creating a different permutation of the weights of the connections I can teach my network that for any input neuron x, it should fire any other output neuron y. The impressive part is that without the middle layer I would have needed 11*11 = 121 connections to have a network that can learn any combination, and with that middle layer I can do it in only 11*2*2 = 44 connections.

Eleven inputs and outputs is close to the limit for what you can do with two nodes in the middle. If you ask it to do much more the weights will fluctuate and they will keep on stepping on each others toes. But with three nodes in the middle I can learn these patterns for something like thirty inputs and outputs, and with four nodes, I can learn direct connections for more than a hundred inputs and outputs. It doesn’t grow linearly because the number of combinations doesn’t grow linearly. Luckily the back propagation algorithm scales with the number of connections, not with the number of combinations. So with a linear increase in computation cost I get a super-linear increase in the amount of stuff I can learn.

I’m still just getting started with neural networks, and real networks look nothing like my tiny examples above. I don’t know how many neurons people actually use nowadays, but even small examples talk about hundreds of neurons. At that point it’s more difficult to understand what your network has learned. You can’t visualize it as easily as the above pictures. In fact even the eleven neuron picture is difficult to understand because you can’t just look at the strong connections. Even the weak connections are important. If that middle layer had 700 neurons, then good luck getting a picture of which connections are actually important. Maybe a bunch of weak connections add up to build the output you wanted and one random strong connection suppresses all the unwanted outputs that you didn’t want.

But I hope I have given you an intuition for how neural networks can compress patterns in few weights. They use the full range of the weights to the point where a connection activated with a strong input can mean something entirely different than the same connection activated with a weak input. And best of all I didn’t have to teach them to do this. They just start behaving like this if you force them to express a complex pattern in few connections.

]]>