Interesting. I’ll try to investigate where the performance difference comes from.

There is a problem with that benchmark in that repeatedly popping the top item and pushing a random item leads to a degenerate case where the new item will almost always be the top item. The reason is that the first few pops remove all large items, and then a random insert will usually be larger than the largest remaining items in the heap. (if it isn’t, then the average just goes further down and future items are even more likely to be the top item)

So it could be that in this case all the branches are very predictable and all my branchless trickery ends up backfiring.

]]>ok I added some more benchmarks which indeed show similar numbers than what you have :). Only the first benchmark of extracting and the pushing into the heap are a bit odd (in particular in seems to hit some performance bug in your binary heap implementation)

]]>Best Whiches ]]>

I think your trick might work. That’s pretty clever. My first solution would have involved much more work if I hadn’t been able to use bit scan reverse…

The talk about the interface of branch misprediction is about how C++20 decided to standardize the access to the “bit scan reverse” instruction. I don’t know if their way of standardizing isn’t adding unnecessary overhead. So if you have the option between using “bit scan reverse” directly and using the C++20 library function, you should probably try the C++20 library function first, but check that it isn’t doing more work than if you used “bit scan reverse” directly. The reason for their overhead is that “bit scan reverse” returns nonsense when it’s called on 0. So the C++20 standard decided to define behavior for that edge case. But I never call “bit scan reverse” on 0, so I don’t need logic for handling that edge case.

]]>I have to admit that it’s a bit hard to figure out what exactly led to the speed-up. In my code

min(min(1, 2), min(3, 4)) ended up branchless, and min(1, min(2, min(3, 4))) did not end up branchless. So it could either be the shorter dependency chain, or the branchless code that could have led to the speed-up. Or both. But in a binary heap all siblings are also always right next to each other in memory.

I did measure that for an octonary heap the chaining version is faster. Meaning min(1, min(2, min(3, min(4, min(5, min(6, min(7, 8))))))) is faster than min(min(min(1, 2), min(3, 4)), min(min(5, 6), min(7, 8)))). And I’m pretty sure the reason is what I said in the blog post: The longer the chain, the better the branch predictor gets. min(1, 2) is mostly unpredictable, but the last min in that long chain is very predictable because on random items the chance that the last is the smallest is only 1/8.

]]>Same comment as above, I uploaded the benchmarks in the same repository here:

https://github.com/skarupke/heap

I uploaded the benchmarks to the same repository here:

https://github.com/skarupke/heap

They’re not fancy, just calling the relevant instructions in a loop. The only difference is that I graph nanoseconds per item, where google benchmark gives you items per second.

]]>That is, I wouldn’t be surprised to see a speedup here even without parallel comparison (doing `min(1, min(2, min(3, 4)))`

rather than `min(min(1, 2), min(3, 4))`

). But I also won’t be surprised by not seeing it — I don’t have an intuition about how the two factors compare to each other.

But even when you’re not using a heap for a priority queue, and if you’re pushing completely unpredictable items, the branch predictor can do a good job: The first comparison has a 50/50 chance because it depends on if you’re in the upper half or lower half. But the next one can be predicted correctly 75% of the time because you only get to trickle up twice if you’re in the top 25% of items, and you only get to trickle up three times if you’re in the top 12.5% of items etc. So the branch predictor only gets better after that.

This is confused as written, but the idea’s right: branch prediction works better with heaps than trees.

Loosely, you expect the lowest level in a binary heap, where the item is initially placed, to be the bottom half of the item distribution, and that level’s direct parents are the third quarter of the item distribution. So the initial comparison is 5:3 against the newly inserted item trickling up.

This pattern is the same all the way up: given the comparisons that have gone before, the parent the item is being compared against is in the third quarter of the conditional item distribution, so the odds are again 5:3 against a trickle-up.

This isn’t *that* much, but it’s better than binary trees, which are exactly even odds each time (and never mind rebalancing operations), and it’s enough to give the branch predictor something to work with.