Why Does Integer Addition Approximate Float Multiplication?
by Malte Skarupke
Here is a rough approximation of float multiplication (source):
float rough_float_multiply(float a, float b) {
constexpr uint32_t bias = 0x3f76d000;
return bit_cast<float>(bit_cast<uint32_t>(a) + bit_cast<uint32_t>(b) - bias);
}
We’re casting the floats to ints, adding them, adjusting the exponent, and returning as float. If you think about it for a second you will realize that since the float contains the exponent, this won’t be too wrong: You can multiply two numbers by adding their exponents. So just with the exponent-addition you will be within a factor of 2 of the right result. But this will actually be much better and get within 7.5% of the right answer. Why?

It’s not the magic number. Even if you only adjust the exponent (subtract 127 to get it back into range after the addition) you get within 12.5% of the right result. There is also a mantissa offset in that constant which helps a little, but 12.5% is surprisingly good as a default.
I should also say that the above fails catastrophically when you overflow or underflow the exponent. I think the source paper doesn’t handle that, even though underflowing is really easy by doing e.g. 0.5 * 0. It’s probably fine to ignore overflow, so here is a version that just handles underflow:
float custom_multiply(float a, float b) {
constexpr uint32_t sign_bit = 0x8000'0000;
constexpr uint32_t exp_offset = 0b0'01111111'0000000'00000000'00000000;
constexpr uint32_t mantissa_bias = 0b0'00000000'0001001'00110000'00000000;
constexpr uint32_t offset = exp_offset - mantissa_bias;
uint32_t bits_a = std::bit_cast<uint32_t>(a);
uint32_t bits_b = std::bit_cast<uint32_t>(b);
uint32_t c = (bits_a & ~sign_bit) + (bits_b & ~sign_bit);
if (c <= offset)
c = 0;
else
c -= offset;
c |= ((bits_a ^ bits_b) & sign_bit);
return std::bit_cast<float>(c);
}
Clang compiles this to a branchless version that doesn’t perform too far off from float multiplication. Is this ever worth using? The paper talks about using this to save power, but that’s probably not worth it for a few reasons:
- Most of the power-consumption comes from moving bits around, the actual float multiplication is a small power drain compared to loading the float and saving the result
- You wouldn’t be able to use tensor cores
- I don’t think you can actually be faster than float multiplication because there are so many edge cases to handle
It feels close to being worth it though, so I wouldn’t be surprised if someone found a use case.
But there is still the question of why this works so well. The mantissa is not stored in log-space, it’s just stored in plain old linear space where addition does not do multiplication. But lets think about how to get the exponent from the mantissa.
In general how do you get the remaining exponent-fraction from the remaining bits? This is easier to think about for integers where you can get the log2 by determining the highest set bit:
log2(20) = log2(0b10100) ~= highest_set_bit(0b10100) = 4
The actual correct value is log2(20)=4.322. The question we need to answer is: How do you get the remaining exponent-fraction, 0.322 from the remaining bits, 0b0100?
To make this work for any number of bits we should normalize the remaining bits into the range 0 to 1, which in this case means doing the division 0b0100/float(1 << 4)=0.25. (in general you divide by the highest set bit, which you already had to calculate for the previous step)
After we brought the numbers into the range from 0 to 1, you can get the remaining exponent fraction with log2(1+x). In this case it’s log2(1+0.25) = 0.322.
If you plot y=log2(1+x) for the range from 0 to 1 you will find that it doesn’t deviate too far from y=x. So if you just want an approximate solution you might as well skip this step.

And then the mantissa is already interpreted as a fraction on floats, so you also don’t have to divide. So the whole operation cancels out and you can just add.
You still need to handle
- The sign bit
- Overflowing mantissa
- Overflowing exponent
- Underflowing exponent
Number 1 and 2 also work out naturally using addition because of how floats are represented:
- Since the sign bit is the highest bit, overflow is ignored so addition is the same as xor, which is what you want
- When the mantissa overflows you end up increasing the exponent, which is what you want. (e.g. 1.5 * 1.5 = 2.25, which has a higher base-2 exponent)
Number 3 can be ignored for most floats you care about.
Number 4 is the one that required me to write that more complicated version of the code. It’s really easy to underflow the exponent, which will wrap around and give you large numbers instead. In neural networks lots of activation functions like to return 0 or close to 0 and when you multiply with that you will underflow the initial function and get very wrong results. So you need to handle underflow. I have not found an elegant way of doing it because you only have a few cycles to work in, otherwise you might as well use float multiplication.
The last open question is that mantissa-adjustment: You can see in the graph above that the approximation y=x is never too big, so by default you will always bias towards 0. But you can add a little bias to the mantissa to shift the whole line up. I tried a few analytic ways to arrive at a good constant, but they all gave terrible results when I actually tried this on a bunch of floats. So I just tried many different constants and stuck with the one that gave the least error on 10,000 randomly generated test floats.
So even though this is probably not useful and I haven’t found a really elegant way of doing it, it’s still neat how the whole thing almost works out as a one-liner because so many things cancel out once you approximate y=log2(1+x) as y=x.
edit: Since people liked playing around with this, I have uploaded my test code here.
Is just masking the exponent & mantissa wouldn’t be enough to handle underflow?
If I understand correctly, the issue only happens when one of the multiplicand is zero. In that case the rough version will return a NaN instead of 0.
floatrough_float_multiply2(floata,floatb) {constexpr uint32_tbias = 0x3f76d000;uint32_t a =bit_cast<uint32_t>(a), b =bit_cast<uint32_t>(b);return a&b ? bit_cast<float>(a + b - bias) :0.0f;}Thanks, I like the idea of just checking for zero. I didn’t go for it because there are also lots of ways to get numbers very close to zero.
This particular check doesn’t work for e.g. 0.1f * 2.0f because the a&b will be 0. Also -0.0f exists.
I have now uploaded my test code where you can try this.
Hi,
Thanks for this interesting article. I think you made a mistake however, the biais specified in the article is 0x3f780000 instead of 0x3f76d000.
Best
I didn’t understand where the offset in the article came from so I tried to come up with my own. I have now uploaded my test code which shows how I arrived at this constant, it’s the function determine_constant_slow().
@sweethack,
your rough_float_multiply2() thing doesn’t seem to work well.
But I’ve come up with a faster/shorter function (well, compared to custom_multiply(), that is) that behaves the same as custom_multiply() (at least from all the testing I’ve done so far):
I’m calling it sharkautarch_float_multiply() because I can 🙂
https://godbolt.org/z/f6dfjEojM
Thanks! I have now uploaded my test code so you can try this on more floats. This one fails on 0.1 * 0 for me.
Hello again!
I have made a second version of my own `sharkautarch_float_multiply_v2` functions, which passes your test code
Actually it’s two versions: `sharkautarch_float_multiply_v2_a` & `sharkautarch_float_multiply_v2_b`
version b yields slightly more compact codegen than version a, but version b relies on using a 64bit integer multiply to avoid having the compiler produce extra branches or conditional selects
version 2a is a teeny bit longer (in terms of insn length) than my original version, but I think that version 2b is pretty much the insn same length as my original version of the function
version b (running on my intel alderlake cpu, & compiled w/ gcc -O3, & outperforming version 2a in the test) seems to have surprisingly good performance in the test, while (in terms of the test’s reported min_diff & max_diff) seeming to have the same behavior as `custom_multiply`
godbolt link w/ the updated versions:
https://godbolt.org/z/Gq6sYnhE6
A compelling usage case is in graphics / GPU programming, in situations where you have atomic integer operations, but not atomic floating point operations. Thanks for the article!