Source to Source Optimization

by Malte Skarupke

In April Olaf Krzikalla Gave a talk titled “Performing Source-to-Source Transformations with Clang” (Videos available here)

In that talk he showed an auto vectorizer that doesn’t perform the optimization in the assembly, but which instead spits out new source code that performs the optimized operations. Here is a picture from his talk:

A picture showing a source-to-source transformation. There is the source code for a simple loop operating on an array of floats on the left hand side, and the same loop using SIMD instructions on the right hand side.

And I want to explain why all compiler optimizations should do this.

The reason that Mr Krzikalla gives for why they prefer to do their optimization on the source code as opposed to later during the compile step is that this allows them to very quickly port their optimizations to new machines. They work in high performance computing where they often don’t have a choice about which compiler to use when they get a new machine. Which may sound familiar to people in gaming. (I’ve been told such things as “we don’t use operator overloading because the compiler for platform X generated crappy assembly for that”) But with this method you write your loop vectorizer once, and it works for all compilers. (yes you’ll have to change the intrinsics for different platform, but that’s easy. The difficult part is writing the code that understands C++ loops)

Which is a very good reason to do this, but there are many more benefits. And the most immediately important one for me is this:

You can actually debug this code.

Everyone working in C++ has had the experience that debugging optimized code can be very painful. It’s obvious with just the above example: Imagine all you had was the code on the left when you were actually debugging the code on the right. (and yes, you can expect most modern compilers to perform the above optimization) In more complex scenarios with many inlined functions debugging can be a tedious and error prone process because you first have to understand/guess what the compiler did before you can begin to debug.

But imagine if the compiler would not only write the optimized assembly, but also a C++ file that more closely resembles the final executable. And your debugger would be smart enough to use the new text file instead of the original file while debugging. Maybe show them side by side. That would make working with Release builds much easier. Common performance traps such as accidentally copying a string would also immediately become obvious, because there’d be an unplanned call to operator new() in the middle of your code. You might even be able to debug STL code because all those layers of abstraction would just be inlined away, exposing the algorithm underneath. (note that I don’t expect this to be true in general. In general I would expect code to become much less readable after optimizations. Only STL code and boost code could become more readable after inlining)

Let’s take that idea one step further and give the compiler a GUI (with that I don’t mean an IDE, which just calls a command line compiler which it treats like a black box, but an actual GUI for the compiler) where it can show you how the source code looks after each optimization pass. We could write an optimizer that’s not a black box. If you’ve ever been frustrated because the compiler generates assembly that’s obviously stupid, you know how valuable this could be. You could actually work together with the compiler because it could explain the steps that it’s doing and why it’s not optimizing away certain instructions. And you could add a “const” or a “restrict” to allow the compiler to do it’s job.

It would also eradicate a whole category of bugs that is caused by an overly aggressive optimizer. Bugs caused by the optimizer exploiting undefined behaviour or re-ordering instructions in lockless code would become obvious if it told you what it was doing during optimization.

And it’s not like these problems are going to get better by themselves. Optimized code will be even more difficult to debug in ten years than it is now. And the more an optimizer does without your input, the easier it will be to break it. It’s cool when your loop gets auto vectorized, but it’s not so cool when you later make an innocent-looking change that confuses the vectorizer and suddenly your code runs half as fast for no apparent reason.

So how likely is it that every compiler writer will start doing this? This is unfortunately where reality gets in the way: It’s more difficult to write an optimizer that operates on source code than one that operates on assembly. The reason for this is that when dealing with source code you can be too high level to find out what the intent of the programmer is. You have to iterate over a tree structure that can be nested infinitely (for example with the “do {…} while (false)” that people like to put around macros) and which has ten different ways to express the same thing. Also some optimization steps can only be performed on assembly: You can’t know whether you want to inline a function or not if you don’t know how big the assembly is.

So writing an optimizer that operates on source code is difficult. But maybe a hybrid approach can be found: I’m specifically thinking about adding callbacks to optimization passes and writing code that translates optimizations back to source code. This would conceptually be simple for inlining: Every time that you inline a function in assembly, perform the same operation on source code. There will be problems with that method, but it’s one step closer to having the best of both worlds.

I think that source to source optimization has many benefits over the classical approach of “don’t tell the programmer anything.” The main downside is that it takes more work than writing a black box optimizer. I usually write about code that I have written and this is the first time where I just have to say: I am excited about this, but I don’t have the time to do it myself. Sorry for the disappointment at the end.

But hey, maybe I can get people thinking about how to expose more of an optimizer’s internals in an easier way.