Source to Source Optimization
by Malte Skarupke
In April Olaf Krzikalla Gave a talk titled “Performing Source-to-Source Transformations with Clang” (Videos available here)
In that talk he showed an auto vectorizer that doesn’t perform the optimization in the assembly, but which instead spits out new source code that performs the optimized operations. Here is a picture from his talk:
And I want to explain why all compiler optimizations should do this.
The reason that Mr Krzikalla gives for why they prefer to do their optimization on the source code as opposed to later during the compile step is that this allows them to very quickly port their optimizations to new machines. They work in high performance computing where they often don’t have a choice about which compiler to use when they get a new machine. Which may sound familiar to people in gaming. (I’ve been told such things as “we don’t use operator overloading because the compiler for platform X generated crappy assembly for that”) But with this method you write your loop vectorizer once, and it works for all compilers. (yes you’ll have to change the intrinsics for different platform, but that’s easy. The difficult part is writing the code that understands C++ loops)
Which is a very good reason to do this, but there are many more benefits. And the most immediately important one for me is this:
You can actually debug this code.
Everyone working in C++ has had the experience that debugging optimized code can be very painful. It’s obvious with just the above example: Imagine all you had was the code on the left when you were actually debugging the code on the right. (and yes, you can expect most modern compilers to perform the above optimization) In more complex scenarios with many inlined functions debugging can be a tedious and error prone process because you first have to understand/guess what the compiler did before you can begin to debug.
But imagine if the compiler would not only write the optimized assembly, but also a C++ file that more closely resembles the final executable. And your debugger would be smart enough to use the new text file instead of the original file while debugging. Maybe show them side by side. That would make working with Release builds much easier. Common performance traps such as accidentally copying a string would also immediately become obvious, because there’d be an unplanned call to operator new() in the middle of your code. You might even be able to debug STL code because all those layers of abstraction would just be inlined away, exposing the algorithm underneath. (note that I don’t expect this to be true in general. In general I would expect code to become much less readable after optimizations. Only STL code and boost code could become more readable after inlining)
Let’s take that idea one step further and give the compiler a GUI (with that I don’t mean an IDE, which just calls a command line compiler which it treats like a black box, but an actual GUI for the compiler) where it can show you how the source code looks after each optimization pass. We could write an optimizer that’s not a black box. If you’ve ever been frustrated because the compiler generates assembly that’s obviously stupid, you know how valuable this could be. You could actually work together with the compiler because it could explain the steps that it’s doing and why it’s not optimizing away certain instructions. And you could add a “const” or a “restrict” to allow the compiler to do it’s job.
It would also eradicate a whole category of bugs that is caused by an overly aggressive optimizer. Bugs caused by the optimizer exploiting undefined behaviour or re-ordering instructions in lockless code would become obvious if it told you what it was doing during optimization.
And it’s not like these problems are going to get better by themselves. Optimized code will be even more difficult to debug in ten years than it is now. And the more an optimizer does without your input, the easier it will be to break it. It’s cool when your loop gets auto vectorized, but it’s not so cool when you later make an innocent-looking change that confuses the vectorizer and suddenly your code runs half as fast for no apparent reason.
So how likely is it that every compiler writer will start doing this? This is unfortunately where reality gets in the way: It’s more difficult to write an optimizer that operates on source code than one that operates on assembly. The reason for this is that when dealing with source code you can be too high level to find out what the intent of the programmer is. You have to iterate over a tree structure that can be nested infinitely (for example with the “do {…} while (false)” that people like to put around macros) and which has ten different ways to express the same thing. Also some optimization steps can only be performed on assembly: You can’t know whether you want to inline a function or not if you don’t know how big the assembly is.
So writing an optimizer that operates on source code is difficult. But maybe a hybrid approach can be found: I’m specifically thinking about adding callbacks to optimization passes and writing code that translates optimizations back to source code. This would conceptually be simple for inlining: Every time that you inline a function in assembly, perform the same operation on source code. There will be problems with that method, but it’s one step closer to having the best of both worlds.
I think that source to source optimization has many benefits over the classical approach of “don’t tell the programmer anything.” The main downside is that it takes more work than writing a black box optimizer. I usually write about code that I have written and this is the first time where I just have to say: I am excited about this, but I don’t have the time to do it myself. Sorry for the disappointment at the end.
But hey, maybe I can get people thinking about how to expose more of an optimizer’s internals in an easier way.
That would be pretty neat, but I have a suspicion that it would basically boil down to implementing the optimization twice, once on IL and once on an AST. I’d rather just have it expose the intermediate form of each optimization step in whatever form it uses rather than translating it back to the original language (which can only lose information). Like after register allocation has been done, good compilers will arrange for a value to be stored in right register to be used in the function call you make next. How do you reasonably see that that happened in C-like source code?
I also suspect that having the compiler tell you why it isn’t performing an optimization falls under the same category as the compiler warning you when you invoke undefined behavior: too noisy to be useful. But there are probably specific instances where it’d be very useful, so I guess I’d reserve judgement.
With Clang it probably wouldn’t even be that difficult to get it to emit its IL after each optimization pass. And if you made use of the same source location mapping that the debugger uses, it might even be fairly easy to interpret.
While I agree that showing the optimization steps in the intermediate form in which they are performed would be an improvement when it comes to debugging the optimizer, I think that for debugging the generated program it has to be source code or nothing. Or you’d have to come up with some other form in which you can look at variable names with type information and in which the variable names don’t change meaning all the time. (which is one of the main reasons why debugging assembly is frustrating: For a crash you often can’t even see what has happened so far in the function because the meaning of “rbx” has changed three times already, and every time it got assigned from a name which has also changed it’s meaning since then such as “rax + 12.” Which gives me an idea for greatly improving disassembly debugging: it should be simple to programmatically reason backwards and show the value of the registers for many instructions in the function so far. Just always display that information next to the disassembly lines for which you were able to reason backwards. And you should be able to recover type information for registers by seeing if they happen to point at a known vtable. It’s also helpful to see for which pointers that doesn’t work)
One thing that’s also worth remembering is that Olaf Krzikalla put the large amount of work into doing source-to-source optimizations not because he wanted better debugging, but so that his optimizations could be used by other compilers. Which I know doesn’t apply as much to you, but it’s another good reason for why you have to map back to source code.
I do realize that there are optimizations that can’t be shown in source code, but that shouldn’t stop us from trying for the other optimizations. Luckily the optimizations that tend to confuse you most while debugging tend to be optimizations with a big impact, and big impacts are more likely to have a mapping back to source code. (I have no evidence for this but it sounds plausible)
As for the signal to noise ratio in the compiler telling you about optimizations that it couldn’t perform: That’s why I think it has to be interactive and why I think that you need a GUI. It usually wouldn’t be helpful to just dump all optimization passes to disk. You need to be able to step across optimization passes and then into detail through a specific optimization pass. It’s basically the difference between printf debugging and interactive debugging with breakpoints.