Ideas for a Programming Language Part 2: A more liberal call stack
by Malte Skarupke
One thing that came out of playing around with coroutines is that I eventually realized that callstacks are a bit annoying. It’s one of these things where it’s difficult to convince people because callstacks are equivalent in power to any other form of organizing your program flow. But I think that callstacks constantly force us into solving two puzzles: 1. How to split some sequential, blocking, code up into several smaller pieces of non-blocking code because the function has to return quickly. 2. Where to store an object (or keep track of an object) because it is long-lived but the current scope has to be short-lived.
If you’ve been programming for a while you are really good at solving both of these puzzles because you solve them all the time. Which is another reason for why it’s difficult to convince people that it’s worth looking into alternatives: If you are a programmer today, these don’t seem that complicated to you.
But I want to investigate how much can be gained by looking at different approaches to call stacks. I think the potential gains are huge, because this is such a core piece of programming.
Do you know that experience when you spend two weeks coding a new system, only to spend one day at the end simplifying all your new code and you realize that you could have written something much simpler in a single day? So you throw your two weeks of work away and keep the one day solution. That’s what happened to me with this blog post… I realized that I can probably get very far in getting rid of annoyances with two fairly small changes to the existing model.
1. Sequential Code
We forget that callstacks are not inevitable. It seems like you just need them to call functions. Where would a function store its temporaries if there was no callstack? But even fundamental things come from somewhere: here is a guy who points out that even functions were once a programming pattern. (search for “Recurring problem” on that page) In The Art of Computer Programming, functions are already accepted as a given, but there was no callstack: Every function just had to have a bit of storage, and the programmer had to make sure that two functions didn’t get in each others ways. Which seems strange at first, but that model has actually made a comeback in recent years in the scripting language Kismet which pretty much every game engine has reimplemented. In that language every function has its storage allocated ahead of time and it just reuses that storage every time that it is called. If you want to call a function twice without them getting in each others way, you have to place it twice to allocate the necessary storage twice. Which leads to all the obvious problems that you would expect, but they actually don’t come up that often. And the benefit is that since every function has permanent storage, it’s easy to chain together sequential functionality such as “walk over there, then wait half a second then play a sound and an animation, then start attacking the player.” Functions just are re-entrant and you don’t have to think about how to make your temporaries survive from one frame to the next.
In theory there is nothing stopping you from writing the same code in C++ that you’re building in Kismet, except it would be much too verbose. You could build up the same nodes and then define a state machine over those nodes, but connecting together all the pointers for the behaviors would just be ugly and error-prone code. What you want is to express the same thing in “normal” code. Let me illustrate by going through an example:
This is a picture of a Kismet script. I stole the picture from Wikipedia. (which surprisingly only has one paragraph about Kismet…) One quick note: I have never actually used Kismet. I have only used a clone of Kismet in another engine, but from what I hear that clone is pretty much identical to the old version of Kismet which this screenshot is from.
This script controls an elevator. Have a look at it to see if you can find out how it works. (the Matinee plays animations on objects, in this case it moves the elevator) Now as an exercise let’s write the equivalent C++ code. All we’d have to do is set up all the nodes and all the connections in C++. There’s a few possible syntaxes for that but I came up with this pseudo-code:
bool bMoved = false; CompareBoolNode * cmp = new CompareBoolNode(bMoved); DelayNode * first_delay = new DelayNode(some_hardcoded_number); DelayNode * second_delay = new DelayNode(another_hardcoded_number); DelayNode * third_delay = new DelayNode(a_third_hardcoded_number); MatineeNode * matinee = new MatineeNode(Matinee_0, Interior_Elevator_3); BoolNode * set_moved = new BoolNode(bMoved, true); BoolNode * clear_moved = new BoolNode(bMoved, false); cmp->onFalse += &first_delay->start first_delay->onFinished += &matinee->play; matinee->onCompleted += &second_delay->start; second_delay->onFinished += &matinee->reverse; second_delay->onFinished += &set_moved->in; set_moved->out += &third_delay->start; third->delay->onFinished = &clear_moved->in; RTriggerVolume_0.onTouched += &cmp->in;
Which is terribly confusing. And I am already assuming a bunch of magic here. It’s obvious to see that this code would be a giant source of maintenance problems. If you sat a new programmer in front of this it would take them a while to figure out the structure because all the connections have to be parsed and remembered. But that is the most equivalent code to how Kismet works that I could come up with. (or is it? Have a look to see what it actually does. Would you be able to tell if you didn’t have the picture above this?) Really what you want to do is write this:
bool bMoved = false; RTriggerVolume_0.Touched = [bMoved]() mutable { if (bMoved) return; Delay(some_hardcoded_delay); Matinee_0.Play(Interior_Elevator_3); Delay(another_hardcoded_delay); bMoved = true; Matinee_0.Reverse(Interior_Elevator_3); Delay(a_third_hardcoded_delay); bMoved = false; };
This looks a lot more like normal C++ code: Progress along the program is determined simply by looking on which line you are. The early out for bMoved is denoted simply by a return statement. Would you have been able to see that there is an early out in the first code listing?
But I can’t write this code. You can’t have a function like “delay” in C++. And I don’t know about you, but it annoys me a lot that my designers can do this more easily than I can. And that if I want to write code like this I have to work in an inferior language.
So how do we get to this? Coroutines can get you there but then you need a stack for each coroutine. Which is not super expensive, but it’s expensive enough that you have to start thinking about how often you want to use this. Stackless coroutines are cheap enough that you could use them for a system like this, but they are not powerful enough because you can only yield execution on the top level. A custom scheduler with the proposed “await” syntax could also work, but it involves talking across threads and heap allocations for the shared state. I think there’d be too much overhead in that to use it for game code. (except where you’re IO bound like in a streamer, which is precisely what “await” is for)
So what is it fundamentally that makes it possible for Kismet to run efficiently? I believe the answer is that it has a custom compiler. The compiler knows how much memory it has to allocate for a given script, so it can allocate only exactly as much as it needs. At least that’s how it works in the clone that I am familiar with: The compiler computes the memory usage of each of the nodes, prepares a big block of memory that can fit all the nodes, and then that block of memory is allocated when the script is first loaded. So there are no heap allocations at runtime and you could even in theory reason about all the connections and make the transitions between the connections be as cheap as a function call. (Kismet could be as fast as my second code listing which has no function pointers, but I believe it is only as fast as my first code listing)
A C++ compiler doesn’t work like that though: The necessary memory for a function call is allocated at runtime. If I want to call a function called “Matinee::Play()”, then the compiler will place some instructions that allocate the necessary memory at runtime. In Kismet (or at least in the Kismet clone that I am familiar with) the necessary memory would already be ready at that point: The compiler would know that Matinee::Play() requires let’s say 200 bytes of scratch memory, so that memory (plus the memory for Matinee::Reverse() and the three calls to Delay()) would be available by the time that the lambda is called, because the compiler knows that the lambda will eventually call Matinee::Play().
This doesn’t make any difference in terms of performance (allocating stack memory is pretty much free) but what it does allow is that all functions in Kismet can be interrupted and resumed at any point.
OK so how would you actually implement this? I think all we have to do is make stackless coroutines where the compiler can see however deep that it has to see. A naive implementation would require us to reserve enough space in that segment for all functions that we call from the resumable function, plus all functions that we call from those recursively until there is space for every function. However a quick optimization is to only reserve space for the functions that we actually want to be able to resume in the middle of. So let’s say Matinee::Play() is resumable, and it uses a function called Matinee::CalculateLength() which is not resumable, then we just use a normal stack for the second function, and only put the stack memory for Matinee::Play() into the space of the stackless coroutine. It’s actually a tiny modification on stackless coroutines, but it may have a big impact.
The goal is to make resumable functions be a normal part of the language so that nobody ever has to even think about it. They have to be as natural as in Kismet, so that if code is sequential, I can write it sequentially. All the downsides of Kismet apply. (don’t accidentally call the function two times at the same time etc.)
Did you notice how I left one feature out of the Kismet version in my C++ version? In the Kismet version two nodes actually run at the same time. That is trivial to do when using stackless coroutines: You simply can run two at the same time, there are no problems. It is however a problem of syntax that I’ll have to think about first.
2. Object Lifetimes
The second part that annoys me about callstacks is object lifetimes. And for this let’s imagine I had this function:
std::vector<std::string> split_string(const std::string & to_split, const std::string & separator);
You would probably tell me that this is bad code because I have to create both a vector and potentially many strings. (imagine reading a large file and splitting on newlines) So basically this is slow. If you have time have a look at what the string split function looks like in your engine and how it is used. The fastest ones work on iterators, maybe like this:
void split_string(const std::string & to_split, const std::string & separator, const std::function<bool (const char *, const char *)> & callback);
This puts the control in the hands of the calling code so that it can decide how to allocate the results. So if the calling code wants the results in a std::set instead of a std::vector, it now has the choice to do just that. Or if it doesn’t need to keep the split strings around at all, it can look at them without causing allocations. And if you’re a good C++ programmer you’re probably writing code like this all the time. String splitting is an obvious example, but there are many other cases where we avoid heap allocations and many other tricks to save heap allocations, and you are learning all of them every day and you’re using them without even noticing.
And then one day you look at Python code and the code to read all the lines in a file into a set looks like this:
names = set(open(filename).readlines())
And that is just so nice. And you can write code like this in C++. You can write it in exactly the same syntax and it would have exactly the same speed as in Python. But you don’t, because for a C++ programmers it is an insult to be called as fast as Python. So our libraries don’t even contain the code that would allow us to write this. Our libraries contain the fast code from above. And people just use the code that is in the libraries, so a typical C++ programmer would write the above one liner as a ten liner.
Except that, fundamentally, there is nothing here that should be slow. The only heap allocations needed here are one for each read line, and one for the set. The function readlines() doesn’t need to allocate anything because we know the lifetime of its objects and they don’t need to survive the current statement. So in theory everything that readlines does could happen on the stack. But you can’t do that in C++. To write the above code this would have to be the declaration of readlines:
std::vector<std::string> File::readlines();
Which is obviously stupid because we just said that we want the results in a set. Why does readlines() decide to store things in a vector if we want them in a set? Shouldn’t the calling function decide how things should be stored, or whether they need to be stored at all? It seems like there should be a way to get the efficiency of iterators with the composability of returning fully formed objects.
So here is the idea: We allow the transformation from the iterator form to this form as syntactic sugar. So let’s go back to the string split example and let’s say the code looks like this:
std::set<std::string> as_set; split_string(some_string, ",", [&](string_view substring) { as_set.emplace(substring); });
Then the language would contain syntactic sugar that recognizes that the last argument is a continuation function, and I am allowed to write this instead:
std::set<std::string> as_set; for (string_view substring : split_string(some_string, ",")) { as_set.emplace(substring); }
Which looks pretty similar, but it allows us to write the Python code:
std::set<std::string> as_set(split_string(some_string, ","));
Since std::set currently doesn’t have a range constructor, it would probably actually look like this:
auto split = split_string(some_string, ","); std::set<std::string> as_set(split.begin(), split.end());
Now you only have to believe me that this could be done as syntactic sugar. This is using the callback based version of the function, so it is as efficient as it can be. What happens here is that the begin() and end() iterators are forward iterators. Every time that begin() is incremented, it runs through one iteration in the internal loop inside of split_string. We never actually exit out of the scope of split_string in this. Its stack is only unwound when the calling scope ends. Which seems like it might be a hard problem to determine when which scope gets unwound: what if this is nested several layers deep? For example if readlines() uses split_string() internally so the outer function actually has two stack frames alive? But all of that goes away when you do the transformation to the callback based version manually: There is never any ambiguity about which stack you are in. If you’re wondering what this would look like in a debugger: You could show it as a stack but I think that would be more confusing. I think I’d probably represent it as a call-graph, with lines from small side stacks going back to their parent. (the graph would have circles)
One interesting thing that comes out of this is that some things which would have been stupid in the past are now efficient. For example here is code for determining how many lines are in a file:
size_t num_lines = std::open(filename).readlines().size();
With the above change, this is as fast as you can get. Since this gets transformed to the callback based model behind the scenes, readlines() will never actually allocate anything. Everything here happens on the stack. Which is interesting, right? I don’t quite know what the consequences of this will be yet. Optional return values can be much faster because they only actually incur overhead if they are used. It’s like lazy evaluation with the performance of immediate evaluation.
When reading all of this you may notice that this kinda looks like Haskell’s special ‘do’ syntax for monads. And that is exactly where the motivation for this came from: I first had this idea when trying to implement a parser monad in C++: I was frustrated by all of the unnecessary heap allocations that a naive parser monad requires. I actually never checked whether my solution does allow a full, efficient parser monad to be implemented. It is not as powerful as the Haskell syntax because I don’t want to sacrifice efficiency: Everything gets evaluated immediately. This only works as long as the lifetimes of all involved objects are known at compile time. The result type of the split_string function above is a special type that can not be stored for later use. If you want the values to survive, you have to turn them into a std::vector<std::string>. Which is OK for simple cases like split_string, but it may be annoying for monads.
That being said I think this is a good compromise towards the composability of a language like Haskell with the efficiency of C++.
So in summary I think I want to experiment with the callstack a little. I don’t want to make super big changes, but I am convinced that the current model is flawed. I think either of these changes can have an enormous impact on how we code. If resumable functions are trivial with no loss in efficiency, we’d use them all the time. If objects can survive a scope without having to worry about efficiency or lifetimes, we’d write more reusable functions. But honestly, right now I would give it a 20% chance that I will actually end up with the solutions described above. There is also a 5% chance that I will end up with normal classical callstacks, but I’d say it’s most likely that I will end up with something different. You never know how these ideas work out until you start writing and using them. But for now it is important to decide that I do not want to accept callstacks with all their small annoyances as a given. It is clear to me that there are better models out there, and we just have to go looking for them.
The “compiler can see however deep” model is what GPUs use for lightweight scheduling, so it’s already proven to a degree. It kind of scares me to throw the complexity of average C++ code at a compiler that is never allowed to stop inlining, but it’s possible. (Though I still think that the unsolved problem of coroutines-for-game-logic is persisting them safely.)
As far as the string splitting example: do you think the compiler getting involved is significantly better than, say, the Boost range adaptor setup?
Here is a new reply, four years later. Java is experimenting with adding coroutines, and they are talking about making them Serializable by default. As far as I understand this it means that all the stack variables are serialized and you can then transfer that blob to another machine and continue running the coroutine there:
https://www.infoq.com/presentations/continuations-java/
This just confirms that it’s totally possible for the compiler to iterate through all stack variables and serialize them. And I still think (as mentioned in my other comments from four years ago) that it’s possible to write this in a way where you could do this in a library.
Ah I actually thought about persisting: The idea is that if you have stackless coroutines, they are essentially their own type anyway. Just like every lambda is a separate type in C++.
So let’s say I have a stackless coroutine that looks like this:
Then I’d make it that that creates a new type called “delay” of which this function is the constructor. And that type has all of its stack variables as members.
So I could then write
Like that you can get to all stack variables recursively. (as long as they are named, but you could also automatically generate names for unnamed variables. Names don’t really matter, see part 1 of this series) And since the language will definitely have compile time reflection, it should be possible to write a library that persists this automatically. The big problem that I could think of right now is that that class will probably have some kind of magic instruction pointer as a member which may be difficult to persist. But that’s probably solvable. Maybe just make it an enum where all possible yield points are valid values for that enum.
As for range adaptors: Once you have resumable functions, range adaptors should in theory be enough, because then a split_string function could just be a resumable function that acts as a range.
But the trick of transforming the remainder of a function into a continuation lambda is more widely applicable. Another example where lifetime issues are a bit annoying is in partially applied functions. Since you can just say “auto” as your return type, curried functions are a bit less annoying than they used to be in C++, but even then you often run into cases where you just have to return a std::function, and then you have to worry if your functor is 16 bytes in size or more. If it’s more then you have runtime overhead over the uncurried version because the std::function will allocate from the heap. So in the end nobody uses curried functions in C++ because after all you don’t really need them, they are just a convenience. With this syntactic sugar you can get curried functions with zero overhead. The function doesn’t return the curried function, but it takes a continuation lambda as the last argument to which it makes the curried function available. As long as the lifetime of the curried function is known at compile time (and when is it not?) this will work.
This is why I want to try to see if this can be used for a parser monad.
I think this would be a feature that would be used all over the place, without people even realizing that they are using it. After all I can make it transparent in the IDE. You usually don’t care if a function returns a value or if it makes the value available to a continuation function. For most types it doesn’t make a difference anyway.
If you change the code inside your coroutine, change the compiler flags, change the compiler, or blink, it becomes much harder to load from a persisted coroutine correctly. Seems scary.
I don’t intend to say that transform-to-continuation-passing-style is useless, but I’m not convinced yet… curried functions can be implemented in a million different ways. Like a function returning a lambda, or just making a new lambda around a function with a few arguments applied, and trusting the optimizer… I’m not sure that a C++-like starting point is the right direction to approach these programming techniques from, if that makes sense.
Yeah the changing coroutine is going to be hard, but only because it’s more likely to change than classes. The way to deal with changing classes is to use serialization that doesn’t rely on memory layout. Like write it out to JSON. (or use a robust binary representation that can skip members and read them in any order) All these factors are the same for coroutines, except coroutines are likely to change more often. You could even force people to write their own version handling code, like in boost.serialization.
I don’t necessarily plan on supporting serialization out of the gate, but I do want to make sure that it’s somewhat easy to write a library like boost.serialization, and it should be possible to make it also work for resumable functions.
So the “making a new lambda around a function with a few arguments applied” usually gets handled very well by the optimizer. Unfortunately you can’t compose it: If you want to put that into a reusable piece of code you have to return the lambda type. Which brings you into “function returning a lambda.” Which works, but every time that I have tried playing around with it it falls apart somewhat quickly and you end up needing std::function. And then you’ve got something that the optimizer can’t reason about super well. (at least as soon as you need a heap allocation) And as soon as you have one function which can’t use the “function returning a lambda” approach but which has to be evaluated directly, you’re making your life a lot harder when doing for example a parser monad. Either everything has to be composable or nothing has to be. A inbetween thing ends up working a lot like something where nothing is composable.
That being said I did recently watch this video: https://www.youtube.com/watch?v=xKRndVoo2ms
And that guy has a clever piece in his framework with his “sequential_” function which seems at first like it might not be needed, but it actually solves one issue that I ran into when trying something similar. What he solves is that that sequential_ function reduces your nesting to exactly one layer, where if you use operator overloading for chaining (which is what I did, and which is also used in range libraries) you end up getting recursive layers that can be annoying.
It’s not just layout: the optimizer has abilities with stack variables that it never has with member variables. Like finding one variable is unused and optimizing it out, or fusing two together, etc. etc.
I haven’t thought about it carefully, but it seems that “function returning a lambda” is very similar to continuation passing style, and they would both break down and need an std::function-like wrapper at the same point. If I curry a function, the compiler can only transform to continuation passing if the eventual function call is visible at that point, which seems very similar to the lambda restriction.