Ideas for a Programming Language Part 2: A more liberal call stack

by Malte Skarupke

One thing that came out of playing around with coroutines is that I eventually realized that callstacks are a bit annoying. It’s one of these things where it’s difficult to convince people because callstacks are equivalent in power to any other form of organizing your program flow. But I think that callstacks constantly force us into solving two puzzles: 1. How to split some sequential, blocking, code up into several smaller pieces of non-blocking code because the function has to return quickly. 2. Where to store an object (or keep track of an object) because it is long-lived but the current scope has to be short-lived.

If you’ve been programming for a while you are really good at solving both of these puzzles because you solve them all the time. Which is another reason for why it’s difficult to convince people that it’s worth looking into alternatives: If you are a programmer today, these don’t seem that complicated to you.

But I want to investigate how much can be gained by looking at different approaches to call stacks. I think the potential gains are huge, because this is such a core piece of programming.

Do you know that experience when you spend two weeks coding a new system, only to spend one day at the end simplifying all your new code and you realize that you could have written something much simpler in a single day? So you throw your two weeks of work away and keep the one day solution. That’s what happened to me with this blog post… I realized that I can probably get very far in getting rid of annoyances with two fairly small changes to the existing model.

1. Sequential Code

We forget that callstacks are not inevitable. It seems like you just need them to call functions. Where would a function store its temporaries if there was no callstack? But even fundamental things come from somewhere: here is a guy who points out that even functions were once a programming pattern. (search for “Recurring problem” on that page) In The Art of Computer Programming, functions are already accepted as a given, but there was no callstack: Every function just had to have a bit of storage, and the programmer had to make sure that two functions didn’t get in each others ways. Which seems strange at first, but that model has actually made a comeback in recent years in the scripting language Kismet which pretty much every game engine has reimplemented. In that language every function has its storage allocated ahead of time and it just reuses that storage every time that it is called. If you want to call a function twice without them getting in each others way, you have to place it twice to allocate the necessary storage twice. Which leads to all the obvious problems that you would expect, but they actually don’t come up that often. And the benefit is that since every function has permanent storage, it’s easy to chain together sequential functionality such as “walk over there, then wait half a second then play a sound and an animation, then start attacking the player.” Functions just are re-entrant and you don’t have to think about how to make your temporaries survive from one frame to the next.

In theory there is nothing stopping you from writing the same code in C++ that you’re building in Kismet, except it would be much too verbose. You could build up the same nodes and then define a state machine over those nodes, but connecting together all the pointers for the behaviors would just be ugly and error-prone code. What you want is to express the same thing in “normal” code. Let me illustrate by going through an example:


This is a picture of a Kismet script. I stole the picture from Wikipedia. (which surprisingly only has one paragraph about Kismet…) One quick note: I have never actually used Kismet. I have only used a clone of Kismet in another engine, but from what I hear that clone is pretty much identical to the old version of Kismet which this screenshot is from.

This script controls an elevator. Have a look at it to see if you can find out how it works. (the Matinee plays animations on objects, in this case it moves the elevator) Now as an exercise let’s write the equivalent C++ code. All we’d have to do is set up all the nodes and all the connections in C++. There’s a few possible syntaxes for that but I came up with this pseudo-code:

bool bMoved = false;
CompareBoolNode * cmp = new CompareBoolNode(bMoved);
DelayNode * first_delay = new DelayNode(some_hardcoded_number);
DelayNode * second_delay = new DelayNode(another_hardcoded_number);
DelayNode * third_delay = new DelayNode(a_third_hardcoded_number);
MatineeNode * matinee = new MatineeNode(Matinee_0, Interior_Elevator_3);
BoolNode * set_moved = new BoolNode(bMoved, true);
BoolNode * clear_moved = new BoolNode(bMoved, false);
cmp->onFalse += &first_delay->start
first_delay->onFinished += &matinee->play;
matinee->onCompleted += &second_delay->start;
second_delay->onFinished += &matinee->reverse;
second_delay->onFinished += &set_moved->in;
set_moved->out += &third_delay->start;
third->delay->onFinished = &clear_moved->in;

RTriggerVolume_0.onTouched += &cmp->in;

Which is terribly confusing. And I am already assuming a bunch of magic here. It’s obvious to see that this code would be a giant source of maintenance problems. If you sat a new programmer in front of this it would take them a while to figure out the structure because all the connections have to be parsed and remembered. But that is the most equivalent code to how Kismet works that I could come up with. (or is it? Have a look to see what it actually does. Would you be able to tell if you didn’t have the picture above this?) Really what you want to do is write this:

bool bMoved = false;
RTriggerVolume_0.Touched = [bMoved]() mutable
    if (bMoved)
    bMoved = true;
    bMoved = false;

This looks a lot more like normal C++ code: Progress along the program is determined simply by looking on which line you are. The early out for bMoved is denoted simply by a return statement. Would you have been able to see that there is an early out in the first code listing?

But I can’t write this code. You can’t have a function like “delay” in C++. And I don’t know about you, but it annoys me a lot that my designers can do this more easily than I can. And that if I want to write code like this I have to work in an inferior language.

So how do we get to this? Coroutines can get you there but then you need a stack for each coroutine. Which is not super expensive, but it’s expensive enough that you have to start thinking about how often you want to use this. Stackless coroutines are cheap enough that you could use them for a system like this, but they are not powerful enough because you can only yield execution on the top level. A custom scheduler with the proposed “await” syntax could also work, but it involves talking across threads and heap allocations for the shared state. I think there’d be too much overhead in that to use it for game code. (except where you’re IO bound like in a streamer, which is precisely what “await” is for)

So what is it fundamentally that makes it possible for Kismet to run efficiently? I believe the answer is that it has a custom compiler. The compiler knows how much memory it has to allocate for a given script, so it can allocate only exactly as much as it needs. At least that’s how it works in the clone that I am familiar with: The compiler computes the memory usage of each of the nodes, prepares a big block of memory that can fit all the nodes, and then that block of memory is allocated when the script is first loaded. So there are no heap allocations at runtime and you could even in theory reason about all the connections and make the transitions between the connections be as cheap as a function call. (Kismet could be as fast as my second code listing which has no function pointers, but I believe it is only as fast as my first code listing)

A C++ compiler doesn’t work like that though: The necessary memory for a function call is allocated at runtime. If I want to call a function called “Matinee::Play()”, then the compiler will place some instructions that allocate the necessary memory at runtime. In Kismet (or at least in the Kismet clone that I am familiar with) the necessary memory would already be ready at that point: The compiler would know that Matinee::Play() requires let’s say 200 bytes of scratch memory, so that memory (plus the memory for Matinee::Reverse() and the three calls to Delay()) would be available by the time that the lambda is called, because the compiler knows that the lambda will eventually call Matinee::Play().

This doesn’t make any difference in terms of performance (allocating stack memory is pretty much free) but what it does allow is that all functions in Kismet can be interrupted and resumed at any point.

OK so how would you actually implement this? I think all we have to do is make stackless coroutines where the compiler can see however deep that it has to see. A naive implementation would require us to reserve enough space in that segment for all functions that we call from the resumable function, plus all functions that we call from those recursively until there is space for every function. However a quick optimization is to only reserve space for the functions that we actually want to be able to resume in the middle of. So let’s say Matinee::Play() is resumable, and it uses a function called Matinee::CalculateLength() which is not resumable, then we just use a normal stack for the second function, and only put the stack memory for Matinee::Play() into the space of the stackless coroutine. It’s actually a tiny modification on stackless coroutines, but it may have a big impact.

The goal is to make resumable functions be a normal part of the language so that nobody ever has to even think about it. They have to be as natural as in Kismet, so that if code is sequential, I can write it sequentially. All the downsides of Kismet apply. (don’t accidentally call the function two times at the same time etc.)

Did you notice how I left one feature out of the Kismet version in my C++ version? In the Kismet version two nodes actually run at the same time. That is trivial to do when using stackless coroutines: You simply can run two at the same time, there are no problems. It is however a problem of syntax that I’ll have to think about first.

2. Object Lifetimes

The second part that annoys me about callstacks is object lifetimes. And for this let’s imagine I had this function:

std::vector<std::string> split_string(const std::string & to_split, const std::string & separator);

You would probably tell me that this is bad code because I have to create both a vector and potentially many strings. (imagine reading a large file and splitting on newlines) So basically this is slow. If you have time have a look at what the string split function looks like in your engine and how it is used. The fastest ones work on iterators, maybe like this:

void split_string(const std::string & to_split, const std::string & separator, const std::function<bool (const char *, const char *)> & callback);

This puts the control in the hands of the calling code so that it can decide how to allocate the results. So if the calling code wants the results in a std::set instead of a std::vector, it now has the choice to do just that. Or if it doesn’t need to keep the split strings around at all, it can look at them without causing allocations. And if you’re a good C++ programmer you’re probably writing code like this all the time. String splitting is an obvious example, but there are many other cases where we avoid heap allocations and many other tricks to save heap allocations, and you are learning all of them every day and you’re using them without even noticing.

And then one day you look at Python code and the code to read all the lines in a file into a set looks like this:

names = set(open(filename).readlines())

And that is just so nice. And you can write code like this in C++. You can write it in exactly the same syntax and it would have exactly the same speed as in Python. But you don’t, because for a C++ programmers it is an insult to be called as fast as Python. So our libraries don’t even contain the code that would allow us to write this. Our libraries contain the fast code from above. And people just use the code that is in the libraries, so a typical C++ programmer would write the above one liner as a ten liner.

Except that, fundamentally, there is nothing here that should be slow. The only heap allocations needed here are one for each read line, and one for the set. The function readlines() doesn’t need to allocate anything because we know the lifetime of its objects and they don’t need to survive the current statement. So in theory everything that readlines does could happen on the stack. But you can’t do that in C++. To write the above code this would have to be the declaration of readlines:

std::vector<std::string> File::readlines();

Which is obviously stupid because we just said that we want the results in a set. Why does readlines() decide to store things in a vector if we want them in a set? Shouldn’t the calling function decide how things should be stored, or whether they need to be stored at all? It seems like there should be a way to get the efficiency of iterators with the composability of returning fully formed objects.

So here is the idea: We allow the transformation from the iterator form to this form as syntactic sugar.  So let’s go back to the string split example and let’s say the code looks like this:

std::set<std::string> as_set;
split_string(some_string, ",", [&](string_view substring)

Then the language would contain syntactic sugar that recognizes that the last argument is a continuation function, and I am allowed to write this instead:

std::set<std::string> as_set;
for (string_view substring : split_string(some_string, ","))

Which looks pretty similar, but it allows us to write the Python code:

std::set<std::string> as_set(split_string(some_string, ","));

Since std::set currently doesn’t have a range constructor, it would probably actually look like this:

auto split = split_string(some_string, ",");
std::set<std::string> as_set(split.begin(), split.end());

Now you only have to believe me that this could be done as syntactic sugar. This is using the callback based version of the function, so it is as efficient as it can be. What happens here is that the begin() and end() iterators are forward iterators. Every time that begin() is incremented, it runs through one iteration in the internal loop inside of split_string. We never actually exit out of the scope of split_string in this. Its stack is only unwound when the calling scope ends. Which seems like it might be a hard problem to determine when which scope gets unwound: what if this is nested several layers deep? For example if readlines() uses split_string() internally so the outer function actually has two stack frames alive? But all of that goes away when you do the transformation to the callback based version manually: There is never any ambiguity about which stack you are in. If you’re wondering what this would look like in a debugger: You could show it as a stack but I think that would be more confusing. I think I’d probably represent it as a call-graph, with lines from small side stacks going back to their parent. (the graph would have circles)

One interesting thing that comes out of this is that some things which would have been stupid in the past are now efficient. For example here is code for determining how many lines are in a file:

size_t num_lines = std::open(filename).readlines().size();

With the above change, this is as fast as you can get. Since this gets transformed to the callback based model behind the scenes, readlines() will never actually allocate anything. Everything here happens on the stack. Which is interesting, right? I don’t quite know what the consequences of this will be yet. Optional return values can be much faster because they only actually incur overhead if they are used. It’s like lazy evaluation with the performance of immediate evaluation.

When reading all of this you may notice that this kinda looks like Haskell’s special ‘do’ syntax for monads. And that is exactly where the motivation for this came from: I first had this idea when trying to implement a parser monad in C++: I was frustrated by all of the unnecessary heap allocations that a naive parser monad requires. I actually never checked whether my solution does allow a full, efficient parser monad to be implemented. It is not as powerful as the Haskell syntax because I don’t want to sacrifice efficiency: Everything gets evaluated immediately. This only works as long as the lifetimes of all involved objects are known at compile time. The result type of the split_string function above is a special type that can not be stored for later use. If you want the values to survive, you have to turn them into a std::vector<std::string>. Which is OK for simple cases like split_string, but it may be annoying for monads.

That being said I think this is a good compromise towards the composability of a language like Haskell with the efficiency of C++.

So in summary I think I want to experiment with the callstack a little. I don’t want to make super big changes, but I am convinced that the current model is flawed. I think either of these changes can have an enormous impact on how we code. If resumable functions are trivial with no loss in efficiency, we’d use them all the time. If objects can survive a scope without having to worry about efficiency or lifetimes, we’d write more reusable functions. But honestly, right now I would give it a 20% chance that I will actually end up with the solutions described above. There is also a 5% chance that I will end up with normal classical callstacks, but I’d say it’s most likely that I will end up with something different. You never know how these ideas work out until you start writing and using them. But for now it is important to decide that I do not want to accept callstacks with all their small annoyances as a given. It is clear to me that there are better models out there, and we just have to go looking for them.