Probably Dance

How I use LLMs to program

Malte Skarupke — Wed, 10 Apr 2024 03:08:08 +0000

Studies have shown that LLMs help novice programmers more than experienced programmers. This matches my experience. At work I see that interns or new hires have some LLM window open almost all the time. I use them maybe once a week. But you could say the same thing about Stack Overflow. I used it all the time when I started programming. Now I use it occasionally. While it’s easy to point at their obvious issues, I think they are also clearly a net-positive on average. So how do LLMs help me?

Big plus: Languages that I don’t use as often

I don’t often write SQL statements. I can obviously write the simple ones, but SQL is a language that has all the features you could ever possibly want, and I don’t know how to use them and don’t know how to google for them. So I ask a LLM. Similarly for javascript/css/html programming. I used to hate doing web frontend work, now it’s not so bad because LLMs can help me get out of the tricky edge cases.

I have also used LLMs to translate functionality from one language to another. E.g. if I know what a function is called in C++ but I can’t find an equivalent one in the standard library of another language, an LLM will often do a decent first pass of rewriting the C++ function in the other language.

Small minus: The code is overly generic

LLMs should probably ask for more information more often. At the moment they’re a highly motivated programmer who doesn’t know enough but who can provide you with a piece of code that should work in most contexts. It won’t tell you that though, it’ll pretend that it gave you a good solution.

So for example when I ask it “I have this complicated SQL statement but I need to bring in information from one more table” it’ll give me back a nested select statement. I have to prompt it again and ask “isn’t there a way to do this with a join?” before it tells me that there is a kind of JOIN statement that does exactly what I want. Similarly for javascript it likes to just register global callbacks that listen to every mouseenter event on any element in the page.

This works fine for most languages. It leads to problems in strict languages like Rust. When someone asks me for help in LLM-assisted Rust code, usually we first have to ask “but what is this actually trying to do” a few times to undo the unnecessary genericness (RefCell) of the LLM code.

It also is going to lead to problems in the long term where you have a lot of code that’s overly broad. When you have too many pieces of code that register broad callbacks “just in case” or do nested SQL statements “just in case” you can’t build a mental model of what actually happens in your code base.

Neutral: It can fix its own mistakes

One thing I keep on forgetting to do is to ask it to fix its own mistakes. After writing the above, I realized I could just ask it how to avoid the global javascript callbacks. It told me about “mutation observers,” which are a kind of global observer that allow me to attach my other observers to only the nodes that I want. Which is better, maybe?

In general I find that you can ask it a few times “this part bothers me, can you do it better?” to get better quality. I wish it just did this on its own, but it is good that it’s happy to rewrite it for you as often as you ask. (though if you ask three times and there is no improvement, that means the LLM just can’t do what you’re asking)

Big plus: It can help you get unstuck

LLMs are undaunted. You can ask them any tricky question and they’ll provide an answer. Sometimes the answer is not bad and you immediately see how to make progress. Often the answer is pretty bad, but that doesn’t mean it’s useless. It still helps in at least two ways:

Rubber duck debugging. LLMs are pretty good for this. Often it helps me to just explain the problem, because it forces me to actually clearly state the problem. Also if the LLM comes back with an answer that’s totally unrelated, I have to go back and be even more clear. And then once I can clearly state the problem, the solution sometimes presents itself.
They provide another perspective. It’s not always a great perspective, but when you’re solving tricky problems you have to look at them from different angles, and LLMs are great at providing more angles. Even if I don’t go with any of the ideas that the LLM provided, it often helps me think of a good approach. I especially appreciate when the LLM provides me with something that’s simpler than what I thought of, which it does surprisingly often.

Small minus: It does not hold back even when producing garbage

E.g. there is this infamous example:

ceiling is being raised. cursor's copilot helped us write "superhuman code" for a critical feature. We can read this code, but VERY few engineers out there could write it from scratch.

Took lots of convincing too. "come on, this must be possible, try harder". and obviously- done… pic.twitter.com/rPHOUFbEyw
— Atai Barkai (@ataiiam) March 5, 2024

This is mostly nonsense where they wrote something that is much too complicated and then used AI to do… something with it. Any experienced programmer would have stopped halfway through writing this and thought “no, this is getting too messy, there must be another way.” But an AI will happily write this for you, and then you have to unhappily live with it.

Big plus: It knows all algorithms and libraries

Even the first early version of ChatGPT, which produced terrible code, was impressive when it came to one area: It understood what you wanted in a way that no search engine could. LLMs have only gotten better there. I love that I can ask “I have a problem that’s shaped like this and it feels like there should be a data structure or algorithm that could help here” and it’ll understand what I want and point me towards relevant algorithms. The first answer is almost never what I want, but I also love that I can have a conversation. It usually goes like this:

me: I have the following problem: … What algorithm or data structure can help with that?

LLM: Have you considered X or Y?

me: Well yes, obviously those are the first things I thought of, but they don’t work because of this part of the problem that I just told you about.

LLM: Oh I’m so sorry, you’re right. How about Z then?

me: I had heard about Z before but don’t really know about it. I thought it had the following problem: …?

LLM: Oh that can be overcome by doing Z* or with this other approach

And back and forth like that a few times. I love that it just knows this stuff and can point me at all the interesting things that exist. I also love that sometimes the first answer is “no there is no good algorithm here because this is actually really hard and you probably want to just go with this simple, partial solution that’s at least easy to understand.”

Small minus: It read the experts but it’s not an expert itself

For any algorithm there is the simple straightforward thing that everyone uses, then there is slightly more complicated solution that is useful sometimes, and then there are a dozen really complicated solutions that only exist because someone needed to publish a paper and they found one specialized benchmark where their solution artificially looks good. Nobody actually uses them.

Much of the value of experts is that they can tell you which things you have to pay attention to and which things you can safely ignore. LLMs can not be trusted when it comes to that. They will believe the claims of the paper authors and confidently tell you all the benefits that certain methods have. If you believe the LLM, you’ll have to spend a lot of time, possibly days, rediscovering all the reasons why nobody is using this algorithm in practice.

The LLM makes this mistake in all directions. It will never tell you “I don’t know” or “I’m not confident.” Instead it will tell you confidently that the simple solution is good when you should use something more complicated, and it will also confidently tell you that the complicated solution is good when you should never be using it. You just have to take that into account.

Big plus: It’s competent for simple tasks

This is obvious but it’s worth pointing out. The output of LLMs is usually high quality. If you ask it to explain, say “how do heat pumps work?” and maybe ask some follow up question, it’ll do a better job than 99% of the population. This is also true of code. Of course most of the population doesn’t program so that 99% number is irrelevant, but for simple problems it does a better job than many professional programmers would. Sure, there will be bugs and you may have to adapt the code (depending on how you asked) but it’ll be competent. LLMs would pass coding interviews.

And while I could usually nitpick and find improvements, there are a lot of places where you have to solve the kinds of simple problems that LLMs are good at solving, and where it’s fine to use code at LLM quality, unmodified.

Small minus: I have no idea how to use LLMs for maintenance

Unfortunately I don’t spend most of my days writing code. I spend most of my days maintaining code. Meaning debugging and refactoring and adding small features to existing programs. LLMs just don’t fit into that. Maybe that’s just expected because it’s “generative AI” and if I don’t need to generate much code, it can’t help.

It sure would be nice though if I could point it at a directory of files and say “I need to change the following types, which will change some interfaces. Can you update all necessary files and point out any place where I need to pay attention?” But I currently would have no confidence that it would do a good job at that.

You need to be more of a critic, editor and reviewer

So when you use LLMs to generate new code, you mostly need to be a critic with good taste. You need to know when the quality of the LLM is appropriate. When is it OK to use overly generic code? Can you voice what bothers you and ask it to do better? Usually you have to ask the question “but what do we actually need to do here” a few times.

When you ask for advice on algorithms, the LLM will enthusiastically tell you that something is a great idea, and when you tell it that it’s not, it will enthusiastically tell you that you’re right and that it was a terrible idea. So you can’t rely on it for critical thinking. You can only use it to enumerate options.

When working with junior devs who use LLMs, code review is slightly easier because on average they write better code, but there are new failure modes. They don’t have the necessary experience to know when to accept the output of the LLM and when to doubt it. You need to watch out even more for code that was written without a mental model of how the existing code works.

On average LLMs are a clear positive. They’re not good enough yet where they’ll make you hugely more productive, but they’re already good enough where you’d miss out if you weren’t at least using them occasionally. Use one of the paid options, they’re much better than the free options. I hear Claude Opus is currently best. (I haven’t done comparisons) You just have to get experience with where they’re good and where they’re bad, and I’m hoping that sharing my experience can help you with that.

Transform Matrices are Great and You Should Understand Them

Malte Skarupke — Sun, 29 Oct 2023 18:39:03 +0000

The title of this blog post is obvious for any game programmer, but I notice that people outside of games often write clumsy code because they don’t know how transform matrices work. Especially when people do some simple 2D rendering code, like if you just want to quickly visualize some data in a HTML canvas. People get tripped up on transform math a lot. As an example imagine drawing this simple graph:

It’s just an arbitrary graph with arbitrary numbers, the point is all the layout decisions that happened here: E.g. “Long First Label” extends to the left and pushes everything else over to the right by a bit. If you aren’t organized about how to express your transforms, your code ends up with lots of arbitrary offsets and multipliers. You can’t even calculate where to draw the labels on the y-axis without including an offset for potentially long labels on the x-axis. (and long labels on the y-axis can push over the x-axis, too. Uh-oh) Your first choice for visualization should probably be an existing tool (like I used to generate the graph above) but surprisingly often you’ll want to do something custom, and then you have to worry about transforms.

Game programmers have to do complicated transforms all the time, so they had to get organized about this and the result is the transform matrix. It’s remarkably simple and every programmer should probably know it, just to appreciate its beauty. There are two tricks to transform matrices:

Trick 1: Add an Extra Column for Positions

Matrices are not an obvious choice for positioning of things. It’s easy to see how to scale things with matrices, and rotations are doable by putting the sin() and cos() of the angle in the right places. But how do you do positions? Matrices can only multiply numbers, so how do you add 10 to all x coordinates? You need one more column. For 2D rendering you need a 3×3 transformation matrix. For 3D rendering you need a 4×4 matrix. Then when multiplying your matrix with a position, you use instead of just .

This is how you can add numbers using matrices. Occasionally it’s also useful to multiply with instead, which just get the rotation and scaling parts, no translation.

Trick 2: Compression through Associativity

Matrix multiplication is associative: . And, equally importantly, square matrices don’t grow after being multiplied. If you multiply two 3×3 matrices together, you get another 3×3 matrix. Meaning sizeof(a*b) == sizeof(a). This allows you to collapse multiple transform matrices into one.

For example in a video game you might express animation data as a hierarchy of matrices. (this used to be standard, now there are slightly better techniques, but they use the same “compression through associativity” trick so I’ll just stick with matrices for now) So when you animate a fingertip, your matrix would contain a translation plus a rotation, and maybe a scaling factor. But that matrix only tells you how the tip of the finger should move relative to the middle segment of the finger. And that middle segment only has a matrix that tells it how to move relative to the root of the finger, and so on all the way up the arm and down the spine until you get to the reference matrix that moves the whole body.

When rendering you have to determine the position of every vertex of every triangle on the body, so that’s going to be a lot of matrix multiplications. Instead of walking up that chain of matrices for every vertex, you collapse all the matrices down. If you have 100 matrices for your character’s skeleton, you’ll pre-multiply those 100 matrices in a topological-order loop to end up with one matrix for each bone that contains all of its parent matrices baked in. Meaning the matrix for the fingertip has in it the matrix for the middle joint of the finger, the wrist, elbow, shoulder, all the way to the reference bone. All of those translations and rotations can be collapsed down into a single 4×4 matrix that does all of the work in one step.

These matrices can be extremely complicated but it doesn’t matter because you always end up with 16 numbers in the end that contain all the collapsed information, leaving only transforms that didn’t cancel out. (and you don’t have to figure out what cancels out, it just happens)

My favorite example of this is shadow mapping. The fastest technique for getting accurate shadows is to draw the whole scene twice: Once from the perspective of the camera, and once from the perspective of the light. In both render passes you write the distance from the camera to the rendered surface into a buffer, the so-called z-buffer.

(picture from Wikipedia user -Zeus-, CC BY-SA 3.0)

Once you have a z-buffer for the camera and a z-buffer for the light, you can figure out whether a pixel in the camera-view can see the light or not. Just look in the light’s z-buffer and see if there was a closer surface blocking the light. But how do you figure out which pixel to look at in the other depth buffer? After all it was rendered from a whole different perspective.

Well the camera transform is encoded as a matrix. The perspective settings (field-of-view, aspect ratio, …) are also encoded as a matrix. The light cone is encoded in a matrix. All you have to do is pre-multiply all the relevant matrices for your camera, invert the result, and multiply with the light matrix. The result is a single 4×4 matrix which will transform any point that was rendered from the camera to the corresponding pixel coordinate in the z-buffer that encodes the distance to the light. If the value that you read in the light’s z-buffer is smaller than the z-value after the multiplication, you’re in shadow because there was some other surface that was closer to the light. It’s crazy complicated, but it’s just a 4×4 matrix multiply in the end.

Using this for 2D Rendering

Ok but how do you use this for simple things? The nice things is that transform matrices are really simple at the core and simple operations remain simple. The main convenience is that you can always operate in a local frame and that you don’t have to care about the rest of the world.

E.g. when drawing the lines in the graph at the beginning of this blog post, you shouldn’t have to do any extra math to figure out where each vertex of the line should end up at. A single matrix multiply should be enough. That matrix can contain all the offsets and scaling that you need to account for how big the canvas is, how big the labels are, what the range of your x and y-values were etc.

You still need to do the math to calculate those offsets, but the important part is that all the setup functions take in and return matrices. So here is a potential interface for the labels:

fn label_offsets(transform_so_far: &Matrix3x3, x_labels: &[string], y_labels: &[string]) -> Matrix3x3 {
    // find the longest y_label and calculate the length of the half of the first x_label
    // the longer of those two is the offset to the left. you may also want to add an offset
    // to the right if the last x_label is long.
}

Meaning you pass in the matrix that was calculated so far. So for example if you want to draw small-multiples, you can just pass a different matrix for each and nothing else has to change. Then you do whatever logic that you need to do to figure out what the resulting offsets need to be, and you return back out a new matrix that will be used by the rest of the rendering. You can decide what the format of that matrix should be. I would choose it so that you can calculate the positions for the x-labels by multiplying it with etc. and you can calculate the positions for the y-labels by multiplying it with etc. If that is awkward for drawing the lines (because the y-values in this example are in the range [0,100]), you can always add in one more scaling matrix afterwards. Or return two different matrices, one for label-drawing and one for line-drawing. These things are easy once you’re dealing with matrices because you can encapsulate so much information in a single object that can be passed around. (good luck trying to have different return values with custom hacked offset math)

Another thing that’s easy with matrices is scrolling, panning and zooming. Imagine implementing google maps, or just something that has scroll bars: You need one matrix to get the area within the page where the map should be rendered, (subtracting out any side bars or top bars) and you need a second matrix that contains the current pan/zoom position. You can then multiply those two matrices together to figure out where to draw each tile.

Oh and a last trick: If it ever bothers you that the y-axis in the HTML canvas goes down, (e.g. because the y-axis always goes up in graphs like the one above) you can flip that by chaining in a translation matrix and a flip on the y-axis so all your own math can assume that the y-axis goes up. And no other code has to care that there are now two more matrices in the chain because they all just see nine floats.

Tricky Parts

There are a few tricky parts with matrices:

When drawing with the HTML canvas you may be tempted to use its transform matrix API. You probably don’t want to do this because it scales everything, including the width of lines and the size of text. Instead you usually want to multiply the coordinates yourself and draw text at normal sizes.
Matrices are not commutative. . Meaning if you translate first and then rotate, that is different from rotating first then translating. This is a feature, not a bug, but it can be tricky and requires thinking about what you actually want to do.
As a follow up for the above: There are two different ways of doing everything. You either multiply with a row vector on the left, or a column vector on the right. Whichever you choose means that “translate first then rotate” means that the rotation matrix has to be on the right or on the left of the translation matrix. (and the matrices also end up transposed) For some reason the standard way of doing this is backwards so that you have to read right-to-left. If you write your own, you’re free to choose. Oh and to make it more confusing there are two different ways of thinking about it: Are you moving the object, or are you moving the coordinate system? If you’re moving the object then “translate first, then rotate” means the object will orbit around the origin. If you’re moving the reference frame then “translate first, then rotate” means the object will rotate around the given point. (how do you render a planet that should orbit the sun and rotate around its own center? rotate then translate then rotate again) There are long arguments about which one is the sane choice, which I won’t get into here. Just prepare to be confused whenever you switch to a new library/framework because everyone does this differently.

Generalizing

While matrices are probably the right choice to get you started, you can also use different solutions if you learned the right lessons. I’m actually not sure if you want to use matrices for 2D layouts. You may be better off using a custom type that just contains the current bounding-box in integer coordinates plus font size. (though if your matrices are sane, it should be easy to recover the current bounding box: Just multiply the matrix with the coordinates of the four corners) If you do write a custom type, it’s important that you learn the lessons from transformation matrices. Meaning you need the “compression through associativity” trick. You want objects that encapsulate all of the state that you need for layout information, and you want to be able to pass those objects around, and those objects need to always have the same size.

In video games there are many different ways to express transforms. Matrices were the best choice for a long time, but the current hotness is to use geometric algebra or something derived from geometric algebra. (quaternions or dual quaternions) The important parts are still there and they work better for animation: It’s hard to lerp matrices. One nice thing about geometric algebra is that it works in any number of dimensions, so you can also use it in 2D. It’s just much more complicated to explain and matrices are probably the better choice to get started. They will clean up your code a lot.

Two Kids Put Me on a Two Sleep Schedule

Malte Skarupke — Sat, 16 Sep 2023 07:11:49 +0000

I’m writing this at 2am, having already slept from 9pm to 1am. I will sleep again from 4am to 7am. This is apparently called “biphasic sleep”, which I first heard about in the 2022 article The forgotten medieval habit of ‘two sleeps’. I had read that article with interest, because this has happened to me occasionally over the years: I’d fall asleep early, then wake up to spend a few hours restless before finally falling back asleep in the early morning hours. After reading that this used to be normal, I finally decided to just let it happen when it does, which is about two or three times a week.

What really made it happen this commonly is having one year old twins. (almost two years actually, time flies…) After we put the kids to bed at 8pm, I’m often just exhausted. I lay down on the couch to rest a little and fall asleep within 20 minutes. I wake up again between 11pm and 1am. If it’s closer to 11pm I can go back to sleep and sleep through the night. If it’s closer to 1am, it’s impossible to fall asleep again. I’m wide awake.

On other nights I actually just rest for thirty minutes on the couch without falling asleep. I then spend the evening normally before falling asleep later, at 11pm or midnight.

In all cases I wake at 7am because that’s when the kids wake up. (yes, toddlers sleep from 8pm to 7am, plus a two hour nap in the middle of the day)

On “normal” nights my evenings are of uneven quality. Most evenings I’m tired and am not able to get anything done. I might just spend the evening reading or playing games, but I’m unable to program. On biphasic nights I’m wide awake and full of energy. It’s easy to be productive, doing programming, writing (this blog post), or even exercising. In fact it’s tempting to want to stay up too long. I have to force myself to go back to sleep, otherwise the next day is ruined.

On “normal” nights I don’t need an alarm to wake up at 7am. In fact I often wake up a little early, between 6am and 6:30am. On biphasic nights I need an alarm, but still wake up easily without feeling tired. (if I didn’t stretch the nightly wake time too long) I have been a morning person ever since I lived in an apartment with a south-east-facing bedroom where the sun often woke me at 5am. When I lived there I found that I enjoyed waking up early because the morning hours are often higher quality than the evening hours. After I moved from there I didn’t wake up that early any more, but I still woke up easily in the morning without feeling tired.

Now I find a similar quality that I had in the morning hours, except in the middle of the night. If I could choose between one of the two, I’d definitely choose sleeping fully through the night and having more time from 5am to 7am, but having the nighttime hours is also nice. Having two toddlers around means I don’t get much time to myself any more.

If you think this is for you, what do you have to do? Not much. When you feel tired in the evening, just have low lights and let yourself fall asleep. I recommend reading challenging texts or trying to solve challenging puzzles. I seem to fall asleep when I have to re-read a text multiple times or when I’m stuck on a puzzle for a while, looking for new angles. Then, when you wake up in the middle of the night and can’t go back to sleep, just get out of bed. Don’t toss and turn being unable to sleep. Just skip that phase. Then don’t stay up too long, because you do need more sleep in the morning.

What Helps or Hurts

Screens – I don’t think screen use affects whether I stay awake or not. I do make my screen as dark as possible in the evenings, but I also do that on evenings where I stay awake. On biphasic nights I’ll fall asleep equally easily whether I’m reading a book, on a phone, or on a computer.

Caffeine – This definitely affects biphasic sleep, but I’m not sure I have a clear conclusion yet. The main effect is that I have biphasic sleep less often on days when I have caffeine. But if I’m very tired, I might still fall asleep early and then caffeine instead stretches my nightly waking hours. E.g. recently I had very bad (normal) sleep the previous night, and drank matcha to make it through the day. Since I was so tired I still fell asleep before 9pm, but then I was unable to fall asleep at night again, so my second sleep was only from 6am to 7am, which is not enough.

Physical activity – This can definitely keep me awake. If something needs to get done in the evening, even if I’m tired, I won’t have biphasic sleep. But on nights where I fall asleep at 9pm, I’m often not able to do much around the house. The couch has a very strong pull.

Naps – They kill biphasic sleep. On weekends I often nap when the kids are napping. On those days it’s easy to stay up until 11pm. I don’t have that option when I have to work during the week.

The main thing that’s required is the right mindset. I used to fight falling asleep early because I thought it ruined my sleep for the night. I would worry about waking up at odd hours and not getting enough sleep and being tired the next day. I’d think that it’s better to stay awake until 11pm and then sleep the full night than fall asleep early and wake up at night. Now that I know of biphasic sleep, and I have figured out how to have it without being more tired the next day, I just listen to my body when I get tired.

Finally, how do you go back to sleep when you’re wide awake in the middle of the night? I find that after 4am, even if I feel totally awake, all it takes is to lay down and wait. I’ll feel wide awake, unable to sleep, and then the next thing I know it’s 7am and I get woken by my alarm.

Beautiful Branchless Binary Search

Malte Skarupke — Fri, 28 Apr 2023 03:09:49 +0000

I read a blog post by Alex Muscar, “Beautiful Binary Search in D“. It describes a binary search called “Shar’s algorithm”. I’d never heard of it and it’s impossible to google, but looking at the algorithm I couldn’t help but think “this is branchless.” And who knew that there could be a branchless binary search? So I did the work to translate it into a algorithm for C++ iterators, no longer requiring one-based indexing or fixed-size arrays.

In GCC it is more than twice as fast as std::lower_bound, which is already a very high quality binary search. The search loop is simple and the generated assembly is beautiful. I’m astonished that this exists and nobody seems to be using it…

Lets start with the code:

template
It branchless_lower_bound(It begin, It end, const T & value, Cmp && compare)
{
    size_t length = end - begin;
    if (length == 0)
        return end;
    size_t step = bit_floor(length);
    if (step != length && compare(begin[step], value))
    {
        length -= step + 1;
        if (length == 0)
            return end;
        step = bit_ceil(length);
        begin = end - step;
    }
    for (step /= 2; step != 0; step /= 2)
    {
        if (compare(begin[step], value))
            begin += step;
    }
    return begin + compare(*begin, value);
}
template
It branchless_lower_bound(It begin, It end, const T & value)
{
    return branchless_lower_bound(begin, end, value, std::less<>{});
}

I said the search loop is simple, but unfortunately the setup in lines 4 to 15 is not. Lets skip it for now. Most of the work happens in the loop in lines 16 to 20.

Branchless

The loop may not look branchless because I clearly have a loop conditional and an if-statement in the loop body. Let me defend both of these:

The if-statement will be compiled to a CMOV (conditional move) instruction, meaning there is no branch. At least GCC does this. I could not get Clang to make this one branchless, no matter how clever I tried to be. So I decided to not be clever, since that works for GCC. I wish C++ just allowed me to use CMOV directly…
The loop condition is a branch, but it only depends on the length of the array. So it can be predicted very well and we don’t have to worry about it. The linked blog post fully unrolls the loop, which makes this branch go away, but in my benchmarks unrolling was actually slower because the function body became too big to be inlined. So I kept it as is.

Algorithm

So now that I’ve explained that the title refers to the fact that one branch is gone and the other is nearly free and could be removed if we wanted to, how does this actually work?

The important variable is the “step” variable, line 7. We’re going to jump in powers of two. If the array is 64 elements long, it will have the values 64, 32, 16, 8, 4, 2, 1. It gets initialized to the nearest smaller power-of-two of the input length. So if the input is 22 elements long, this will be 16. My compiler doesn’t have the new std::bit_floor function, so I wrote my own to round down to the nearest power of two. This should just be replaced with a call to std::bit_floor once C++20 is more widely supported.

We’re always going to do steps that are power-of-two sized, but that’s going to be a problem if the input length is not a power of two. So in lines 8 to 15 we check if the middle is less than the search value. If it is, we’re going to search the last elements. Or to make it concrete: If the input is length 22, and that boolean is false, we’ll search the first 16 elements, from index 0 to 15. If that conditional is true, we’ll search the last 8 elements, from index 14 to 21.

input          0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
line 8 compare                                       16
when false     0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
when true                                      14 15 16 17 18 19 20 21

Yes, that means that indices 14, 15 and 16 get included in the second half even though we already ruled them out with the comparison in line 8, but that’s the price we pay for having a nice loop. We have to round up to a power of two.

Performance

How does it perform? It’s incredibly fast in GCC:

Somewhere around 16k elements in the array, it’s actually 3x as fast as std::lower_bound. Later, cache effects start to dominate so the reduced branch misses matter less.

Those spikes for std::lower_bound are on powers of two, where it is somehow much slower. I looked into it a little bit but can’t come up with an easy explanation. The Clang version has the same spikes even though it compiles to very different assembly.

In fact in Clang branchless_lower_bound is slower than std::lower_bound because I couldn’t get it to actually be branchless:

The funny thing is that Clang compiles std::lower_bound to be branchless. So std::lower_bound is faster in Clang than in GCC, and my branchless_lower_bound is not branchless. Not only did the red line move up, the blue line also moved down.

But that means if we compare the Clang version of std::lower_bound against the GCC version of branchless_lower_bound, we can compare two branchless algorithms. Lets do that:

The branchless version of branchless_lower_bound is faster than the branchless version of std::lower_bound. On the left half of the graph, where the arrays are smaller, it’s 1.5x as fast on average. Why? Mainly because the inner loop is so tight. Here is the assembly for the two:

inner loop of std::lower_bound	inner loop of branchless_lower_bound
loop: mov %rcx,%rsi	loop: lea (%rdx,%rax,4),%rcx
mov %rbx,%rdx	cmp (%rcx),%esi
shr %rdx	cmovg %rcx,%rdx
mov %rdx,%rdi	shr %rax
not %rdi	jne loop
add %rbx,%rdi
cmp %eax,(%rcx,%rdx,4)
lea 0x4(%rcx,%rdx,4),%rcx
cmovge %rsi,%rcx
cmovge %rdx,%rdi
mov %rdi,%rbx
test %rdi,%rdi
jg loop

These are all pretty cheap operations with only a little bit of instruction-level-parallelism, (each loop iteration depends on the previous, so instructions-per-clock are low for both of these) so we can estimate their cost just by counting them. 13 vs 5 is a big decrease. Specifically two differences matter:

branchless_lower_bound only has to keep track of one pointer instead of two pointers
std::lower_bound has to recompute the size after each iteration. In branchless_lower_bound the size of the next iteration does not depend on the previous iteration

So this is great, except that the comparison function is provided by the user and, if it is much bigger, it can take many more cycles than we do. In that case branchless_lower_bound will be slower than std::lower_bound. Here is binary searching of strings, which gets more expensive once the container gets large:

More Comparisons

Why is it slower for strings? Because this does more comparisons than std::lower_bound. Splitting into powers of two is actually not ideal. For example if the input is the array [0, 1, 2, 3, 4] and we’re looking for the middle, element 2, this behaves pretty badly:

std::lower_bound	branchless_lower_bound
compare at index 2, not less	compare at index 4, not less
compare at index 1, less	compare at index 2, not less
done, found at index 2	compare at index 1, less
	compare at index 1, less
	done, found at index 2

So we’re doing four comparisons here where std::lower_bound only needs two. I picked an example where it’s particularly clumsy, starting far from the middle and comparing the same index twice. It seems like you should be able to clean this up, but when I tried I always ended up making it slower.

But it won’t be too much worse than an ideal binary search. For an array that’s less than elements big
– an ideal binary search will use or fewer comparisons
– branchless_lower_bound will use or fewer comparisons.

Overall it’s worth it: We’re doing more iterations, but we’re doing those extra iterations so much more quickly that it comes out significantly faster in the end. You just need to keep in mind that if your comparison function is expensive, std::lower_bound might be a better choice.

Tracking Down the Source

I said at the beginning that “Shar’s algorithm” is impossible to google. Alex Muscar said he read it in a book written in 1982 by John L Bentley. Luckily that book is available to borrow online from the Internet Archive. Bentley provides the source code and says that it’s got the idea from Knuth’s “Sorting and Searching”. Knuth did not provide source code. He only sketched out the idea in his book, and says that it came from Leonard E Shar in 1971. I don’t know where Shar wrote up the idea. Maybe he just told it to Knuth.

This is the second time that I came across an algorithm in Knuth’s books that is brilliant and should be used more widely but somehow was forgotten. Maybe I should actually read the book… It’s just really hard to see which ideas are good and which ones aren’t. For example immediately after sketching out Shar’s algorithm, Knuth spends far more time going over a binary search based on the Fibonacci sequence. It’s faster if you can’t quickly divide integers by 2, and instead only have addition and subtraction. So it’s probably useless, but who knows? When reading Knuth’s book, you have to assume that most algorithms are useless, and that the good things have been highlighted by someone already. Luckily for people like me, there seem to still be a few hidden gems.

Code

The code for this is available here. It’s released under the boost license.

Also I’m trying out a donation button. If open source work like this is valuable for you, consider paying for it. The recommended donation is $20 (or your local cost for an item on a restaurant menu) for individuals, or $1000 for organizations. (or your local cost of hiring a contractor for a day) But any amount is appreciated:

Make a one-time donation

Thanks! I have no idea how much this is worth to people. Feedback appreciated.

Donate

Fine-grained Locking with Two-Bit Mutexes

Malte Skarupke — Tue, 06 Dec 2022 03:43:16 +0000

Lets say you want to have a mutex for every item in a list with 10k elements. It feels a bit wasteful to use a std::mutex for each of those elements. In Linux std::mutex is 40 bytes, in Windows it’s 80 bytes. But mutexes don’t need to be that big. You can fit a mutex into two bits, and it’s going to be fast. So we could fit a mutex into the low bits of a pointer. If your container already stores pointers, you might be able to store a mutex for each element with zero space overhead, not even any extra operations during initialization. You’d pay no cost until you use a mutex for the first time.

Lets start with a mutex that uses one byte. It’s easy now that C++ has added futexes to the standard:

struct one_byte_mutex
{
	void lock()
	{
		if (state.exchange(locked, std::memory_order_acquire) == unlocked)
			return;
		while (state.exchange(sleeper, std::memory_order_acquire) != unlocked)
			state.wait(sleeper, std::memory_order_relaxed);
	}
	void unlock()
	{
		if (state.exchange(unlocked, std::memory_order_release) == sleeper)
			state.notify_one();
	}

private:
	std::atomic state{ unlocked };

	static constexpr uint8_t unlocked = 0;
	static constexpr uint8_t locked  = 0b01;
	static constexpr uint8_t sleeper = 0b10;
};

The futex operations on here are the call to “wait()” and “notify_one()”. These are like simpler versions of condition variables. The “state.wait(sleeper)” call will put the thread to sleep only if state==sleeper. And “notify_one()” will wake one thread that went to sleep on this variable. You can do this on any atomic variable now. There are no special requirements. How can you sleep and notify on anything? The kernel has a hash table to store sleeping threads which is indexed by the address of the variable. So all you need is a unique address. Most operating systems support this, even Windows. (who, when they copied this idea from Linux, had the chutzpah to patent futexes) If an operating system doesn’t have this, this is implemented with a global hash table in the standard library.

This mutex is optimized for the common case when it is unlocked and there is no contention. In that case we only have to do one exchange() when locking, and one exchange() when unlocking. It’s a bit tricky to convince yourself that this is correct. I verified the implementation in TLA+. The main trick to keeping it simple is that we only try to set the state to “locked” once. If that doesn’t work, we instead set it to “sleeper”. This is necessary because we don’t know how many threads are sleeping, and unlock() clears both bits. So if there was one sleeping thread, it has to set the “sleeper” flag just in case there are others.

One tricky interleaving is if a new thread comes in and does the initial “exchange()” call at a bad time, clearing the “sleeper” bit and causing an unlocking thread to not call notify_one(). In that case the newly entering thread also sets the sleeper flag, so it will call notify_one() when it unlocks.

Two Bit Mutex

The one_byte_mutex is simple and storing a mutex in one byte is good. Storing it in two bits, say the low bits of a pointer, is trickier. Here is an implementation that does that:

template
struct pointer_with_mutex
{
	T* get() const
	{
		uint64_t masked = pointer.load(std::memory_order_relaxed) & ~both_bits;
		return reinterpret_cast(masked);
	}
	void set(T* ptr)
	{
	    static_assert(std::alignment_of::value >= 4);
		uint64_t as_int = reinterpret_cast(ptr);
		uint64_t old = pointer.load(std::memory_order_relaxed);
		while (!pointer.compare_exchange_weak(old, (old & both_bits) | as_int, std::memory_order_relaxed))
		{
		}
	}

	void lock()
	{
		uint64_t old = pointer.load(std::memory_order_relaxed);
		if (!(old & both_bits) && pointer.compare_exchange_strong(old, old | locked, std::memory_order_acquire))
			return;
		for(;;)
		{
			if (old & sleeper)
			{
				pointer.wait(old, std::memory_order_relaxed);
				old = pointer.load(std::memory_order_relaxed);
			}
			else if (pointer.compare_exchange_weak(old, old | sleeper, std::memory_order_acquire))
			{
				if (!(old & both_bits))
					return;
				pointer.wait(old | sleeper, std::memory_order_relaxed);
				old = pointer.load(std::memory_order_relaxed);
			}
		}
	}
	void unlock()
	{
		uint64_t old = pointer.fetch_and(~both_bits, std::memory_order_release);
		if (old & sleeper)
			pointer.notify_one();
	}

private:
	std::atomic pointer{ 0 };

	static constexpr uint64_t locked  = 0b01;
	static constexpr uint64_t sleeper = 0b10;
	static constexpr uint64_t both_bits = locked | sleeper;
};

This one is significantly larger, but it’s a direct translation from the above, just replacing “exchange” with “compare_exchange” to leave the other bits unaffected. Plus some extra conditionals to skip the compare_exchange when we can. (my first attempt was to just replace exchange() with fetch_or(), which would lead to simpler code, but that just uses compare_exchange internally, in a way that was noticeably slower)

The thing to point out is that none of the code depends on which bits we use, or on what the remaining bits are used for. In this case I use them for a pointer, which has to be at least four-byte-aligned, but you could use any two bits for the mutex and store anything in the remaining bits.

Performance

How does this perform? It’s a bit slow when the mutex is contended, but it’s actually faster than std::mutex for an unlocked mutex. Here are the timings on Windows (compiled with Visual Studio 2019):

	lock/unlock single-threaded	lock/unlock many threads
std::mutex	22ns	50ns
one_byte_mutex	10ns	67ns
pointer_with_mutex	18ns	94ns

Here are the timings on Linux (compiled with clang-12 against libc++-12)

	lock/unlock single-threaded	lock/unlock many threads
std::mutex	12ns	71ns
one_byte_mutex	8ns	228ns
pointer_with_mutex	15ns	255ns

This is running a benchmark where I call lock(); unlock(); in a loop. The first column is running the benchmark with a single thread, so we always hit the fast path where we don’t have to go to sleep. The second column is running with sixteen threads, so the mutex will often be locked. I then divide the length of the benchmark by the number of lock() unlock() calls to get the time for the average call. These are running on the same machine, booted into either Windows 10 or Ubuntu 20.04.

So if you expect that you will usually find your mutex unlocked, these will actually be both faster and smaller for you than std::mutex. But if you have lots of contention, you may have to put a bit more work into these to make them fast. (my first guess would be that this particular benchmark would benefit from spinning for a bit before sleeping, because the critical section is small. My second guess is that these futexes aren’t implemented efficiently in the standard library yet. I didn’t look into either of these guesses)

Four Mutexes per Byte

So if we only need two bits for a mutex, can we store four mutexes in a single byte? Maybe, but you can’t control which mutex you want to wake up. futexes work by having a hash table with the address of the futex. And you can’t address bits, you can only address bytes. So all four mutexes would have the same address, so they would all be the same futex.

You could probably still make it work by using notify_all() instead of notify_one(), and that might be fine if you expect the mutex to almost always sit idle. Or alternatively you could have a look at the hash table that your standard library uses to implement futexes in userspace, copy it, and change the key to not just be a pointer. But I’ll leave that as an exercise for the reader.

Code

The code for this is available here. It’s released under the boost license.

Also I’m trying out a donation button for the first time. If open source work like this is valuable for you, consider paying for it. The recommended donation is $20 (or your local cost for an item on a restaurant menu) for individuals, or $1000 for organizations. (or your local cost of hiring a contractor for a day) But any amount is appreciated:

Make a one-time donation

Thanks! I have no idea how much this is worth to people. Feedback appreciated.

Donate

Finding the “Second Bug” in glibc’s Condition Variable

Malte Skarupke — Sun, 18 Sep 2022 05:20:48 +0000

I continue to have no time for big programming projects, so here is a short blog post. Two years ago I looked into a bug in the glibc implementation of condition variables: Sometimes pthread_cond_signal() doesn’t do anything, which can easily hang your program. The bug is still not fixed, partially because a mitigation patch was available right away that seemed to make it go away. Except that people kept on showing up in the bug report saying that they still hit the bug sometimes, raising the suspicion that there might be a second bug. I finally got around to looking into this. I found that the mitigation patch only helps a little, it’s still the same bug, and the patch I submitted (unreviewed, don’t use yet) would actually fix it.

As I mentioned last time, one of the affected programming languages is Ocaml. Their master lock occasionally doesn’t notify a sleeper because sometimes pthread_cond_signal() doesn’t do anything. And then the whole process hangs forever, because that’s what happens when someone doesn’t get woken up in cooperative multithreading.

Checking this code in TLA+ happens to be easier than the code I was checking last time, because this results in a deadlock, which can be checked quickly. Last time I had to write temporal formulas, which make TLA+ run slowly, but a deadlock is easy to find: Find a state that has no successor states. All TLA+ has to do is enumerate all states. So whenever you can, try to write your TLA+ code so that it causes a deadlock on error.

The Ocaml masterlock can be directly translated into the PlusCal language of TLA+. It just wraps a normal mutex and adds a count of waiters, to make cooperative multithreading slightly more efficient: You can quickly check if anyone actually wants the lock, so you don’t have to give it up if nobody is waiting. For our purposes we don’t even need the yield function, just st_masterlock_acquire() and st_masterlock_release() ended up being enough. Then all we have to do is call those in a loop on multiple threads. Here is the code in PlusCal (it’s calling the mutex and condition-variable functions we wrote last time)

procedure acquire_masterlock()
{
acquire_masterlock_start:
  call lock_mutex();
acquire_masterlock_loop:
  while (busy)
  {
    waiters := waiters + 1;
    call cv_wait();
  acquire_lock_after_wait:
    waiters := waiters - 1;
  };
acquire_masterlock_after_loop:
  busy := TRUE;
  call unlock_mutex();
  return;
}

procedure release_masterlock()
{
release_masterlock_start:
  call lock_mutex();
release_masterlock_after_lock:
  busy := FALSE;
  call unlock_mutex();
release_masterlock_signal:
  call cv_signal();
  return;
};

fair process (Proc \in Procs)
variable num_loops = MaxNumLoops;
{
proc_start:
  while (num_loops > 0)
  {
    num_loops := num_loops - 1;
    either { call acquire_masterlock(); }
    or     { goto proc_done; };
  proc_loop_done:
    call release_masterlock();
  };
proc_done:
  skip;
}

If you clicked on the github link, you will see that this is a direct translation. But the point is that it’s fairly straightforward code. Now we just have to run this with the magic number 3, three Procs and MaxNumLoops=3, and three hours later we have a trace that shows how you can hang even with the mitigation patch. (full reproduce steps at the bottom of this blog post)

One reason why I didn’t find this problem last time is that the trace is even more complicated. The mitigation patch does help for smaller traces, with fewer threads or smaller numbers for MaxNumLoops. To hit the bug you need a very specific long interleaving. This will be gibberish unless you have spent many hours studying the glibc condition variable code:

Find the magic interleaving that makes you hit the “potential steal” code path in pthread_cond_wait(), discussed last time
Run the pthread_cond_broadcast() from the mitigation patch while no other thread is waiting, so that it will early-out without doing anything
Then signal again and trigger a call to quiesce_and_switch_g1()
At the same time as step 3 have a thread wait, consuming the leftover signal from the “potential steal” in step 1, then wait again
Now have step 3 finish with g_size=2 (because of the two waits) for the new g1, then signal after the switch, leaving g_size=1
Then have step 4 finish the second wait, consuming the signal from step 5

You’re now left with g_size=1 for the new g1 even though nobody is waiting on it. The next call to pthread_cond_signal() just reduces g_size to 0 without waking anyone. Any signal after that will work again.

This is still really complicated for me, so I’m not sure exactly where the problem is, but it is suspicious that the pthread_cond_broadcast() in my trace would just early-out because nobody was waiting, so that the mitigation patch doesn’t result in any change to the state. If that can happen, TLA+ just had to find an interleaving where the lingering signal from the “potential steal” causes problems later.

Why didn’t I find this last time? Because I stopped looking after I had found the initial problem, and after the mitigation patch made the problem go away with my trace. Turns out I needed to run an even bigger job to find the problem with the mitigation patch.

So now the state of the glibc condition variable is

It mostly works except occasionally a thread can take too long to wake up, which will cause it to steal a signal from a later wakeup. (and then the later thread doesn’t wake)
To fix that the code tries to detect whether we potentially stole a signal. But the response to that can leave the condition variable in a broken state, so that a later pthread_cond_signal() will signal on the wrong futex, causing sleepers to not be woken.
To fix that there is a mitigation patch which various distributions are running. But that mitigation patch has two issues:
- it doesn’t work for the reasons that the author thinks it does (discussed last time)
- it sometimes doesn’t do anything, which allows the broken state to cause problems later. (discovered in this blog post)

Both of these fixes makes the issue less likely to happen, but also make it harder to understand (and debug). What’s the solution? I submitted a patch for this already a year ago: Fix the start of the chain of errors by broadening the scope of grefs, so that no waiter can ever be missed when we close a group. Then the “potential stealing” code path becomes unnecessary, and the bug in that path goes away, and the mitigation patch becomes unnecessary. The implementation also becomes simpler to reason about: My “patch” is actually a series of six patches in which the first one fixes the bug and the remaining five just clean up code. (but a word of warning: this patch is unreviewed and I wouldn’t be at all surprised if there is still a problem with it. This is tricky code. See comment by Ricardo below)

If anyone has specific steps I can take to get this into glibc, I will try to do them. Last time I gave up a bit too easily because I got burned out on this problem after having spent way too many hours debugging it. Actually getting the fix in was too much extra work as a project to do in my spare time. But it’s been two years, so my burnout on this particular problem is gone and I’m game. As long as there are steps that actually work to get this in.

Appendix: Detailed Steps to Reproduce

Download TLA+ https://github.com/tlaplus/tlaplus/releases/tag/v1.7.1#latest-tla-files
Run the toolbox
Click “File -> Open Spec -> Add New Spec” then select ocaml_mutex.tla
Click “TLC Model Checker -> New Model”, click OK
In the model under
- What is the behavior spec: leave as default “Spec”
- What is the model?
  - UseSpinlockForCVMutex <- TRUE
  - AlwaysSetGrefs <- FALSE (this flag controls whether my patch should be used)
  - Procs <- {P0, P1, P2}, also select “Set of model values” and “Symmetry set”
  - UseSpinlockForMutex <- TRUE
  - UsePatch <- TRUE (this flag controls whether the mitigation patch should be used)
  - MaxNumLoops <- 3
- What to check?
  - Check “Deadlock” (should be checked by default)
  - Invariants: leave empty
  - Properties: leave empty
Click “TLC Model Checker -> Run Model”
Watch the progress in the “Model Checking Results” column. Once per minute it should populate the table “Time, Diameter, States Found, Distinct States, Queue Size.” After ~2.5 hours, when “Diameter” reaches 342, a new panel should open up titled “TLC Errors”, and it should say “Deadlock reached.” in the first window.
Now you have 341 states to look at in the Error-Trace, in full detail. The “pc” variable shows you which line each process was on for every state. All other variables are also visible. They are a pretty direct translation from the C code. It should be possible to figure out by having the C code and TLA+ code side-by-side to see the minor differences. I uploaded a table with only the relevant information here. (this was from a run with slightly different source code, with the AlwaysSetGrefs commented out. This means your trace will be slightly different and the line numbers of the spreadsheet won’t align if you try to add more information. You’ll need to create your own by exporting the trace)

(optional step before 6: Click on “Additional TLC Options” and increase “Number of worker threads” for your machine, also “Fraction of physical memory allocated to TLC” and set “Profiling” to “Off”, this should speed up the checking)

Sudoku Variants as Playful Proof Practice

Malte Skarupke — Mon, 13 Jun 2022 01:01:34 +0000

Doing mathematical proofs is kinda fun. Unfortunately they only make you do a few fun ones in school, then they get frustrating and tedious. So I have long been looking for a game that is about doing mathematical proofs. Euclidea was good, but eventually runs into the same problem as the hard proofs you do in school, so I never finished the game. But recently a lot of hard Sudoku variants have come along that feel exactly like doing a mathematical proof, but are designed to be fun.

The Sudoku world is currently going through an explosion of creativity and innovation, something which I have called a “Treasure Hunting System” before. It’s quite joyful to watch, especially since I never really got into Sudokus before. I found that when Sudokus get hard, they get more tedious instead of getting more interesting. They’re only fun until you get good enough to attempt the tedious ones. At least that’s what I thought until Youtube kept on recommending the Cracking the Cryptic channel, which currently features mostly Sudoku variants, and those are much more interesting.

Like many people I first came across their channel when they featured the “Miracle Sudoku” with only two given digits. It is a variant with the following rules:

Normal Sudoku rules apply. Any two cells separated by a knight’s move or a king’s move (in chess) cannot contain the same digit. Any two orthogonally adjacent cells cannot contain consecutive digits.

These are very restrictive rules. I recommend that you try to solve the puzzle here, or watch the video below. Or watch the first 4:40 minutes of the video up to the point where Simon says “If you want to have a go, click on the link” because he does a good job of introducing the puzzle.

Now if you’re like me, you will be pleasantly surprised by this puzzle, and then end slightly disappointed. To me it felt almost like a cheap trick. In hindsight it’s obvious that if change the rules this much, it only takes two given digits to fill the board. There are probably only a few valid ways to set up a board with these rules, so you can only do this puzzle once, and there are no other interesting puzzles with these rules. And it’s disappointingly easy after the initial setup seemed to promise a challenge.

So I ignored the channel until years later, when some other blogger wrote about them and I found that there were much more interesting puzzles. For example this one:

This is a variant with the following rules:

Normal sudoku rules apply. The grey line in the central box is a thermometer. Digits along that thermometer must increase from the bulb to the tip, but the positions of the bulb and the tip are hidden. In the outer three rings, the sum of the digits along a ring between two 9s must always be the same throughout the puzzle. For the avoidance of doubt, those two 9s could be the same 9 if a ring has only a single 9 in it.

The “thermometer” wording is unnecessarily terse and confusing, but “thermometers” are standard building blocks in Sudoko variants. It just means that the ring in the inner box either has increasing numbers clockwise or counter-clockwise, and it won’t tell you where it wraps around. The constraint with the 9s is interesting and requires doing some proofs. So lets do the start of the puzzle in this blog post. You can try it yourself here, you’ll just have to know one advanced technique which you will have no chance of deriving yourself: The Phistomefel Ring, so lets explain that first, because it also involves proving something.

Phistomefel Ring

Lets take a random Sudoku, say this one:

The contents don’t really matter, we’ll just use this board to explain the ring. The thing we’ll prove is that if in the picture below the 16 digits that make up the blue ring are identical to the 16 digits that make up the orange squares:

This is true in any Sudoku, not just this one. To prove it lets just highlight the top two rows and the bottom two rows:

These are 36 digits consisting of the numbers from 1 to 9, four times each. This is just given by the rules of Sudoku. Now lets instead highlight column 3, column 6, box 2 and box 8:

This highlight also contains 36 digits, also the numbers from 1 to 9, four times each. (twice in the two columns and twice in the two boxes)

So the blue are and the orange area contain the same digits. This is not very interesting until we overlap them:

Now we have shaded area that both of these regions share. The area contains a 7 and a 8 in this case, but also 14 unknown digits. It doesn’t really matter what’s in there though, we know it’s the same digits in both the orange set and the blue set. So if we just remove them:

We’re left with the ring and the corner squares. And since we removed the same digits from both sets, and both sets started off with the same digits, we know that whatever is left over must still be identical. q.e.d.

Starting on Archery Target

Lets get back to the Archery Target puzzle. You’ll notice that one of the rings in there is the Phistomefel ring. It won’t be useful right away, but eventually you’ll need to use that fact. If you know the Sudoku rules, you now know everything you need to solve it. You can give it a go here. It took me several days to finish.

So how do we solve this? We have too few given digits so the extra rules must be helpful somehow. If you play around with this a little, you’ll find that the extra rule for the middle ring is useless at first. So lets investigate the 9s. I started off by just randomly placing them to get a feeling for how this behaves:

Just counting the areas between nines, I count 8 areas. This makes sense because the outer three rings cover 8 boxes, so there must be 8 nines. I also tried the extreme case: If I have just one nine in a ring, there are still 8 areas between them:

I placed green lines here. There are 8 green lines because there are 8 areas between the nines. The digits along each of these green lines must add to the same number. Just playing around, I noticed that this is quite restrictive: This arrangement almost certainly doesn’t work because the line in the top left is too short. How could four numbers add up to the same sum as the long line in the middle ring?

Doing some math we find that the outer rings cover 8 boxes, each of which contains the numbers from 1 to 9, which add up to 45, so the overall sum is 8*45=360. Subtracting the nines, we’re left with 288, dividing by 8 gives us 36. So each of these lines must add up to 36. Which is the sum of the numbers from 1 to 8. With this we can already rule out a few placements. For example there can’t be a 9 in the bottom right corner:

If there was a 9 there, we’d have the digits from 1 to 8 on its left, which add up to 36, and then we’d get another 3 around the corner, adding up to 39, before we can place another 9. So this wouldn’t work.

If we try placing a 9 in any of the other corners, we run into a different impossible situation:

The 9 in the top right forces the other two 9s around the corner. This leaves too big of a gap between those two 9s. We can’t make that add up to 36, it’s always going to be bigger. So we can rule out all four corners. Lets mark them red to indicate that there won’t be a 9 in there:

This doesn’t help us much though, and it’s actually a detour on the ideal path for the proof, but I find that playing around and ruling out of a few cases helps to become familiar with the problem. It’s worthwhile even if it turns out that ultimately we didn’t need to do it. It’s also necessary because when doing a difficult proof, you have to continually renew your interest in the problem. If you’re just staring at a mostly empty grid, that’s hard, but small wins like this help to maintain interest.

The actual way to work towards the solution is to try to figure out how many nines go on each ring. I start off by pencil-marking the numbers 1234 onto each ring, as our possible options for how many nines can be in each ring. Usually pencil-marking is used to indicate possible digit-values for a cell, but I just want to keep track of how many nines are still possible on each ring. As we eliminate options, we’ll remove numbers:

The easiest options to remove are the 1 options. It’s easy to see that we can’t place a single 9 one the outer ring and middle ring because the sum for that 9 would be too big. We could in theory place digits that add up to less than 36 on the inner ring, but the 7 that’s already placed messes it up:

So with a single 9, we’d always be over 36 because of that 7. So there must be at least two 9s in each ring, which we’ll remember by eliminating 1 from our options in each ring:

The next-easiest options are on the outer ring. We can pretty easily rule out that there are two or three 9s on that outer ring. The sums always ends up being too big because we have to use big digits on the outer ring. So there have to be four 9s on the outer ring:

Now after this we can eliminate the 3s from the other rings because our total has to be 8. So if we have 4 on the outer ring, we can only have 3 on another ring if we had 1 on the last ring, and we already ruled that out:

At this point I want to rule out the 4s on the inner two rings, but thinking about that actually reveals an exciting option: What if there was a 0 on any of the rings? In that case the rules change. We would no longer completely cover 8 boxes, so our overall sum is different, so the areas between 9s don’t have to add up to 36. So lets add that option:

How would we prove this case? If we could eliminate the 2 from any of the two rings, that would also eliminate the 2 from the other ring, so there would have to two 4s and a 0. We can do that on the middle ring. If we tried to place the smallest possible digits there, we find that we can never put digits small enough that would add to 36. The smallest possible digits, after placing two 9s, are the digits from 1 to 6 twice and the digits from 1 to 5 twice, which add up to 21+21+15+15=72=2*36. This seems to barely work, but the pre-placed 3 messes up that plan because it rules out 3 from one row. We have to use a larger digit at least once, so our sum will be too big.

So that means there have to be zero nines on one of the rings, which will change what the sum between 9s has to be:

I will stop here. This should give a feeling for how you attack this Sudoku. There is a lot of proving left to do, even after you figure out which ring has zero nines on it.

Is this a real mathematical proof?

This is similar enough to a real mathematical proof that we can practice all of our techniques in a playful environment. Notice how nobody tells you what you have to prove. Your first goal is to just figure out how to even attack this puzzle. You have to figure out which approach gets you stuck and which approach allows you to make progress. It took me a while to understand that the important part is how many 9s are on each ring. And then it took me a while to understand that I have to prove that one of the rings has no 9 on it. And then nobody tells you how to do it. The rules of Sudoku constrain what you can do, and you can do some of the approaches that you would try in real proofs, like “try the smallest possible thing” or “try the largest possible thing” but you mostly have to do a lot of thinking, which is very unlike the normal Sudoku gameplay of “try to spot the number you missed.”

You can do all of the classic techniques for this like trying to first solve a simpler subproblem, or first try to solve a more general problem (like the Phistomefel Ring, which will become relevant, or you can see Simon try to think if he knows a general proof that each ring has to contain a 9, in the video below) and you can do that in a setup that’s been designed to be helpful (notice how the given digits have added just the right constraint in crucial moments) and interesting to work in.

My solve was noticeably different than Simon’s solve on the youtube channel. He goes through the puzzle at a crazy high speed, solving it in just over 90 minutes. I don’t know how long it took me, but five to ten times that long is a good guess:

There’s More

Puzzles like this are very common on the channel at the moment. It seems like Simon is playing more puzzles like this, that require interesting proofs, and Mark is playing more puzzles that require advanced Sudoku techniques. So I’ve been playing more of the puzzles that Simon does.

Here are three more that I enjoyed:

Play it here

I’m sure I’m missing some really good ones, but I don’t have time to play most of the puzzles on the channel, especially since I’m so much slower than Simon is… But this beats playing randomly generated Slitherlinks (which is what I’d do before this) by a lot. If you know of some good puzzles that are like this, please share them because I don’t have time to catch up with the channel and necessarily have to miss most puzzles…

And just as I had this blog post typed up and ready to go, I noticed that today’s video is called “Do You Need Theorems To Be A Sudoku Expert?”, so I guess I’ll try that one next:

Reasons why Babies Cry in the First Three Months, How to Tell The Cries Apart, and What to Do

Malte Skarupke — Sat, 19 Feb 2022 21:08:26 +0000

I have twin daughters that just turned three months old. I decided to write up the list that I wish I had before they were born. Just reasons why they cry, how to tell the cries apart, and what to do in each case. Yes it’s hard to help babies because they cry for everything, but you can definitely tell the difference between an angry cry (“feed me”) or a pained cry (“I scratched myself”) or a sad cry. (“why did you wake me up?”) You can’t fix em all, but you can do a good amount.

This is just based on my experience of getting twins through three months. Every baby is different, but hopefully this still helps.

1. Hunger

The most obvious one. Babies get hungry a lot.

How to tell if it’s this: You will hear this a lot and will know it well. It escalates quickly if not addressed. Also they often give signs ahead of time, like movement of the lips and tongue. The bigger difficulty is knowing when it’s not hunger. At some point you’ll be able to say “this sounds weird, not like hunger” but I also recommend installing one of those baby tracker apps on your phone. (pen and paper works fine, too) We stopped tracking most things pretty quickly, but you have to keep track of when they last ate. If it’s been less than an hour, then maybe they didn’t have enough and it’s hunger. If it’s been two hours, then it’s probably not hunger. If it’s three and a half hours, it’s probably hunger. Also just pick the baby up, and if they try sucking on your arm, they’re hungry. (sometimes they’re not hungry even after three hours)

What to do: Feed them. Sounds easy, but at first you will spend a ridiculous amount of time on this. Our kids easily got distracted while eating and then get mad at you if you don’t continue feeding after the distraction. Meaning this could happen: they drink, then take a break because they need to be burped, then drink a little more, then take a pause to poop in their diaper, then drink a little more, then get twitchy because they want the diaper changed, then drink a little more, and finally they’re happy. Early on it took us two hours to feed both twins. (while my example sounds extreme, we had few feedings where there wasn’t at least one interruption and they wanted more after)

The amount they eat changes a lot, especially in the first week. In the first couple days after birth, babies eat almost nothing and lose lots of weight. This is normal. When we came home from the hospital all the bottles we had at home seemed ridiculously large. The kids would drink ~10ml per feeding, but the lowest line on the bottle for measuring was at 30ml. But then the next day they ate twice as much, 20ml, then it doubled again the next day to 40ml, and then it increased again to 50ml. After that it increases more slowly, but it’s always increasing and you have to always adjust how much you think they should eat. (as I’m writing this the minimum they’ll eat is ~120ml, the maximum ~220ml. If they push the bottle out at less than 120, I know to wait five minutes and they’ll be hungry again)

2. Physical Discomfort

In the hospital one nurse told us that newborns really don’t want much. They just want to eat, sleep and poop. And be comfortable, which is a whole category in itself.

2.1 Dirty Diaper

No surprises here. You wouldn’t like wearing wet/dirty underwear either.

How to tell if it’s this: This cry starts light and escalates slowly. Often they just fuss a bit, going “eh, eh, eh”. Check the diaper. Don’t always trust the indicator. The indicator goes blue when they pee. If they only pooped, you can’t tell from the outside.

What to do: Change the diaper. When they’re very little you have to do this a lot. At the peak we went through roughly 25 diapers a day for our twins. Friends tell us similar numbers of 12 diapers per day for one kid. Another friend had a peak day of 20 diapers for one kid in one day. We probably overdid it, but at least if they have a clean diaper and they’re still crying, you can be sure that the problem is something else.

2.2. Pain

The most common reason for this is when they scratch themselves, or when something is wrong with the outfit. (like they somehow got both legs one one side or the diaper is on too tight)

How to tell if it’s this: A pain cry is super obvious. It sounds very different from other crying. It’s very clear they’re in pain. You’ll definitely hear this early on, like when they draw blood at the hospital to test for jaundice and other problems.

What to do: Usually you can address the source of the pain: Wrap their hands so they don’t scratch themselves, etc. Obviously go see a doctor if there is no clear reason. (though also read the point about gas below)

2.3 Too Hot / Too Cold

This one scared me a lot at first because how do you know how many layers they should wear when it’s freezing out? They’ll let you know if you get it wrong.

How to tell if it’s this: This one is a less intense version of the pain cry. It’s usually clear based on the context, you just have to remember that this could be the problem. E.g. we would dress the kids for freezing weather, then put them in a car seat in a car, and as soon as the car heats up, they’d get too hot and start crying.

What to do: Add/remove layers. If this was the problem, they will calm down very quickly once they are warmer/colder. The other point where this comes up is when giving them a bath. What I consider a nice hot bath is slightly too warm for a baby. But only slightly. They just want comfortable warm. (or get a bath thermometer)

2.4 Bright Lights at Night

Babies don’t like it. Took me too long to figure out. At a certain point in the evening you just want rooms to be somewhat dark.

How to tell if it’s this: I had a crying baby, picked her up, she was still crying, walked into a different room and she stopped. Turned the light on and she complained again. Easy.

What to do: Get smaller lights so you can control the brightness by turning most lights off.

3. Internal Issues

These are more difficult to deal with. You can’t always help them when it’s internal, but surprisingly often you can. One warning: This chapter contains cursed knowledge. There are words like “snot sucker” in here that you might not want to know about, but once you do know about them you can no longer ignore the crying of your baby. Where an innocent parent can just feel bad for their children and try to comfort them, you will know that there is this weird thing you could try that would help, and that knowledge won’t let you be…

3.1 Stuffy Nose

A booger can cause them to have trouble breathing, which is very upsetting for them.

How to tell if it’s this: If their nose is stuffy while they are crying, that’s a good hint, but if they didn’t have clear breathing before the crying, then it’s definitely this. In either case you can always try the snot sucker, it doesn’t hurt.

What to do: Saline spray and a snot sucker. Yes, you do weird things with babies. They can’t blow their own nose. Just spray saline in their nose, this will get them to sneeze which either ejects the snot right away, or at least moves it close to the surface where you can see it. Then you use the snot sucker.

3.2 Needing to Poop

Newborns often have trouble getting their poop out. They’re actually great at holding it in. It can be really obvious and the fix is easy, but somehow nobody tells you this… (or any of the other advice in this section…)

How to tell if it’s this: At least with our twins, it was very obvious. They’re clearly trying to push something out. Sometimes they’re grunting. Also, when they started sleeping a little longer at night, they almost always had to go in the morning. You’d always get this type of crying at some point between 5 and 8am.

What to do: Hold them over the sink. Take the diaper off, hands under the legs or butt, just hold them and wait. (the linked video shows holding the legs, but sometimes they like kicking while you hold them, and then it’s better to have the hands under the butt to leave the legs free) The first time I tried this I had a crying baby on my hand, the second time she was really calm and then the poop that she’d been struggling with for an hour came out easily. This was at 1 week old. It worked amazingly well. (she was still crying on the changing table both before and after, like they often did at that age, but while holding her over the sink she was very calm) It’s weird, but no more weird than the snot sucker. Apparently it’s just the normal thing to do with babies in some countries. The other thing that’s worked for friends of ours is a foot massage. Search Youtube, but I haven’t had luck with this myself, so no video recommendations.

3.3 Needing to Fart

You’d think this one is the same as needing to poop, but it’s completely different. I don’t know why. If it’s just a little fart, then they won’t cry. But if they’re really gassy, they will let you know.

How to tell if it’s this: This one sounds like they’re in pain. At least a little bit. The surest sign is when they cry, fart, and cry more. There are usually four farts or more behind that that you have to get out for them to calm down.

What to do: It took us way too long to realize that when they cry in pain, gas can be the problem. So I don’t have a fully established solution. If you try the “hold them over the sink” solution from the poop section, they’ll just be in pain as they try to get the gas out. Bicycle kicks and hip rotations worked sometimes, and some people say tummy massages are great. Unfortunately our kids resist the bicycle kicks once they start crying. Gas seems to be a really common problem so there are lots of products for sale that are supposed to help with this: Gas-X for babies, Gripe Water, Probiotics, tubes that you insert in the butt… I don’t like the Windi tube, but it works and I recommend using it once. It’s a hollow tube that you lubricate and then stick in their butt, and then the air has no choice but to come out. The reason why I recommend using it once is that you have to give a belly massage while doing it, which pushes the remaining air out, and it’s the only time you hear immediate results from a belly massage. So after that experience I got better at belly massages… Besides that my mom swears by fennel tea, and in the US Gripe Water seems to be the closest thing. I had one good experience with gripe water, but often it doesn’t seem to do anything. As I said, still trying to come up with a good solution since we realized this one so late.

3.4 Needing to Burp

If you didn’t burp enough after feeding, sometimes the baby will cry later.

How to tell if it’s this: There are two different sounds. If it’s not too bad, it sounds like they got something stuck in their throat. Not something big, it’s just a cry that’s a little raspy, where an adult would clear their throat and be done with it. The other one is if they managed to go to sleep, they will wake up with a loud cry. Which scared us a lot. Now we’re calm when that happens because we know it’s likely to just be a burp.

What to do: Pick them up and hold them vertical. High on your shoulder so there is some pressure on their chest but they have to hold their own head up. (you can’t do this when they’re too small) Patting the back is supposed to work but seems unreliable for us, even if we pat them really hard. Walking around with them or bouncing seems to help. When they were very young you could massage the lower part of their back, the soft spot between the ribs and the hips, and they would burp almost immediately. That stopped working once they got muscles there, but some people say that rubbing circles in that same area works…

3.5 Reflux

When you put a baby down too soon after eating, or just move it too much, sometimes the milk comes back up.

How to tell if it’s this: This only happens less than thirty minutes after feeding. They were calm, you put them down, they immediately go to a slightly pained cry. Sometimes followed by throwing up a lot of milk.

What to do: Give them more time upright or at at an angle after feeding. After we saw this once or twice, we realized what to do and we would pick them up quickly enough again that they didn’t throw up.

4. Energy / Emotion

At first babies are really easy. They just want to sleep and are only awake long enough to eat and poop. But then they get more complicated.

4.1 Needing to Sleep

This one is weird. The kid needs to sleep but instead of calming down, they start to cry.

How to tell if it’s this: This happens once they stay awake longer. If they stay awake too long they get irritated very easily. The other thing that happened to us somewhat regularly with our kids is this: They’re tired, falling asleep in our arms, we put them to bed, but the act of doing that makes them become more awake. They get a bit fussy but only enough where we think they’ll soon calm down again. Then suddenly they get really angry and cry loudly. This goes on for a short time, say fifteen seconds, then they immediately fall into deep sleep. It’s almost as if they’re resisting the falling asleep.

What to do: Put them to bed and give them a minute. If it seems like they should be tired enough to fall asleep, then just watch a little. It sucks, but usually you don’t have to wait long to see which direction the crying goes. If the crying becomes less loud after a minute, give them another minute. If it doesn’t, it could be a different problem. This is not the “cry it out” sleep training. Really just wait a short time to see how it develops. As long as the trend is in the right direction, you can just let them be and it’ll end.

4.2 Scary New Situation

This happens for example the first few times when they take a bath, or when you undress them at the doctor.

How to tell if it’s this: It’s obvious from their facial expression. They are clearly concerned/scared.

What to do: Hold them securely and allow them to hold on. You don’t want their hands flailing around. In the bath I’d always hold them with my left arm behind their neck, curved so that they could grab onto the fingers of my left hand. Hold them close if you can. Also you just have to be calm. The goal here is to calm the baby down and if you are not calm and try to rush the bath, you’re just making things worse.

4.3 Being Bored

For us we first saw this at some point in the second month. Babies are usually happy to just be on their own, but at some point they want some attention.

How to tell if it’s this: This is a cry that’s not too loud and doesn’t escalate. The baby was alone for a while, then starts crying but seems happy when you get there. Definitely check the diaper first, but maybe they just want to be entertained.

What to do: At this point the baby can’t move around on its own and can’t even play. You can do some exercises with it like tummy time (it feels weird to have a crying baby and then make it do pushups in tummy time, but when they’re bored this sometimes helps. They stop crying, but may make “workout sounds” instead) or the bicycle kicks, or show it one of those high-contrast books like “Look, Look!”, (you have to hold it pretty close to their face) or a music toy. If the problem was something else, the baby will let you know. It certainly won’t sit there for 20 minutes happily looking at a book, and it will certainly get really loud if tummy time wasn’t the right choice.

4.4 Witching Hour

That’s what it’s called. For some reason it’s common for babies to get angry in the evening. This happened for us for at least half the days in the first three months. It sucks. It’s just very important that you know this exists. We were puzzled a lot early on before we knew this was a thing. (we’d wonder “you were doing so well all day, what’s up now?”)

How to tell if it’s this: This one sounds like they’re hungry. For us this happened after 6pm, and could go on until we were able to put them to bed. Which could be quite late. It’s the witching hour if you can’t figure out the problem. You change the diaper, they only eat a tiny amount, they don’t want to poop, they don’t want to sleep, there doesn’t seem to be anything you can do to help them.

What to do: Since this one sounds like they’re hungry, you’d think that feeding helps. It helps for like a minute, but they won’t eat much and then they get mad at you again. And if you try to feed them repeatedly they’ll drink too much and then throw up. (this literally happened three times to me…) We eventually got good at managing this, but it’s always draining. Step 1 is to strictly wait three hours between feedings during witching hour. Otherwise they either eat too much or too little and you’re creating more problems. Step 2 is to hold them and walk around with them. The movement is key. (when that gets exhausting they also like it if you bounce on a yoga ball while holding them) The way you hold them also matters. The superman hold works well, but I try to not overuse it so it doesn’t lose it’s power. Then, ~2 hours after their previous feeding, they sometimes fall asleep for 30 minutes to an hour. After that sleep, feed them and then put them to bed. This is the routine that we eventually figured out. I don’t know if it would have worked at the first witching hour. It never worked super well, but we managed. I just sometimes had to walk around while carrying a baby for 30 minutes or more so that they wouldn’t cry.

4.5. Cranky Tired

(I added this later in an edit) At some point (when they get closer to three months old) babies just want to stay awake. If you let them, they will get cranky. It won’t necessarily look like they’re tired.

How to tell if it’s this: This one sounds complainy and can get very loud. It seems like they have strong mood swings. One second they’re happy, the next they’re very upset at something, then they calm down again before getting upset at the next thing. It’s actually similar to the witching hour, so I buy the theory that the witching hour is related to being over-tired.

What to do: Once they’re over-tired, it’s too late and you will have a hard time getting them to sleep. You can sometimes still rock them to sleep, but they’ll complain even about that. The only fix is to get them to sleep before they’re cranky-tired. For us sleep training helped. Not so much the “cry it out” part of sleep training, but the “have a regular schedule with regular naps” part. We didn’t have this figured out at three months, that’s why I edited this in later. In the fourth month, managing their sleep was probably our biggest problem. But in hindsight this started before they were three months old. Once we had it worked out where they went down for three naps a day without complaint, this mostly went away. Getting there is a whole chapter on its own, which is why you can read so much about sleep training… Just remember that once you find something that works, they will cry a lot less. It’s just unfortunate that nobody seems to have found a really good sleep training method yet, so there is more crying during the training… But if you delay sleep training, you will just get more crying in the meantime than you would have gotten if you had just gotten it over with earlier. They really can sleep through the night at three months old. Getting the naps down actually took longer for us.

5. Illness

We got lucky and didn’t have this in the first three months. At all. So I can’t give recommendations here.

6. Easy Things

Here are some things that are easy to address but should be mentioned for completeness.

6.1. Woken Up From Sleep

Everyone knows to not wake a sleeping baby. Except sometimes they seem to sleep too much and you’re worried they’ll be off-schedule…

How to tell if it’s this: They just woke up from sleep, and it sounds sad, like “woe is me”.

What to do: We sometimes want to wake up the baby if they’re sleeping too long during the day. Like if it’s been almost five hours since their last feeding. Which is good at night, but not good during the day. Since they usually sleep on their tummy during the day (they just sleep better, and we can watch them, so it’s safe) we just flip them onto their back, and they’ll wake up sooner. Also, in the hospital we were told to feed them 8 to 12 times a day, and to wake them up if it’s been more than three hours since their last feeding. (I think because the twins were very little at first) This was bad advice. We did it for exactly long enough until they were back at their birth weight, then we let them sleep. They actually ate better and more often after that.

6.2 Workout Sounds

In the first months, babies should spend a lot of time on their tummy, (only while you can watch, due to SIDS) which makes no sense when you first hear it: Why does it matter if the baby is on their belly or on their back? It’s obvious the first time you try it: The baby immediately tries to raise its neck, working out the neck and arm. They need that exercise and they’ll often do a lot of it.

How to tell if it’s this: They’re in the middle of working out and are making sounds that sound like they are distressed, but are not cries. If they start crying, that’s not what I’m talking about here. In this video she makes a little bit of a sound at 20 seconds. She could get a lot louder than this, but it’s not crying.

What to do: Don’t do anything. When they had enough they can just lay down and rest. For some reason everyone around me wanted to pick up the baby as soon as it makes the slightest sounds, even though it was clearly exercising. I’d stop them from picking up the baby, we’d watch for ten seconds, then the baby decided that it had enough and would rest its head on the mat and relax. There was no problem to begin with. Obviously if they start actually crying, we’d try to find out what the problem is.

7. Fixes for All Kinds of Crying

Here are some things that always make a baby stop crying. Don’t abuse these. Sometimes you need to listen to the cry to find out what the problem is. If you always try to shut them down, you might just be leaving them hungry.

7.1 Shushing While Patting on the Back

This works like magic in the first month: Lay back on a pillow, put the baby on your chest, facing you, then do a slow rhythmic pat on their back (60bpm or slower) while making a calm sssshhh sound. After 30 seconds they stop crying and calm down and stay calm. Keep making the sssshhh sound until they have slept for at least a minute.

7.2 Holding Them While Walking Around

A baby complains less while you hold it. This can make it hard to tell if there is a problem, but sometimes this is necessary, e.g. during witching hour. When holding them is not enough, hold them and walk around. When that is not enough, do the “superman”: hold them so that they’re lying flat on their belly, your hand between their legs, their head in the direction of your elbow, their hands on either side of your arm. Here is a picture of me holding my twins:

If I do this and walk around, that has literally calmed them from every single thing they’re complaining about, except when they get really hungry. Meaning if even this doesn’t work, they must be hungry.

7.3 Stroller Rides

Babies fall asleep in strollers. Unfortunately we couldn’t do as many stroller rides as we wanted because it was freezing out this winter. Apparently Aquaphor protects their skin against the cold. We got a cream from a German brand that helped, too. Still even with that we were hesitant to take them out below 0 degrees after they got really bad heat rash once. The other thing you can do is drive them around in a car, but we don’t have a car.

Overall Approach

The main thing to remember is that every baby is different and that you’ll have to adjust to them. Our twins are supposed to be identical twins but we’re not so sure. They’re easy to tell apart and had large differences in behavior from day one:

As I mentioned above, we tried sticking to a “feed every 3 hours” schedule in the first week, which worked well for one twin, but was bad for the other. She would end up too tired to eat
One twin had no problem eating in one sitting, the other often needed to poop after eating a little bit, then wanted to eat more after a break.
One twin ate well on the breast, the other had to be bottle-fed often because she quickly got exhausted on the breast.
One twin would push out the bottle when she was done, the other would just hang around, nibbling a little but not drinking much more (same on the breast)

There were more differences but the point is that even identical twins have large, noticeable differences from day one. You will have different experiences from us. You just have to notice these things and adapt to what works for the baby. There is no forcing them to do anything at this age, or trying to train them. I remember us being confused early on where one was crying and we said “she just ate, then pooped, then we changed her diaper, and she is still crying. What could possibly be bothering her?” Answer: She didn’t eat enough the first time because she needed to poop. We only noticed this pattern after a few days. There is nothing you can do about it. You just have to adjust to the kid. And then of course things change every week.

Discussion

How good can you get at managing a baby’s cry? We never got very good. There were days where it felt like we had it under control and we had happy babies all day and everything was great. And inevitably the next day would be bad. After I had mostly finished this blog post we had a day where both twins were complaining a lot, always eating slightly less than I expected them to eat, and not napping well. We thought it might be gas because one of them seemed slightly pained, so we gave her Simethicone. Which didn’t really help a lot, and bicycle kicks etc. also didn’t work, so we thought the problem was that she hadn’t slept enough all day because that makes them generally irritated. Or maybe they hadn’t eaten enough all day, for similar effect. But inevitably at the next feeding she’d eat slightly less than I expected again and wouldn’t nap well. We were busy all day just managing these babies. (luckily it was a Sunday) Then in the evening I carried one of the twins around in Superman pose and she finally farts for a very long time. After that she stops complaining. The other one also farts a little bit during the next feeding and also finally sleeps. We finally put them both to bed, and the second one again wakes up thirty minutes later, crying. We finally use the Windi which gets a lot of gas out of her. After that she falls asleep like a stone and sleeps well…

So even after having learned all of the above, we still had days like that where we just had to try things. But since most days weren’t like that, I still think overall we got really lucky with our twins. They didn’t cry nearly as much as we heard from other people. From week one there seemed to always be a reason. Maybe that made it easier to identify these issues. But it could also mean that I’m missing at least one big thing that we just didn’t run into. Also keep in mind that this is literally just from the experience of getting two kids through the first three months. I would have loved to have this list up front because we often felt like terrible parents when we finally figured something out and realized that our kids had suffered unnecessarily, so if you’re currently at zero kids, hopefully this helps. If you already have more, feel free to tell me in the comments about all the things I forgot or got wrong. I’m sure there are many.

Appendix: After Four and a Half Months

I’m writing an appendix after four and a half months just to say how much has changed. It really is true that at three months old, babies change completely. So many of the problems above no longer happen to us: We never hold them over the sink any more to poop, they just go in their diaper, no problem. Gas is also easy for them. They don’t complain as much about a wet diaper any more, so we use fewer diapers per day. Bath time is now a pleasure, they don’t cry when their nose is only a little stuffy, reflux hasn’t been an issue in a while and the witching hour is gone. The new biggest challenge is getting them to sleep enough. I added an edit above about “cranky tired” which became a big issue during the third month, and became the main issue in the fourth month. At four and a half months, it feels like we’re finally doing well on that. (though still with room for improvement)

All of this is just to say: Things quickly got better once we made it past three months. There are always new problems coming up, but the new ones are more fun, like when they want to keep on playing with you even though you’re exhausted. That’s a good problem to have.

Automated Game AI Testing

Malte Skarupke — Sat, 15 Jan 2022 18:37:19 +0000

In 2018 I wrote an article for the book “Game AI Pro 4” called “Automated AI Testing: Simple tests will save you time.” The book has since been canceled, but the article is now available online on the Game AI Pro website.

The history of this is that in 2017 there was a round table at the Game Developer Conference about AI testing. And despite it being the year 2017, automated testing was barely even mentioned. It was a terrible round table. A coworker who sat in the audience with me said to me that I could have given a better talk because I had invested a lot of work into automated testing. So next year I submitted a talk about automated AI testing and was rejected. But they asked me to write for the book instead. Now the book is canceled, too. A copy of the article is below, with some follow ups at the end. It’s written for people who have never done automated testing, like the AI programmers at the round table. But I think the core trick of doing fiber-based control flow, that can wait for things to happen, could be widely useful:

Introduction

Game programmers have been slow to adopt automated testing. There are many reasons for that, one of which is that games fall under the category of code that is “hard to test.” But inroads are being made: I have seen good test coverage for lower level libraries, and graphics programmers have figured out how to test their code well in recent years. (automated screenshot diffing with good visualization showing how two screenshots differ) I hope to include AI code in the group of code that is “easy to test,” and I will explain how to do that in this chapter.

The main insights are that you can reproduce a lot of AI bugs with no more than two characters, that fibers are a great match for writing sequential code that takes many seconds to run, (such as AI tests) and that most complicated failures are a result of simple underlying causes that can be tested in isolation. I will illustrate how to write a simple testing framework that not only helps us write tests, it also makes debugging of problems easier.

Getting Value from Automated Testing

These days it’s uncontroversial to say that automated test are an accepted best practice for programmers. An informal survey among my coworkers found that most of them want to write more tests, they’re just finding it hard to do so. It’s understood that all code lies on a spectrum from “easy to test” to “hard to test” with things like std::vector and std::sort being on the “easy” side and video games being on the “hard” side.

But AI code is actually fairly easy to test manually: Often all we have to do is set up a test level with a few AI characters and watch what happens. We will try to convert that ease of manual testing to an ease of automatic testing. In fact we will start with the simplest, easiest automated test.

If you’re new to testing, you have to start testing the easy things. There is no shortcut to getting better at testing, and if you start with difficult tests, you will just get frustrated and give up. So if you are writing a small utility function or a container, (for example your own spatial data structure, which C++ has a shortage of) start with writing tests for those. Then as you get more experienced you should not move on to the things that are “hard to test,” but instead you will find that more things now seem “easy to test” than did at first. Maybe that big, hairy class actually has an ad-hoc implementation of a container inside of it. If you pull that out, you can test the code. Making the class less hairy, and giving you a safety net to make further changes. The more tests you write, the better you’ll get at this.

So the first lesson for getting value from automated tests is this: If something is hard to test, don’t write a test for it. It’s just not worth the time invested in writing and maintaining the test. Instead try to find a way to make it easier to test. But don’t fret if you can’t. You don’t need to write tests for everything.

To really get value out of tests though, we have to see that tests have a second dimension. Not only are some tests “easy to write” and others “hard to write”, there is also a dimension of specificity: Some tests tell you exactly what is wrong, others are only able to tell you that “something broke.” The first kind of test can significantly cut down on your debugging time. The second kind of test doesn’t shorten your debugging time, it merely points out that there is a problem. An example for a test that’s “easy to write but not specific” is a smoke test. A smoke test is usually a very simple test like “start the game, shut down the game, tell me if there were any errors.” It’s a good test to run automatically after every submit to version control. It finds problems surprisingly often. However all it can tell you is that there is a problem. It doesn’t help you narrow down where the problem came from: If a smoke test tells you that you get an error on shutdown, that will take exactly as long to debug as if another person tells you that they got an error on shutdown. (that’s why you have to run smoke tests on every submit, because that narrows down potential culprits)

With that, the best kind of tests are tests that are easy to write and tell you exactly what is wrong. The peak of that are unit tests: If a unit test for std::sort breaks, you know exactly where to start looking for the problem.

Since unit tests can be so valuable, I will talk a little bit about unit testing, but most of AI testing doesn’t fit into unit tests, so after a brief detour into unit tests, we will talk about how to write AI tests specifically.

Unit Testing

Unit tests are the best kinds of tests because they are easy to write and point at a specific area of code for the source of the problem. I will keep this section short because I have found it hard to write unit tests for most of my AI code, but I still felt it necessary to include this section because unit tests can be so valuable.

In AI code I have found that unit tests are appropriate in small utility functions like matrix math, or utility classes like spatial data structures. As an example you can see a simple unit test in the code below. The test is written in the Google Test library, which is the unit testing library I use most often.

struct VisionCone
{
    float cos_angle; // result of dot product
    float radius;
    float vision_score;
};
struct VisionConeCollection
{
    VisionConeCollection(std::vector cones);
    float ScoreAt(Matrix4x4 head_matrix, Vec3 position) const;
    // ...
};

TEST(vision_cones, single_cone)
{
    VisionConeCollection c({ { 0.7071f, 10.0f, 0.5f } });
    Matrix4x4 id;
    // in cone
    ASSERT_EQ(0.5f, c.ScoreAt(id, { 0.0f, 0.0f, 5.0f }));
    // too far away
    ASSERT_EQ(0.0f, c.ScoreAt(id, { 0.0f, 0.0f, 15.0f }));
    // behind head matrix
    ASSERT_EQ(0.0f, c.ScoreAt(id, { 0.0f, 0.0f, -5.0f }));
}

I’ve included summaries of the VisionCone and VisionConeCollection classes so that you can understand what’s going on in the test, but we won’t be too concerned with those specific classes. The idea is to have multiple vision cones, each of which has a different “vision score.” But what’s important about this unit test is that

It was fast to write. My rule of thumb is that it should be faster to write a test and to run it than it takes to test the same code in the game. If it takes me a minute to launch the game and to spawn AI characters on which I can test my vision code, then it should take less than a minute to write and run the test. If a test is the fastest way to run your code, you will write more tests. This is mostly a pipeline issue and you simply have to set up your dev environment to make it fast to run tests.
It runs fast. Google Test always shows how long a test takes, and this test always runs in “0ms.” If a test of mine takes longer than 10 milliseconds, I either rewrite it or delete it. If you have too many slow tests around, you will not use tests as often.

This simple example will be all I use to illustrate how to use unit tests. The earlier that you start to write tests for a piece of code, the better the interfaces of that piece of code will be for testing. From here it’s easy to extend the tests by adding more cases. Or let’s say you implement different behavior for the y-axis than for the xz-plane. You should add a test for it to get faster iteration times. Or if you want to add hysteresis so that visibility doesn’t flicker on and off when the player is standing right on a threshold, add a test for it. You can also add edge cases (like what if a character asks if it can see itself) and be sure that they never break. As the feature grows, the tests grow with it.

Confidence for More Complex Tests

I can sense your skepticism about the previous example: AI vision rarely breaks because you introduced a bug in the vision cone code. Instead AI vision probably breaks because AI characters are now wearing helmets, and the raycast is hitting the helmet. Or because an artist introduced a new glass model and forgot to set the see-through flag on it. And even if AI vision breaks, that’s an easy bug to fix. The hard bugs are results of several characters interacting in unexpected ways. How are you possibly supposed to test all that?

To start with remember the first lesson: We’re not going to write tests for things that are hard to test. But if we’re only going to write easy tests, how much value are we going to get? Research from other fields suggests that we’ll get a lot of value.

In a study on concurrency bugs, Lu et al. found that 96% of concurrency bugs can be reproduced by enforcing a certain execution order of just two threads. Similarly 97% of deadlock bugs are the result of two threads waiting for two resources. In a study on distributed computing, Yuan et al. found that 84% of bugs are reproducible with just two computers. 98% are reproducible with three computers. I claim that something similar is true for AI code: Most AI bugs are reproducible with two characters. I don’t have the percentage numbers on how many bugs exactly can be caught with simple tests, but the numbers from the other fields should give us some confidence that it’s a good amount.

I want to clarify that I’m not talking about simple bugs here. In a talk about Ding Yuan’s study on distributed computing, he emphasizes that a lot of the investigated bugs were really complicated. But when you trace the bug all the way back, you often find a root problem that would have been easy to detect. Similarly in AI we often see confusing behavior, and we have to look at several frames of history to find out that the AI is acting weird because of a simple root cause. Maybe it failed to play a certain animation, or the animation failed to move the character. Or maybe as soon as it got into a vehicle it stopped seeing its target because the raycasts collide with the vehicle.

When that happens wouldn’t you rather have a test that verifies that the AI can move to where it’s supposed to be able to move to? Or that an AI can still see when it’s sitting in a vehicle? If one of those tests breaks, the problem is a lot easier to fix than if you’re just seeing unexplained weird behavior in the game. Even if the problem ended up being caused by something else, it’s still useful to quickly rule out whole categories of problems: “it’s probably not an AI vision problem because the test for that is currently green.”

Testing in a Game Engine

You probably already have a bunch of manual tests for your AI. Test levels with gray boxes where you can spawn a few characters and observe them in simple scenarios. You probably also have a “free-fly mode” (or “ghost mode”) in which no player character spawns and you can just observe the AI. All we have to do is run those manual tests automatically and detect whether they behave the way we want or not.

How to jump into a test level and set the game into a mode where you can observe your AI will depend on your engine. You should set it up to be able to launch tests from a command line argument, or using an in-game command. The command line argument is for launching the test on an autobuilder. The in-game command is useful to quickly launch tests on other people’s computers to verify if something works or to show bugs to other programmers. But then what does writing a test actually look like?

After a few misguided attempts, I realized that fibers are a perfect match for writing the kind of logic that a test needs. The main benefit is the ability to suspend a fiber at any point in the function. That makes it possible to write similar checks as in Google Test, but to give the engine time to fulfill the criteria.

As an example here is a very simple test: I spawn two characters of opposing factions. One of them has a weapon, the other is brain-dead. The test simply asserts that the character with a weapon will defeat the brain-dead character.

jt::TestResult RunTest(jt::TestRunner& test_runner)
{
    Character* bd = GetObjectByAlias("braindead");
    JT_ASSERT(bd != nullptr);
    return test_runner.WaitUntil([&]
    {
        return !bd->IsAlive();
    },
    "Waiting for the character to be killed", 30.0f);
}

In this test most of the setup is done in content. The two characters are configured in the test level. I don’t have to refer to two characters in code because I only care about what happens to one of them. I get that character through it’s “alias” which is an engine-specific feature to refer to objects from code or script. Your engine probably has a similar feature. The TestRunner is a small wrapper around the fiber that’s used internally. It provides one important function: WaitForOneFrame(). All other test functionality can be built on that function. The WaitUntil function I use in the example just calls the condition-lambda and if it’s false, calls WaitForOneFrame. I pass a message into the function to display on the screen while the test runs. That message will also be displayed on the autobuilder if the test fails in this step. The final argument is a time-out in seconds. If after 30 seconds the lambda still returns false, the test fails. The other possible place where this test can fail is in the JT_ASSERT macro, which simply returns TEST_FAILED if the given condition is false.

Before we get into how this is implemented, I want to point out a few features that make this easy to implement. First, the test doesn’t have to establish preconditions. The RunTest function is actually a virtual function in a class, and there is a second virtual function called GetPreconditions, which the test framework calls before calling this function. Since many tests require a similar setup, I wrote the code for that once. In GetPreconditions you can indicate a test level that you want to have loaded, the position of the camera, whether you want to have a player character or be in ghost mode, and you have the ability to turn off some global engine features. (such as spawning of traffic on roads) The test framework then ensures that all your preconditions are true before it runs the test, so that the RunTest function really only contains what’s necessary for the test. The precondition code is engine-specific, so I won’t go into it further.

The second important feature is the ability to specify a time-out for a condition. In a normal testing framework like Google Test you only want to assert that something is true after a given sequence of calls, but when testing AI you often can’t be that precise. Depending on the length of animations and on details of your behavior tree, things can take a few frames more or less. So instead I found it useful to check whether something becomes true in less than X seconds.

The third important feature is that this is written in normal C++ and that I have access to all of the normal functions of the engine. I believe that it is very important that tests are easy to write, and that is only the case if I can access a function from a test without having to do any extra steps. (such as exposing the function to a scripting language)

To start implementing a test framework, you just need the ability to implement the magic WaitForOneFrame function. If you have that, you can implement all other utility functions that you may need on top of that. As an example, I sometimes need the ability to get an object by its alias (as in the test above) but the object may not have spawned yet, and I just want the test to wait until the object is spawned. That is easy to implement as a utility function: Try to get the character and call WaitForOneFrame if it doesn’t exist yet. If the character doesn’t exist after X seconds, I fail the test.

So how do we implement the WaitForOneFrame function? I implemented it using a fiber, but you could also implement it using a thread. When using a fiber it’s a simple wrapper around the yield function provided by your fiber library. (for example it’s simply called yield in boost::coroutine)

When using threads you can implement this using two semaphores: The test runs in its own thread, the rest of the game gets controlled by a main thread. At a good point in the frame, when it’s safe for the test thread to access and mutate global state, the main thread signals on the first semaphore that the test thread should run. It then waits on the second semaphore. The test thread was waiting on the first semaphore and runs now. In WaitForOneFrame it signals the second semaphore and waits on the first semaphore again. With that the two threads take turns. All other threads are sleeping while this is happening. They get woken up by the main thread when it is done waiting.

Making the main thread and all other threads sleep while the test thread is running slows down the engine a little bit, but I haven’t found that to be a problem. It’s a price I’m willing to pay to make the tests reliable. I simply don’t have to worry about which state I can access or mutate because I know that nothing else is running.

Pausing, Canceling, Restarting and Repeating a Test

Running tests with timeouts comes with a few problems, all related to time. The first is that the timeout can make things hard to debug. In my example above of giving one character thirty seconds to kill a different character, what happens if something goes wrong and I want to debug the problem? I open our developer menu, turn on some debug drawing look at the internal state of the AI, and before I know it the thirty seconds are over, the test fails and everything de-spawns. Oops. So the first additional feature you’ll want is to be able to pause the test. This simply means not calling into the test fiber. The game simulation continues to run. So in my example if I pause the test, it merely pauses the time-out. (but the character continues fighting) But if the test had multiple steps, the test wouldn’t advance to the second or third step of the test as long as it’s paused. The pause feature is toggled using a global variable that’s easy for me to change from our debug menu.

The second problem is that if you accidentally launch the wrong test, you have to wait 30 seconds or a minute for the test to finish. The solution for that is the ability to cancel the test. For that I changed the TestResult enum to have a TEST_INTERRUPTED value. The function WaitForOneFrame will now return that value when I issue the command to cancel the test from our debug menu. There are two subtleties with this:

The first sublety with interruptions is that you have to make sure that the return value from waiting calls is always handled the same way, so that the test returns should it be interrupted. It might be possible to implement that using exceptions, but we (like many game developers) compile without exceptions. The second best approach I have found is to implement a macro called JT_CHECK that returns on interrupts or failed test. Any call to a waiting function has to be wrapped in JT_CHECK. Here is what the macro looks like:

#define JT_CHECK(...) do {\
::jt::TestResult wait_result = __VA_ARGS__;\
if (wait_result == ::jt::TEST_INTERRUPTED\
        || wait_result == ::jt::TEST_FAILED)\
    return wait_result;\
} while(false)

As usual in C++ macros, this isn’t pretty. (sorry) To use this in the test above, instead of returning the result of the wait function manually, I would wrap that call in a JT_CHECK macro. It doesn’t make a difference for a one-step test like that, but for a test consisting of multiple steps, every line that can wait has to be wrapped in this macro.

The second problem comes directly from this macro and from the JT_ASSERT macro: The test can return at any point. What this means is that all the state that the test creates has to be wrapped in RAII structs so that the state gets cleaned up at the end of the test. For example if the test spawns characters, they have to despawn when the test gets canceled, so there needs to be a RAII wrapper that despawns the character in its destructor. I avoided that in the test above by doing most of my setup in content, not in code, so the test framework handles this for me by unloading the test level.

With that out of the way, we are able to interrupt tests. Which immediately allows us to implement the next feature: Restarting of tests. It’s an option in our debug menu that cancels the current test and immediately starts it again. This feature makes working with tests a pleasure because I can very quickly test a certain situation over and over again. If a problem happens one out of ten times, I create a test with the initial conditions of the problem, and restart the test until the problem occurs.

The final feature is the ability to run a test on repeat so that when it finishes, it automatically restarts itself. This is also now trivial to implement and is also done to reproduce rare issues.

With these tools the testing framework is not only a useful tool to find issues, it also saves us time by giving us additional debugging features such as the ability to run a scenario on repeat until a problem occurs.

Simple Tests

We finally have everything in place to start writing tests. So what kind of tests should we write? Earlier in the chapter I said we should test things that are “easy to test” and we should write tests that tell us specifically what is wrong. An easy test might be to set up a fight scene between ten characters of one faction on one side and five characters of an opposing faction on the other side. Then we assert that the side with ten characters wins. But even though that’s an easy test to write, it’s not very specific. A test like that can fail in a million ways. Also who says that five characters can’t win against ten characters? What if one of them gets lucky with a well-placed grenade?

So we want tests that tell us much more directly what is wrong. The test of “one character with a weapon against one brain-dead character” that I used as an example above is an improvement, but we can be more specific still.

A good test to start is to test the perception of characters. Place two characters in an open field and assert that they can see each other. Place a wall between them and assert that they can not see each other. Then do the same thing for a pane of glass. Then put one of the characters in a vehicle. Then put one of the characters behind a mounted gun on a raised platform. Add more edge cases as you encounter them.

We can also make that test dynamic: Put two characters in the open and assert that they can see each other. Then make one of them walk behind a wall and assert that they can no longer see each other. This test can fail even if all the previous tests succeeded, for example if you have a bug in the logic for when to update vision information. (because you don’t want to do a raycast every frame) Next, if characters have an “investigate” mode you can add a third step to the test by testing that after one character has disappeared behind the wall, the other character “investigates” and catches up with the first character. That test can be seen below. I will use that test to illustrate a couple patterns that I often use.

jt::TestResult RunTest(jt::TestRunner& test_runner)
{
    Character* c = GetObjectByAlias("custom");
    Character* n = GetObjectByAlias("normal");
    JT_ASSERT(c != nullptr);
    JT_ASSERT(n != nullptr);
    c->SetFaction(PLAYER_FACTION);
    auto can_see = [&] { return n->CanSee(c); };
    JT_CHECK(test_runner.WaitUntil(can_see,
        "Waiting for normal to see custom", 1.0f));
    SendGlobalEvent("start_walking");
    JT_CHECK(test_runner.WaitWhile(can_see,
        "Waiting for custom to walk behind the wall", 10.0f));
    JT_CHECK(test_runner.WaitUntil(can_see,
        "Waiting for normal to investigate", 30.0f));
    return jt::TEST_SUCCEDED;
}

The first pattern is that I have one character with a custom behavior tree. That tree is written specifically for this test. All it does is it waits for the global event “start_walking” and then walks to a predetermined spot behind a wall. The other character is a completely normal enemy as it would appear in the game. Since I only care about the behavior of one of the two characters in this test, I like to have complete control over the other one. It also makes the test code easier because all I have to do is send a global event.

With that we can look at how the test works: Before the test starts, we enter a test level with two characters and a wall. I assert that the two characters exist, then I set the faction of the custom character to the PLAYER_FACTION. This is a second pattern: Our AI has different behavior in AI-vs-AI fights than when fighting the player. They never enter the “investigate” mode when fighting other AI. They only do that when fighting the player. So to test the investigate behavior, we simply set the faction of the other character to the player faction. In all of our AI logic we only use the faction to determine whether an enemy is the player or another AI, so if I set the faction, I get player behavior.

Finally we see that the logic is actually quite simple: I wait one second for the characters to see each other. Then I give the signal to start walking. Then I wait ten seconds until the characters can no longer see each other. At this point the normal characters should enter investigate mode. I now wait 30 seconds until they can see each other again. Since each of these wait is an upper bound, the test actually runs faster.

Even though this test is testing a complicated sequence, it’s actually very simple in the implementation. For example if there is a bug in the “investigation mode” then the last step would time out after 30 seconds and the test would fail.

Your tests shouldn’t get more complicated than this. I will list a few more tests to get you started, then I’ll explain how to select your own tests to write. First, test movement. If your characters can climb ladders, test that they can climb ladders. If your characters can enter vehicles, test that they can enter and exit vehicles, and that they can drive vehicles where they’re supposed to be able to drive to. You can also repeat my first example test for all kinds of situations: Make sure that a shooting helicopter can kill a brain-dead opponent. Make sure that a character at a mounted gun can kill a brain-dead opponent. Then test that in a combat scenario, all characters enter cover within X seconds. Test that all characters have shot at least once within Y seconds.

Overall I don’t recommend writing too many tests though. I recommend writing tests for one of two reasons only: 1. To more quickly iterate on a new feature. 2. To reproduce a bug. Those two reasons ensure that I have a small set of tests that runs somewhat quickly. Sometimes people try to write really slow tests like “spawn every vehicle and check for errors.” Those kinds of tests are just an invitation for lots of maintenance work. You will certainly run into situations where one vehicle doesn’t work, and when you tell the responsible person they answer “oh yeah we know. It’s only used in one mission, and that mission has bigger problems right now. We’ll fix it before alpha.” (where alpha is a year away) And now you have a broken test in your system that just always gets in the way. So don’t be too aggressive about your tests, and try to write specific tests.

The other open question is what to do about more complicated situations. Like what do we do if we have squad behavior for up to four characters? I would say that if something seems complicated to test, leave it alone for now. Tests don’t solve everything. We don’t lose anything by not writing a test for this. But we may lose something if we add an overly complicated test that requires lots of maintenance. So leave it alone, and just debug it the old-fashioned way. Maybe you will come up with nicely targeted tests later that can reproduce certain problems. Until then you still have a bedrock of simple tests that you can rely on. Don’t test things that are hard to test. And with experience you will be able to make more things easy to test.

Conclusion

I hope to have moved AI code from the things that are “hard to test” to the things that are “easy to test.” The biggest thing I didn’t show was the code for ensuring preconditions, but that code is mostly just code for loading test-levels. (or teleporting to test islands and waiting for streaming in our open world engine) Otherwise the tricks of using fibers to be able to write the logic in sequence, and the trick of using time-outs instead of asserts were most of what was required to make AI testable. That, and the realization that most AI bugs have simple causes that can be reproduced using two characters.

The testing framework that we ended up with not only helps us reproduce bugs, (and catch them early should they come back) it also saves us time while debugging issues by providing the pause, restart and repeat features. Also since all the tests are written in C++, we can call normal functions and we can step through the code using normal debuggers. While I am not yet able to write good tests for the most complicated AI bugs, I have caught many bugs with relatively simple tests. And the longer I’m doing this, the better I get at coming up with simple tests that catch sources of complex problems.

Postscript for 2022

That was the article. I wrote this testing framework for Just Cause 4, but that game is not exactly known for having bug-free AI. So what happened? I think the testing framework made it so that the AI had far fewer bugs than it would have had otherwise, and it allowed us to ship many more features than we could have otherwise shipped, but why wasn’t the overall result good? I can think of four reasons:

Not enough AI devs
Too many escort quests
General company culture of breaking things
My inability to convince anyone else to write tests

1. Not enough AI Devs

The first one is a lame excuse, but it has some truth to it. We had 2 AI designers and 2 AI programmers. As comparison, lets look at Horizon: Zero Dawn which came out a year earlier. It had 11 AI devs. Plus 6 combat designers, which is work that also fell on the AI team in Just Cause 4 So 4 people vs 17 people. That’s why Horizon: Zero Dawn has much more impressive AI.

2. Too Many Escort Quests

This was a design decision that the AI team objected to during all of development. Our AI designers told the other designers to not do any escort quests because players don’t like them and you really can’t do them well in a Just Cause game. There is way too much chaos. There are always helicopters falling from the sky or reinforcements driving in with a tank, blocking the way. But nobody listened to us and a year before release it became clear that roughly half of all missions would be escort quests. Escort quests are a great way to show off every single problem that your AI has. There was one mission early in the game where you have to help a car drive half-way across the world. There are roadblocks along the way that you have to clear and the AI has to drive around whatever obstacles remain. Also there is normal traffic and enemies are chasing you, so your ally’s car has to drive fast, swerving through traffic. Everything is systemic, almost nothing is scripted. We don’t just drive on a spline. It’s a total nightmare for an AI programmer. This mission was a huge time sink for us and the other AI programmer probably still has nightmares about that mission… You will notice that Horizon: Zero Dawn cleverly has none of those elements. (objects intentionally blocking the AI’s path, traffic, other cars chasing and ramming your ally that has to drive on a road)

Why did we have all those escort quests? In hindsight I realize that the problem is that you really can’t design levels for the main character in Just Cause 4. He is way too mobile. Want to set up a problem where an automated machine gun is blocking your forward progress? Rico Rodriguez can just fly over that. Want to have a gauntlet of difficult enemies blocking the way? You can just climb on any roof and run straight to the end of the level. So most of the missions somehow restrain Rico, just to allow level designers to design levels. So it’s lots of escort quests or “guard this point” objectives… JC3 had a design that worked better with the freeform movement, but we moved away from that to more designed levels because we wanted more variety. It didn’t work…

3. General Company Culture of Breaking Things

The half-life of a working feature was roughly six weeks. That was the pacing of our internal milestones. Of all the features that were working for milestone X, roughly half would be broken for milestone X+1. In hindsight it was incredible how much time we wasted on things that broke over and over again. Wanted to play a mission that we delivered two milestones ago? Good luck. You’d probably have to spend a day to a week to first get it to work again. (if you don’t believe me that it was this bad, listen to the round table about AI testing and think about what environment these people work in to talk like that. I think our environment was worse than industry-average) There were a few programmers whose things didn’t break this often. I was one of them, and my testing framework really helped me there.

4. My Inability to Convince Anyone Else to Write Tests

I don’t know why I was so bad at this. The testing framework really was easy to use. If you don’t believe me, we also had Google Test. When I joined the company fresh out of college, that was one of the first things I added: Google Test so that we could write automated tests. I think in the seven years that I was there, I managed to convince only one other programmer to write tests in game code, even though I remember giving talks about it and even walking some people through how to write a test. (and that one programmer then left the company to work at Google…) Google Test is even easier to use than my testing framework: In any file you invoke the “TEST” macro and that’s all you have to do. I made the tests always run on startup. You didn’t even have to compile and run a separate executable… So even though it was incredibly easy to write tests, explained in thirty seconds and done in a minute, nobody did it.

I did manage to convince some engine programmers, but the game code remained test-free except for my code. I think the main reason is that everyone was always underwater. People were constantly behind. They had no time to save themselves time by writing tests. Things were always breaking and needed fixing. No time for tests. Our company values were “Passion, Courage, Craftsmanship” and we gave awards for each. The first award for “Craftsmanship” was given to a guy who caused a large amount of our bugs and would then heroically step in to fix them. People saw how good he was at firefighting and rewarded him for it. Never mind that he was also causing most of those fires… In that company culture you can’t get people to write tests. (this programmer certainly would never write tests)

At some point it felt like I had convinced the company that we should really write more tests. Everyone gave lip service to it, and we would get around to writing more tests really soon. One concrete thing that happened was that the tech director ordered the most junior programmer to help write some tests. This guy was so junior we didn’t even hire him as a programmer. He was just a QA guy who knew a little bit of programming. I was glad that somebody else was writing tests, and didn’t realize quickly enough that he was causing more problems than he was preventing. He was writing terrible tests and breaking existing tests. I had originally written this testing framework to test our editor. It worked well for both the editor and for AI tests, and would have probably worked for other things, too. At some point the tools programmer from another team asked about the automated testing work I had done. Unfortunately at that point the editor tests had been broken beyond repair by the QA guy who was “helping out” so there was nothing I could point them at… (and I had no time to bring things back into order, see the points above about the small team and about things constantly breaking) So no other team adopted this testing framework either…

How the Story Ends

The testing framework got a couple more neat features by the end, like the ability to define tests only in content, without even needing to make code changes. (one of our AI designers used that) But as far as I know the testing framework is abandoned now. It was on a branch of our codebase that was abandoned: The Just Cause 4 codebase was not continued. theHunter: Call of the Wild became the new main branch that all future projects were branched off from. So it would have required someone porting the testing framework to that branch. Unfortunately a different programmer wrote a different testing framework in theHunter codebase. There was no coordination at all (that was another thing about Avalanche: Not much coordination, lots of repeated work. JC4 shipped with four different gameplay-scripting systems because people kept on writing new ones because the old ones were bad, but the new ones also didn’t have all the features of the old ones and couldn’t replace them, so all four were in use and our gameplay-scripting was an undebuggable mess…) which meant that I didn’t hear about this framework until it was too late. Someone made the decision to go with that one because it was already on the right branch, and mine was abandoned even though it had more features, actually made it easy to write and run tests, and was battle-tested. (Nobody thought to consult me in the conversation where this was decided, they only were kind enough to tell me afterwards)

So the only legacy of this testing framework is the document above. And I did at least succeed in convincing people to want more testing at Avalanche. I left the company at the end of 2019 so I don’t know what happened since. The other testing framework was probably “good enough” so maybe it has seen a lot of use since. I don’t know what the situation is at other game companies. The whole industry was oddly bad at keeping things working. Maybe this article will improve things slightly. I mostly used the testing framework to test AI code because I was an AI programmer in JC4, but I wrote the first version when I was a tools programmer in JC3, and I had also used it to test our editor. (that’s where it was important to call any C++ function. Those tests had a lot more C++ code in it) So I think it is more widely useful. For example if you had the ability to do player input from code, the fiber-based control flow would make it easy to control the player, too. (write functions to “aim at this point” and then “press the right trigger”) It’s unfortunate that I didn’t get a chance to grow it further. I also don’t have the source code, but I’d be happy to elaborate on any details in the comments.

C++ Coroutines Do Not Spark Joy

Malte Skarupke — Mon, 01 Nov 2021 02:57:51 +0000

C++20 added minimal support for coroutines. I think they’re done in a way that really doesn’t fit into C++, mostly because they don’t follow the zero-overhead principle. Calling a coroutine can be very expensive (requiring calls to new() and delete()) in a way that’s not entirely under your control, and they’re designed to make it extra hard for you to control how expensive they are. I think they were inspired by C# coroutines, and the design does fit into C# much better. But in C++ I don’t know who they are for, or who asked for this…

Before we get there I’ll have to explain what they are and what they’re useful for. Briefly, they’re very useful for code with concurrency. The classic example is if your code has multiple state machines that change state at different times: E.g. the state machine for reading data from the network is waiting for more bytes, and the code that provides bytes is also a state machine (maybe it’s decompressing) which in turn gets its bytes from another state machine (maybe it’s handling the TCP/IP layer). This is easier to do when all of these can pretend to operate on their own, as if in separate threads, maybe communicating through pipes. But the code looks nicer if the streamer can just call into the decompressor using a normal synchronous function call that returns bytes. Coroutines allow you to do that without blocking your entire system when more bytes aren’t immediately available, because code can pause and resume in the middle of the function.

One of the best things the C++ standard did is to define the word “coroutine” as different from related concepts like “fibers” or “green threads”. (this very much went against existing usage, so for example Lua coroutines are not the same thing as C++ coroutines. I think that’s fine, since the thing that was added to C++ could be useful, and there is a good case to be made that the existing usage was wrong) In the standard, a coroutine is simply a function that behaves differently when called multiple times: Instead of restarting from the start on each call, it will continue running after the return statement that last returned. To do this, it needs some state to store the information of where it last returned, and what the state of the local variables was at that point. In existing usage that meant that you need a program stack to store this state, but in C++ this is just stored in a struct.

To illustrate all of this, lets build a coroutine in plain C++, without using the language coroutines:

Manually Building a Coroutine

We’ll take a very simple example. C++ coroutines transform this code:

generator_coro range(int stop, int step = 1)
{
    for (int i = 0; i < stop;)
    {
        co_yield i;
        i += step;
    }
}

Into this code:

struct range_struct
{
    int i;
    int stop;
    int step;
    enum {
        At_start,
        In_loop,
        Done
    } state;

    explicit range_struct(int stop, int step = 1)
        : stop(stop), step(step), state(At_start)
    {
    }

    int resume()
    {
    switch (state)
    {
    case At_start:  for (i = 0; i < stop;)
                    {
                        state = In_loop;
                        return i;
    case In_loop:       i += step;
                    }
                    state = Done;
    case Done:      return 0;
    }
    }
};

The “co_yield” in the first listing is a new keyword. I said that repeat calls to a coroutine resume after the last return statement, but they actually resume after the last “co_yield” statement. (and co_yield is syntactically the same as “return”) I think this was done because you want two different ways of returning from a function:

co_return, which means “return and don’t call me again”
co_yield, which means “return and resume here on the next call.”

The second listing is a struct that looks exactly like what the compiler generates for the first listing. The transformation is this:

Turn all stack variables into member variables of a struct
Rename the function to resume() and make it a member function of the struct
Put a switch/case around the entire body of the function, as in Duff’s Device
Add a new case to your enum for every co_yield.
Add a Done case to indicate that the function is over. It doesn’t matter what I return at the end, that value returned from Done should never be used

These transformations work even for functions that are much more complicated. This might be familiar from Simon Tatham’s classic article Coroutines in C which does this transformation using static variables instead of class member variables.

Aside: As an example of how existing usage of the word “coroutine” was different, Tom Duff (of Duff’s Device) said that this is not a good way to implement coroutines, because when using this trick you can not yield from nested functions. The C++ standard turned this around and said “it’s only a coroutine if you can not yield from nested functions. If you could, it would be a fiber.”

Let’s continue to build this coroutine by making it do something useful. Say we want to get this test to work:

TEST(range, coro_and_struct)
{
    std::vector range_result;
    std::vector range_struct_result;
    for (int i : range(10, 2))
    {
        range_result.push_back(i);
    }
    for (int i : range_struct(10, 2))
    {
        range_struct_result.push_back(i);
    }
    ASSERT_EQ((std::vector{0, 2, 4, 6, 8}), range_result);
    ASSERT_EQ((std::vector{0, 2, 4, 6, 8}), range_struct_result);
}

So here I allocate two vectors, then I loop over the results of the range() function (from the first listing) and also loop over the results of the range_struct() constructor (from the second listing), and both should give the same result. Knowing how the range_struct is implemented so far, what would you add to it to make this work?

We need iterators. Here is the minimal implementation to make this work for range_struct:

struct range_struct
{
    // ... unchanged top half of the struct

    struct end_iterator {};
    struct iterator
    {
        range_struct * self;
        bool operator==(const end_iterator&) const
        {
            return self->state == Done;
        }
        void operator++()
        {
            self->resume();
        }
        int operator*() const
        {
            return self->i;
        }
    };

    iterator begin()
    {
        resume();
        return { this };
    }
    end_iterator end()
    {
        return {};
    }
};

The iterator needs operator++ to advance, operator* to get a value and operator== to check if we’re at the end. The implementations of each is trivial, but it might take some thinking to see that this does the right thing. One odd thing is that when constructing the iterator, in begin(), we also have to call resume() once. This is just an accident of the C++ iterator interface. These coroutines actually map better to the Rust iterator interface where next() would directly wrap resume(). To see the need for the call, think about what would happen if we didn’t call resume() once, and if end==0. The first call to operator== would return the wrong thing.

It’s possible to avoid reading the internal state of the function. Instead of reading self->i, we could also store the return value of resume() and remember it. Then this would work for any function that followed the conversion steps I outlined above, but I wanted to keep the implementation trivial for educational purposes.

If this is all the compiler does, then what is the problem? The problem is that they didn’t stop there.

Heap Allocation

One problem with the above is step 1 of my conversion steps: “Turn all stack variables into member variables of a struct.” On more complex functions this can clearly be wasteful. If there was more work to be done after the end of the for-loop, the compiler could reuse the stack space that used to be taken up by the variable ‘i’. So the C++ standard doesn’t want to describe the layout of these coroutine structs. And because of that you’re not allowed to see this struct. The language hides it from you completely, even more than the types of lambda functions. You can’t even use sizeof() on it. How do you hide a type even more than lambda function types? They hide it behind a pointer. And that pointer points at a heap allocation.

So every call to a coroutine includes a call to new(). And at the end of the coroutine, a call to delete(). How did they get this into the standard? Who bought the story of “we wanted this compiler optimization, and all we had to do was introduce a call to new() and a call to delete()” which is clearly not an optimization? The way to get this into the standard is to promise that the coroutine can be inlined, and if it’s inlined, no call to new() and delete() are necessary.

This sounds good, and the simple generator example does get inlined for me, but it didn’t seem to work for anything more complicated. I was trying to implement Differential Dataflow in a way that would allow the code to mostly look normal. So this was building up a graph of nested coroutines, each operating on pipes from other coroutines. This is similar to the simple range generator above, except every function is operating on infinite input pipes and there are joins and splits and aggregations. The lifetimes are often not clear, but sometimes they’re very clear and even then I didn’t get inlining.

This is less of a problem if your coroutines are long-lived. Suspending and resuming a coroutine is fast once it has been allocated. This makes sense if you think back to the struct we created above: The heap allocation only affects the constructor and destructor. But if you have small utility coroutines, a heap allocation can really cost you.

This is a big change in the language. The language has never wanted to give you control about inlining. The “inline” keyword is intentionally vague. There are non-standard ways of saying “don’t inline” and “please, try really hard to inline this” but no guarantees. In most code the difference between inlining and not inlining isn’t big. When it does matter, the difference can easily be 2x, 5x or more. But if not inlining means adding heap allocations, then suddenly not inlining can mean that your code runs 100x slower. With that big of a difference, we suddenly need real control about whether something is inlined or not. Since a coroutine is just a struct, we should already be able to control this: Just put it on the stack, or make it a member of a different struct, but the language intentionally forbids that, insisting that you leave it up to the compiler.

So if something needs to be inlined, what do you use?

Macros

My coroutine code quickly accumulated macros. And not just for performance reasons. There are several ways in which coroutines don’t compose nicely. As an example, lets stick to the generator code above and make it slightly more complex. Lets say it first emits the numbers zero to nine, then calls something else that also returns a generator:

generator_coro foo()
{
    for (int i = 0; i < 10; ++i)
        co_yield i;
    generator_coro rest = bar();
    // ... how do I return all the values from rest?
    return rest;
}

This does not compile. The type for rest is generator_coro, and it looks like the function returns a generator_coro, but actually that’s the type of the coroutine wrapper, and the wrapper expects that we return ints. (as can be seen by us yielding ints before)

So instead you have to write this:

generator_coro foo()
{
    for (int i = 0; i < 10; ++i)
        co_yield i;
    generator_coro rest = bar();
    for (int i : rest)
        co_yield i;
}

OK that isn’t too bad. We just have to store the other coroutine on our stack and then yield all values from it. In my code I was composing coroutines like this a lot, which is all fine until I needed to make a change. I wanted to slightly change how to forward values like this. For the sake of example lets say I want to set a flag “is_draining” before doing this loop. So it should look like this instead:

generator_coro foo()
{
    for (int i = 0; i < 10; ++i)
        co_yield i;
    generator_coro rest = bar();
    rest.set_is_draining(true);
    for (int i : rest)
        co_yield i;
}

Unfortunately the code was manually inlined in various places like this: to completely drain and yield a generator, I was always iterating through the list, so when I needed to set the flag before the loop, I had to change every place. How would I fix this in normal code? Write a function that encapsulates the repeated code:

generator_coro drain_and_yield_all(generator_coro & coro)
{
    coro.set_is_draining(true);
    for (int i : coro)
        co_yield i;
}

This seems reasonable: Now I can call this function to drain a generator. But you can’t do this. This is a new coroutine. If I try to use this in the outer function, I get this:

generator_coro foo()
{
    for (int i = 0; i < 10; ++i)
        co_yield i;
    generator_coro rest = bar();
    generator_coro rest2 = drain_and_yield_all(rest);
    // ... what now?
}

The function drain_and_yield_all returns a whole new coroutine, which I can’t return. So I’m right back at the problem where I started.

The only way to solve this is to use macros. Coroutines don’t compose without macros. I can’t force the function “drain_and_yield_all” to be inlined, therefore it generates a new coroutine (probably allocated on the heap) and therefore I can’t write this little reusable helper. If I want to have one standardized way of doing this that I can easily change, I need to use a macro:

#define DRAIN_AND_YIELD_ALL(x) if (true)\
{\
    auto && coro = x;\
    coro.set_is_draining(true);\
    for (auto && i : coro)\
        co_yield i;\
} else static_cast(0)

Not pretty, but it’s the only way.

Yielding Values

In my manual code transformation above I was just reading the value ‘i’ from the stack of the coroutine. Why make it any more complicated?

The standard makes this slightly more abstract. When I returned the type “generator_coro” above, that is the same type no matter what the body of the function looks like. I haven’t shown what that type looks like because it’s messy, but I have to explain it a little bit. In theory it’s just a wrapper that provides the iterator interface so we can use the type in a loop, but it has to work for all functions. In my manual transformation above we saw that for each function we get a different struct, so how can one wrapper work for all the different structs you might get? The heap allocation makes it easier, but the main reason is that the wrapper can’t rely on anything in the function. We can’t access its stack variables. So how does co_yield work?

Between the coroutine and the caller there is a channel of communication called the “promise type”. It’s where co_yield can store a value, and it’s where the wrapper can read from. This promise_type is the same for all coroutines that return the same wrapper. So in co_yield we’re actually just storing the value in the promise_type, and then the wrapper reads it from the promise_type. This is clearly inefficient in my example where I’m just yielding a stack variable. Do we really have to make a copy of that?

One improvement is to store a pointer in the promise_type. That’s overkill for an int, but in general it’s faster. We can safely point at anything in the coroutine body, because those are all just members of the struct which lives on. This saves us a copy.

But we can do one step better: Turns out direct reading from the stack is allowed in one special case: The promise_type is stored on the stack of the coroutine. So all coroutines that return the same wrapper will generate a struct that has the promise_type as one of its members. And we are allowed to read from the struct as long as we’re reading from the promise_type. The problem is that the coroutine can’t write to the promise_type. If it could, I would just have put my loop variable, ‘i’ into the promise_type.

Luckily David Mazieres found a way to do the sane thing. Unfortunately the code looks a bit insane. There is another new operator, co_await, which I haven’t explained at all. It’s intended to be used when your coroutine waits for the result of a long operation, say a network call. In that case you can use co_await, which means “return for now and ask the object that I’m returning when it’s OK to call me again.” So in practice you co_await some kind of object that indicates when the network request is done. But we’re not going to use it for that. Turns out co_await can be used as a hack to get the promise object, because the awaitable object is allowed to read from the promise_type, just like the wrapper is allowed. So we’re going to create an awaitable object that looks like this:

template
struct GetPromise
{
  PromiseType * p;
  bool await_ready()
  {
      return false; // says we're not ready, call await_suspend
  }
  bool await_suspend(costd::coroutine_handle h)
  {
    p = &h.promise();
    return false; // says no don't suspend coroutine after all
  }
  PromiseType * await_resume()
  {
      return p;
  }
};

Each of these functions has a meaning according to the standard. But really what’s relevant here is that in await_suspend we’re allowed to access the promise_type. We’re just going to stash a pointer away, and in await_resume we’re going to return it. And we never actually suspend the coroutine. So in the body of a coroutine you can do this:

generator_coro range(int stop, int step)
{
    generator_coro::promise_type * promise = co_await GetPromise::promise_type>{};
    for (promise->i = 0; promise->i < stop;)
    {
        co_yield promise->i;
        promise->i += step;
    }
}

It’s kinda gross, but we know for a fact that the promise is stored right next to “stop” and “step” in the auto-generated struct of this coroutine. So this is how you yield a value directly from the stack without making a copy and without having to use a pointer. The co_yield call should compile down to a self-assignment, i=i.

This looks insane, but if you think of the actual generated instructions, it’s the only sane way to pass data between the coroutine and its handle. Once you understand how coroutines work, it only makes sense for the wrapper to read values right from the stack, and for the coroutine to write values to the stack. And the only allowed place to do that is in the promise_type. We aren’t supposed to access it, even though it is on our stack, but that’s just the standard being weird.

One nice thing would be to pull this into a wrapper function, so that I could write this instead:

generator_coro range(generator_coro::promise_type * promise, int stop, int step)
{
    for (promise->i = 0; promise->i < stop;)
    {
        co_yield promise->i;
        promise->i += step;
    }
}

So I just receive the promise as one of my arguments. Alas, that wrapper is impossible to write, because we would have to force inlining to make it work, and we can’t. You can make it slightly less ugly with macros though…

Operators for Fibers

One thing I was looking forward to was to finally have the co_yield and co_await operators in the language so that I could use them with fibers. Everything is customizable enough that this should be possible. Before C++20 I would have to create a macro like this in my fibers:

#define FIBER_YIELD(x) if (fiber_yield(x)) static_cast(0) else return false

In this I assume that “fiber_yield()” returns true if the fiber should continue running, and false if the fiber should early-out. (probably because it’s being destructed, so we need to tear down the stack, so it’s absolutely necessary to return immediately) The co_yield operator has this behavior built-in and I want to use it as well in fibers. Unfortunately as soon as you use the keyword, your function is no longer a function, it’s now a coroutine. Meaning it needs a wrapper and it needs to be heap-allocated. This is terrible because fibers don’t have the problems with composition I mentioned above: It’s much easier to write little helper functions because in a fiber you’re allowed to yield from a nested function. But if those little nested functions have to be coroutines, that ruins everything.

Conclusion and Advice

Overall I’m still happy that coroutines exist. I liked experimenting with them. But everyone’s reaction seems to be the same: These coroutines quickly deflate your energy. They don’t spark joy. Everything is just a bit clunky and gross. You always have to worry if something gets inlined, because the difference between an inlined coroutine and a not-inlined coroutine is a much bigger cliff than for a normal function. And they’re just not ergonomic to use.

If you’re working on coroutines for other languages, do a couple of simple things:

Give me access to the generated struct. Allow me to put it on the stack of another function. Or as a member of another struct. Then I can store it on the heap if I want, but don’t force me.
Think more about how these compose. This gets a lot better already if I have control over the struct, because then if I want to write a small utility function, I can put it on the stack and don’t have to worry about heap allocations
Allow me to use the same operators for fibers. Don’t turn my function into a coroutine just because I use “co_yield”
Give me a saner way to access stack variables. Maybe just allow me to write “public” on stack variables. So that the initial code would have looked like this:

generator_coro range(int stop, int step = 1)
{
    public int i;
    for (i = 0; i < stop;)
    {
        co_yield i;
        i += step;
    }
}

I pulled this out into a separate line to make this clearer, but it should be valid to make any stack variable public, no matter how it is declared. Since these just become struct members, using the “public” keyword even makes sense because it’s used for the same purpose: I make something accessible to users of the struct.

As the coroutines are, I don’t quite know what they’re for. They seem like they might be useful, but nobody seems excited by them. Anyone who has long-running concurrent tasks switched to using fibers ages ago. See this good talk from 2012, or this better talk from 2015. I don’t see how to use coroutines here. The threat of having a heap allocation when your inliner changes its mind ruins it. (or more likely for complex code like this, the inlining never works to begin with)

I’m curious if anyone actually finds something useful to do with these. Are they just for simple generator functions?