Automated Game AI Testing

by Malte Skarupke

In 2018 I wrote an article for the book “Game AI Pro 4” called “Automated AI Testing: Simple tests will save you time.” The book has since been canceled, but the article is now available online on the Game AI Pro website.

The history of this is that in 2017 there was a round table at the Game Developer Conference about AI testing. And despite it being the year 2017, automated testing was barely even mentioned. It was a terrible round table. A coworker who sat in the audience with me said to me that I could have given a better talk because I had invested a lot of work into automated testing. So next year I submitted a talk about automated AI testing and was rejected. But they asked me to write for the book instead. Now the book is canceled, too. A copy of the article is below, with some follow ups at the end. It’s written for people who have never done automated testing, like the AI programmers at the round table. But I think the core trick of doing fiber-based control flow, that can wait for things to happen, could be widely useful:

Introduction

Game programmers have been slow to adopt automated testing. There are many reasons for that, one of which is that games fall under the category of code that is “hard to test.” But inroads are being made: I have seen good test coverage for lower level libraries, and graphics programmers have figured out how to test their code well in recent years. (automated screenshot diffing with good visualization showing how two screenshots differ) I hope to include AI code in the group of code that is “easy to test,” and I will explain how to do that in this chapter.

The main insights are that you can reproduce a lot of AI bugs with no more than two characters, that fibers are a great match for writing sequential code that takes many seconds to run, (such as AI tests) and that most complicated failures are a result of simple underlying causes that can be tested in isolation. I will illustrate how to write a simple testing framework that not only helps us write tests, it also makes debugging of problems easier.

Getting Value from Automated Testing

These days it’s uncontroversial to say that automated test are an accepted best practice for programmers. An informal survey among my coworkers found that most of them want to write more tests, they’re just finding it hard to do so. It’s understood that all code lies on a spectrum from “easy to test” to “hard to test” with things like std::vector and std::sort being on the “easy” side and video games being on the “hard” side.

But AI code is actually fairly easy to test manually: Often all we have to do is set up a test level with a few AI characters and watch what happens. We will try to convert that ease of manual testing to an ease of automatic testing. In fact we will start with the simplest, easiest automated test.

If you’re new to testing, you have to start testing the easy things. There is no shortcut to getting better at testing, and if you start with difficult tests, you will just get frustrated and give up. So if you are writing a small utility function or a container, (for example your own spatial data structure, which C++ has a shortage of) start with writing tests for those. Then as you get more experienced you should not move on to the things that are “hard to test,” but instead you will find that more things now seem “easy to test” than did at first. Maybe that big, hairy class actually has an ad-hoc implementation of a container inside of it. If you pull that out, you can test the code. Making the class less hairy, and giving you a safety net to make further changes. The more tests you write, the better you’ll get at this.

So the first lesson for getting value from automated tests is this: If something is hard to test, don’t write a test for it. It’s just not worth the time invested in writing and maintaining the test. Instead try to find a way to make it easier to test. But don’t fret if you can’t. You don’t need to write tests for everything.

To really get value out of tests though, we have to see that tests have a second dimension. Not only are some tests “easy to write” and others “hard to write”, there is also a dimension of specificity: Some tests tell you exactly what is wrong, others are only able to tell you that “something broke.” The first kind of test can significantly cut down on your debugging time. The second kind of test doesn’t shorten your debugging time, it merely points out that there is a problem. An example for a test that’s “easy to write but not specific” is a smoke test. A smoke test is usually a very simple test like “start the game, shut down the game, tell me if there were any errors.” It’s a good test to run automatically after every submit to version control. It finds problems surprisingly often. However all it can tell you is that there is a problem. It doesn’t help you narrow down where the problem came from: If a smoke test tells you that you get an error on shutdown, that will take exactly as long to debug as if another person tells you that they got an error on shutdown. (that’s why you have to run smoke tests on every submit, because that narrows down potential culprits)

With that, the best kind of tests are tests that are easy to write and tell you exactly what is wrong. The peak of that are unit tests: If a unit test for std::sort breaks, you know exactly where to start looking for the problem.

Since unit tests can be so valuable, I will talk a little bit about unit testing, but most of AI testing doesn’t fit into unit tests, so after a brief detour into unit tests, we will talk about how to write AI tests specifically.

Unit Testing

Unit tests are the best kinds of tests because they are easy to write and point at a specific area of code for the source of the problem. I will keep this section short because I have found it hard to write unit tests for most of my AI code, but I still felt it necessary to include this section because unit tests can be so valuable.

In AI code I have found that unit tests are appropriate in small utility functions like matrix math, or utility classes like spatial data structures. As an example you can see a simple unit test in the code below. The test is written in the Google Test library, which is the unit testing library I use most often.

struct VisionCone
{
    float cos_angle; // result of dot product
    float radius;
    float vision_score;
};
struct VisionConeCollection
{
    VisionConeCollection(std::vector<VisionCone> cones);
    float ScoreAt(Matrix4x4 head_matrix, Vec3 position) const;
    // ...
};

TEST(vision_cones, single_cone)
{
    VisionConeCollection c({ { 0.7071f, 10.0f, 0.5f } });
    Matrix4x4 id;
    // in cone
    ASSERT_EQ(0.5f, c.ScoreAt(id, { 0.0f, 0.0f, 5.0f }));
    // too far away
    ASSERT_EQ(0.0f, c.ScoreAt(id, { 0.0f, 0.0f, 15.0f }));
    // behind head matrix
    ASSERT_EQ(0.0f, c.ScoreAt(id, { 0.0f, 0.0f, -5.0f }));
}

I’ve included summaries of the VisionCone and VisionConeCollection classes so that you can understand what’s going on in the test, but we won’t be too concerned with those specific classes. The idea is to have multiple vision cones, each of which has a different “vision score.” But what’s important about this unit test is that

  1. It was fast to write. My rule of thumb is that it should be faster to write a test and to run it than it takes to test the same code in the game. If it takes me a minute to launch the game and to spawn AI characters on which I can test my vision code, then it should take less than a minute to write and run the test. If a test is the fastest way to run your code, you will write more tests. This is mostly a pipeline issue and you simply have to set up your dev environment to make it fast to run tests.
  2. It runs fast. Google Test always shows how long a test takes, and this test always runs in “0ms.” If a test of mine takes longer than 10 milliseconds, I either rewrite it or delete it. If you have too many slow tests around, you will not use tests as often.

This simple example will be all I use to illustrate how to use unit tests. The earlier that you start to write tests for a piece of code, the better the interfaces of that piece of code will be for testing. From here it’s easy to extend the tests by adding more cases. Or let’s say you implement different behavior for the y-axis than for the xz-plane. You should add a test for it to get faster iteration times. Or if you want to add hysteresis so that visibility doesn’t flicker on and off when the player is standing right on a threshold, add a test for it. You can also add edge cases (like what if a character asks if it can see itself) and be sure that they never break. As the feature grows, the tests grow with it.

Confidence for More Complex Tests

I can sense your skepticism about the previous example: AI vision rarely breaks because you introduced a bug in the vision cone code. Instead AI vision probably breaks because AI characters are now wearing helmets, and the raycast is hitting the helmet. Or because an artist introduced a new glass model and forgot to set the see-through flag on it. And even if AI vision breaks, that’s an easy bug to fix. The hard bugs are results of several characters interacting in unexpected ways. How are you possibly supposed to test all that?

To start with remember the first lesson: We’re not going to write tests for things that are hard to test. But if we’re only going to write easy tests, how much value are we going to get? Research from other fields suggests that we’ll get a lot of value.

In a study on concurrency bugs, Lu et al. found that 96% of concurrency bugs can be reproduced by enforcing a certain execution order of just two threads. Similarly 97% of deadlock bugs are the result of two threads waiting for two resources. In a study on distributed computing, Yuan et al. found that 84% of bugs are reproducible with just two computers. 98% are reproducible with three computers. I claim that something similar is true for AI code: Most AI bugs are reproducible with two characters. I don’t have the percentage numbers on how many bugs exactly can be caught with simple tests, but the numbers from the other fields should give us some confidence that it’s a good amount.

I want to clarify that I’m not talking about simple bugs here. In a talk about Ding Yuan’s study on distributed computing, he emphasizes that a lot of the investigated bugs were really complicated. But when you trace the bug all the way back, you often find a root problem that would have been easy to detect. Similarly in AI we often see confusing behavior, and we have to look at several frames of history to find out that the AI is acting weird because of a simple root cause. Maybe it failed to play a certain animation, or the animation failed to move the character. Or maybe as soon as it got into a vehicle it stopped seeing its target because the raycasts collide with the vehicle.

When that happens wouldn’t you rather have a test that verifies that the AI can move to where it’s supposed to be able to move to? Or that an AI can still see when it’s sitting in a vehicle? If one of those tests breaks, the problem is a lot easier to fix than if you’re just seeing unexplained weird behavior in the game. Even if the problem ended up being caused by something else, it’s still useful to quickly rule out whole categories of problems: “it’s probably not an AI vision problem because the test for that is currently green.”

Testing in a Game Engine

You probably already have a bunch of manual tests for your AI. Test levels with gray boxes where you can spawn a few characters and observe them in simple scenarios. You probably also have a “free-fly mode” (or “ghost mode”) in which no player character spawns and you can just observe the AI. All we have to do is run those manual tests automatically and detect whether they behave the way we want or not.

How to jump into a test level and set the game into a mode where you can observe your AI will depend on your engine. You should set it up to be able to launch tests from a command line argument, or using an in-game command. The command line argument is for launching the test on an autobuilder. The in-game command is useful to quickly launch tests on other people’s computers to verify if something works or to show bugs to other programmers. But then what does writing a test actually look like?

After a few misguided attempts, I realized that fibers are a perfect match for writing the kind of logic that a test needs. The main benefit is the ability to suspend a fiber at any point in the function. That makes it possible to write similar checks as in Google Test, but to give the engine time to fulfill the criteria.

As an example here is a very simple test: I spawn two characters of opposing factions. One of them has a weapon, the other is brain-dead. The test simply asserts that the character with a weapon will defeat the brain-dead character.

jt::TestResult RunTest(jt::TestRunner& test_runner)
{
    Character* bd = GetObjectByAlias<Character>("braindead");
    JT_ASSERT(bd != nullptr);
    return test_runner.WaitUntil([&]
    {
        return !bd->IsAlive();
    },
    "Waiting for the character to be killed", 30.0f);
}

In this test most of the setup is done in content. The two characters are configured in the test level. I don’t have to refer to two characters in code because I only care about what happens to one of them. I get that character through it’s “alias” which is an engine-specific feature to refer to objects from code or script. Your engine probably has a similar feature. The TestRunner is a small wrapper around the fiber that’s used internally. It provides one important function: WaitForOneFrame(). All other test functionality can be built on that function. The WaitUntil function I use in the example just calls the condition-lambda and if it’s false, calls WaitForOneFrame. I pass a message into the function to display on the screen while the test runs. That message will also be displayed on the autobuilder if the test fails in this step. The final argument is a time-out in seconds. If after 30 seconds the lambda still returns false, the test fails. The other possible place where this test can fail is in the JT_ASSERT macro, which simply returns TEST_FAILED if the given condition is false.

Before we get into how this is implemented, I want to point out a few features that make this easy to implement. First, the test doesn’t have to establish preconditions. The RunTest function is actually a virtual function in a class, and there is a second virtual function called GetPreconditions, which the test framework calls before calling this function. Since many tests require a similar setup, I wrote the code for that once. In GetPreconditions you can indicate a test level that you want to have loaded, the position of the camera, whether you want to have a player character or be in ghost mode, and you have the ability to turn off some global engine features. (such as spawning of traffic on roads) The test framework then ensures that all your preconditions are true before it runs the test, so that the RunTest function really only contains what’s necessary for the test. The precondition code is engine-specific, so I won’t go into it further.

The second important feature is the ability to specify a time-out for a condition. In a normal testing framework like Google Test you only want to assert that something is true after a given sequence of calls, but when testing AI you often can’t be that precise. Depending on the length of animations and on details of your behavior tree, things can take a few frames more or less. So instead I found it useful to check whether something becomes true in less than X seconds.

The third important feature is that this is written in normal C++ and that I have access to all of the normal functions of the engine. I believe that it is very important that tests are easy to write, and that is only the case if I can access a function from a test without having to do any extra steps. (such as exposing the function to a scripting language)

To start implementing a test framework, you just need the ability to implement the magic WaitForOneFrame function. If you have that, you can implement all other utility functions that you may need on top of that. As an example, I sometimes need the ability to get an object by its alias (as in the test above) but the object may not have spawned yet, and I just want the test to wait until the object is spawned. That is easy to implement as a utility function: Try to get the character and call WaitForOneFrame if it doesn’t exist yet. If the character doesn’t exist after X seconds, I fail the test.

So how do we implement the WaitForOneFrame function? I implemented it using a fiber, but you could also implement it using a thread. When using a fiber it’s a simple wrapper around the yield function provided by your fiber library. (for example it’s simply called yield in boost::coroutine)

When using threads you can implement this using two semaphores: The test runs in its own thread, the rest of the game gets controlled by a main thread. At a good point in the frame, when it’s safe for the test thread to access and mutate global state, the main thread signals on the first semaphore that the test thread should run. It then waits on the second semaphore. The test thread was waiting on the first semaphore and runs now. In WaitForOneFrame it signals the second semaphore and waits on the first semaphore again. With that the two threads take turns. All other threads are sleeping while this is happening. They get woken up by the main thread when it is done waiting.

Making the main thread and all other threads sleep while the test thread is running slows down the engine a little bit, but I haven’t found that to be a problem. It’s a price I’m willing to pay to make the tests reliable. I simply don’t have to worry about which state I can access or mutate because I know that nothing else is running.

Pausing, Canceling, Restarting and Repeating a Test

Running tests with timeouts comes with a few problems, all related to time. The first is that the timeout can make things hard to debug. In my example above of giving one character thirty seconds to kill a different character, what happens if something goes wrong and I want to debug the problem? I open our developer menu, turn on some debug drawing look at the internal state of the AI, and before I know it the thirty seconds are over, the test fails and everything de-spawns. Oops. So the first additional feature you’ll want is to be able to pause the test. This simply means not calling into the test fiber. The game simulation continues to run. So in my example if I pause the test, it merely pauses the time-out. (but the character continues fighting) But if the test had multiple steps, the test wouldn’t advance to the second or third step of the test as long as it’s paused. The pause feature is toggled using a global variable that’s easy for me to change from our debug menu.

The second problem is that if you accidentally launch the wrong test, you have to wait 30 seconds or a minute for the test to finish. The solution for that is the ability to cancel the test. For that I changed the TestResult enum to have a TEST_INTERRUPTED value. The function WaitForOneFrame will now return that value when I issue the command to cancel the test from our debug menu. There are two subtleties with this:

The first sublety with interruptions is that you have to make sure that the return value from waiting calls is always handled the same way, so that the test returns should it be interrupted. It might be possible to implement that using exceptions, but we (like many game developers) compile without exceptions. The second best approach I have found is to implement a macro called JT_CHECK that returns on interrupts or failed test. Any call to a waiting function has to be wrapped in JT_CHECK. Here is what the macro looks like:

#define JT_CHECK(...) do {\
::jt::TestResult wait_result = __VA_ARGS__;\
if (wait_result == ::jt::TEST_INTERRUPTED\
        || wait_result == ::jt::TEST_FAILED)\
    return wait_result;\
} while(false)

As usual in C++ macros, this isn’t pretty. (sorry) To use this in the test above, instead of returning the result of the wait function manually, I would wrap that call in a JT_CHECK macro. It doesn’t make a difference for a one-step test like that, but for a test consisting of multiple steps, every line that can wait has to be wrapped in this macro.

The second problem comes directly from this macro and from the JT_ASSERT macro: The test can return at any point. What this means is that all the state that the test creates has to be wrapped in RAII structs so that the state gets cleaned up at the end of the test. For example if the test spawns characters, they have to despawn when the test gets canceled, so there needs to be a RAII wrapper that despawns the character in its destructor. I avoided that in the test above by doing most of my setup in content, not in code, so the test framework handles this for me by unloading the test level.

With that out of the way, we are able to interrupt tests. Which immediately allows us to implement the next feature: Restarting of tests. It’s an option in our debug menu that cancels the current test and immediately starts it again. This feature makes working with tests a pleasure because I can very quickly test a certain situation over and over again. If a problem happens one out of ten times, I create a test with the initial conditions of the problem, and restart the test until the problem occurs.

The final feature is the ability to run a test on repeat so that when it finishes, it automatically restarts itself. This is also now trivial to implement and is also done to reproduce rare issues.

With these tools the testing framework is not only a useful tool to find issues, it also saves us time by giving us additional debugging features such as the ability to run a scenario on repeat until a problem occurs.

Simple Tests

We finally have everything in place to start writing tests. So what kind of tests should we write? Earlier in the chapter I said we should test things that are “easy to test” and we should write tests that tell us specifically what is wrong. An easy test might be to set up a fight scene between ten characters of one faction on one side and five characters of an opposing faction on the other side. Then we assert that the side with ten characters wins. But even though that’s an easy test to write, it’s not very specific. A test like that can fail in a million ways. Also who says that five characters can’t win against ten characters? What if one of them gets lucky with a well-placed grenade?

So we want tests that tell us much more directly what is wrong. The test of “one character with a weapon against one brain-dead character” that I used as an example above is an improvement, but we can be more specific still.

A good test to start is to test the perception of characters. Place two characters in an open field and assert that they can see each other. Place a wall between them and assert that they can not see each other. Then do the same thing for a pane of glass. Then put one of the characters in a vehicle. Then put one of the characters behind a mounted gun on a raised platform. Add more edge cases as you encounter them.

We can also make that test dynamic: Put two characters in the open and assert that they can see each other. Then make one of them walk behind a wall and assert that they can no longer see each other. This test can fail even if all the previous tests succeeded, for example if you have a bug in the logic for when to update vision information. (because you don’t want to do a raycast every frame) Next, if characters have an “investigate” mode you can add a third step to the test by testing that after one character has disappeared behind the wall, the other character “investigates” and catches up with the first character. That test can be seen below. I will use that test to illustrate a couple patterns that I often use.

jt::TestResult RunTest(jt::TestRunner& test_runner)
{
    Character* c = GetObjectByAlias<Character>("custom");
    Character* n = GetObjectByAlias<Character>("normal");
    JT_ASSERT(c != nullptr);
    JT_ASSERT(n != nullptr);
    c->SetFaction(PLAYER_FACTION);
    auto can_see = [&] { return n->CanSee(c); };
    JT_CHECK(test_runner.WaitUntil(can_see,
        "Waiting for normal to see custom", 1.0f));
    SendGlobalEvent("start_walking");
    JT_CHECK(test_runner.WaitWhile(can_see,
        "Waiting for custom to walk behind the wall", 10.0f));
    JT_CHECK(test_runner.WaitUntil(can_see,
        "Waiting for normal to investigate", 30.0f));
    return jt::TEST_SUCCEDED;
}

The first pattern is that I have one character with a custom behavior tree. That tree is written specifically for this test. All it does is it waits for the global event “start_walking” and then walks to a predetermined spot behind a wall. The other character is a completely normal enemy as it would appear in the game. Since I only care about the behavior of one of the two characters in this test, I like to have complete control over the other one. It also makes the test code easier because all I have to do is send a global event.

With that we can look at how the test works: Before the test starts, we enter a test level with two characters and a wall. I assert that the two characters exist, then I set the faction of the custom character to the PLAYER_FACTION. This is a second pattern: Our AI has different behavior in AI-vs-AI fights than when fighting the player. They never enter the “investigate” mode when fighting other AI. They only do that when fighting the player. So to test the investigate behavior, we simply set the faction of the other character to the player faction. In all of our AI logic we only use the faction to determine whether an enemy is the player or another AI, so if I set the faction, I get player behavior.

Finally we see that the logic is actually quite simple: I wait one second for the characters to see each other. Then I give the signal to start walking. Then I wait ten seconds until the characters can no longer see each other. At this point the normal characters should enter investigate mode. I now wait 30 seconds until they can see each other again. Since each of these wait is an upper bound, the test actually runs faster.

Even though this test is testing a complicated sequence, it’s actually very simple in the implementation. For example if there is a bug in the “investigation mode” then the last step would time out after 30 seconds and the test would fail.

Your tests shouldn’t get more complicated than this. I will list a few more tests to get you started, then I’ll explain how to select your own tests to write. First, test movement. If your characters can climb ladders, test that they can climb ladders. If your characters can enter vehicles, test that they can enter and exit vehicles, and that they can drive vehicles where they’re supposed to be able to drive to. You can also repeat my first example test for all kinds of situations: Make sure that a shooting helicopter can kill a brain-dead opponent. Make sure that a character at a mounted gun can kill a brain-dead opponent. Then test that in a combat scenario, all characters enter cover within X seconds. Test that all characters have shot at least once within Y seconds.

Overall I don’t recommend writing too many tests though. I recommend writing tests for one of two reasons only: 1. To more quickly iterate on a new feature. 2. To reproduce a bug. Those two reasons ensure that I have a small set of tests that runs somewhat quickly. Sometimes people try to write really slow tests like “spawn every vehicle and check for errors.” Those kinds of tests are just an invitation for lots of maintenance work. You will certainly run into situations where one vehicle doesn’t work, and when you tell the responsible person they answer “oh yeah we know. It’s only used in one mission, and that mission has bigger problems right now. We’ll fix it before alpha.” (where alpha is a year away) And now you have a broken test in your system that just always gets in the way. So don’t be too aggressive about your tests, and try to write specific tests.

The other open question is what to do about more complicated situations. Like what do we do if we have squad behavior for up to four characters? I would say that if something seems complicated to test, leave it alone for now. Tests don’t solve everything. We don’t lose anything by not writing a test for this. But we may lose something if we add an overly complicated test that requires lots of maintenance. So leave it alone, and just debug it the old-fashioned way. Maybe you will come up with nicely targeted tests later that can reproduce certain problems. Until then you still have a bedrock of simple tests that you can rely on. Don’t test things that are hard to test. And with experience you will be able to make more things easy to test.

Conclusion

I hope to have moved AI code from the things that are “hard to test” to the things that are “easy to test.” The biggest thing I didn’t show was the code for ensuring preconditions, but that code is mostly just code for loading test-levels. (or teleporting to test islands and waiting for streaming in our open world engine) Otherwise the tricks of using fibers to be able to write the logic in sequence, and the trick of using time-outs instead of asserts were most of what was required to make AI testable. That, and the realization that most AI bugs have simple causes that can be reproduced using two characters.

The testing framework that we ended up with not only helps us reproduce bugs, (and catch them early should they come back) it also saves us time while debugging issues by providing the pause, restart and repeat features. Also since all the tests are written in C++, we can call normal functions and we can step through the code using normal debuggers. While I am not yet able to write good tests for the most complicated AI bugs, I have caught many bugs with relatively simple tests. And the longer I’m doing this, the better I get at coming up with simple tests that catch sources of complex problems.

Postscript for 2022

That was the article. I wrote this testing framework for Just Cause 4, but that game is not exactly known for having bug-free AI. So what happened? I think the testing framework made it so that the AI had far fewer bugs than it would have had otherwise, and it allowed us to ship many more features than we could have otherwise shipped, but why wasn’t the overall result good? I can think of four reasons:

  1. Not enough AI devs
  2. Too many escort quests
  3. General company culture of breaking things
  4. My inability to convince anyone else to write tests

1. Not enough AI Devs

The first one is a lame excuse, but it has some truth to it. We had 2 AI designers and 2 AI programmers. As comparison, lets look at Horizon: Zero Dawn which came out a year earlier. It had 11 AI devs. Plus 6 combat designers, which is work that also fell on the AI team in Just Cause 4 So 4 people vs 17 people. That’s why Horizon: Zero Dawn has much more impressive AI.

2. Too Many Escort Quests

This was a design decision that the AI team objected to during all of development. Our AI designers told the other designers to not do any escort quests because players don’t like them and you really can’t do them well in a Just Cause game. There is way too much chaos. There are always helicopters falling from the sky or reinforcements driving in with a tank, blocking the way. But nobody listened to us and a year before release it became clear that roughly half of all missions would be escort quests. Escort quests are a great way to show off every single problem that your AI has. There was one mission early in the game where you have to help a car drive half-way across the world. There are roadblocks along the way that you have to clear and the AI has to drive around whatever obstacles remain. Also there is normal traffic and enemies are chasing you, so your ally’s car has to drive fast, swerving through traffic. Everything is systemic, almost nothing is scripted. We don’t just drive on a spline. It’s a total nightmare for an AI programmer. This mission was a huge time sink for us and the other AI programmer probably still has nightmares about that mission… You will notice that Horizon: Zero Dawn cleverly has none of those elements. (objects intentionally blocking the AI’s path, traffic, other cars chasing and ramming your ally that has to drive on a road)

Why did we have all those escort quests? In hindsight I realize that the problem is that you really can’t design levels for the main character in Just Cause 4. He is way too mobile. Want to set up a problem where an automated machine gun is blocking your forward progress? Rico Rodriguez can just fly over that. Want to have a gauntlet of difficult enemies blocking the way? You can just climb on any roof and run straight to the end of the level. So most of the missions somehow restrain Rico, just to allow level designers to design levels. So it’s lots of escort quests or “guard this point” objectives… JC3 had a design that worked better with the freeform movement, but we moved away from that to more designed levels because we wanted more variety. It didn’t work…

3. General Company Culture of Breaking Things

The half-life of a working feature was roughly six weeks. That was the pacing of our internal milestones. Of all the features that were working for milestone X, roughly half would be broken for milestone X+1. In hindsight it was incredible how much time we wasted on things that broke over and over again. Wanted to play a mission that we delivered two milestones ago? Good luck. You’d probably have to spend a day to a week to first get it to work again. (if you don’t believe me that it was this bad, listen to the round table about AI testing and think about what environment these people work in to talk like that. I think our environment was worse than industry-average) There were a few programmers whose things didn’t break this often. I was one of them, and my testing framework really helped me there.

4. My Inability to Convince Anyone Else to Write Tests

I don’t know why I was so bad at this. The testing framework really was easy to use. If you don’t believe me, we also had Google Test. When I joined the company fresh out of college, that was one of the first things I added: Google Test so that we could write automated tests. I think in the seven years that I was there, I managed to convince only one other programmer to write tests in game code, even though I remember giving talks about it and even walking some people through how to write a test. (and that one programmer then left the company to work at Google…) Google Test is even easier to use than my testing framework: In any file you invoke the “TEST” macro and that’s all you have to do. I made the tests always run on startup. You didn’t even have to compile and run a separate executable… So even though it was incredibly easy to write tests, explained in thirty seconds and done in a minute, nobody did it.

I did manage to convince some engine programmers, but the game code remained test-free except for my code. I think the main reason is that everyone was always underwater. People were constantly behind. They had no time to save themselves time by writing tests. Things were always breaking and needed fixing. No time for tests. Our company values were “Passion, Courage, Craftsmanship” and we gave awards for each. The first award for “Craftsmanship” was given to a guy who caused a large amount of our bugs and would then heroically step in to fix them. People saw how good he was at firefighting and rewarded him for it. Never mind that he was also causing most of those fires… In that company culture you can’t get people to write tests. (this programmer certainly would never write tests)

At some point it felt like I had convinced the company that we should really write more tests. Everyone gave lip service to it, and we would get around to writing more tests really soon. One concrete thing that happened was that the tech director ordered the most junior programmer to help write some tests. This guy was so junior we didn’t even hire him as a programmer. He was just a QA guy who knew a little bit of programming. I was glad that somebody else was writing tests, and didn’t realize quickly enough that he was causing more problems than he was preventing. He was writing terrible tests and breaking existing tests. I had originally written this testing framework to test our editor. It worked well for both the editor and for AI tests, and would have probably worked for other things, too. At some point the tools programmer from another team asked about the automated testing work I had done. Unfortunately at that point the editor tests had been broken beyond repair by the QA guy who was “helping out” so there was nothing I could point them at… (and I had no time to bring things back into order, see the points above about the small team and about things constantly breaking) So no other team adopted this testing framework either…

How the Story Ends

The testing framework got a couple more neat features by the end, like the ability to define tests only in content, without even needing to make code changes. (one of our AI designers used that) But as far as I know the testing framework is abandoned now. It was on a branch of our codebase that was abandoned: The Just Cause 4 codebase was not continued. theHunter: Call of the Wild became the new main branch that all future projects were branched off from. So it would have required someone porting the testing framework to that branch. Unfortunately a different programmer wrote a different testing framework in theHunter codebase. There was no coordination at all (that was another thing about Avalanche: Not much coordination, lots of repeated work. JC4 shipped with four different gameplay-scripting systems because people kept on writing new ones because the old ones were bad, but the new ones also didn’t have all the features of the old ones and couldn’t replace them, so all four were in use and our gameplay-scripting was an undebuggable mess…) which meant that I didn’t hear about this framework until it was too late. Someone made the decision to go with that one because it was already on the right branch, and mine was abandoned even though it had more features, actually made it easy to write and run tests, and was battle-tested. (Nobody thought to consult me in the conversation where this was decided, they only were kind enough to tell me afterwards)

So the only legacy of this testing framework is the document above. And I did at least succeed in convincing people to want more testing at Avalanche. I left the company at the end of 2019 so I don’t know what happened since. The other testing framework was probably “good enough” so maybe it has seen a lot of use since. I don’t know what the situation is at other game companies. The whole industry was oddly bad at keeping things working. Maybe this article will improve things slightly. I mostly used the testing framework to test AI code because I was an AI programmer in JC4, but I wrote the first version when I was a tools programmer in JC3, and I had also used it to test our editor. (that’s where it was important to call any C++ function. Those tests had a lot more C++ code in it) So I think it is more widely useful. For example if you had the ability to do player input from code, the fiber-based control flow would make it easy to control the player, too. (write functions to “aim at this point” and then “press the right trigger”) It’s unfortunate that I didn’t get a chance to grow it further. I also don’t have the source code, but I’d be happy to elaborate on any details in the comments.