Probably Dance

I can program and like games

Tag: coroutine

C++ Coroutines Do Not Spark Joy

C++20 added minimal support for coroutines. I think they’re done in a way that really doesn’t fit into C++, mostly because they don’t follow the zero-overhead principle. Calling a coroutine can be very expensive (requiring calls to new() and delete()) in a way that’s not entirely under your control, and they’re designed to make it extra hard for you to control how expensive they are. I think they were inspired by C# coroutines, and the design does fit into C# much better. But in C++ I don’t know who they are for, or who asked for this…

Before we get there I’ll have to explain what they are and what they’re useful for. Briefly, they’re very useful for code with concurrency. The classic example is if your code has multiple state machines that change state at different times: E.g. the state machine for reading data from the network is waiting for more bytes, and the code that provides bytes is also a state machine (maybe it’s decompressing) which in turn gets its bytes from another state machine (maybe it’s handling the TCP/IP layer). This is easier to do when all of these can pretend to operate on their own, as if in separate threads, maybe communicating through pipes. But the code looks nicer if the streamer can just call into the decompressor using a normal synchronous function call that returns bytes. Coroutines allow you to do that without blocking your entire system when more bytes aren’t immediately available, because code can pause and resume in the middle of the function.

One of the best things the C++ standard did is to define the word “coroutine” as different from related concepts like “fibers” or “green threads”. (this very much went against existing usage, so for example Lua coroutines are not the same thing as C++ coroutines. I think that’s fine, since the thing that was added to C++ could be useful, and there is a good case to be made that the existing usage was wrong) In the standard, a coroutine is simply a function that behaves differently when called multiple times: Instead of restarting from the start on each call, it will continue running after the return statement that last returned. To do this, it needs some state to store the information of where it last returned, and what the state of the local variables was at that point. In existing usage that meant that you need a program stack to store this state, but in C++ this is just stored in a struct.

To illustrate all of this, lets build a coroutine in plain C++, without using the language coroutines:

Read the rest of this entry »

Reinventing the Wheel for Better Compile Time

I have a new laptop on which everything compiles slowly. Which means I sometimes get really annoyed with C++ and its slow compile times. The two most common causes of slow compile times are large headers and complicated templates. Large headers are somewhat easy to identify, and you can usually improve your compile time by using precompiled headers, but it has traditionally been hard to find out which templates are hurting your compile time. Luckily there is a project called Templight which can tell you how much time Clang is spending on instantiating templates. Using Templight I have found that some templates take unnecessarily long to compile.

So I re-implemented std::unique_ptr, boost::flat_map, and await.

Read the rest of this entry »

Handmade Coroutines for Windows

In a previous post I implemented coroutines using the ucontext.h header. In this post I will show a replacement that should be much faster. I will also provide a Windows implementation.

I like to start off with some code, so here is the complete code for switching the stack in Linux:

switch_to_context:
    pushq %rbp
    movq %rsp, %rbp
    // store rbx and r12 to r15 on the stack. these will be restored
    // after we switch back
    pushq %rbx
    pushq %r12
    pushq %r13
    pushq %r14
    pushq %r15
    movq %rsp, (%rdi) // store stack pointer
switch_point:
    // set up the other guy's stack pointer
    movq %rsi, %rsp
    // and we are now in the other context
    // restore registers
    popq %r15
    popq %r14
    popq %r13
    popq %r12
    popq %rbx
    popq %rbp
    retq

Read the rest of this entry »

Variadic Coroutines in C++ and D

In my last post I implemented a tiny coroutine class that is fully functional, but it doesn’t support arguments and return values. This time I provide an implementation which adds those features as well as exception safety.

I also used this as an example to compare D and C++; as an excuse to learn C++’s variadic template syntax and as a reason to see how you can improve on boost’s implementation by using variadic templates.
Let’s start off with the punchline. Here is my coroutine implementation in C++:

#pragma once

#include <ucontext.h>
#include <cstdint>
#include <exception>
#include <functional>
#include <memory>
#include <tuple>

#ifndef CORO_DEFAULT_STACK_SIZE
#define CORO_DEFAULT_STACK_SIZE SIGSTKSZ
#endif

namespace coro
{
namespace detail
{

/**
 * the coroutine_context holds the stack context and the current state
 * of the coroutine. it also starts the coroutine
 */
struct coroutine_context
{
    coroutine_context(size_t stack_size, void (*coroutine_call)())
        : stack{new unsigned char[stack_size]}
    {
        // create a context for the callee
        getcontext(&callee);
        callee.uc_link = &caller; // tell it to return to me when done
        callee.uc_stack.ss_size = stack_size;
        callee.uc_stack.ss_sp = stack.get();
        // pass the this pointer as two arguments, because officially
        // makecontext will pass the arguments as ints (even though
        // inofficially it would work with a single argument)
        makecontext(&callee, coroutine_call, 2, reinterpret_cast<size_t>(this) >> 32, this);
    }

    void operator()()
    {
        if (returned) throw "This coroutine has already finished";

        swapcontext(&caller, &callee); // continue here if yielded or returned

        if (exception)
        {
            std::rethrow_exception(std::move(exception));
        }
    }

protected:
    ucontext_t caller;
    ucontext_t callee;
    std::unique_ptr<unsigned char[]> stack;
    std::exception_ptr exception;
    bool returned = false;
};

// a type that can store both objects and references
template<typename T>
struct any_storage
{
    any_storage() = default;
    any_storage(T to_store)
        : stored{std::move(to_store)}
    {
    }
    any_storage & operator=(T to_store)
    {
        stored = std::move(to_store);
        return *this;
    }
    operator T &&()
    {
        return std::move(stored);
    }

private:
    T stored = T{};
};
// specialization for void
template<>
struct any_storage<void>
{
};
// specialization for lvalue references
template<typename T>
struct any_storage<T &>
{
    any_storage() = default;
    any_storage(T & to_store)
        : stored{&to_store}
    {
    }
    any_storage & operator=(T & to_store)
    {
        stored = &to_store;
        return *this;
    }
    operator T &()
    {
        return *stored;
    }

private:
    T * stored = nullptr;
};
// specialization for rvalue references
template<typename T>
struct any_storage<T &&>
{
    any_storage() = default;
    any_storage(T && to_store)
        : stored{&to_store}
    {
    }
    any_storage & operator=(T && to_store)
    {
        stored = &to_store;
        return *this;
    }
    operator T &&()
    {
        return *stored;
    }

private:
    T * stored = nullptr;
};

// implements the shared code among all specializations of coroutine_yielder
template<typename Result, typename... Arguments>
struct coroutine_yielder_base
    : protected coroutine_context
{
    coroutine_yielder_base(size_t stack_size, void (*coroutine_call)())
        : coroutine_context{stack_size, coroutine_call}
    {
    }
    std::tuple<Arguments...> yield()
    {
        swapcontext(&this->callee, &this->caller);
        return std::move(this->arguments);
    }

protected:
    any_storage<Result> result;
    std::tuple<any_storage<Arguments>...> arguments;
};

/**
 * The coroutine_yielder is responsible for providing a yield
 * function
 */
template<typename Result, typename... Arguments>
struct coroutine_yielder
    : protected coroutine_yielder_base<Result, Arguments...>
{
    coroutine_yielder(size_t stack_size, void (*coroutine_call)())
        : coroutine_yielder_base<Result, Arguments...>{stack_size, coroutine_call}
    {
    }
    std::tuple<Arguments...> yield(Result && result)
    {
        this->result = std::forward<Result>(result);
        return coroutine_yielder_base<Result, Arguments...>::yield();
    }

    // copying yield
    std::tuple<Arguments...> yield(const Result & result)
    {
        this->result = result;
        return coroutine_yielder_base<Result, Arguments...>::yield();
    }
};
// specialization for void
template<typename... Arguments>
struct coroutine_yielder<void, Arguments...>
    : coroutine_yielder_base<void, Arguments...>
{
    coroutine_yielder(size_t stack_size, void (*coroutine_call)())
        : coroutine_yielder_base<void, Arguments...>{stack_size, coroutine_call}
    {
    }
};
// specialization for lvalue reference
template<typename Result, typename... Arguments>
struct coroutine_yielder<Result &, Arguments...>
    : protected coroutine_yielder_base<Result &, Arguments...>
{
    coroutine_yielder(size_t stack_size, void (*coroutine_call)())
        : coroutine_yielder_base<Result &, Arguments...>{stack_size, coroutine_call}
    {
    }
    std::tuple<Arguments...> yield(Result & result)
    {
        this->result = result;
        return coroutine_yielder_base<Result &, Arguments...>::yield();
    }
};
// specialization for rvalue reference
template<typename Result, typename... Arguments>
struct coroutine_yielder<Result &&, Arguments...>
    : protected coroutine_yielder_base<Result &&, Arguments...>
{
    coroutine_yielder(size_t stack_size, void (*coroutine_call)())
        : coroutine_yielder_base<Result &&, Arguments...>{stack_size, coroutine_call}
    {
    }
    std::tuple<Arguments...> yield(Result && result)
    {
        this->result = std::move(result);
        return coroutine_yielder_base<Result &&, Arguments...>::yield();
    }
};
}

template<typename Result, typename... Arguments>
struct coroutine;

template<typename Result, typename... Arguments>
struct coroutine<Result (Arguments...)>
    : detail::coroutine_yielder<Result, Arguments...>
{
private:
    template<int N, int... S>
    struct starter;
    typedef starter<sizeof...(Arguments)> Starter;
public:
    typedef typename detail::coroutine_yielder<Result, Arguments...> self;
    coroutine(std::function<Result (self &, Arguments...)> func, size_t stack_size = CORO_DEFAULT_STACK_SIZE)
        : detail::coroutine_yielder<Result, Arguments...>{stack_size, reinterpret_cast<void (*)()>(&Starter::coroutine_start)}
        , func{std::move(func)}
    {
    }
    // I don't need to specify these. the default behavior would result
    // in the same constructors and assignment operators being generated
    // but you get better error messages if I'm explicit about it
    coroutine(const coroutine &) = delete;
    coroutine & operator=(const coroutine &) = delete;
    coroutine(coroutine &&) = default;
    coroutine & operator=(coroutine &&) = default;

    operator bool() const
    {
        return !this->returned;
    }

    Result operator()(Arguments... args)
    {
        this->arguments = std::make_tuple(detail::any_storage<Arguments>{std::forward<Arguments>(args)}...);
        detail::coroutine_yielder<Result, Arguments...>::operator()();
        return Starter::Finisher::return_result(*this);
    }

private:
    // returning a value needs to be specialized for void return
    // this struct handles that
    template<typename R, int... S>
    struct finisher
    {
        static R return_result(coroutine & this_)
        {
            return std::forward<Result>(this_.result);
        }
        static void start_and_store_result(coroutine & caller)
        {
            caller.result = caller.coroutine_start<S...>();
        }
    };
    template<int... S>
    struct finisher<void, S...>
    {
        static void return_result(coroutine &)
        {
        }
        static void start_and_store_result(coroutine & caller)
        {
            caller.coroutine_start<S...>();
        }
    };

    template<int ...ArgCountList>
    Result coroutine_start()
    {
        return func(*this, std::forward<Arguments>(std::get<ArgCountList>(this->arguments))...);
    }

    // the calling of the coroutine needs to be specialized by
    // the number of arguments. this struct handles that
    template<int N, int... S>
    struct starter : starter<N - 1, N - 1, S...>
    {
    };

    template<int... S>
    struct starter<0, S...>
    {
        typedef finisher<Result, S...> Finisher;
        static void coroutine_start(uint32_t this_pointer_left_half, uint32_t this_pointer_right_half)
        {
            coroutine & this_ = *reinterpret_cast<coroutine *>((static_cast<size_t>(this_pointer_left_half) << 32) + this_pointer_right_half);
            try
            {
                Finisher::start_and_store_result(this_);
            }
            catch(...)
            {
                this_.exception = std::current_exception();
            }
            this_.returned = true;
        }
    };

    std::function<Result (self &, Arguments...)> func;
};

}

And here it is in D:

module coro.coroutine;

import coro.ucontext;
import std.algorithm;
import std.conv;
import std.typecons;

template VariadicForward(string before, string toforward, string after, size_t i)
{
    static if (i == 0)
    {
        immutable string VariadicForward = before ~ after;
    }
    else static if (i == 1)
    {
        immutable string VariadicForward = VariadicForward!(before, toforward, "move(" ~ toforward ~ "[" ~ to!string(i - 1) ~ "])" ~ after, i - 1);
    }
    else
    {
        immutable string VariadicForward = VariadicForward!(before, toforward, ", move(" ~ toforward ~ "[" ~ to!string(i - 1) ~ "])" ~ after, i - 1);
    }
}

class coroutine(Result, Arguments...)
{
	private alias Tuple!Arguments.Types Types;
	this(Result delegate(coroutine!(Result, Arguments), Types) func, size_t stack_size = 8192)
	{
		this.func = func;
		stack = new ubyte[stack_size];
		getcontext(&callee);
		callee.uc_link = &caller;
		callee.uc_stack.ss_size = stack_size;
		callee.uc_stack.ss_sp = stack.ptr;
		extern(C) void function() call = cast(void function())&coroutine_start;
		makecontext(&callee, call, 2, cast(size_t)(cast(void *)this) >> 32, this);
	}

	Result opCall(Types arguments)
	{
		if (returned) throw new Throwable("This coroutine has already finished");

		static if (arguments.length)
		{
			foreach(i, ref argument; arguments)
				this.arguments[i] = move(argument);
		}
	    swapcontext(&caller, &callee);

	    if (exception)
		{
			throw exception;
		}
		else static if (!is(Result == void))
		{
			return move(result);
		}
	}

	@property bool callable() const
	{
		return !returned;
	}

	static if (is(Result == void))
	{
		alias yield_common yield;
	}
	else
	{
		Tuple!Arguments yield(ref Result result)
		{
			this.result = result;
			return yield_common();
		}
		Tuple!Arguments yield(Result result)
		{
			this.result = move(result);
			return yield_common();
		}
	}

	private Tuple!Arguments yield_common()
	{
	    swapcontext(&callee, &caller);
    	mixin(VariadicForward!("return Tuple!Arguments(", "arguments", ");", Types.length));
	}

private:
	Result delegate(coroutine!(Result, Arguments), Types) func;
	ubyte[] stack;
	ucontext_t caller;
	ucontext_t callee;
	Throwable exception;
	bool returned = false;
	static if (!is(Result == void))
	{
		Result result;
	}
	Types arguments;

	extern(C) static void coroutine_start(uint this_pointer_left_half, uint this_pointer_right_half)
	{
		coroutine this_ = cast(coroutine)cast(void *)(((cast(size_t)this_pointer_left_half) << 32) + this_pointer_right_half);
	    try
	    {
	    	static if (is(Result == void))
	    	{
	    		mixin(VariadicForward!("this_.func(this_, ", "this_.arguments", ");", Types.length));
	    	}
	    	else
	    	{
	    		mixin(VariadicForward!("this_.result = this_.func(this_, ", "this_.arguments", ");", Types.length));
	    	}
	    }
	    catch(Throwable ex)
	    {
	    	this_.exception = ex;
	    }
	    this_.returned = true;
	}
}

The most obvious point first: D is much cleaner.

Read the rest of this entry »

Implementing coroutines with ucontext

Coroutines are a way of doing cooperative multithreading where each context has it’s own stack. And they can be incredibly useful for games. (I’ll use the word context as the coroutine equivalent of a thread) The basic idea of a coroutine is that you can yield from a context in the middle of a function, and then you will continue exactly at that point when the coroutine gets called the next time.
Imagine that you want something to happen in a sequence, for example you want to spawn an enemy, show a text for two seconds, then kill the enemy and show another text for two seconds. A trivial implementation would be this:

void sudden_exploding_unit(const Vector3 & pos)
{
    std::unique_ptr<Entity> guy = spawn_entity(pos, "grunt.entity");
    sleep(1.0f);
    show_text(pos, "oh no he is exploding!", 2.0f);
    sleep(0.5f);
    explode(*guy);
    sleep(1.0f);
    show_text(pos, "well that was underwhelming", 2.0f);
}

Even though this code is simple, you would probably have a large amount of pain to implement this in a game engine, because a game can never block, and this code is sleeping all over the place. You’d probably end up writing some kind of state machine (maybe using a kismet-like scripting system) or if you’re a terrible person you’d use asynchronous callbacks. State machines can be a fine solution, but they introduce way too much infrastructure for such a trivial piece of code.

Enter coroutines. If your engine has coroutines, you can write exactly the above code. With very little downsides. Because you could just sleep like this:

void sleep(float seconds)
{
    while (seconds >= 0.0f)
    {
        yield(); // resume here next frame
        seconds -= global_dt;
    }
}

Read the rest of this entry »