Handmade Coroutines for Windows

by Malte Skarupke

In a previous post I implemented coroutines using the ucontext.h header. In this post I will show a replacement that should be much faster. I will also provide a Windows implementation.

I like to start off with some code, so here is the complete code for switching the stack in Linux:

switch_to_context:
    pushq %rbp
    movq %rsp, %rbp
    // store rbx and r12 to r15 on the stack. these will be restored
    // after we switch back
    pushq %rbx
    pushq %r12
    pushq %r13
    pushq %r14
    pushq %r15
    movq %rsp, (%rdi) // store stack pointer
switch_point:
    // set up the other guy's stack pointer
    movq %rsi, %rsp
    // and we are now in the other context
    // restore registers
    popq %r15
    popq %r14
    popq %r13
    popq %r12
    popq %rbx
    popq %rbp
    retq

All code in this post is in the public domain. It would be nice if you gave credit anyway.

That is surprisingly little code to switch a stack, right? It turns out all you have to do is move rsp, and then the return statement at the end will jump the instruction pointer correctly.

The registers that I store and restore are the nonvolatile registers of the SystemV x64 ABI. These need to be restored to the original value before you return from a function. Since coroutines have weird stack behavior where I have no idea ahead of time which function will return to the caller, I have to store and restore all of them.

The reason why this is faster than the functions provided by the ucontext.h header is that those functions also change which signals the current thread is listening to. I don’t know why it was decided that those functions would also do that, but it doesn’t seem useful for coroutines.

You will notice that if I switch from a stack using this function, I can switch back to it using the same code because I store rsp and the code will pop all the registers in the correct order.

Now all we need is a way to start in a new stack. For that I wrote a class which sets up a stack so that when you switch into it, it will call a function that you can give to it.

struct stack_context
{
    stack_context(void * stack, size_t stack_size, void (* function)(void *), void * argument);
    void switch_into();
    void switch_out_of();

private:
    void * caller_stack_top;
    void * my_stack_top;
#ifndef _WIN64
    void * rbp_on_stack;
#endif

    // intentionally left unimplemented. there is a this pointer stored on the
    // stack and as such not even moving makes sense, because then that this
    // pointer would point to the old address
    stack_context(const stack_context &);
    stack_context & operator=(const stack_context &);
    stack_context(stack_context &&);
    stack_context & operator=(stack_context &&);
};

void stack_context::switch_into()
{
    asm("call switch_to_callable_context"
        : : "D"(&caller_stack_top), "S"(my_stack_top), "d"(rbp_on_stack));
}
void stack_context::switch_out_of()
{
    asm("call switch_to_context"
        : : "D"(&my_stack_top), "S"(caller_stack_top));
}

As you can see this struct just calls the assembly from above, so there is nothing more complicated going on. switch_to_callable_context is the same thing as switch_to_context from above except it stores rbp at rbp_on_stack:

switch_to_callable_context:
    pushq %rbp
    movq %rsp, %rbp
    // store rbx and r12 to r15 on the stack. these will be restored
    // after we jump back
    pushq %rbx
    pushq %r12
    pushq %r13
    pushq %r14
    pushq %r15
    movq %rsp, (%rdi) // store stack pointer
    // set up the other guy's stack pointer to make
    // debugging easier
    movq %rbp, (%rdx)
    jmp switch_point

rbp_on_stack is only for debugging (it is never read by code) and I will explain it later.

Another benefit of this is that it is much smaller than ucontext_t, which is 936 bytes on x64. Plus you have to keep two ucontext_t’s around when all you really need is a pointer to the top of a stack so that you can resume where you left off.

The stack_context sets up the stack in such a way that the return statement at the end of switch_to_context will jump to this code:

callable_context_start:
    movq %r13, %rdi // function_argument
    callq *%r12 // function
    movq (%rbx), %rsi // caller_stack_top
    jmp switch_point

This is the start of the coroutine. All it does is call a function with one argument then jump back to the stack switch from above. The function and the function argument come from the stack: The stack switching code from above will pop them and place them in registers r12 and r13. I’ll place these values on the stack in the next function.

But first it’s interesting to think about what the code will switch to at the jump to switch_point at the end. Because it will probably not switch back to the original caller. It’s interesting how coroutines work like that: Between line 3 and 4 you could be in a completely different part of your program. The stack that you will jump back to may look nothing like what called you. You will jump to whatever code called your coroutine after you have yielded for the last time. It’s fun to tell the debugger to step over line 3 and watch all the things that happen while your function is executing.

OK now the last piece of the puzzle is how to set up a new stack so that that piece of code is run when you switch into it:

static void * ensure_alignment(void * stack, size_t stack_size)
{
    static const size_t CONTEXT_STACK_ALIGNMENT = 16;
    unsigned char * stack_top = static_cast<unsigned char *>(stack) + stack_size;
    // if the user gave me a non-aligned stack, just cut a couple bytes off from the top
    return stack_top - reinterpret_cast<size_t>(stack_top) % CONTEXT_STACK_ALIGNMENT;
}

stack_context::stack_context(void * stack, size_t stack_size, void (* function)(void *), void * arg)
    : caller_stack_top(nullptr), my_stack_top(nullptr)
    , rbp_on_stack(nullptr)
{
    unsigned char * math_stack = static_cast<unsigned char *>(ensure_alignment(stack, stack_size));
    my_stack_top = math_stack - sizeof(void *) * 9;
    void ** initial_stack = static_cast<void **>(my_stack_top);
    // store the return address here to make the debuggers life easier
    asm("movq $switch_point, %0\n\t" : : "m"(initial_stack[8]));
    initial_stack[7] = nullptr; // will store rbp here to make the debuggers life easier
    asm("movq $callable_context_start, %0\n\t" : : "m"(initial_stack[6]));
    rbp_on_stack = initial_stack[5] = &initial_stack[7]; // initial rbp
    initial_stack[4] = &caller_stack_top; // initial rbx
    initial_stack[3] = reinterpret_cast<void *>(function); // initial r12
    initial_stack[2] = arg; // initial r13
    initial_stack[1] = nullptr; // initial r14
    initial_stack[0] = nullptr; // initial r15
}

Which is the initial assembly code in reverse. In the first code listing in line 14, when I assign to rsp, I am assigning initial_stack from this function. It will then pop off all the values according to the comments in this function, and then call retq which will jump to callable_context_start, because that’s the address that I place there. (I place it there using inline assembly because in gcc you can only get a pointer to an assembly label from assembly)

Now I can also explain rbp_on_stack: The calling convention dictates that I store rbp and the instruction pointer of the calling function just above the stack of the current function. So the idea is that instead of wasting that space, you always write the rbp of the last function who called the coroutine there, and the debugger should be able to step the callstack across coroutine boundaries. Turns out that gdb does not do that, because it tries to use debug symbols and gets enormously confused. One day I will look into how to fix that properly. At the moment this at least makes it easier to step the stack manually, which may be useful for someone someday.

The reason why rbp_on_stack is not useful for the Win64 version is that the Windows x64 ABI doesn’t have the linked list of rbp any more. Instead you always need debug symbols if you want to figure out what the stack is frome some configuration of rsp and rip.

So that is all for Linux. You can take this code and put it into either of my previous coroutine implementations.

Given the above class, it shouldn’t be too much work to implement the stack_context class for Windows. After all the Windows x64 ABI is pretty similar, right? Well in Windows things are always more complicated. I’m going to start off with the code:

switch_to_context proc
    ; store NT_TIB stack info members
    push qword ptr gs:[8]
    push qword ptr gs:[16]
    ; store rbx and r12 to r15 on the stack. these will be restored
    ; after we switch back
    push rbx
    push rbp
    push rdi
    push rsi
    push r12
    push r13
    push r14
    push r15
    ; start at -24 instead of -16 because of alignment
    movaps [rsp - 24], xmm6
    movaps [rsp - 40], xmm7
    movaps [rsp - 56], xmm8
    movaps [rsp - 72], xmm9
    movaps [rsp - 88], xmm10
    movaps [rsp - 104], xmm11
    movaps [rsp - 120], xmm12
    movaps [rsp - 136], xmm13
    movaps [rsp - 152], xmm14
    movaps [rsp - 168], xmm15
    mov qword ptr [rcx], rsp ; store stack pointer
switch_to_context endp

stack_switch_finish proc
    ; set up the other guy's stack pointers
    mov rsp, rdx
    ; and we are now in the other context
    ; restore registers
    ; start at -168 instead of -160 because of alignment
    movaps xmm15, [rsp - 168]
    movaps xmm14, [rsp - 152]
    movaps xmm13, [rsp - 136]
    movaps xmm12, [rsp - 120]
    movaps xmm11, [rsp - 104]
    movaps xmm10, [rsp - 88]
    movaps xmm9, [rsp - 72]
    movaps xmm8, [rsp - 56]
    movaps xmm7, [rsp - 40]
    movaps xmm6, [rsp - 24]
    pop r15
    pop r14
    pop r13
    pop r12
    pop rsi
    pop rdi
    pop rbp
    pop rbx
    ; restore NT_TIB stack info members. this is needed because _chkstk will
    ; use these members to step the stack. so we need to make sure that it, and
    ; any other functions that use the stack information, get correct values
    pop qword ptr gs:[16]
    pop qword ptr gs:[8]
    ret ; go to whichever code is used by the other stack
stack_switch_finish endp

callable_context_start proc
    mov rcx, rdi ; function_argument
    call rbp ; function
    mov rdx, [rbx] ; caller_stack_top
    jmp stack_switch_finish
callable_context_start endp

So immediately this is a lot more code. Most of that is because the Windows x64 ABI has a lot more nonvolatile registers. I don’t know if that is good or bad in general, (I could see that it is an advantage to have at least some nonvolatile floating point registers) but in this case it bloats the code.

I decided to store all the XMM registers outside of the stack because it made the code a bit easier to reason about and because it saved two instructions. Otherwise it doesn’t really matter where they are stored.

And then there’s the part about “storing the NT_TIB stack info members.” NT_TIB is a global variable that keeps some information about the current thread. For example where its stack is. So when I switch stacks, I have to keep that information up to date, otherwise code that relies on that global variable could show strange behavior because my stack could be very far away from the normal one. Which usually I wouldn’t care about, (who knows what your code is doing: Maybe it wants information about the normal stack even if it’s called from a coroutine) except that a function called _chkstk that Visual Studio will insert in some of your code may incorrectly complain about the stack overflowing, after which it will crash your program.

If you are curious as to how you find out about things such as NT_TIB: I’d like to know as well. Here are the steps I took to find out: 1. Switch a stack to somewhere on the heap. 2. Call a function that calls _chkstk. Which means either alloca or a function which has a big array on the stack. Luckily writing to std::cout will call one such function. 3. Get lucky and get warnings about the stack overflowing from _chkstk. 4. While debugging _chkstk notice that it is reading values from an offset from the gs register. 5. Read up about the gs register and find that Windows 64 uses it to store a pointer to something called NT_TIB.

Those are the necessary steps written with perfect hindsight. I don’t think there is a shorter way to find out. You can imagine how long it took to debug this, especially since things will work fine most of the time and since you can spend a lot of time going down wrong paths when reading up about _chkstk.

And finally here is the code to set that stack up:

extern "C" void switch_to_context(void ** old_stack_top, const void * new_stack_top);
extern "C" void callable_context_start();

void stack_context::switch_into()
{
    switch_to_context(&caller_stack_top, my_stack_top);
}
void stack_context::switch_out_of()
{
    switch_to_context(&my_stack_top, caller_stack_top);
}
stack_context::stack_context(void * stack, size_t stack_size, void (* function)(void *), void * arg)
    : caller_stack_top(nullptr), my_stack_top(nullptr)
{
    unsigned char * math_stack = static_cast<unsigned char *>(ensure_alignment(stack, stack_size));
    my_stack_top = math_stack - sizeof(void *) // space for return address (initial call)
                              - sizeof(void *) * 2 // space for stack info
                              - sizeof(void *) * 4 // space for arguments
                              - sizeof(void *) * 8 // space for non-volatile integer registers
                            //- sizeof(void *) * 2 * 10 // space for non-volatile xmm registers
                            //- sizeof(void *) // stack alignment
                              ;
    void ** initial_stack = static_cast<void **>(my_stack_top);
    // initial_stack[11] to initial_stack[14] are space for arguments. I won't
    // use that space but the calling convention says it has to be there
    initial_stack[10] = &callable_context_start;
    initial_stack[9] = math_stack;
    initial_stack[8] = stack;
    initial_stack[7] = &caller_stack_top; // initial rbx
    initial_stack[6] = function; // initial rbp
    initial_stack[5] = arg; // initial rdi
    initial_stack[4] = nullptr; // initial rsi
    initial_stack[3] = nullptr; // initial r12
    initial_stack[2] = nullptr; // initial r13
    initial_stack[1] = nullptr; // initial r14
    initial_stack[0] = nullptr; // initial r15
    initial_stack[-1] = nullptr; // stack alignment
    initial_stack[-3] = initial_stack[-2] = nullptr; // initial xmm6
    initial_stack[-5] = initial_stack[-4] = nullptr; // initial xmm7
    initial_stack[-7] = initial_stack[-6] = nullptr; // initial xmm8
    initial_stack[-9] = initial_stack[-8] = nullptr; // initial xmm9
    initial_stack[-11] = initial_stack[-10] = nullptr; // initial xmm10
    initial_stack[-13] = initial_stack[-12] = nullptr; // initial xmm11
    initial_stack[-15] = initial_stack[-14] = nullptr; // initial xmm12
    initial_stack[-17] = initial_stack[-16] = nullptr; // initial xmm13
    initial_stack[-19] = initial_stack[-18] = nullptr; // initial xmm14
    initial_stack[-21] = initial_stack[-20] = nullptr; // initial xmm15
}

I don’t really need to initialize all that memory to null pointers, but I don’t like uninitialized variables.

And that is all. A complete stack switch library for Linux and Windows in 64 bit. It really is much less complicated than I thought it was. In summary: Change rsp and make sure that your nonvolatile registers are OK. And then put a function pointer and a function argument on a new stack and switch into it.

8 Comments to “Handmade Coroutines for Windows”

John Calsbeek (@jcalsbeek) says:

February 21, 2013 at 19:19

I believe you also have to save/restore the current SEH frame (gs:[0]) if you don’t want to screw up exception handling on Windows.

Malte Skarupke says:

February 24, 2013 at 02:26

Good point. After some debugging it seems though that that value is not being used for Windows x64 any more. Even if I use the SEH keywords __try and __except the value always stays at 0.

Which makes sense because Microsoft switched to zero overhead exceptions for x64. And the way that those work is that there is a global lookup table for exception handlers and every throw statement passes its index as the argument to the internal RaiseException function. So there is no need to update a global variable at runtime.

I haven’t tried a whole lot of test cases, but for those that I tried the stack swap does what I expect when I throw exceptions. Specifically I was worried that yielding from within a try block would give the wrong results. But that seems to work fine in that throwing a value outside the coroutine will never invoke an exception handler inside of it.

Malte Skarupke says:

February 24, 2013 at 03:08

Oh and one thing that I meant to complain about but that didn’t really fit is _chkstk:

I can’t think of a reason why _chkstk needs to use global variables. All it does is access the stack once per page to make sure that it grows in a controlled way to whichever size you requested. To do that it could just start off at the current value of the stack pointer. Which would actually make the code faster, less complex, and it would mean less use of global variables. So unless I missed something, I am once again upset at Microsoft for wasting large amounts of my time for no reason.

John Calsbeek (@jcalsbeek) says:

February 25, 2013 at 22:28

Hmm, that’s true about x64, that makes sense. And nobody writes 32-bit code on x86 anymore anyways. 🙂

_chkstk probably does a bit of overflow checking by itself (before bothering to allocate the stack one page at a time). And to know the stack size, you need to knows its base, which is presumably where the TEB comes in.

Malte Skarupke says:

March 3, 2013 at 06:16

No _chkstk is pretty stupid. I would like it if it did overflow checking because then it would at least have been consistent in giving me error messages. It’s simply a loop that touches the stack regularly. All the actual logic happens in the signal handler for when it triggers a page fault, but that’s OS internals.

John Calsbeek (@jcalsbeek) says:

March 3, 2013 at 21:49

I checked the source for _chkstk out of curiosity (as it’s included with Visual Studio as chkstk.asm), and you’re right, it’s very stupid, but it also does use esp directly, not the TEB. But of course fixing the TEB is the right thing to do anyway, if for no other reason than making debuggers happy.

John Calsbeek (@jcalsbeek) says:

March 3, 2013 at 21:53

Hmm, but you saw gs being used when debugging it. I wonder if the compiled version of _chkstk is built from different asm than they ship?

Malte Skarupke says:

March 4, 2013 at 20:54

That sounds like the 32 bit version though. I just checked myself and it looks like the 64 bit chkstk.asm doesn’t come with Visual Studio. But you can look at the disassembly if you call alloca or if you allocate an array that’s more than one page in size on the stack and step into the _chkstk call.

So for some reason Microsoft decided to make the 64 bit version of chkstk use global variables where the 32 bit version didn’t need it. For a moment I was tempted to compare the two line by line to see if maybe they managed to improve speed a little because of that, but then I thought that I didn’t want to spend my time on that.

Probably Dance