Coroutines in one page of C

ChuckMcM · on Aug 20, 2013

Wow, that takes me back. That was _exactly_ how the original Java Virtual Machine did threading. I ported it to the new model used by Solaris (from its SunOS 4.x roots) and that begat the term 'hello hello hello world world world' which is, as you might imagine, getting the threaded version of helloworld.c running :-)

NoodleIncident · on Aug 20, 2013

Better yet: coroutines in three macros. I found this page googling "yield in c++", not really expecting to find anything. I ended up not using it, but it's still my favorite bit of hackery.

http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

fanf2 · on Aug 20, 2013

Note that Simon's coroutines are more like Python generators than full-blown coroutines, since only the top-level function can yield.

You don't need asm to manipulate the stack and frame pointers in standard C - see http://fanf.livejournal.com/105413.html or for some slightly more fleshed-out variants that do not require C99 see http://dotat.at/cgi/git/picoro.git

yosefk · on Aug 20, 2013

So you allocate a coroutine stack by allocating a large array on the caller's stack, so that when you longjmp back to the setjmp'd context, you can safely overwrite that reserved space. F* me plenty.

Good stuff. One reason my inline asm serves my specific needs better is that my host OS has stack protection - a thread can only access its own stack; and I send coroutines around threads - yield in thread A, resume in thread B - and so I have to allocate my stacks outside the stacks of either A or B. I'll add a link to your implementation in my article, and perhaps Wikipedia should, too, if it didn't already (it has a lengthy list of coroutine implementations in C...)

iambvk · on Aug 20, 2013

Glad you mentioned this; to me, this is a very important requirement.

I don't see lot of people understand the difference between YIELD and full-blown coroutines.

IMO YEILD is not very useful; it doesn't compose well with non-yield function. OTOH full-blown coroutines are super awesome.

Too bad C++11 still doesn't give full-blown coroutines.

batguano · on Aug 21, 2013

Could you explain what you mean by full-blown coroutines? Thanks!

anonymoushn · on Aug 21, 2013

If you have a function that can suspend its own execution and return data, you have a Python generator.

If it can also receive data when it is resumed, you have an ES6 generator (or the C macros above!).

If its subroutines can cause it to suspend, you have a full-blown coroutine.

If it can be copied, you can build call/cc.

616c · on Aug 21, 2013

I had know idea call/cc was that powerful. Very cool, and thanks for the breakdown for someone who did not know the difference.

_ZeD_ · on Aug 21, 2013

but but... http://dabeaz.com/coroutines/

anonymoushn · on Aug 21, 2013

Oops! "Python generator" should be "Python 2.4 generator." Versions of Python people actually use go with ES6.

iambvk · on Aug 23, 2013

Didn't see your comment til now. So, here is the difference I see:

Full-blown coroutines give us a way to write fully asynchronous code without using callbacks. For example, I could define ASYNC macro as follows:

#define ASYNC(blocking_function_call) \ do { \ auto io_pool = GetIOThreadPool(); \ auto cpu_pool = GetCurrentThreadPool(); \ \ Detach_The_Thread_and_Enqueue_Coroutine_For_IO(io_pool); \ \ blocking_function_call; \ \ Detach_The_Thread_and_Enqueue_Coroutine_For_CPU(cpu_pool); \ } while (0)

Now, I can make any blocking operation asyncrhonous. For example, open() or readdir() functions do not have asynchronous interfaces, but I can make them asynchronous by simply wrap it with ASYNC(...) macro.

It works as follows: When we need to do a blocking operation, we suspend the coroutine and we enqueue it into a thread-pool that is designated to do blocking calls; the cpu thread is now free, so it can start executing other cpu-bound coroutines. When io thread has finished the job, it will enqueue the coroutine back to the CPU thread pool.

What we have now is a fully asynchronous system where cpu threads can be kept busy with only cpu-bound work and io threads can be kept with io-bound work. This gives us full control over the concurrency. We could be confident that CPU threads are always available for compute work.

We cannot do this with python "yield".

ricree · on Aug 21, 2013

If I had to guess, I think he's talking about coroutines that allow you to pass data as well as take it out.

Python's PEP 342, which expanded python generators into truer coroutines, has a decent discussion on the differences.

http://www.python.org/dev/peps/pep-0342/

catnaroek · on Aug 20, 2013

Boost.Asio has an implementation of cooperative multitasking via (hacked, really, really ugly) coroutines.

yosefk · on Aug 20, 2013

I think boost also has a standalone coroutines library.

rdpmc · on Aug 20, 2013

You can also implement a simple version of channels (go's concurrency model) on top of something like that, with just one more page of C: https://github.com/danluu/setjmp-longjmp-ucontext-snippets/b...

https://github.com/danluu/setjmp-longjmp-ucontext-snippets/b...

nanofortnight · on Aug 20, 2013

This can be implemented without assembly using setcontext/getcontext on POSIX-compatible systems, and GetThreadContext/SetThreadContext on Windows.

anon1385 · on Aug 20, 2013

Examples of that in the Io source:

https://github.com/stevedekorte/io/blob/master/libs/coroutin...

cygx · on Aug 20, 2013

Not if you care for performance.

nanofortnight · on Aug 20, 2013

get/setcontext do not copy stack, but use an allocated bit of memory space for the stack and swaps stack pointers just like this implementation. I don't see how that's a performance hit.

I do not pretend to know about Windows internals so I won't comment on Windows.

vinkelhake · on Aug 20, 2013

The author of boost.context did some benchmarking[1] and found that get/setcontext was an order of magnitude slower than a custom solution. Whether this matters is an entirely different question, but it's clear that get/setcontext has some overhead.

[1] http://www.boost.org/doc/libs/1_54_0/libs/context/doc/html/c...

cygx · on Aug 20, 2013

See also the notes on various implementation mechanisms of libcoro ( http://cvs.schmorp.de/libcoro/coro.h )

davidtgoldblatt · on Aug 20, 2013

getcontext and setcontext also keep track of signal state, which involves a syscall.

jvanenk · on Aug 20, 2013

I'm reminded of protothreads (http://dunkels.com/adam/pt/). I used protothreads some time ago on an embedded project to implement a (very small) HTTP server in only a few KB or RAM.

yosefk · on Aug 20, 2013

The trouble with them is they're stackless, which means you can't have arbitrary functions thoughtlessly coded by someone unaware of threading called inside many different protothreads. Coroutines are easier to combine with code not written to account for their existence (of course stack overflow is a big concern though, which can be somewhat mitigated through memory protection depending on the OS and the hardware).

enqk · on Aug 21, 2013

Care to explain further?

In my experience of using protothreads, only the functions using the protothread macros need to be written in a special way:

   - passing a context object
   - being careful of variable initialization/blocks 
     in order not to work on unitialized memory.

while all the other ones can stay plain C functions.

yosefk · on Aug 21, 2013

Sure, if none of these C functions tries to yield, or calls a library function trying to yield. In fact, the problem isn't that they can't be plain C functions, but that they must be plain C functions :)

Why would functions called by "lightweight threads"/"coroutines"/whatever want to yield? One example is, they want to wait for an event - an I/O request completion or a computation offloaded to an accelerator or whatever - without having to return all the way back to the event loop.

enqk · on Aug 21, 2013

I must admit that this property attracted me to protothreads, since there is a clear demarcation between coroutines (through the presence of a protothread argument) and plain C functions.

msteinert · on Aug 20, 2013

I've always liked Russ Cox's solution [1]

[1]: http://swtch.com/libtask/

_ak · on Aug 20, 2013

I concur. It's more code, but it's still really easy to understand, and basically precedes Go's concurrency model.

iambvk · on Aug 20, 2013

Also, it is more powerful. You can return/resume from arbitrary function depths.

gte910h · on Aug 20, 2013

Can people explain the fascination with Coroutines?

I don't see what problem domain hits solutions that are best done with them all the time.

ufo · on Aug 20, 2013

The main benefit of coroutines is avoiding inversion of control flow.

A classic example is iterators. Its very natural to write an iterator like

    for(i=0; i<10; i++){
       yield i
    }

But to program the same thing without coroutines you need to keep all the control flow state explicitly and write the logic inside out like a state machine:

    i = 0
    int getNext(){
        if(i >= 10){
           return Done
        }else{
           temp = i
           i = i + 1
           return temp
        }
    }

This inversion of control gets worse the more complicated your control flow is. For example if you have a recursive structure

    iterate_tree(t){
        if(t.isLeaf){
           yield t.value
        }else{
           iterate_tree(t.left)
           yield t.key
           iterate_tree(t.right)
        }
    }

then you suddenly need to also keep a bunch of stacks as your control flow state instead of just a single integer. I wont even try to write it because its just that ugly!

---------

One way to prevent the inversion of control is to use internal iterators, where you pass a code block to the iterator:

    void iterate(callback){
       for(i=0; i<10; i++){
          callback(i)
       }
    }

This works just fine sometimes but there are limitations: you cant't break out of the iteration at the middle and you cant interweave to iterations in parallel. Another situation where you see this kind of limitation is Javascript. Using callbacks to do async IO is better than writing everything inside out with all state in global variables (like you would do in C) but its a PITA to do simple things like chaining async calls one after the other inside a loop For example, with coroutines you could do something looking like

    foreach(url in list){
       var x = await getUrl(url)
       console.log(x)
    }
    restOfTheCode();

while in regular JS you need to use recursive functions instead:

    function go(i){
        if(i >= urls.length){
           restOfTheCode();
        }else{
           getUrl(function(x){
               print(x)
               go(i+1)
           })
        }
    }

jws · on Aug 20, 2013

all the time is strong.

I find them useful when conjoining function A with produces a sequence of results and function B which consumes the sequence†. Without coroutines you either have to turn one of the functions inside out to use it as a callback or break into threads and pass data between them.

With coroutines you can leave the functions in a more readable form and keep your cache footprint small by consuming the result as soon as it is produced.

␄

† I don't mean that I call A or B for each element of the sequence. A and B may have internal state and fill produce or consume the whole sequence in a single call. They could be organized to have a create function, a context structure, a "do one" call, and a destroy function… but that is a lot of noise and a lot of places for someone to make a mistake using them. A coroutine is cleaner.

gte910h · on Aug 20, 2013

Can I have an example here?

spc476 · on Aug 20, 2013

Here's a Lua-esque (since Lua supports coroutines) example. I want to run a simple TCP echo server, so the main code that handles a connection ideally would look like this:

    function echo(socket)
      local remote,data = socket:read()
      if remote == nil then -- no data
        socket:close()
        return 
      end
      socket:write(remote,data)
      return echo(socket) -- Tail call, this is a goto
    end

Normally, such code would need to be run in a separate process or thread but that's too much overhead. You can use an event-based method (say, with poll() or epoll()) but then the code becomes inverted (or involves callbacks---which in this case isn't too bad, but imagine trying to implement, say, SMTP). With coroutines, we can still use an event-based mainloop, but still keep the imperative style of the original function above. Our event loop would look something like:

    function mainloop()
      local events = select:events()
      for i = 1 , #events do
        local err,keepgoing = coroutine.resume(events[i].code,events[i].socket,events[i].remote)
        if not keepgoing then
          select:remove(events[i].socket)
          events[i].socket:close()
        end
      end
      return mainloop()
    end

and we would need to change our function just a bit:

    function echo()
      local socket,remote = coroutine.yield(true)
      local data = socket:read()
      if data == nil then
        return false
      end
      socket:write(remote,data)
      return echo()
    end

Oh, and don't forget the coroutine to accept new connections:

    function echoconnection()
      local socket = coroutine.yield(true)
      local connection,remote = socket:accept()
      select:insert(connection,{ read = true } ,{ code = echo,remote = remote })
      return echoconnection()
    end

You can, of course, rewrite it such that socket:read() does the call to coroutine.yield(), but here I wanted to make it more explicit (as well as a shorter example) to fully hide the implementation (this is based on a working implementation I have in about 224 lines of Lua, which also handles signals and hides the coroutine calls from the main functions---the first example in this post would run as-is in the fully-blown implementation).

616c · on Aug 21, 2013

Very cool stuff. I wanted to start learning Lua. Is there any way to see your code?

spc476 · on Aug 21, 2013

Well, it's finicky. You'll need to install several custom written Lua modules [1]. These have been tested under Linux and Solaris. You will definitely have to edit the Makefile to exclude building some of the modules like org.conman.tcc [2] because of dependency issues (TCC isn't for the faint of heart). If you are on another Unix system, you'll probably be able to get most of it built (at the least, you'll need org.conman.process, org.conman.net and org.conman.pollset). You will also need to install LPeg [3]. Once all those are installed, then you'll be able to run the code: http://pastebin.com/ZBq5gSbS

Or you can just look at the code: http://pastebin.com/ZBq5gSbS The coroutine stuff is there, just well hidden.

I will also link to a sample multiuser chat room (no security, very basic): http://pastebin.com/u6vSPc69

[1] https://github.com/spc476/lua-conmanorg

[2] It's a Lua interface to the TCC library, which is a C compiler-as-a-library. Not used in the same code I posted, but it is something I find immensely useful, but I would classify as "advanced."

[3] http://www.inf.puc-rio.br/~roberto/lpeg/

This is the Lua Parsing Expression Grammar library. If you are familiar with compiler tools, think lex and yacc (or flex and bison).

andrewcooke · on Aug 20, 2013

very informally, they're good for the kind of problem where it would be nice to structure your code as if you're using threads, but you have a simple sequence of actions so don't need the asynchronous aspect of threads or the worry that goes with them + mutable state.

for example, in the matasano crypto challenge i needed to simulate a message protocol between various actors (and then break it). i could have written client and server as separate threaded entities, but it was simpler to just have them each be a coroutine.

think of them like thread, state and bi-directional queue, in one easy package.

another example is lazy sequences. you could start up a separate thread that generates the sequences, but then you need to talk to it, throttle it in some way, and close it down. using a generator (a simple co-routine) you can still have your code for generating the sequence in one clean place, but it's easier to call and control.

if you've ever thought "it would be kinda cool if threads were simple, so i could structure my code like..." then you probably would have appreciated a coroutine.

[update: blog post on the protocol bit - http://www.acooke.org/cute/UsingCorou0.html - although really it doesn't say much more than the above]

tel · on Aug 20, 2013

Cool, I've been expecting a chance to do coroutine-based modeling of crypto protocols in those challenges, but haven't been able to devote any time to going through them.

anonymoushn · on Aug 20, 2013

Some people think that async code phrased as though it was blocking code is easier to read, write, and modify. This is because the workflow for writing async code with callbacks is basically

  1. Imagine the control flow you actually want.
  2. Manually perform some subset of a CPS transform around the parts that would block.
  3. Write the result of 2 and check it in.

And then the workflow for updating it is

  a. Read the code.
  b. Attempt to reconstruct a mental model of the originally-desired control flow from the mess.
  c. goto 1.

If you have some simulation/game in which many entities need to interact, you can use coroutines and write each entity's routine(s) imperatively. For example, BulletML lets you do this by embedding a suspendable domain-specific language in your project. If you check out http://www.youtube.com/watch?v=uK_YTyhj7T4, every bullet that either fires other bullets or changes velocity has one or more functions attached to it, and they are all running concurrently. It might be even better to do this without using an extra language, though.

vidarh · on Aug 20, 2013

In addition to inversion of control mentioned elsewhere, consider situations where you could have used threads and synchronization points (mutexes etc.). In such situations co-routines can offer a much lighter weight alternative to full blown threads (for single core execution), as the points where control should switch can be explicitly known, and the "context switch" can often be done with far less data shuffling.

jerf · on Aug 20, 2013

As with so many programming techniques, there is nothing that can only be done with a coroutine approach, so bear in mind that any example that fits in a little text box probably also fits into a little text box with conventional functions. Practical differences only appear at scale.

Further, "coroutine", like so many other programming terms, has suffered from a certain fuzziness. There's a reasonably precise term used by programming language researchers, which very few languages have. Things like Python's generators do not conform entirely to this definition, but you get people calling them coroutines anyhow, along with so many other things.

So, rather than try to justify the exact PL-research definition, let me try to explain the interesting thing about the more sloppy, broad definition. In a nutshell:

When you call a function, you pass in certain arguments. Then you start at the beginning, generating state or values in response to the input, before finally returning some more values. You then discard all the intermediate state the function generated. The idea here is that there are times when you may want to not discard the state, and/or you may want to be able to not start the function at the beginning. You may also want to pass fresh values into the middle of the function.

One of the most clear examples of where this is useful is when you create a pipeline of functions within your code. Using "functions" that can get more input as they go along, and generate output without discarding all their intermediate state, makes it very easy to compose functions together into a pipeline that only know they are in a pipeline. You can take input from a user, apply a "coroutine" that then parses their input into numbers, feed that into a "coroutine" that groups them together into a run-length delimited format, then feed that into a coroutine that writes the results to a file. Each of these routines holds on to a little bit of state, instead of discarding it. Each of these routines is "pluggable"; you can take the same pipeline, pop the "user input" bit off the front, and read from a file, or stream from the network, and the same for all the little bits. You get a composable system of functions that all know just barely enough to function (which is what makes composition easy), instead of a set of functions that knows "A ha, I'm going to get this sort of linked list, traverse it, do 'my thing', and then return a new sort of linked list", which doesn't work on network streams, or hash tables, etc. This also is an easy way to operate on a "stream" without having to have the whole stream loaded into memory, and is more cache/memory-bandwidth friendly because you're only traversing it once.

Of course, if you work at it enough, you can do this with conventional functions. But you're going to write a lot of boilerplate, you're going to get it wrong, and it probably still won't be as composable in the end. Or you'll end up manually re-implementing "coroutine"ish things yourself.

This is only one example, but it's one of the clearer ones where having things able to retain state without being forced into the strict call hierarchy that conventional function stacks require is obviously a good idea.

(Pendant note: I've tried to keep this comment very language agnostic. There's a world of difference between Haskell and Python here, for instance. I'm going for the essence of the pattern in general, not in a particular language.)

xargon7 · on Aug 20, 2013

> Pendant note

That should be "Pedant"

;)

jerf · on Aug 20, 2013

Damn. I know that, it just typed itself. Muphry's Law strikes again, sort of.

ricree · on Aug 21, 2013

>Things like Python's generators do not conform entirely to this definition, but you get people calling them coroutines anyhow

What's python still missing these days?

zzzcpan · on Aug 20, 2013

I guess familiarity is the reason. Most people know how to make a function call, but learning other ways to do concurrency takes time and effort.

brianpgordon · on Aug 20, 2013

> A coroutine is a function that you can jump back into after returning from it - and it remembers where it was in the code, and all the variables.

This is actually called a continuation. Coroutines are a control flow idiom for skipping back and forth between two blocks of code, which can depend on continuations.

tikhonj · on Aug 20, 2013

And, for a completely different take on the subject, coroutines in Haskell:

http://random.axman6.com/blog/?p=231

This is a nice article that explains how they work pretty well. It's very different from the C approach.

ginko · on Aug 20, 2013

Wouldn't you implement coroutines in Haskell by just generating a stream that is lazily evaluated? You can seed something like this with the iterate function for instance.

ufo · on Aug 20, 2013

One of the biggest reasons for using coroutines in Haskell is when you want to mix them with effects. For example, using lazy streams to do IO can lead to unpredictable resource usage so many people use coroutine libraries to do the IO in a more predictable manner (piper / conduits / io-streams / etc).

Finally, while lazy streams go a long way, in terms of preventing the inversion of control that you usually need to use coroutines for, there are still some situations where coroutines are useful. For example try writing that consumer/producer example from the blog post he linked using only lazy streams...

tel · on Aug 20, 2013

Laziness is pretty implicit and while it enables all kinds of optimizations and reasoning it can be hard to understand the sequencing of effects. In particular, laziness and IO lead to code where handle opening and closure is difficult to understand. For instance

    lines :: IO [String]
    lines = withFile path $ \h -> hReadLines h

will generate a lazy stream of strings in an IO action without actually reading those lines. You can either close the handle immediately and get a runtime error later when you force those lines OR leave it open and leak a file handle.

Explicit coroutines help to ensure that you know exactly what to expect from effect interleaving.

Mario Blazevic's article [0] is the quintessential treatment and forms the basis of the Pipes library today.

[0] http://themonadreader.files.wordpress.com/2011/10/issue19.pd...

zzzcpan · on Aug 20, 2013

If you are interested in coroutines, take a look at GNU Portable Threads http://www.gnu.org/software/pth/ and this paper: http://www.gnu.org/software/pth/rse-pmt.ps

imurray · on Aug 20, 2013

An alternative approach, arguably with too much magic: http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

nonane · on Aug 20, 2013

Actually, the link you've posted is much simpler than the parent post: no assembly voodo required and no longjump shenanigans. It's also much easier to debug under a debugger. I've successfully used this in production code. It effectively boils down to these 3 defines:

  #define COR_BEGIN(stateVar)            switch(stateVar) { case 0:;
  #define COR_END(stateVar)              }; stateVar = -1;
  #define COR_RETURN(stateVar)           do { stateVar=__LINE__; return; case __LINE__:;} while (0)

  void dofunc(int* pstate)
  {
   COR_BEGIN(*pstate);

   while (computation1)
       COR_RETURN(*pstate);

   while (computation2)
       COR_RETURN(*pstate);

   COR_END(*pstate);
  }

yosefk · on Aug 20, 2013

No stack though, so you can't have arbitrary code running in several concurrently executing coroutines.

As to debugging, I don't think setjmp/longjmp gets in the way of debuggers at all.

halayli · on Aug 20, 2013

or use lthread.

http://github.com/halayli/lthread

ilink · on Aug 21, 2013

My coworker showed me this, great library!

bajsejohannes · on Aug 20, 2013

What's the runtime overhead on this? Aren't setjmp and longjmp considered slow?

ajross · on Aug 20, 2013

set/longjmp are full context switches, so they need to write out all relevant registers to memory. More to the point the "context" being saved here is the full C stack, so in addition to the space for the registers you also need to allocate memory for the stacks to which you are going to swtich.

With a more constrained language, the compiler would be able to do cleaner analysis to save off only the subset of closed-over data. Most functions like this don't ever need to return all the way up to the thread top, they just work with a few local variables.

That said, saving/restoring a few hundred bytes of registers and allocating a page for a small stack are hardly "slow" operations, so I wouldn't view performance as a real issue here.

chmike · on Aug 21, 2013

Libcoro[1] is a C library providing the primitives to implement yield.

[1] http://software.schmorp.de/pkg/libcoro.html