> I gave a talk (with transcript) about this Thanks. A fascinating history. Do k...

steveklabnik · on Sept 26, 2023

Thank you! Glad you enjoyed.

> I don't understand why C would care where you call it from.

The details here differ based on what kind of green threads you are implementing, but the core of it is, they're cheaper than regular threads because they do not use a normal stack. C expects a normal stack. Bridging this gap has a cost. You also have to manage the interaction between the GC and C, which can have a cost. If you're curious about specifics, one example of this is cgo: https://go.dev/src/runtime/cgocall.go Go has changed strategies here several times throughout its history (as did Rust when Rust had green threads), so you may find other information that's older as well.

rstuart4133 · on Sept 26, 2023

> one example of this is cgo:

I should reveal at this point I've created protected mode x86 OS's from scratch, written BIOS's and what not, all done in C, so I do know a bit about C and stacks.

As I expected there is nothing in cgo that suggests C that cares about a stack. That's not surprising as with the exception of esoteric things like setjmp, and backtraces, C doesn't care. You can happily malloc a block of memory and point the SP there, push the args and call a C function, and it will do it's thing and return. It's vaguely possible the OS may get pissed off that the stack isn't where it thought it should be - but the user space C function won't notice.

What cgocall() (the function that handles go call's to C) spends most of it's time doing is tell the green thread scheduler what is happening. I'm guessing the reason for that is the C code is effectively code of a different colour - ie it's code that could be using blocking I/O calls. If the C function does block it won't stop just the green thread calling it, it will block all of them. I imagine is not considered acceptable in Go. A work around would be to move the green thread to a different native thread while the C function is running. Maybe that's what all that bookkeeping accomplishes does. As you say, and as I can see in cgocall(), the overhead of bookkeeping involved is literally orders of magnitude bigger than the overhead of the C call itself.

And as you also say, that overhead isn't acceptable for Rust. The solution Rust has implemented for async is effectively ignore the problem, so if a async function calls a C method and that C method blocks, then every async task stops until that C function returns. It would have been a perfectly acceptable solution for green threads too. But I'm guessing the original Rust green thread went for the Go "make the library hide the problem from the programmer" approach, and found itself stuck with a whole pile of overheads that ended up being unacceptable for a systems programming language.

If so, the solution wasn't to throw out green threads and adopt the async solution. That was akin to throwing the baby out with the bath water. The simple solution was to just take the async approach and make the issue of blocking C calls the programmers problem, as opposed to hiding it with the runtime libraries.

If they had have gone that route even handing blocking C calls could have been made relatively straight forward - just provide a library function calls the function it's passed in it's own thread. (Maybe async already provides a similar function now?) Effectively that lets the programmer choose when to take the C call overhead Go imposes on every call, and when to avoid it.

Right now, it looks to me like my opening comment still stands - green threads (although not Rust's initial implementation) would have been a much better solution over async to the multi tasking problem. At the 1000ft view, green threads and async are very similar. Both get their speed by using event driven I/O rather than blocking I/O, and thus avoid the overheads of OS task switching. The key difference is where green threads store state on a separate stack (a technique so wonderfully efficient we use it everywhere), async stores it in manually allocated block that must then have data copied into it, and later freed. That manually allocated block creates a lot of overheads, both in code and at runtime, that green threads don't have.

steveklabnik · on Sept 27, 2023

> As I expected there is nothing in cgo that suggests C that cares about a stack.

Okay well again, I'm trying to be very broad and vague here, because the details do actually matter but differ between systems. C in a general sense doesn't care, as you elaborate, sure, but because these stacks are so small, and C code doesn't know how to expand the stack (since there's no API to do so), you run the risk of overflowing the stack. So in practice, that stack usage does matter, and the way that you protect against this is to set up a regular sized stack, swap to it, and make the call. At least, in this specific implementation. http://manticore.cs.uchicago.edu/papers/pldi20-stacks-n-cont... talks about tradeoffs of six different ways of implementing this kind of thing, for example. (both Go and Rust tried the "segmented" strategy here and threw it out, for example.)

> (Maybe async already provides a similar function now?)

Many implementations provide a threadpool for you to throw blocking stuff onto, yes. That's up to the given runtime. But again, that's purely for the blocking semantics, it isn't about calling into C vs calling into Rust.

Anyway if you truly want to understand this space I would encourage you to continue looking into it, but when it comes to demonstrated performance in the real world, the green thread strategy loses out. There are other great reasons to choose that model, but for Rust's systems language goals, as well as its performance goals, async/await is the only design that's made sense.

rstuart4133 · on Sept 27, 2023

Ahh, all those speculative words from me, and it turns out there is a Rust green thread implementation out there now. May: https://crates.io/crates/may

And it's included in a set of independent benchmarks of http servers written in variety of languages: https://www.techempower.com/benchmarks/#section=data-r21&tes... May (and Rust) put in a very good showing there, may-minihttp taking out 2nd spot. Another Rust library, xitca-web, takes out 3rd spot. Neither may-minihttp nor xitca-web use async, but there are other Rust async implementations that come close to them. I'd call it a wash.

From that I'd say may's green thread implementation is on a par with async speed wise.

steveklabnik · on Sept 27, 2023

May is an unsound library; you can access TLS and it will cause UB, in purely safe code. I’m not familiar with the other one though, I’ll have to check it out, thanks!

rstuart4133 · on Sept 28, 2023

> you can access TLS and it will cause UB, in purely safe code.

Errrk. I was looking at using it (because async really does suck from a usability point of view compared to green threads). Do you have a link?

Hmmm. Is it TLS consuming too much stack? https://github.com/rust-lang/rust/issues/111272

That would be an issue for green threads. And other things, as I discovered when I took a brief look at the may code to see if they handled stack allocation. Turns out may doesn't don't handle it directly - the standard library (nightly) has a way of creating stacks for co-routines (generator::Gn). May's green threads are just co-routines, and the Rust nighly library provides the stack.

That means if it is the issue I linked to, it's a bit unfair to blame it on may. The same bug will manifest itself any Rust nightly generator that calls TLS.

Probing further, it generator::Gn creates using stack::Stack, and stack::Stack allocates stacks using malloc. And yes, that guarantees stack overflow will cause UB of the worst sort because it just overwrite the next malloced block. Someone should lookup "man 5 mmap" on Linux and BSD. Both have ways that create stacks behave very nicely, including causing a hard fail if they overflow rather than UB. I presume Windows has a similar function.

To repeat the point I keep making: all these issues with green threads aren't intrinsic issues to the concept. They arise because the initial Rust implementation wasn't well designed, and not implemented particularly well either.

steveklabnik · on Sept 28, 2023

Another recent example of this problem: https://github.com/dotnet/runtimelab/issues/2398

rstuart4133 · on Oct 2, 2023

Looks like they made the same design decision as Rust's early green thread implementation. Quoting that link:

> The key benefit of green threads is that it makes function colors disappear and simplifies the progr'samming model.

As a point of order, no, green threads don't make colours disappear. They can't as the whole point is to run multiple tasks, so no green task can be allowed to make a blocking I/O call like native code does, so you have re-do every I/O library using non-blocking I/O. And thus green threads must use the non-blocking version of the library, aka as a different coloured code.

Where green threads are different to async is the language library can make the colouring disappear for green threads. It does that by, on every I/O call, checking if a green thread is making the call and switch between blocking and non-blocking I/O accordingly. That incurs a speed penalty of course. And it doesn't just hit green thread code, it slows down native threads too.

Looks like .net decided that overhead is too high to bear. Fair enough - but that's a consequence the decision to hide coloured code, not green threads per se.

While you could do the same trick to hide blocking vs non-blocking for async code too of course, it wouldn't hide colouring. That's because async colours code in other ways too - for example it introduces a whole now call / return syntax. Unlike "not needing colours", not needing a new syntax is a real advantage of green threads over async. Another one is saving state on the stack rather than a malloced block. (If writing function locals to a malloc'ed block was faster than pushing them on a stack was faster we would do it everywhere.)

rstuart4133 · on Sept 27, 2023

> http://manticore.cs.uchicago.edu/papers/pldi20-stacks-n-cont...

Odd they didn't compare the most common strategy used in practice, which is the one the linux kernel uses. The technique is described in mmap(2), under the MAP_GROWSDOWN flag. Even if you allow for a 64Kb stack for each green thread a 32bit machine has enough virtual address space for thousands of stacks. If you need more add an option to trim down the stack size.

> But again, that's purely for the blocking semantics, it isn't about calling into C vs calling into Rust.

Yes, it's blocking semantics. But the reason given for abandoning green threads was those calls from Rust to C were too slow in green threads, and the only reason I can see that would be is the library is attempting to hide those blocking semantics by intercepting every C call. It it didn't there would be no speed disadvantage.

Yes, intercepting slows down the call by an order of magnitude. But there is another solution - don't intercept the calls, let the programmer handle it instead. That's the solution async adopts. If you are going to claim green threads are slower than async then it's only fair to compare apples with apples, and that means comparing implementations that do it the same way.

Mind you, it's purely a guess on my part that the old green threads implementation slowed C calls by intercepting them, so it's purely a guess we aren't comparing apples with apples. The guess is based on the fact there is no other reason green threads C calls should be slower, as C doesn't care one way or the other.

> There are other great reasons to choose that model, but for Rust's systems language goals, as well as its performance goals

I can't see what systems language goals would be broken by green threads - but then I'm not familiar with them. Apart from the C call thing, green threads should be faster as they are storing data on the stack rather than copying it into a manually allocated block. Since the C call thing is looks to be a problem with the design choices of that early Rust green thread model, I don't trust the claim an implementation of green threads that makes the same tradeoffs as async currently does would be slower. And green threads does provide a much cleaner API.

But I guess the response to my whinging at this point is "patches are welcome", or rather an appropriate green thread implementation.