The problem was caused by a memory management primitive. This wouldn't be an iss...

lowbloodsugar · on Dec 31, 2021

Wood for trees. The problem was caused by an ill-thought-out design. We can do similarly performance degrading things in GC languages, just the details will be different. However, at some extreme, which, to be fair, most systems won't hit, GC languages perform vastly worse than non GC languages. In one service I own, Java's G1GC uses 20x more CPU than Rust for an application specific benchmark. Most of this time is spent in the concurrent phase so Shenandoah and GenShen aren't going to make a dent (and we can't afford the RAM for Shendandoah). 20x CPU and 2x wall clock. The question we are looking at is, "Should we just continue to spend 20x $ on operating costs for the Java version to avoid writing it in Rust?"

dagmx · on Dec 31, 2021

How would the GC avoid the atomic lock and cache invalidation across numa boundaries? What language has a sufficiently lock less rw capable GC?

Edit: to clarify I was thinking about a mutable context share as shown in the code example, not solely about the ref counting.

Const-me · on Dec 31, 2021

> How would the GC avoid the atomic lock and cache invalidation across numa boundaries?

By not using reference counting. State of the art GCs don’t count references. They usually doing mark and sweep, implementing multiple generations, and/or doing a few other things.

Most of that overhead only happens while collecting. Merely referencing an object from another thread doesn’t modify any shared cache lines.

> What language has a sufficiently lock less rw capable GC?

Java, C#, F#, Golang.

dagmx · on Dec 31, 2021

Yeah I think my confusion was that I was thinking about a fully mutable context shared across threads based on the code example in the post.

But if it's just for the ref counting part of the Arc then I can see how a GC would solve it by not needing the RC

Const-me · on Dec 31, 2021

> I was thinking about a fully mutable context shared across threads

A quote from the article: “No locks, no mutexes, no syscalls, no shared mutable data here. There are some read-only structures context and unit shared behind an Arc, but read-only sharing shouldn’t be a problem.” As you see, the data shared across threads was immutable.

However, the library they have picked was designed around Rust’s ref.counting Arc<> smart pointers. Apparently for some other use cases, not needed by the OP, that library needs to modify these objects.

> I can see how a GC would solve it by not needing the RC

Interestingly enough, C++ would also solve that. The language does not stop programmers from changing things from multiple threads concurrently. For this reason, very few libraries have their public APIs designed around std::shared_ptr<> (C++ equivalent of Rust’s Arc<>). Instead, what usually happens, library authors write in the documentation things like “the object you pass to this API must be thread safe” and “it’s your responsibility to make sure the pointer you pass stays alive for as long as you using the API”, and call it a day.

dagmx · on Dec 31, 2021

To be fair, anything you can do in C++ can be done in Rust. The language just steers you away from it unless you enter into unsafe.

Const-me · on Jan 1, 2022

> To be fair, anything you can do in C++ can be done in Rust.

Technically, all programming languages are Turing-complete. Practically, various things can affect development cost by a factor of magnitude. The OP acknowledges that, they wrote "Rewriting Rune just for my tiny use case was out of the question".

Just because something can be done doesn't mean it's a good idea to do that. Programming is not science, it's engineering, it's all about various tradeoffs.

> The language just steers you away

Such steering caused unique performance issues missing from both safer garbage collected languages, and unsafe C++.

Note the OP was lucky to be able to workaround by cloning the data. If that context or unit objects would use a gigabyte of RAM, that workaround probably wouldn't work, too much RAM overhead.

dagmx · on Jan 1, 2022

Your comment said that C++ would solve it, I was merely pointing out that Rust can solve it identically to C++ by side stepping the borrow checker. You can do so without any performance penalties as well and the code would function identically to the way it does in C++.

Const-me · on Jan 1, 2022

> Rust can solve it identically to C++ by side stepping the borrow checker

Doing that is prohibitively expensive in this particular case. It would require patching a large third-party library who uses Arc for the API: https://docs.rs/rune/latest/rune/struct.Vm.html#method.new

And the reason that library uses Arc in the API is unique to Rust.

dagmx · on Jan 1, 2022

You're talking about the specifics of this library. As you said in your original comment, that would apply to any C++ library using shared_ptr too across the API boundary.

A different non-GC language wouldn't change things, because you'd have the exact same trade off if the same decision was made.

The only major difference is that Rust pushes you to Arc but C++ doesn't push you to a shared_ptr.

amelius · on Dec 31, 2021

GoLang. It was designed for highly concurrent loads. For an overview of history and technology involved, read e.g.:

https://blog.twitch.tv/en/2016/07/05/gos-march-to-low-latenc...

dagmx · on Dec 31, 2021

Edit: another comment explains that you're likely talking about just the ref counting aspect rather than the entire context sharing used by the rune code shown, in which case, yes I see why a concurrent GC would avoid the issue in that scenario.

----------

I'm familiar with Go's GC. Your linked post doesn't explain how it would avoid the hit mentioned by the cache invalidation across multiple clusters.

It'll either try and put multiple go routines on a single cluster (as listed in the link) or it'll need to copy the necessary stack per thread. Which is effectively what the original article ends up doing.

But if you encounter anything that needs to run concurrently across threads while using a single r/w object, you'll hit the same cliff surely?

YogurtFiend · on Dec 31, 2021

While this isn't true with Go's GC, a GC that's stop-the-world can avoid these cache coherency issues altogether. If you pause every thread before marking and sweeping, then you won't run into problems--only the GC is running. While this may sound silly, stopping the world will oftentimes lead to higher throughput than concurrent collectors, at the cost of higher latency (which is why it's not suitable for Go, which is often used for building servers where latency is a priority).

jhgb · on Dec 31, 2021

For one, I'd be surprised if Azul's C4 (Zing) didn't have these issues already solved (assuming anyone has solved that, they probably did).

yencabulator · on Dec 31, 2021

If the only purpose of the Arc was to be a reference count, a language with a (non-refcount) GC wouldn't need that atomic in the first place.

dagmx · on Dec 31, 2021

That's fair. It read to me, from his post, that Rune was using it for context sharing as well between threads (since that's the source of the highest level Arc in his code example) . If it's only for ref counts then it makes sense that a concurrent GC could avoid the issue