The culprit was a shared atomic reference counter (Arc). The cache invalidation across cores eliminated any kind of scaling via multiple threads.
The solution is to stop sharing. The author had to make changes specific to this codebase for each thread to have its own copy of the offending dataset.
The issue is that Rust doesn't allow static borrowing across thread boundaries (except for globals), which necessitates cloning the arc. This is why they chose to instead copy the contents and use thread locals (may be wrong - skimmed the article).
I've had this complaint for over a year, and nobody is addressing it. Arcs are absolutely ubiquitous, it can come back and bite anytime, it's just very hard to know upfront where a bottleneck is and most people will never profile to this extent.
I really wish Rust would reconsider the decision to throw its hands up at non-lexical destruction guarantees. This is related to the thread::scoped removal from 2015 (IIRC), after which mem::forget was added. Anyway, I'm rambling. There's tons to read for the curious.
The simplest mitigation right now is to clone arcs only when necessary, and instead use static references of those arcs whenever possible (especially hot paths). It's a mess for refactors and easy to miss, but better than nothing.
> I really wish Rust would reconsider the decision to throw its hands up at non-lexical destruction guarantees. This is related to the thread::scoped removal from 2015 (IIRC), after which mem::forget was added. Anyway, I'm rambling. There's tons to read for the curious.
I think the original argument that leaking memory using safe code by building a reference cycle is still valid. The language would have to change significantly to prevent that, so either reference counting must be unsafe, or we accept that destruction cannot be guaranteed just because something goes out of scope. That is why `forget` is safe.
The general solution here is to use scopes instead of RAII to ensure destruction. Rust lost its thread::scoped libstd API because that was based on RAII, but there are multiple crates in the ecosystem that provide the lexical scope-based solution
This is indeed the reason often cited, but it's more nuanced than that. If cycle forming ref types had a lifetime param, you could guarantee destruction at a lifetime level, a weaker but highly useful guarantee. In fact one proposal at the time was to add a lifetime param to Arc and Rc, but it was rejected because it had such big impact and the team was stressed about the 1.0 release.
If you look at the various crates that offer static borrowing across threads today, they use lexical scope and they are perfectly safe and guarantee destruction IFF the inner scopes terminate (which is exactly the guarantee you want). However, they are too restrictive to be useful.
I don't know why you're being downvoted. Memory reclamation is hard in the face of parallelism. It's much easier to design lock-free algorithms in languages with a garbage collector, because memory reclamation issues are taken care of for you (and often times in a batched manner, which won't affect the hot path of your code). There _are_ techniques to avoid using a garbage collector: atomic reference counting, epoch-based reclaimation, hazard points, and more. But they can be quite difficult to implement and difficult to use.
There are very few GCs that are actually lock-free. While uncooperative mutators cannot block other mutators in an on-the-fly collector, for example, the collector is blocked until all mutators cooperate. Nonetheless I'd guess that a sufficiently concurrent collector is good enough.
Right. Plus GCs can compact memory, which we can't do with manual memory management. I don't understand why people fetishize manual memory management --- do you want to spend your short time on Earth twiddling references?
After 28 years of programming in C and 25 years programming in C++ (on and off), I am sick to death of managing memory. Programmers can't do it right or well. Let the super-intelligent idiot savant with billions of cycles of computation per second and trillions of bits of storage do the thinking for me.
Personally I glad the world is moving away from unsafe manual memory management in C++, be that with better RAII and managed pointers, Rust's type system, etc. But those things are still breakable and IMHO, ultimately a big waste of people's time. If Rust's lifetime annotations could be inferred by a compiler, then by all means. But if they can't, just GC already. All that extra thinking pushes people to think about the wrong things and designs end up ossified, difficult to refactor because of the extra overhead of radically changing ownership. Forget it already! Let computers figure out the whole memory management problem.
Want your program to go fast? Don't allocate a bunch of garbage. Redesign it to reuse data structures carefully. Don't burden yourself forever twiddling with managing the large and small alike, obsessing over every dang byte.
> Let the super-intelligent idiot savant with billions of cycles of computation per second and trillions of bits of storage do the thinking for me.
Unfortunately, it can't. To be fair, GC does management memory very well, but there is more to memory than handling allocation.
It becomes immediately obvious when you compare Go to Rust. Go strives for simplicity. Compare Go and Rust code, and Rust's overhead for making memory handling safe (type annotations, the type system exploding because it carries information about so many types of ownership) make it horribly complex compared to Go. Go code expresses the same concepts in far fewer tokens, and just as memory safe as the Rust code - provided there is only one thread.
Add a second thread, and the Rust code will continue to work as before. The Go code - well the GC does manages memory allocation only, it does _not_ arbitrate how multiple threads access those objects. Screw up how multiple Go threads access that nice GC managed memory and undefined and non-deterministic behaviour is inevitable result. That IMO is the _worst_ type of bug. Most of the overhead Rust imposes (like ownership) has very little to do with managing memory lifetimes. It is about preventing two threads form stomping on each other. Managing memory lifetimes just comes for free.
I suspect the rise and rise of Rust is mostly due it's concurrency guarantees. Go back a couple of decades and when multiple CPU's were almost non-existent and I suspect the complexity Rust imposes would have been laughed out of the room, given all it really gave you was an very complex alternative to GC - which is what you are effectively saying is all it provides. Nowadays we have $1 computers with multiple ARM cores in a disposable CVOID testing kit. Once you've been hit a couple of times by a currency heisenbug taking out your products in the field, you are screaming out for some tool to save your arse. Rust is one such tool.
Yeah, I recognize that ownership helps with concurrency, and TBH most of the code I write isn't highly concurrent, especially at a fine-grained level. Nevertheless, I think all that ownership is going overboard to solve what is in effect a niche problem. (After all, 30 years ago, engineers at Sun architected the Solaris kernel with very carefully designed fine-grained locking with little more than C. "Solaris Internals" is still highly recommended.).
There are a lot of other strategies for writing concurrent systems, like, for example, making most data immutable, coarse-grained sharing, shared-nothing multi-process architectures like actors, using race detectors and/or thread sanitizers, using good concurrent libraries (with lock-free hashtables and other datastructures).
The difference between the number of concurrency bugs causing catastrophic failure and all other kinds of bugs causing catastrophic failure has gotta be at least 100 or maybe 1000 to 1. So forcing everyone to think about ownership because maybe they are writing concurrent code (then again maybe they aren't) so that "congrats your memory management problems are solved" seems like a Pyrrhic victory--you've already blown their brain cells on the wrong problem. Worse, you've forced them to bake ownership into the seams in every part of the system, making it more difficult to refactor and evolve that system in the future. [1]
> Screw up how multiple Go threads access that nice GC managed memory and undefined and non-deterministic behaviour is inevitable result.
You get race conditions, you don't get undefined behavior (nasal daemons). Go and Java have memory models that at least your program doesn't violate the source type system and you don't get out-of-thin air results. You get tearing and deadlocks and race conditions, not undefined behavior. At Google we used TSAN extensively in Chrome and V8 and similar tools exist for Java, and I assume so as well for Go.
It's a broader conversation, but I don't think advocating that everyone write highly concurrent programs with shared mutable state, no matter what tools they have at their disposal, is pushing in the right direction. Erlang, Actors, sharing less, immutable data, and finding better higher level parallel programming constructs should be what we focus our and other's precious brain cycles on, not getting slapped around by a borrower checker or thread sanitizer. We gotta climb out of this muck one of these decades.
[1] If you are instead saying that people should think about the design of their system from the beginning so that they don't have to refactor as much, then I agree; but just don't waste time thinking about ownership, or worse, architecting ownership into the system; it will alter the design.
> So forcing everyone to think about ownership because maybe they are writing concurrent code (then again maybe they aren't) so that "congrats your memory management problems are solved" seems like a Pyrrhic victory--you've already blown their brain cells on the wrong problem.
https://manishearth.github.io/blog/2015/05/17/the-problem-wi... argues that "[a]liasing with mutability in a sufficiently complex, single-threaded program is effectively the same thing as accessing data shared across multiple threads without a lock". This is especially true in Qt apps which launch nested event loops, which can do anything and mutate data behind your back, and C++ turns it into use-after-free UB and crashing (https://github.com/Nheko-Reborn/nheko/issues/656, https://github.com/Nheko-Reborn/nheko/commit/570d00b000bd558...). I find Rust code easier to reason about than C++, since I know that unrelated function calls will never modify the target of a &mut T, and can only change the target of a &T if T has interior mutability.
Nonetheless the increased complexity of Rust is a definite downside for simple/CRUD application code.
On the other hand, when a programmer does write concurrent code with shared mutability (in any language), in my experience, the only way they'll write correct and understandable code is if they've either learned Rust, or were tutored by someone at the skill level of a Solaris kernel architect. And learning Rust is infinitely more scalable.
Rust taught me to make concurrency tractable in C++. In Rust, it's standard practice to designate each piece of data as single-threaded, shared but immutable, atomic, or protected by a mutex, and separate single-threaded data and shared data into separate structs. The average C++ programmer who hasn't studied Rust (eg. the developers behind FamiTracker, BambooTracker, RtAudio, and RSS Guard) will write wrong and incomprehensible threading code which mixes atomic fields, data-raced fields, and accessing fields while holding a mutex, sometimes only holding a mutex on the writer but not reader, sometimes switching back and forth between these modes ad-hoc. Sometimes it only races on integer/flag fields and works most of the time on x86 (FamiTracker, BambooTracker, RtAudio), and sometimes it crashes due to a data race on collections (https://github.com/martinrotter/rssguard/issues/362).
Though I can't find it right now, there was a recent paper that did do some effective copying/compaction under a normal malloc/free regime. It did so by noting that once two pages were getting somewhat empty, and the allocations within them wouldn't then overlap, it could use virtual memory to put them in the same physical page. I believe it claimed effective results, though I cannot recall the scale.
What language wouldn't hit this issue? The second you have an atomic write across multiple numa clusters you'll hit this.
Edit: it appears they're talking specifically about the ref counting whereas I was considering the entirety of a shared context. That clarification helps me understand where their statement was coming from
Wood for trees. The problem was caused by an ill-thought-out design. We can do similarly performance degrading things in GC languages, just the details will be different. However, at some extreme, which, to be fair, most systems won't hit, GC languages perform vastly worse than non GC languages. In one service I own, Java's G1GC uses 20x more CPU than Rust for an application specific benchmark. Most of this time is spent in the concurrent phase so Shenandoah and GenShen aren't going to make a dent (and we can't afford the RAM for Shendandoah). 20x CPU and 2x wall clock. The question we are looking at is, "Should we just continue to spend 20x $ on operating costs for the Java version to avoid writing it in Rust?"
> How would the GC avoid the atomic lock and cache invalidation across numa boundaries?
By not using reference counting. State of the art GCs don’t count references. They usually doing mark and sweep, implementing multiple generations, and/or doing a few other things.
Most of that overhead only happens while collecting. Merely referencing an object from another thread doesn’t modify any shared cache lines.
> What language has a sufficiently lock less rw capable GC?
> I was thinking about a fully mutable context shared across threads
A quote from the article: “No locks, no mutexes, no syscalls, no shared mutable data here. There are some read-only structures context and unit shared behind an Arc, but read-only sharing shouldn’t be a problem.” As you see, the data shared across threads was immutable.
However, the library they have picked was designed around Rust’s ref.counting Arc<> smart pointers. Apparently for some other use cases, not needed by the OP, that library needs to modify these objects.
> I can see how a GC would solve it by not needing the RC
Interestingly enough, C++ would also solve that. The language does not stop programmers from changing things from multiple threads concurrently. For this reason, very few libraries have their public APIs designed around std::shared_ptr<> (C++ equivalent of Rust’s Arc<>). Instead, what usually happens, library authors write in the documentation things like “the object you pass to this API must be thread safe” and “it’s your responsibility to make sure the pointer you pass stays alive for as long as you using the API”, and call it a day.
> To be fair, anything you can do in C++ can be done in Rust.
Technically, all programming languages are Turing-complete. Practically, various things can affect development cost by a factor of magnitude. The OP acknowledges that, they wrote "Rewriting Rune just for my tiny use case was out of the question".
Just because something can be done doesn't mean it's a good idea to do that. Programming is not science, it's engineering, it's all about various tradeoffs.
> The language just steers you away
Such steering caused unique performance issues missing from both safer garbage collected languages, and unsafe C++.
Note the OP was lucky to be able to workaround by cloning the data. If that context or unit objects would use a gigabyte of RAM, that workaround probably wouldn't work, too much RAM overhead.
Your comment said that C++ would solve it, I was merely pointing out that Rust can solve it identically to C++ by side stepping the borrow checker. You can do so without any performance penalties as well and the code would function identically to the way it does in C++.
You're talking about the specifics of this library. As you said in your original comment, that would apply to any C++ library using shared_ptr too across the API boundary.
A different non-GC language wouldn't change things, because you'd have the exact same trade off if the same decision was made.
The only major difference is that Rust pushes you to Arc but C++ doesn't push you to a shared_ptr.
Edit: another comment explains that you're likely talking about just the ref counting aspect rather than the entire context sharing used by the rune code shown, in which case, yes I see why a concurrent GC would avoid the issue in that scenario.
----------
I'm familiar with Go's GC. Your linked post doesn't explain how it would avoid the hit mentioned by the cache invalidation across multiple clusters.
It'll either try and put multiple go routines on a single cluster (as listed in the link) or it'll need to copy the necessary stack per thread. Which is effectively what the original article ends up doing.
But if you encounter anything that needs to run concurrently across threads while using a single r/w object, you'll hit the same cliff surely?
While this isn't true with Go's GC, a GC that's stop-the-world can avoid these cache coherency issues altogether. If you pause every thread before marking and sweeping, then you won't run into problems--only the GC is running. While this may sound silly, stopping the world will oftentimes lead to higher throughput than concurrent collectors, at the cost of higher latency (which is why it's not suitable for Go, which is often used for building servers where latency is a priority).
That's fair. It read to me, from his post, that Rune was using it for context sharing as well between threads (since that's the source of the highest level Arc in his code example) . If it's only for ref counts then it makes sense that a concurrent GC could avoid the issue
Immutable collections wouldn't really help here since Rust does not have a garbage collector, so they would still be wrapped in an `Arc<T>`, or use one internally. Unless you're giving each thread their own copy of the immutable collection, in which case the immutability doesn't really add much.
You're right if they're wrapped in an Arc<T>. But it's a lot easier to confidently clone an immutable data structure to threads/async functions because you know they won't be mutated by other cores. If you're cloning mutable data, you'll have to collect each "shard" at the end of whatever runtime loop your program has, and merge them back into the authoritative model. (Or drop the changes, but then why make the data mutable?) Gets hairy.
The solution is to stop sharing. The author had to make changes specific to this codebase for each thread to have its own copy of the offending dataset.