These are all tiny, targetted microoptimizations worth a percent or three of benefit in specific tests. They're worth doing (or at least evaluating) in any mature product and I have no complaint.
Nonetheless rustc remains really slow relative to other similar technologies, including C++ compilers. Is there any consensus as to why?
I mean, with C++, the answer is something to the effect of "template expansion happens syntactically and generally has to be expressed in headers, leading to many megabytes of code that has to be compiled repeatedly with every translation unit". And that isn't really amenable to microoptimization. We all agree that it sucks, and probably can't be fixed with the language as it's specified, and chalk it up to a design flaw.
What's the equivalent quip with rustc? I mean... is it going to get faster (in a real sense, not micro), or is it not? Is this fixable or not, and if not why?
> I mean, with C++, the answer is something to the effect of "template expansion happens syntactically and generally has to be expressed in headers, leading to many megabytes of code that has to be compiled repeatedly with every translation unit". And that isn't really amenable to microoptimization. We all agree that it sucks, and probably can't be fixed with the language as it's specified, and chalk it up to a design flaw.
Isn't this what happens with rust too ? It's called "monomorphisation" there but everytime you use a Vec<i32> in Rust I would guess that what happens is fairly similar to a template instantiation.
Most of the time of compilation is spent in LLVM since Rust generates a lot of IR. I haven't looked deeply into the issue, but it wouldn't surprise me that monomorphization combined with the functional programming patterns is one cause of code bloat. For instance, every time you use `map` (or another HOF) with a closure you end up with two generated functions per usage: One for the closure and one for the `map` monomorphed for that specific closure.
Compare this to a `map` function which is only generic over the return type:
fn map<T>(self, f: &FnMut(Self::Item -> T)
This function would only be instantiated once per unique T, regardless of how many times you use it with different closures. This halves the number of trivial functions that LLVM need to process and (usually) inline. The reason `map` isn't defined like that is that it uses dynamic dispatch instead of static dispatch.
However, I don't actually know how much a difference it would make, but it's my understanding that LLVM was never optimized for the use case where you have many, tiny functions (which is more common in functional languages).
The equivalent is doing more dead code analysis at the MIR stage before sending it to llvm. This is where rustc loses most of its perf. LLVM is usually 50-60% of execution time, mostly because rust sends teams of ir to chew through.
Serious question: is there any reason dead code detection would be any faster at the syntax level than the generated code? I mean, I'm no compiler expert but in a straightforward SSA representation "dead code" is just a register that gets assigned but never used, all of which can be detected and dropped globally with just one pass over the data.
I suspect the real reason is going to turn out more complicated, no? Like, the code is really "dead" yet because we can't know if it will be used later without solving the halting problem.
Which, if you think about it, is pretty much exactly the situation C++ is in. All those expanded templates aren't "dead" yet at the point of expansion either, you just don't know if the code will use them. And in practice it won't, so you wasted a ton of cycles to find that out.
> Serious question: is there any reason dead code detection would be any faster at the syntax level than the generated code? I mean, I'm no compiler expert but in a straightforward SSA representation "dead code" is just a register that gets assigned but never used, all of which can be detected and dropped globally with just one pass over the data.
That is only one type of dead code elimination. It is not necessary to solve the halting problem to eliminate larger portions of code than just unused register assignments. There are plenty of ways to eliminate beyond that, such as proving that entire classes, functions, declarations, branches, loop iterations, and data are unreachable. Especially so in a type-resolved format like MIR, some of these higher level optimizations become really cheap and would allow for LLVM to do a lot less work.
In general, you see inflation in code size as the representation becomes lower level and more explicit. Traversing a larger volume of code to detect dead code is generally more expensive than traversing a smaller volume of code. This might not be true if the deadness determination is really expensive, but it's almost always worth doing some kind of lightweight dead code elimination before each stage of lowering to save time in the subsequent stage.
> I mean, with C++, the answer is something to the effect of "template expansion happens syntactically and generally has to be expressed in headers, leading to many megabytes of code that has to be compiled repeatedly with every translation unit".
That's not really the case in C++. Most code in headers is not compiled repeatedly in every translation unit. The vast majority of it is dead code, which compilers are very good at discarding at an early stage.
> We all agree that it sucks, and probably can't be fixed with the language as it's specified
A lot of the overhead can be fixed, and is fixed, via precompiled headers, which have been around for decades.
> What's the equivalent quip with rustc? I mean... is it going to get faster (in a real sense, not micro), or is it not? Is this fixable or not, and if not why?
At this point, rustc and C++ compiler performance problems mostly consist of little things that add up. As engineers we always want there to be a magic bullet that will solve the performance problems, but in mature technology like compilers, those magic bullets often don't exist.
Ironically, C++ often doesn't suffer from the "too much code" problem as badly, because in practice programmers don't write "modern C++" nearly as much as C++ fans think they do. Industrial C++ uses raw pointers and C idioms all the time, and these tend to compile faster than the equivalent "modern C++" idioms. Rust, on the other hand, forces you into the corresponding "modern" idioms, and the compilation time tends to be slower as a result.
To this end, probably the most fruitful area we can explore in rustc is fast MIR-level optimizations to reduce the amount of code that we send to LLVM. In effect, this would desugar the "modern" style to the "raw" C style before sending it off to the heavyweight optimization and code generation framework.
In swift it's the type inference and operator overloading combination that causes long compile times. The operator + for example can have a lot of implementations, and since types have to be inferred instead of being declared upfront, the combo creates slow compile times:
Ex, there is this infamous error in swift:
"Expression was too complex to be solved in reasonable time; consider breaking up the expression into distinct sub-expressions"
Exponential performance cliff compiling associated types with equality constraints[1]. It was first opened in 2015 and was only recently addressed. Rust certainly has the potential for exponential times when performing large inference.
IME compiling C++ in debug mode (-Og) is decently fast; faster than Rust.
Are there any languages with fast compilers that aren't designed from the start for fast compilation? When I think of "fast compilation" I think of Wirthian languages.
This particular article is about a Firefox developer deciding to spend a little bit of time applying some of his knowledge on profiling and optimizaiton to the Rust compiler, without being a core compiler developer and involved in any of the big performance-related initiatives that are ongoing.
So it's going to focus on a few micro-optimizations that someone can find by doing a little bit of profiling and making some small, isolated changes.
I think there are a couple of major reasons why rustc is slow, and then a lot of ones which just boil down to "we haven't gotten to that yet."
One of them is that rustc was originally written as a self-hosting compiler while the language itself was still in flux. So for the early years, there was a lot of churn just due to keeping up with the language changes, and just implementing the language, not being able to really focus on performance and optimizations.
Once the language stabilized, the compiler was not really the best written piece of code in Rust because it had been written before some features existed, before people had figured out idiomatic ways to do things in Rust, and because it had been written with the assumption that it could just work with the AST and a high-level IR and translate that directly to LLVM and let LLVM handle all of the optimizations.
Shortly after the 1.0 release, a big project began to define and use "MIR", a mid-level IR between the high-level IR that is much closer to surface syntax and LLVM's low-level IR for generic optimizations and code generation. That project took a while to complete, but has finally been completed; that lays the base for new language features, improvements in existing language features, and being able to do better work on optimization; it has led to several major improvements in compiler performance that have already landed.
Another issue is, pretty similar to the C++ issue; Rust has generics, which offer many of the same compilation issues as C++ templates. They are constrained a bit more than C++ templates, and don't have the SFINAE philosophy, so I think it's possible for the compiler to be smarter with those, avoid monomorphization in some cases, and so on, but it is still likely that in heavily generic code, you may not be able to do much better than heavily templated C++ code.
There are also some issues that are just holdovers from the fact that language evolution, correctness, and ease of development was prioritized over compiler performance in the the initial versions; for instance, the de-sugaring of high-level constructs causes rustc to produce a lot of LLVM IR that LLVM winds up just optimizing away. Having rustc be smarter and produce less of this is a major optimizaiton goal.
A few major examples of improvements that are underway or being considered: sharing generic code between crates (if two crates in your dependency graph each instantiate Vec<i32>, you shouldn't have to compile that twice), sharing closures between generic instances (if there's a function that depends on some generic type T, but a closure in it that does not, there's no need to duplicate that closure for both instances of that function), doing some inlining in MIR (lots of high-level Rust features depend on inlining to produce "zero-cost abstractions", so doing that earlier in the process may reduce the amount of work that LLVM has to do).
And then there are improvements for allowing for better parallelization and incremental compilation, better link time optimization, and so on.
tl;dr: it will get faster, in a real sense, not micro, but in the end it will be constrained by language design to be closer to the C++ compiler end of the spectrum than the Go end of the spectrum
It would be nice to be able to use something like Keith Winstein (Stanford) et al's gg.[1] Sort of `make -j1000` for 10 cents on Lambda. Linux kernel cold compilation in under a minute.
Clang has a profiler for why your C++ compiles slow. Last I saw templates were not responsible for much in a typical codebase.
For Rust, though, the answer is obvious - the borrow checker. The language is designed around the idea of the compiler proving whether or not the code is safe and inserting appropriate destructors where they should be. I think it's expected that this will be slower than a language that just doesn't do that.
The borrow checker is clearly not the issue, if you look at the numbers. It does take some time but is far from the most expensive section in every profile I’ve ever seen.
The NLL version is slower than today’s, at the moment, though.
These are all tiny, targetted microoptimizations worth a percent or three of benefit in specific tests. They're worth doing (or at least evaluating) in any mature product and I have no complaint.
Nonetheless rustc remains really slow relative to other similar technologies, including C++ compilers. Is there any consensus as to why?
I mean, with C++, the answer is something to the effect of "template expansion happens syntactically and generally has to be expressed in headers, leading to many megabytes of code that has to be compiled repeatedly with every translation unit". And that isn't really amenable to microoptimization. We all agree that it sucks, and probably can't be fixed with the language as it's specified, and chalk it up to a design flaw.
What's the equivalent quip with rustc? I mean... is it going to get faster (in a real sense, not micro), or is it not? Is this fixable or not, and if not why?