In a register-allocating one-pass backend you by necessity decide the register/stack map for a given label the first time you emit a branch that targets it. As a result you never have a problem with critical edges for two-way branches because the fall-through cannot be targeted directly by other branches; you always have a (possibly trivial) block for the fall-through edge with only one predecessor. Example:
// The jump_if emits shuffle/spill/fill code before the
// conditional jump to conform to the L1 register/stack
// map. If the L1 map hasn't been created yet, we use
// the intersection of our map with the conservative
// live set (perhaps from the lexical scope) for L1
// in which case there's nothing to emit except the jump.
jump_if(value, L1);
// The fall-through block starts here.
// The join emits shuffle/spill/fill code to conform
// our current map (which is equal to the L1 map) to
// the L2 map. It then backpatches any existing refs
// to L2 and updates L2's address to point here. As with
// jump_if, if this is the first ref to L2 we just set
// our current map as L2's map and emit no code.
join(L2);
Multi-way branches (typically from switch jump tables) with critical edges are much rarer and can be handled in a one-pass backend by splitting on demand (by inserting jump pads) when a multi-way branch target has already been referenced and had its map assigned. That's rare enough in practice that doing something more global to decide where to place the splits probably isn't worth it in a one-pass backend.
Integrating regalloc with a more streaming approach is something we may want to look at in the future in our backend -- for now we have a more conventional regalloc that extracts live ranges and either does a backtracking allocation or linear scan. That happens between lowering, which is where empty crit-edge blocks are created, and final emission, where the branch peepholing happens and empty blocks disappear, so the blocks exist during regalloc if needed. Anyway, there's an interesting design-space here!
Yeah, I'm happy to see more experimentation in the lightweight compiler space! My one-pass pipeline does forward greedy regalloc integrated with the approach I outlined in my post.
The next sweet spot seems to be a two-pass pipeline so you get a full forward and backward dataflow pass: your forward pass does SSA generation integrated with the forward dataflow analyses and the backward pass does machine code generation integrated with backward regalloc and dead code elimination. SSA with backward codegen/regalloc gives you an optimal coloring since the backward order for structured code matches the reverse postorder of the corresponding reducible CFG, and the greedy backward approach gives you decent splitting/coalescing for free. This is basically LuaJIT's approach but it can be extended beyond traces to reducible control flow if you replace the linear trace ordering with the dominator tree ordering. With two passes you should also be able to do something like Cliff Click's style of global code motion (as used in the HotSpot server compiler) with the early-schedule/late-schedule steps integrated into the two passes.
The main limitation of a pure two-pass approach is that you can't do precise SSA generation for loops at the same time you're doing your forward dataflow pass, so you have to be conservative. That blocks most LICM opportunities, so an extra micro-pass per loop is easy to justify, either an SSA pre-pass per loop or something like LuaJIT's loop peeling where you fully peel off the first iteration while identifying loop invariant variables before processing the loop a second time. Then LICM happens automatically via CSE since the peeled iteration dominates the loop body.
For those not in the know, cranelift is primarily used as a wasm compiler backend, but there is also work being done on integrating it into the rust compiler.
You can be 1% faster and be 'significantly' faster. 'Significantly' just tells you about variance of your samples. What matters is if it's usefully faster.
I've been trying to catch myself claiming something is "significantly faster" and instead say it's "substantially faster" for this reason. As far as I know, "substantial" doesn't have the same kind of formal definition, and still captures the informal point I was trying to make.
I wish that we (collectively) did a better job of reporting statistically significant performance comparisons, and in that light using "substantially" is a reminder to myself to think about whether it would be hard to go the extra mile and collect the data to actually report significance properly.
Indeed - it was just recommended to me over on Reddit to try it out as the backend for my “rust as a scripting language” shebang header for faster first-time compiles.
If maximizing streaming behavior, minimizing indirection, and minimizing passes over data is what you're looking for, you could do worse than to read papers from the 50's and 60's. Expensive memory meant tiny memories, which pretty much forced streaming, and high latencies pretty much forced minimizing passes. (I've read about a two-pass PDP assembler which required physically feeding the program tape through the reader twice! Tape for data also meant that indirection was minimized because everyone took great pains to bring relevant data together, which come to think of it was probably already habitual from the card era.)