Common register allocation across different bytecode interpretation sequences wa...

Common register allocation across different bytecode interpretation sequences was one of the things specifically on my mind that could be tuned using a high-level assembler.

Very suboptimal might be a slight overstatement. I can see a way, given known register calling conventions, that you could write an interpreter written as tail calls and post-process the machine code to effectively JMP instead of CALL. Guaranteed tail calls would save you a bunch of effort, and register calling convention would give you some guarantees about consistent allocation.