Worth noting that IIRC, for a while LuaJIT in interpreted mode was able to beat V8 in optimized mode not all that infrequently (although it depended on the use case, they are different languages, and I do not know if this is still the case).
Common register allocation across different bytecode interpretation sequences was one of the things specifically on my mind that could be tuned using a high-level assembler.
Very suboptimal might be a slight overstatement. I can see a way, given known register calling conventions, that you could write an interpreter written as tail calls and post-process the machine code to effectively JMP instead of CALL. Guaranteed tail calls would save you a bunch of effort, and register calling convention would give you some guarantees about consistent allocation.
http://article.gmane.org/gmane.comp.lang.lua.general/75426
Worth noting that IIRC, for a while LuaJIT in interpreted mode was able to beat V8 in optimized mode not all that infrequently (although it depended on the use case, they are different languages, and I do not know if this is still the case).