When pigs fly: optimising bytecode interpreters

0xcde4c3db · on March 20, 2019

Although the article is framed around language implementation, the underlying principle is also valid for interpreting CPU instruction sets. I read quite a few years ago that Apple at one point abandoned a particular 68K->PowerPC JIT because an optimized interpreter was faster at running real-world applications for cache-related reasons (keep in mind that this was the early 90s, when mainstream L1 caches were 4/8/16KB and L2 cache was off-chip).

near · on March 20, 2019

The problem I face with my CPU interpreters (I've written about 15 of them) is in the synchronization between components.

If a CPU ran on its own, then there's no end to our optimization potential. If it were a strict Harvard architecture where instructions could only run in immutable (ROM) areas, we could translate the instructions, remove all unnecessary flag calculations, and then statically recompile the result to a native program.

But when the CPU needs to talk to the audio chip, the video chip, the other audio CPU running alongside it, etc ... the reason everything ends up so slow is that there's no viable path for speeding up that synchronization.

You try and do it with real hardware resources: a real CPU core/thread for each emulated chip, and performance falls off a cliff. Even with simple atomic reads/writes for one thread/chip to set a "waiting on you" flag, and another thread/chip to clear it, modern CPUs can only do this around 100,000 times a second ... and that's before the overhead of your emulation. So, if you want to emulate two CPUs that run at more than 0.1 MIPS (which even the NES surpasses), you're out of luck.

So you try and do it with a single thread, but you get destroyed through context switches. You're in the middle of executing a 68K instruction, but that instruction writes to RAM that the Z80 can read from. You don't know if the Z80 is going to read there, so you have to run the Z80 until it's "ahead" in time to the 68K. Switching into the Z80 interpreter absolutely murders your performance.

Pretty much the primary key to optimizing CPU emulators is to synchronize less often. Making assumptions like, "it's very unlikely the Z80 is going to write to the CPU instruction stream in the middle of it executing instructions ... the 68K is most likely executing code in ROM anyway" and not synchronizing the Z80 when the 68K fetches an opcode or operand byte.

But these optimizations always come at a cost. When you get a library of thousands of programs designed to eke out every last drop of performance from old, legacy 2MHz hardware, there's always that one program ... either by design or by fluke, does something crazy and relies on something you optimized away as extremely improbable, and breaks as a result.

In my view, the way to optimize real-world emulation of machines is that we need to be able to spawn lots of 1 core = 1 thread objects, and have their synchronizations be as cheap as humanly possible. The cores do not have to be lightning fast, they just have to be able to synchronize as quickly and as cheaply as real code running on a 90s era 68K+Z80 machine (eg a Sega Genesis) is able to. Many, many bonus points if the "waiting on another chip" operation can result in sapping less power for that thread, without increasing the latency necessary for it to resume operating once the condition is met.

The industry has, for decades, been moving in the complete opposite direction, so I don't have a lot of hope. At this point, I think it's more viable (but still very unlikely) to put FPGAs into home computers for this purpose.

0xcde4c3db · on March 21, 2019

I don't think of your emulators (thanks, by the way) as strictly being interpreters so much as splitting the difference between interpreters and HDL simulators. In some ways it's like you do high-level behavioral simulation of internal logic and RTL at the bus/package. That whole problem space is a far cry from running some Mac applications.

Architecturally, I think some of the things you want exist in some unusual microcontroller families (XMOS, GreenArrays, Cortex-R), but then if you get to pick the microcontroller I guess you might as well also pick the clock so that you can run in real time instead of simulated time.

vkazanov · on March 20, 2019

That's very true.

In fact, I took one of the techniques (traces) directly from a paper describing how the Bochs x86 virtual machine works :-)

But it goes deeper than that. I believe the whole "trace" thing in both jits and vms comes from a few papers describing trace-based instruction predecoding for hardware CPUs.

blattimwind · on March 20, 2019

Yes I believe that's true, with the idea of a trace cache originating in the 90s to work around perceived I-cache limitations, "Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching" is the seminal paper on it.

cfallin · on March 20, 2019

The idea of straight-line traces goes back even further, to Josh Fisher's work on trace scheduling in compilers in the early 80s (linearizing one control-flow path gives a much wider scope for optimizations):

https://en.wikipedia.org/wiki/Josh_Fisher#Trace_Scheduling

He combined this with a VLIW processor architecture to build Multiflow, a hardware startup. (Interesting history tidbit: Robert Colwell, who architected the P6, the first out-of-order Intel core, started his career at Multiflow before joining Intel. The P6 didn't have any trace-cache influences, but the P4, a few years later, infamously did...)

vkazanov · on March 20, 2019

Interestingly trace caches are now gone in both hardware CPUs and major jit compilers :-) Correct me if I am wrong..?

blattimwind · on March 20, 2019

Yes, nowadays x86 cores have µop caches which store the decomposition of individual instructions and then other "optimizers" that target specific constructs (e.g. loop stream detector).

bogomipz · on March 20, 2019

Might you or anyone else have a link to that Bochs x86 virtual machine paper?

Similarly might you have links for those papers that describe "trace-based instruction predecoding for hardware CPUs."?

Cheers

naasking · on March 21, 2019

I think it's the first result when googling "bochs x86 paper".

bsder · on March 20, 2019

> I read quite a few years ago that Apple at one point abandoned a particular 68K->PowerPC JIT because an optimized interpreter was faster at running real-world applications for cache-related reasons (keep in mind that this was the early 90s, when mainstream L1 caches were 4/8/16KB and L2 cache was off-chip).

The big problem was that Apple was lousy at specifying how much cache they needed. IBM told Apple that it needed a LOT more cache on the 603 unless everything was running native. Apple ignored the advice.

The result was that the 603 and 604 were performance dogs because so much was still running in emulation.

So, IBM went back and bumped the cache for the 603e, and performance went up dramatically. This then led to the unfortunate situation wherein the "low-power" chip from IBM (the 603e) tended to be far more performant than the "high-performance" chip (the 604) from Motorola.

tasty_freeze · on March 20, 2019

I have written an emulation for an old microcoded machine. Although the straight-forward implementation was a multiple of times faster than the original machine, I couldn't resist going for a little more performance.

The microinstruction is 23 bits wide (plus one for parity). Rather than doing a lot of masking, shifting, and testing to extract fields (some fields are non-contiguous even), I instead use 64b per microword.

bits [23:0] -- original microinstruction bits [31:24] -- 8b op-dependent predecode value bits [39:32] -- 8b interpreter dispatch index bits [47:40] -- 8b op-dependent predecode value bits [63:48] -- 16b op-dependent predecode value

There are a few oddball instructions which could use more than the 8b and 16b predecode fields, and instead do the old shift and mask on the raw microword to figure out what is needed.

Overall, it nearly doubled my performance.

bogomipz · on March 20, 2019

>"The microinstruction is 23 bits wide (plus one for parity)."

I'm curious why does the microinstruction need a parity bit? What happens if the parity is wrong, a machine check exception?

>"bits [23:0] -- original microinstruction bits [31:24] -- 8b op-dependent predecode value bits [39:32]"

What do the pre-decode bits do exactly?

Which microcoded machine was this?

tasty_freeze · on March 29, 2019

bogomipz -- sorry, I didn't see you replied to this. The machine traps to a parity error handler (one for microcode RAM, and a different one for data ram) and prints out a message indicating the error code and the address being read when it was detected.

There are a number of microinstruction formats, and fields aren't always contiguous. Making up an example, say the instruction is "ADD R1,#imm8" to add an 8b immediate to the R1 register. But the 8b immediate value is stored in bits [14:10] and [4:2] of the microword. The straight-forward way would be to write "uint8_t imm = ((instr >> 7) & 0xF8) | ((instr >> 2) & 0x07;" Instead, when the writable control store is written, that quantity is decoded and stored in an 8b aligned predecode field, so getting the value is just "uint8_t imm = instr_struct.imm8;"

The machine is the Wang 2200. There were two architectures: the first used a 20b word in ROM, the second used a writable control store so the BASIC could have bug and feature updates by mailing out floppies.

rasz · on March 21, 2019

self modifying code? Doom/Quake are such examples in qemu, getting slower with every new jit update https://virtuallyfun.com/wordpress/2019/03/05/degrading-qemu... https://virtuallyfun.com/wordpress/2019/03/15/degrading-qemu...

dataflow · on March 20, 2019

There were JITs back then?!

dfox · on March 20, 2019

JIT taken on the face value is really old concept and when viewed from the right angle it was the original way how programming languages worked (in the sense that you brough stack of punch cards with your program to datacenter, it got compiled, resulting code was ran and the object file was simply thrown away).

There is then some marketing hype around JIT from late 90's that is mostly only relevant for Java, which implies that "true JIT" does things like hot spot detection, deoptimalization traps and trace-based program flow reconstruction. This is mostly only done in production only by JVM implementations and fallback interpreters in hypervisors and is based on 80's research projects.

Somewhat notably in early 90's there was HP Dynamo, which was userspace PA-RISC emulator that ran on PA-RISC host by means of trace based JIT which was able to agressively optimize the code to the extent that it was actually faster than running natively in not insignificant number of (real world!) cases.

blattimwind · on March 20, 2019

JIT in today's sense is at least 50 years old (translating regexes to machine code at runtime).

dataflow · on March 20, 2019

Damn!

vkazanov · on March 20, 2019

What's so surprising? Jit-like techniques are older than I am, and I am not that young anymore :-)

dataflow · on March 20, 2019

I guess I don't know what time period this was referring to exactly but it sounded like it'd have been the mid-1990s... is that correct? Idk I just didn't expect JITs to have existed back then for some reason.

dragontamer · on March 20, 2019

JITs were far more important in the 1980s when people regularly used grossly different computing systems.

Think of the popular 1980s computers: IBM PC (Intel 8086), Amiga (Motorola 68000), Commodore 64 (MOS Technology 6510), TRS-80 (Zilog Z80), Apple II (WDC 65C02), Acorn Electron (Synertek SY6502A).

The solution at that time was to use p-code ("portable" code), such as https://en.wikipedia.org/wiki/UCSD_Pascal

In fact, Pascal's popularity in the 1980s was probably due to the large number of p-code interpreters and Pascal -> p-code compilers.

--------------

Ironically, we still use "p-code", now called Bytecode, today. But we really don't move between systems aside from ARM and x86. GPU assembly is special, since its an entirely different model so you can't really port Java or Python to the GPU.

I guess LLVM shows that the high-level translation to LLVM-intermediate language just simplifies compiler optimization, to the point that its useful even if you're making code for a specific machine.

EDIT: I think the modern CPU have more or less settled on the same features. They're all Multithreaded, cache-coherent Modified 64-bit Harvard machines with out-of-order execution, super-scalar front-end with speculative branch prediction, with ~6 uop dispatch per clock tick and roughly 2 or 3 load/store units with 64kB L1 cache and 64-byte cache lines with a dedicated 128-bit SIMD unit

The above lines describes ARM Cortex-A72, Intel Skylake, AMD Zen, and POWER9... except Skylake has 256-bit SIMD units I guess, and Apple's A12 has 96kB L1 cache. Not very big differences anymore between CPUs.

nickpsecurity · on March 21, 2019

"By 1978, there existed over 80 distinct Pascal implementations on hosts ranging from the Intel 8080 microprocessor to the Cray-1 supercomputer. But Pascal’s usefulness was not restricted to educational institutions; by 1980 all four major manufacturers of workstations (Three Rivers, HP, Apollo, Tektronix) were using Pascal for system programming. Besides being the major agent for the spread of Pascal implementations, the P-system was significant in demonstrating how comprehensible, portable, and reliable a compiler and system program could be made. Many programmers learned much from the P-system, including implementors who did not base their work on the P-system, and others who had never before been able to study a compiler in detail. The fact that a compiler was available in source form caused the P-system to become an influential vehicle of extracurricular education." (Niklaus Wirth)

http://pascal.hansotten.com/niklaus-wirth/recollections-abou...

int_19h · on March 20, 2019

MIPS is still used in cheaper routers.

Was there a JIT compiler for p-code back in the day? I thought all implementations were interpreters or AOT compilers.

dragontamer · on March 20, 2019

Hmmm, the 1980s was before I was born. But according to records, Smalltalk-80 was implemented with a JIT machine.

See the paper: "Efficient Implementation of the Smalltalk-80 System", which suggests that code was generated on the fly.

int_19h · on March 20, 2019

Oh, there were definitely JITs back there. I meant the Pascal p-code specifically. UCSD Pascal, at least, was an interpreter.

gsnedders · on March 20, 2019

Most of the work over the past decade on dynamic language JITs (especially JS) really just builds on top of the work on the Self JIT in the mid-80s. JITs themselves are older still.

kwindla · on March 20, 2019

Lovely article!

Following up on the section about threaded code, Andrew W. Appel's book _Compiling With Continuations_ really blew my mind and changed how I think about the interconnections between compilation, optimization, and language design. There are many ways into that deep set of ideas; I recommend the Appel book as one of them!

  https://books.google.com/books/about/Compiling_with_Continuations.html?id=0Uoecu9ju4AC

And, of course, if there's an influential meme in CS, there's a Haskell paper with a clever title that references it. :-)

  https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/compiling-without-continuations.pdf

toolslive · on March 20, 2019

you should however also read the reply paper: https://www.microsoft.com/en-us/research/wp-content/uploads/...

with an even lovelier title.

naasking · on March 20, 2019

And yet another: https://www.cs.purdue.edu/homes/rompf/papers/cong-preprint20...

blattimwind · on March 20, 2019

Excellent article.

> Everyone knows that pigs can’t fly — just like everyone thinks they know that bytecode interpreters, as a technology for executing high-level languages, can’t be sped up without resorting to labour-intensive dynamic compilation.

[citation needed]

PHP (in recent versions) is a good example for a bytecode VM that's quite quick. So we already have a spectrum that probably covers an order of magnitude or so, just in the basic speed of bytecode execution.

vkazanov · on March 20, 2019

I am the author of the article :-)

This sentence is more like a description of the current state of affairs: people go for very complicated jits instead of trying to push the basic interpreter to its limits.

And jits are complicated, and are very hard to get right.

souprock · on March 21, 2019

I can confirm this.

At work I helped create a PowerPC interpreter running on x86-64 that was faster than anything else I could find. In many cases, it was faster than actual PowerPC hardware. Our first two attempts at a JIT couldn't beat the interpreter.

Eventually we did it. We had to, since the PowerPC system being emulated had real-time software running on a beast of a CPU. It was quad-core, 1.5 GHz, out-of-order, with multiple pipelines. For reasons, all 4 cores needed to be emulated on a single host core.

vkazanov · on March 22, 2019

Sounds like a very interesting project! Could you share some details? What techniques did u use?

I collect everything interpreter-related. But that's easy to see from the article :-)

willvarfar · on March 20, 2019

Some interpreters are faster than others. But lets list some interpreters and their (usually jit) compilation equivs:

    cpython vs pypy
    lua vs luajit
    dalvik vs art (which isn't jit)
    zend vs hhvm

Okay, that last one, php7 is reckoned to be faster than hhvm on some sites.

My gut feeling, though is that is because of the inefficiencies in php, many inherent in the language design, rather than because php7 were the only people to know how to write a fast interpreter.

It seems a sound generalisation that you get another order of magnitude by moving from interpretation to compilation.

tyingq · on March 20, 2019

I think it's pretty impressive that php7 was roughly on par with hhvm performance wise, yet a lot more backwards compatible. Without FB level funding and resources.

They've also been able to iterate where subsequent 7.x releases had notable improvements over prior ones: https://www.phoronix.com/scan.php?page=news_item&px=PHP-7.3-...

yellowapple · on March 20, 2019

I wonder how that parity might change now that HHVM is dropping support for PHP entirely and only supporting Hack?

tyingq · on March 20, 2019

Not sure how relevant it will be for anyone but FB. As far as I know, everyone that went to HHVM went back to PHP.

ufo · on March 20, 2019

For dynamic languages you can only get a 10x speedup if you have a JIT doing speculative type-based optimizations. If you replace the interpreter with a typical ahead of time compiler you will at most get around a 2x speedup (mostly from saving on opcode dispatch overhead). At worst it may not even be worth the extra compilation time and memory usage.

UncleEntity · on March 20, 2019

I'm not sure how you came up with those numbers but I disagree.

A while back I made a python module for libjit and during experimentation I easily got more than 10x speedups by simply translating the python code to the jit. Dusted off the code and ran the test on two different gcd algorithms (time in seconds for 100000 iterations):

  gcd:       13.303379424003651
  gcd2:      5.1671242880111095
  jit_gcd:   0.327398027991876
  jit_gcd2:  0.21529806801117957

The jitted functions would probably be a small bit faster if I did any sort of optimations other than a simple algorithmic translation.

ufo · on March 21, 2019

I am working on interpreters and compilers for dynamic languages for my PHD. I'm spitballing the numbers a bit but in my experience the exact speedups vary a lot depending on the particular choice of language and benchmark. I have more experience with Lua, for which I wrote a bytecode-to-C compiler once (gcc can do wonderful things if you throw some C code at it...). Most of my benchmarks got a speedup between 2x to 3x but some of them got up to 9x.

BTW, "number crunchy" benchmarks like GCD are the best case scenario for ahead of time compilation because the interpretation overhead for them is very high. The interpreter is running lots of bytecodes, and each bytecode is only doing one or two machine instructions of actual work. Did you happen to compare your results against a C implementation of the program (or PyPy JIT)? My experience with Lua was that despite the high speedups compared to the interpreter, the numeric benchmarks still ran much slower than something an AOT compiler for a statically-typed language (or a speculative JIT for a dynamic one) can produce. There is a high cost to the dynamic typing, as variables have to be stored in the interpreter stack instead of in machine registers, and the diamond-shaped control flow gets in the way of optimizations, etc. Getting a 10x speedup in a benchmark that runs 100x slower than C might mean that there is still a lot of fat left to cut :)

vkazanov · on March 20, 2019

So you did a wrapper around libjit and generated functions on-the-fly?

I am not sure it counts as a full bytecode jit. :-)

I did a very stupid jit for the PigletVM thing while working on the article and the difference was more like 1.2-1.5 compared to the first naive interpreter.

UncleEntity · on March 20, 2019

Wasn't saying it was a full bytecode jit (that's a full on big project in itself) I was just doubting the numbers.

Here's the (old, apparently before I figured out how to translate gcd2) file:

https://github.com/eponymous/libjit-python/blob/master/sampl...

vkazanov · on March 20, 2019

Well, this is the perfect case for jits: basic number crunching. Bytecode interpreters are pure overhead in this case.

Problems usually come from jumping between jitted/non-jitted code or, say, pointer chasing in complex objects.

Did you find any use for the libjit wrapper? is it usable practically?

UncleEntity · on March 20, 2019

> Did you find any use for the libjit wrapper? is it usable practically?

Other than it starting me off on a multi-year dive down the compiler/interpreter rabbit hole I didn't do too much with it since I didn't really have the knowledge to use it at the time. I also want to rewrite it because the C++ libjit API is kind of wonky and it really should be using the C-API but haven't gotten around to it yet.

Someday I'll get motivated and work on it again, have a couple half finished projects that could use it as a back end if I ever get around to finishing them.

pjmlp · on March 20, 2019

ART is surely a JIT after Android 7.

willvarfar · on March 20, 2019

Ah I thought it was compile-on-install?

pjmlp · on March 20, 2019

Only in Android 5 and 6.

Compile-on-install was taking so long that in Android 7, they re-introduced the JIT and have been improving it since.

So on Android 7, ART is an hand-optimized Assembly interpreter, which gathers data to a PGO JIT, which likewise gathers data to PGO AOT when the device is idle.

Afterwards they started to optimize this workflow and optimizations being done, and as of Android P, PGO files are uploaded to the Play Store, which are then delivered to identical devices when they fresh install the APK so that they can reach the AOT step with a good level of optimizations faster.

Just follow all the links available on the left sidebar here, https://source.android.com/devices/tech/dalvik

There are also a couple of Google IO talks for Android 7, 8 and 9 talking about some of these changes.

mratsim · on March 21, 2019

I'm surprised their threaded interpreter was slower than switch, this doesn't reflect my benchmark as well even for post-Haswell CPU.

Also this is not true on non-x86 hardware and I unfortunately didn't test on AMD.

The article should at least cite this paper regarding Haswell branch prediction: https://hal.inria.fr/hal-01100647/document "Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore", 2015, Rohou et al.

Also I'm leaving my interpreter optimization resources here with decades of research papers and highlights of the techniques I found most interesting when implementing a fast VM: https://github.com/status-im/nimbus/wiki/Interpreter-optimiz...

vkazanov · on March 21, 2019

Yeah, it's a very good collection you got there! some of the papers I read, others are in the queue.

I was also surprised to see the switch-based solution winning here. But I was even more surprised to see how a simple change (a primitive stack top caching) suggested by one of the readers radically changed my perf benchmarks: the threaded interpreter was the fastest interpreter again.

kayamon · on March 20, 2019

If you really want to speed up your bytecode interpreter, switch to a register-based VM instead of a stack-based one.

Animats · on March 20, 2019

"it is entirely possible to speed up the work of such interpreters by a factor of at least 1½."

Story of CPython.