Superoptimizing Compilers

wbhart · on March 1, 2015

The GMP (GNU Multi Precision library) guys found that the hand written assembly sequences (which were already 2-6x faster than that produced by a decent C compiler) could be sped up another factor of 2x by superoptimisation, at least of out-of-order CPUs.

Superoptimisation isn't only feasibly, it is actually quite practical. A factor of 2x across an already highly optimised library can cut your supercomputing budget in a grant by half. Whether you need 1 or 2 million CPU hours can make or break entire projects

These days, however, one can often do better with SIMD instructions. These often don't superoptimise well because the individual atomic operations they perform aren't able to be rearranged by the programmer. However, lots of existing supercomputers available for research still don't have chipset revisions with decent SIMD support.

gopalv · on March 1, 2015

The SIMD auto-vectorization is an off-shoot of this research.

But what excites me about super-optimization is to make functional optimization repeatable.

Let me give you an example of why I'd use it for more than HPC libs.

Assume you have a firewall system, which needs to determine whether a tuple of (srcip, dstip, dstport) is a safe one or not.

All 3 variables fit inside registers (assume xmm).

A rules engine as written traditionally would be a hashtable of some sort, always regenerated with a gperf hash (already we're in super-optimal territory partially).

But what if going to RAM is 300x more expensive than a SIMD sequence over 3 registers run a few hundred times?

Then writing a functional chunk of code which produces the same outputs for the same inputs, but without any storage system would be a killer feature.

That is sort of the problem where a super-optimizer can take a huge amount of data and turn it into bit-twiddling checks which find hidden invariants in each data-set.

Being able to take a bunch of rules which say subnet to subnet or port access patterns and compile it into a storage -less ACL mechanism (like bit 11 on port is 1, then DENY).

I'm almost certain these problems were solved in the past using a super-optimized FPGA as well (well, I would if I had the budget).

For me, being able to churn a dumb SWITCH loop through these systems is more interesting than chucking well written code into this.

DanWaterworth · on March 2, 2015

I don't think this will be as successful as you might hope. Consider how many possible sets of rules there are. Call that n. The instruction sequence will need to encode log2 n bits of data, but you have no ability to tune the representation to say which inputs are likely.

Instructions aren't free, they still need to be loaded from RAM like data.

darkmighty · on March 2, 2015

I wonder if it would be efficient to use part of those millions of CPU hours to optimize the programs themselves before running them?

dustingetz · on March 1, 2015

There is another way to super-optimize (this is not very related to the approach outlined in OP).

See: "Morte: an intermediate language for super-optimizing functional programs" (https://news.ycombinator.com/item?id=8308282)

TLDR: it trades turning completeness (no infinite loops) for the ability to decide "do these two differently coded functions compute the same thing".

Thus "if two programs are equal then they generate identical executables. Such a language would be immune to abstraction: no matter how many layers of indirection you might add the binary size and runtime performance would be unaffected." (quote from linked article)

hawkice · on March 2, 2015

To rephrase/expand your tl'dr slightly: you can express turing complete programs using Morte. But it only does a certain number of well defined things (like inlining) to reduce an expression and compare it, so some programs that are 'the same' due to a more complex isomorphism could generate different executables. This is fine, because program isomorphism in general is basically the halting problem.

ttctciyf · on March 2, 2015

> program isomorphism in general is basically the halting problem.

How far does this go? On the face of it, for some system to determine whether program X halts period could be different from finding whether program X halts, given that some program Y halts. The fact that program Y halts I guess would count as an additional axiom in the system, and depending on details of program Y could be more or less helpful in determining that X halts?

drostie · on March 2, 2015

Yep! If you want to learn a lot more about this stuff, search for the buzzword "codata." Basically, we can start to define type systems which "wall off" a section (which may or may not have all the things you want to compute in practice; we don't know!) which follows simple recursive rules from a base case up. This section is called "data", acting upon it programmatically is called "recursion", and proofs about it are called "induction". This part of the computer is not Turing-complete and therefore we can solve the halting problem. Everything else is "codata" acted upon with "corecursion" and reasoned about with "coinduction".

One super-awesome thing about the "data" used in codata systems is that, once you've proven that something halts, there is no formal distinction between lazy and eager evaluation -- it simply becomes a choice that the compiler makes, perhaps guided by pragmas inserted by a smart programmer.

One not-so-awesome thing: consider the normal model of the natural numbers, `data Nat = Zero | Succ Nat` where every number is either 0 or the successor of another natural number. The obvious definition of recursion involves stripping off a `Succ` structure one level deeper. Anything which follows that pattern is proven to halt, so it's a whitelisted computation that can be used in other data-computations.

But now, consider the usual algorithm for GCD:

    gcd' m n | n == 0    = m
             | m < n     = gcd' n m 
             | otherwise = gcd' n (mod m n)

Is this corecursive or recursive? It's clearly recursive. But does the computer know that? We have to clue it in. We need to tell the computer that actually, given our definition of recursion here -- "it's recursive if we only go from n to n - 1" -- holds a larger theorem, "it's recursive if we only go from n to some number strictly less than n". Even then, we have to prove that `mod m n < n` and our system needs to be guided to understand that it's okay if `m` increases as long as `n` is strictly decreasing (the arguments could have been flipped, after all).

So this is your "additional axioms in the system." They come when your "recursion" method is not the most obvious way to traverse the data structure; in this case we're using argument-swaps and jumping through the graph of N to traverse two data structures simultaneously and it creates a bunch of headaches.

ttctciyf · on March 3, 2015

Wow, thanks a lot for the detail! That is quite a bit more practical than I imagined.

Strilanc · on March 2, 2015

The site appears to be down. It's in google cache; looks like just a project description with some linked reports. Here's a cache of the final report [1].

1: http://webcache.googleusercontent.com/search?q=cache%3Ahttp%...

drallison · on March 1, 2015

http://en.wikipedia.org/wiki/Superoptimization and/or the articles it cites should have been cited and consulted. Where mechanical discovery is not appropriate and you know what you want to accomplish (e.g., division by a constant integer at the cost of a multiplication or computing the number of one bits in a word) the "tricks" described in Warren's Hacker's Delight can be applied. Some, but not all optimizing compilers, look for sequences for which there is some known "trick".

agumonkey · on March 1, 2015

Massalin's paper were surprisingly accessible and inspiring http://en.wikipedia.org/wiki/Alexia_Massalin

enos_feedler · on March 1, 2015

I think its worthless to find the "optimal" sequence of instructions these days. The performance of a computer system depends far more on the runtime context then on anything a compiler can infer from compile-time information.

I worked on production compilers at a major chip company, and the best thing we can do is improve the context in which a compiler is deployed. Runtime systems and compilers have to cooperate better. The compiler needs to have an interface that is cheap (low overhead) for the runtime to call into and make changes to programs. The runtime needs to sample more architectural perf counters that matter and send them into the compiler so that it can make smarter decisions.

clord · on March 1, 2015

Also worked on compilers at a major chip company. Generally it's better to have clear code than fast code. For that reason I would say that the future of optimization are things like whole-program optimization that exploits language features to improve the asymptotic costs of mechanically provable algorithms. For a primitive example, consider taking a simple Haskell program that is provably correct but needs O(n^3). Put it through some 'equivalent optimal implementation' system that provably spits out an algorithm that does the same task in O(n lg n).

We did stuff like this for simple matrices in Fortran and C, but I think it is generalizable to more problems than that.

And of course, I agree that most of the time no one really cares except HPC.

meowface · on March 1, 2015

The problem is that the subset of programs which can be proved to be correct (in 2015) is mostly restricted either to uninteresting problems, solved problems, or both.

clord · on March 2, 2015

Not always true. For a trivial example, in fortran code, we'd often see people wanting to write code that calls runtime methods directly because it leads to clearer implementations. The language rules were such that we could prove invariants on these runtime calls that were much stronger and farther reaching than the general bitcode optimizer could do, resulting in a class of matrix operations going to n*lg(n) from n^3, and we could also switch the stride direction to make better use of cache because of the algorithm change. This sort of thing was only possible because language knowledge was exploited — generic optimizers will miss this stuff.

I think the future of languages like haskell are bright in terms of performance, because they are so loosey goosey with what gets evaluated when, which gives language-aware optimizers a lot of meat to chew. An aggressive optimizer could turn out versions of the same code that run screaming-fast single-threaded, or that fully exploit massive parallel machines, or even version for all of those.

TheLoneWolfling · on March 1, 2015

Don't think of it at the program level.

Think of it at a smaller level.

Even optimizing, say, a single "trivial" function in a widely used library can have large effects overall.

TorKlingberg · on March 2, 2015

If only we could do whole-program optimization without recompiling everything after the tiniest change.

phkahler · on March 2, 2015

>> I think its worthless to find the "optimal" sequence of instructions these days.

That's funny, the difference between Intel and AMD processors performance is on the same order as what an optimizer can do. People are very disappointed in AMD these days because of it. Low level optimization always has and always will be important and people will pay really money for that difference in performance. Getting the right algorithm and having good cache behavior are still important too of course, but too many CS folks keep thinking that's all that matters.

vardump · on March 1, 2015

I also think runtime information is critical. Integrated specialized JITs and hinted binary translation might be the solution. Especially considering the pace new instructions have been added lately.

Maybe superoptimization databases for some common sequences.

LnxPrgr3 · on March 2, 2015

You might find this interesting reading:

Automatic Generation of Peephole Superoptimizers: http://theory.stanford.edu/~aiken/publications/papers/asplos...

Binary Translation Using Peephole Superoptimizers: https://www.usenix.org/legacy/event/osdi08/tech/full_papers/...

The first one was referenced in the post, but that feature wasn't mentioned.

maaku · on March 1, 2015

For some problems. Not for others.

viraptor · on March 2, 2015

If someone wants to try a real superoptimizer right now, there's the souper (https://github.com/google/souper) project which does this for LLVM IR code. Although the goal seems to be slightly different: "It uses an SMT solver to help identify missing peephole optimizations in LLVM's midend optimizers."

Still, it does compile some pretty big programs as long as you cache the results (in redis).

ridiculous_fish · on March 2, 2015

Ooh, I can write signum in three instructions on PowerPC:

    divw r4,r3,r3
    srawi r3,r3,31
    or r3,r3,r4

The equivalent in C:

    int signum(int x) { return (x / x) | (x >> 31); }

This uses the hilariously fun fact that integer division by zero silently gives you zero on PPC.

unwind · on March 2, 2015

Probably best to stick to assembly for that, since the C version has lots of undefined behavior (division by zero is UB, as is bitwise shift of a negative number) an optimizing compiler can do what it want with that code. :)

I'm not trying to be annoying, but I think there's value in not publishing broken code without at least noting that it is broken. I realize that the code still communicates to humans how to compute signum(), but it's meaningless if compiled which is kind of important sometimes too.

Ono-Sendai · on March 2, 2015

It's possible to compute the fastest function for a finite domain (see my proof here http://forwardscattering.org/post/13 )

However for an infinite domain, I strongly suspect it's not possible, due to the fact that it's impossible to tell in general (e.g. it's uncomputable) if two programs compute the same function. This is due to Rice's theorem, and hence ultimately to the Halting problem.

It's possible there may be some kind of constructive proof but I find it unlikely.

So in summary, general superoptimisation is probably impossible.

fiatmoney · on March 2, 2015

Domains are almost never truly infinite. You tend to be bound by at least the space of memory even in the brute-force case, and the meaningful domain for any meaningful function (especially one small enough to be optimized intensively) is much less.

Ono-Sendai · on March 2, 2015

Domains are not infinite in practice, but I think it's important to establish the theoretical behaviour and possibilities for the infinite case. They can for example give us a good indication of the 'asymptotic' behaviours of algorithms.

AlyssaRowan · on March 2, 2015

Last I checked, I don't write infinitely big functions.

TheLoneWolfling · on March 2, 2015

A finite domain isn't the same thing as an infinitely big function.

Ever written a function that takes a list? You've got an infinite domain. (Well, generally in most programming languages not actually infinite, but the upper bound is so large that it might as well be.)

damianknz · on March 1, 2015

At first I thought the super-optimizer had made a mistake because the results are not the same but I think the error is in the original where line 6 should be

return -1;

Anyway, using assembler for CPU I made up in my head just now, how about

shl r0,1

and r0,1

david-given · on March 1, 2015

I don't think that'll work. Consider the positive integer 0b01: the shift yields 0b10, then the and yields 0.

I've actually tried to come up with a simpler example. It's tougher than you'd think, and you pretty much have to do weird stuff with flags. The superoptimiser trick is pretty neat.

danbruc · on March 1, 2015

Do you mean (r0 << 1) & 1? That would always yield zero.

vardump · on March 1, 2015

Faster code for same end result is:

xor r0, r0

amelius · on March 2, 2015

I wonder if superoptimization can be performed more effciently by using statistical techniques based on the input code sequence. And of course, memoization could be used when compiling the same piece of code twice e.g. during development (actually I would expect that feature in today's optimizing compilers).

TheLoneWolfling · on March 2, 2015

There are superoptimizers that do the first part.

Effectively, build a Markov chain generator from known "fast" executables, then explore that as the input to the superoptimizer.

PaulHoule · on March 2, 2015

I think the web server could use some superoptimization.

lectrick · on March 2, 2015

Link 404's