The GMP (GNU Multi Precision library) guys found that the hand written assembly sequences (which were already 2-6x faster than that produced by a decent C compiler) could be sped up another factor of 2x by superoptimisation, at least of out-of-order CPUs.
Superoptimisation isn't only feasibly, it is actually quite practical. A factor of 2x across an already highly optimised library can cut your supercomputing budget in a grant by half. Whether you need 1 or 2 million CPU hours can make or break entire projects
These days, however, one can often do better with SIMD instructions. These often don't superoptimise well because the individual atomic operations they perform aren't able to be rearranged by the programmer. However, lots of existing supercomputers available for research still don't have chipset revisions with decent SIMD support.
The SIMD auto-vectorization is an off-shoot of this research.
But what excites me about super-optimization is to make functional optimization repeatable.
Let me give you an example of why I'd use it for more than HPC libs.
Assume you have a firewall system, which needs to determine whether a tuple of (srcip, dstip, dstport) is a safe one or not.
All 3 variables fit inside registers (assume xmm).
A rules engine as written traditionally would be a hashtable of some sort, always regenerated with a gperf hash (already we're in super-optimal territory partially).
But what if going to RAM is 300x more expensive than a SIMD sequence over 3 registers run a few hundred times?
Then writing a functional chunk of code which produces the same outputs for the same inputs, but without any storage system would be a killer feature.
That is sort of the problem where a super-optimizer can take a huge amount of data and turn it into bit-twiddling checks which find hidden invariants in each data-set.
Being able to take a bunch of rules which say subnet to subnet or port access patterns and compile it into a storage -less ACL mechanism (like bit 11 on port is 1, then DENY).
I'm almost certain these problems were solved in the past using a super-optimized FPGA as well (well, I would if I had the budget).
For me, being able to churn a dumb SWITCH loop through these systems is more interesting than chucking well written code into this.
I don't think this will be as successful as you might hope. Consider how many possible sets of rules there are. Call that n. The instruction sequence will need to encode log2 n bits of data, but you have no ability to tune the representation to say which inputs are likely.
Instructions aren't free, they still need to be loaded from RAM like data.
TLDR: it trades turning completeness (no infinite loops) for the ability to decide "do these two differently coded functions compute the same thing".
Thus "if two programs are equal then they generate identical executables. Such a language would be immune to abstraction: no matter how many layers of indirection you might add the binary size and runtime performance would be unaffected." (quote from linked article)
To rephrase/expand your tl'dr slightly: you can express turing complete programs using Morte. But it only does a certain number of well defined things (like inlining) to reduce an expression and compare it, so some programs that are 'the same' due to a more complex isomorphism could generate different executables. This is fine, because program isomorphism in general is basically the halting problem.
> program isomorphism in general is basically the halting problem.
How far does this go? On the face of it, for some system to determine whether program X halts period could be different from finding whether program X halts, given that some program Y halts. The fact that program Y halts I guess would count as an additional axiom in the system, and depending on details of program Y could be more or less helpful in determining that X halts?
Yep! If you want to learn a lot more about this stuff, search for the buzzword "codata." Basically, we can start to define type systems which "wall off" a section (which may or may not have all the things you want to compute in practice; we don't know!) which follows simple recursive rules from a base case up. This section is called "data", acting upon it programmatically is called "recursion", and proofs about it are called "induction". This part of the computer is not Turing-complete and therefore we can solve the halting problem. Everything else is "codata" acted upon with "corecursion" and reasoned about with "coinduction".
One super-awesome thing about the "data" used in codata systems is that, once you've proven that something halts, there is no formal distinction between lazy and eager evaluation -- it simply becomes a choice that the compiler makes, perhaps guided by pragmas inserted by a smart programmer.
One not-so-awesome thing: consider the normal model of the natural numbers, `data Nat = Zero | Succ Nat` where every number is either 0 or the successor of another natural number. The obvious definition of recursion involves stripping off a `Succ` structure one level deeper. Anything which follows that pattern is proven to halt, so it's a whitelisted computation that can be used in other data-computations.
But now, consider the usual algorithm for GCD:
gcd' m n | n == 0 = m
| m < n = gcd' n m
| otherwise = gcd' n (mod m n)
Is this corecursive or recursive? It's clearly recursive. But does the computer know that? We have to clue it in. We need to tell the computer that actually, given our definition of recursion here -- "it's recursive if we only go from n to n - 1" -- holds a larger theorem, "it's recursive if we only go from n to some number strictly less than n". Even then, we have to prove that `mod m n < n` and our system needs to be guided to understand that it's okay if `m` increases as long as `n` is strictly decreasing (the arguments could have been flipped, after all).
So this is your "additional axioms in the system." They come when your "recursion" method is not the most obvious way to traverse the data structure; in this case we're using argument-swaps and jumping through the graph of N to traverse two data structures simultaneously and it creates a bunch of headaches.
The site appears to be down. It's in google cache; looks like just a project description with some linked reports. Here's a cache of the final report [1].
http://en.wikipedia.org/wiki/Superoptimization and/or the articles it cites should have been cited and consulted. Where mechanical discovery is not appropriate and you know what you want to accomplish (e.g., division by a constant integer at the cost of a multiplication or computing the number of one bits in a word) the "tricks" described in Warren's Hacker's Delight can be applied. Some, but not all optimizing compilers, look for sequences for which there is some known "trick".
I think its worthless to find the "optimal" sequence of instructions these days. The performance of a computer system depends far more on the runtime context then on anything a compiler can infer from compile-time information.
I worked on production compilers at a major chip company, and the best thing we can do is improve the context in which a compiler is deployed. Runtime systems and compilers have to cooperate better. The compiler needs to have an interface that is cheap (low overhead) for the runtime to call into and make changes to programs. The runtime needs to sample more architectural perf counters that matter and send them into the compiler so that it can make smarter decisions.
Also worked on compilers at a major chip company. Generally it's better to have clear code than fast code. For that reason I would say that the future of optimization are things like whole-program optimization that exploits language features to improve the asymptotic costs of mechanically provable algorithms. For a primitive example, consider taking a simple Haskell program that is provably correct but needs O(n^3). Put it through some 'equivalent optimal implementation' system that provably spits out an algorithm that does the same task in O(n lg n).
We did stuff like this for simple matrices in Fortran and C, but I think it is generalizable to more problems than that.
And of course, I agree that most of the time no one really cares except HPC.
The problem is that the subset of programs which can be proved to be correct (in 2015) is mostly restricted either to uninteresting problems, solved problems, or both.
Not always true. For a trivial example, in fortran code, we'd often see people wanting to write code that calls runtime methods directly because it leads to clearer implementations. The language rules were such that we could prove invariants on these runtime calls that were much stronger and farther reaching than the general bitcode optimizer could do, resulting in a class of matrix operations going to n*lg(n) from n^3, and we could also switch the stride direction to make better use of cache because of the algorithm change. This sort of thing was only possible because language knowledge was exploited — generic optimizers will miss this stuff.
I think the future of languages like haskell are bright in terms of performance, because they are so loosey goosey with what gets evaluated when, which gives language-aware optimizers a lot of meat to chew. An aggressive optimizer could turn out versions of the same code that run screaming-fast single-threaded, or that fully exploit massive parallel machines, or even version for all of those.
>> I think its worthless to find the "optimal" sequence of instructions these days.
That's funny, the difference between Intel and AMD processors performance is on the same order as what an optimizer can do. People are very disappointed in AMD these days because of it. Low level optimization always has and always will be important and people will pay really money for that difference in performance. Getting the right algorithm and having good cache behavior are still important too of course, but too many CS folks keep thinking that's all that matters.
I also think runtime information is critical. Integrated specialized JITs and hinted binary translation might be the solution. Especially considering the pace new instructions have been added lately.
Maybe superoptimization databases for some common sequences.
If someone wants to try a real superoptimizer right now, there's the souper (https://github.com/google/souper) project which does this for LLVM IR code. Although the goal seems to be slightly different: "It uses an SMT solver to help identify missing peephole optimizations in LLVM's midend optimizers."
Still, it does compile some pretty big programs as long as you cache the results (in redis).
Probably best to stick to assembly for that, since the C version has lots of undefined behavior (division by zero is UB, as is bitwise shift of a negative number) an optimizing compiler can do what it want with that code. :)
I'm not trying to be annoying, but I think there's value in not publishing broken code without at least noting that it is broken. I realize that the code still communicates to humans how to compute signum(), but it's meaningless if compiled which is kind of important sometimes too.
However for an infinite domain, I strongly suspect it's not possible, due to the fact that it's impossible to tell in general (e.g. it's uncomputable) if two programs compute the same function. This is due to Rice's theorem, and hence ultimately to the Halting problem.
It's possible there may be some kind of constructive proof but I find it unlikely.
So in summary, general superoptimisation is probably impossible.
Domains are almost never truly infinite. You tend to be bound by at least the space of memory even in the brute-force case, and the meaningful domain for any meaningful function (especially one small enough to be optimized intensively) is much less.
Domains are not infinite in practice, but I think it's important to establish the theoretical behaviour and possibilities for the infinite case. They can for example give us a good indication of the 'asymptotic' behaviours of algorithms.
A finite domain isn't the same thing as an infinitely big function.
Ever written a function that takes a list? You've got an infinite domain. (Well, generally in most programming languages not actually infinite, but the upper bound is so large that it might as well be.)
At first I thought the super-optimizer had made a mistake because the results are not the same but I think the error is in the original where line 6 should be
return -1;
Anyway, using assembler for CPU I made up in my head just now, how about
I don't think that'll work. Consider the positive integer 0b01: the shift yields 0b10, then the and yields 0.
I've actually tried to come up with a simpler example. It's tougher than you'd think, and you pretty much have to do weird stuff with flags. The superoptimiser trick is pretty neat.
I wonder if superoptimization can be performed more effciently by using statistical techniques based on the input code sequence. And of course, memoization could be used when compiling the same piece of code twice e.g. during development (actually I would expect that feature in today's optimizing compilers).
Superoptimisation isn't only feasibly, it is actually quite practical. A factor of 2x across an already highly optimised library can cut your supercomputing budget in a grant by half. Whether you need 1 or 2 million CPU hours can make or break entire projects
These days, however, one can often do better with SIMD instructions. These often don't superoptimise well because the individual atomic operations they perform aren't able to be rearranged by the programmer. However, lots of existing supercomputers available for research still don't have chipset revisions with decent SIMD support.