I'd rather say "don't use Intel x86 assembly unless you are measuring very, very carefully". First, few architectures have so many weird traps and interactions as x86 does. Second, most of these things become visible once you start measuring (although you have to be very careful when measuring performance, as it is an art in itself).
The x86 these days is effectively a poor-man's VLIW machine: you have a number of units with out-of-order execution. This means that you can look at your instruction stream as a stream of VLIW instructions for all of the CPUs units. The performance of your code might depend not just on the exact number of instructions, their execution times, their latencies, but also on the interdependencies between your instructions and scheduling constraints. Getting that right is a nightmare, and even if you do get it right on your chip, there are at least several major x86 architectures in use, each one with its own peculiarities.
As a counterexample, I recently wrote some ARM (Thumb-2, for the Cortex-M4) code. It wasn't very difficult, and the code I wrote was much better than what gcc could produce.
I write assembly for DSPs on a near-daily basis. Up until a few months ago there didn'nt even exist a C compiler for the target architecture.
Even when you write C code for an embedded platform that does have a decent C toolchain, you cannot truly understand what you're doing without spending a lot of time looking at the generated assembly, and writing some of it yourself.
However, this kind of architecture has nothing to do with the x86. RISC, no cache, in most cases no MMU (and rarely any DMA), an extremely simple and straightforward pipeline, etc. I've written assembly for several such architectures, but compared to them I find x86 assembly intimidating.
I also do DSP development and looking at the x86 architecture it almost seems pointless to try to predict what the processor is doing. Out of order execution? Register renaming? The only way I can be sure assembly is faster than generated C is by profiling, which isn't even exact anyway without isolating the operating system.
For us, latency and determinism are the most important requirements of any processor, and it's very hard to get that on x86 without writing your own OS. But that negates the entire advantage of using the x86 platform with so much available code written on top of other kernels.
This brings back memories of when I wrote a lot of DSP code. The architecture I was on had a decent C compiler, but it wouldn't correctly take advantage of some of the special instructions. By rewriting a viterbi decoder in assembly to take advantage of their special subtract-compare-branch instructions, I recall getting a huge performance increase. But the compiler did a decent job at optimizing for multiply-accumulate instructions especially if you gave it the right hints with pragmas.
I wrote x86 assembly demos for fun in the golden age of asm between the eras of 8088 and p6... It became so much about putting random instructions here and there to get better performance, one almost needs like a QuickCheck-like sythensizing+benchmarking hybrid compiler that automagically mutates code to find what is actually fastest by genetic algo-like heuristics.
In my experience, performance is dominated almost completely by memory throughput and/or memory latency (depending on whether the prefetcher succeeds).
All the instruction latencies, VLIW optimizations, and so forth have a minor effect.
However, the cases where these things do have an effect, are exactly the cases where it might make sense to go down to ASM level.
The first avenue to improve performance, though, is improving memory access patterns. Compacting, aligning and rearranging data in cache windows. Even creating and maintaining redundant copies of data which is compacted or organized in a way that aids the cache can help.
Another big improvement I've witnessed is helping the prefetcher. For example, I've had code that had to iterate a few linked lists to work on all elements. When I reorganized that code to interleave the iteration of 8 linked lists at a time (rather than iterating them all serially), I went up from being memory-latency-bound, to being closer to memory-throughput-bound. The prefetcher could predict the next 8 ptrs I'll dereference, and prefetch 8 cache lines, instead of 1 at a time.
I recently wrote some ARM (Thumb-2, for the Cortex-M4) code. It wasn't very difficult, and the code I wrote was much better than what gcc could produce.
Better in terms of code size, speed, or both? (Hm, does code size even matter anymore? Maybe for the CPU cache.)
Better in terms of both, and code size matters on all architectures. On microcontrollers it's because you really have limited memory, and on larger machines it's because your code cache imprint matters for performance.
That's what I thought too when I saw the numbers; OK, he managed to get it 18% faster, but how much bigger was the result? I'm guessing far more than 18% bigger. I've seen cases where replacing a function with a smaller one, but one which microbenchmarked slower, actually lead to the application as a whole performing much faster due to more code fitting in the cache. Ditto for individual instructions; (unfortunately - but Intel seems to be slowly changing this) many of the shortest instructions aren't the fastest when considered in isolation, so compilers tend to avoid them, but the 2-3x difference there is far less than the cost of a single cache miss.
64K of L1 might seem like a huge amount for a function to fit into, but that's 64K shared among every other often-used piece of code. It's for this reason that loop unrolling is no longer recommended too.
The first one is alignment of branches: the assembly code contains no instructions to align basic blocks on particular branches, whereas gcc happily emits these for some basic blocks. I mention this first as it is mere conjecture; I never made an attempt to measure the effects for myself.
And if you were, you would likely not see much on a modern x86; in fact it would probably make things slower, as it may confuse the branch predictor --- instructions are variable-length, so their addresses naturally are not aligned, and that enables e.g. lower-order bits to be used as cache tags.
yes, actually a better word would be code density. A weird encoding like x86 behaves like a compressed code, that gets decoded in the cpu into another instruction set that actually gets executed. This property, along with the backwards compatibility is what kept the x86 instruction set alive.
With the move to 64-bit, code had to be recompiled anyways; hence there was a good opportunity to reshape the x86 instruction set significantly. Instead, the number of changes were relatively small; the instruction set maintained his flavour (variable length encoding, a non orthogonal operands, ...). I believe there are many reasons for that choice (made by AMD), but code density was an key factor.
It wasn't very difficult, and the code I wrote was much better than what gcc could produce.
This is at least partially because a lot more work has been put into gcc's x86 backend. If the compiled code was that bad then presumably you saw quite a bit of low-hanging fruit in terms of potential improvements to the optimiser.
A random thought just crossed my mind -- how about a genetic algorithm that experiments with all these options and evolves towards the best performance?
Run speed is objective and measurable, which is desirable for a fitness criterion.
Somehow, you'd have to specify which instructions can be moved around and which can't -- or maybe not, if you could also assess whether the program executed "successfully."
I don't really see it as practical, but it might be fun. Might uncover optimizations that no one has seen before.
Unfortunately in most modern OSes run speed on x86 is variable and profiling itself is intrusive. You might be able to get a general trendline toward a faster algorithm but it'll still be lost in statistical noise.
The second problem is that the underlying processor is a moving target. The x86 programming model doesn't directly map to hardware anymore, with out of order execution, branch predictors and register renaming. One variant of the processor with a particular caching algorithm might have totally different results than the next.
Writing SSE assembly (not intrinsics) by hand can sometimes give you over a 10x speedup in an inner loop of a computationally expensive problem, like CPU bone animation of vertices.
And while it's convenient to use intrinsics to generate SSE (intrinsics look like ordinary C functions and save you from having to write actual assembly), it's impossible for them to always be as efficient as a clever programmer. I'm pretty sure that generating perfectly efficient SSE from SSE intrinsics is an NP-complete problem. So a clever programmer will always be able to boost performance more than a compiler.
That said, the two biggest arguments against assembly are:
(a) performance matters much less in the modern day,
(b) you kill your portability by writing inline assembly. If you expect your code to run on Windows (and gamedevs expect their code to run on Windows, though this may be changing) you'll need to maintain two parallel, #ifdef'd hand written bits of assembly that do exactly the same thing: one for GCC, and one for MSVC. It's a Royal Pain.
Those are two damning arguments, and it's why SSE assembly crafting has become something of a lost art.
EDIT: I originally led with "It's disappointing that the author makes no mention of SSE assembly," but this was confusing due to the distinction between assembly vs intrinsics. They did briefly mention SSE intrinsics, but not that hand-written SSE assembly is sometimes superior for raw performance. Also, getting a strikethrough modifier on HN for edits would be wonderful.
> I'm pretty sure that generating perfectly efficient SSE from SSE intrinsics is an NP-complete problem. So a clever programmer will always be able to boost performance more than a compiler.
Being NP-complete doesn't really have anything to do with whether humans can solve a given problem better than computers or not. It just means that it (probably) takes exponential time to solve.
I believe there is no distinction whatsoever, in principle, between computers and humans (our brains are just big, complicated calculators). As such, it's only a matter of time before computers beat humans in compilation at every point, just like the situation in chess. People claimed for years that humans will always have some edge in chess, because they "understand" the game, or have "ingenuity", or something like that. Nobody claims that anymore, since computers absolutely massacre humans. And similarly to chess, computers have been getting better and better at a very fast rate, while humans have stagnated more-or-less.
As an aside, computer chess got quite boring for me, due to the homogeneity of current programs, but it did help to show me the limitations in current programming languages and compilers. So I'd like to do my part to put humans in their place in another field. My still very-much-not-even-close-to-finished contribution, that I haven't found much time to work on in a while, the Mutagen programming language: https://github.com/zwegner/mutagen
Often, the compiler cannot produce better code than humans because of the failings/features of the language being used. Even c/c++ has many gotchas, for example the assembly code might have to constantly dereference a pointer because the source code (in this case) does not guarantee that the data won't have been changed by others. Meanwhile, a programmer who hand-crafts the same assembly language might know that this won't happen, so they can keep the value in a register.
In theory, some of these shortcomings could be solved by a well-written aggressive whole-program optimiser that could deduce data usage & aliasing, but in practice they don't/can't, few people use WPO and it fails when you use library code.
In short, compilers probably don't have enough info to beat a skilled assembly programmer.
Absolutely--I see this as a failure of current programming languages and compiler technology, though, and not a fundamental barrier. We should aim to take away all the impediments to the compiler's job, like the aliasing problem you point out. Then, we can just sit back and let the blinding speed and precision of computers take over, trying all the different possible compilations of a given program.
This is the main goal of the Mutagen language. See the link above for a bunch of hand-waving about compiler technology that might exist some day :)
It's quite doable in C nowadays. Just use the restrict keyword which pretty much just gives the compiler a permission to assume that nothing aliases the pointer in question.
It's not that hard to write loops in C that vectorize well. And you get the benefit that those same loops also vectorize for ARM neon.
While the restrict keyword helps several cases, it's not a catch-all. Even worse, it is very easy to get wrong, leading to near-undetectable errors in your code at some future date.
Worse, it's almost impossible to retrofit into large existing code. Even turning on strict aliasing in the compiler might cause hideous problems. The compiler can't always warn you about these.
The 'holy grail' is still for a compiler that can infer this stuff from the code. With existing C code, that requires at least WPO and some very deep reasoning about pointer usage. The other approach is a different programming language that makes data/variable access control more explicit.
I agree that retrofitting existing codebase is a pain.
Turning on strict aliasing is a mixed bag as programs that break when it's on are explicitly breaking the C spec. But one must make amends to support badly written legacy code. The official way to handle the aliasing across types is to use unions.
In my opinion the default behaviour in C should have been what restrict does. And a keyword to allow alias and yet another to allow aliasing across types, because unions are bit tedious to use. As an example float *aliased foo. And warnings when you cast normal pointer into aliased one or cast across types if the pointer is not aliased across types.
sillysaurus3 said "a clever programmer will always be able to boost performance more than a compiler." That's a much, much stronger claim than the mostly unobjectionable "a clever programmer can still typically boost performance more than a compiler (especially when they have the ability to look at the output of that compiler with various optimization settings)".
> Isn't that what LLVM tries to solve dynamically?
No. And in fact, even writing directly in LLVM IR, there are certain kinds of code (high-speed interpreters like LuaJIT) that cannot even be expressed.
LLVM is awesome, but it's a language (LLVM IR) that does not support the full range of what can be computed efficiently with a particular CPU.
The reason you're wrong is because the semantics of C restrict what compilers can do. Humans know the actual semantics they require, and thus, when writing assembly language, can retain only those semantics, dropping everything else. Thus, "ingenuity" as you put it is a real advantage that humans actually posses over what a compiler can do—even if we gave that compiler an infinite amount of time.
This is quite unlike chess, where the rules are known up front, and are the same for both the computer and the human.
The reason you're wrong is that I wasn't talking about C :)
Yes, for any current programming language, humans do have some advantage by being able to utilize both low- and high-level semantic information that simply cannot be communicated to the compiler. The trump card, I believe, is coming up with new languages where this is not the case.
The worst gift #ifdef gives to the world is processor specific code paths.
You should really be using conditionals in your make file to select source code files.
It is easier to test, easier to debug and you can have a portable fallback.
Plan9 uses this method, which is why porting to a new architecture is orders of magnitude simpler than other software platforms.
There is no #ifdef hell to work your way through.
For instance, plan9's lib C http://plan9.bell-labs.com/sources/plan9/sys/src/libc/ You can choose which functions you would like to hand code and test them against portable C versions with the flick of an environment variable.
Actually you can do the following (example is a hypothetical memory allocator implemented at least partially in asm):
1) create a "meta header file" malloc.h, containing the function prototypes.
2) create a "meta source file" malloc.cpp, with #ifdef switches for e.g. x86, x86_64, arm, etc, and inside the ifdefs some #include "malloc_x86.cpp"
3) inside the malloc_x.cpp source files, implement your architecture-specific asm code.
4) if you're after portable code, just do a #else #include "malloc_c.cpp" with a pure-c/cpp implementation. This way everyone can use the malloc and those with matching CPUs can benefit from acceleration.
You should really be using conditionals in your make file to select source code files.
If you start moving your programming logic (even compiletime logic) into any kind of makefile, you make life much harder for gamedevs, because it becomes harder to port your code to Windows. If you care about open source, then this is less inclusive a way to write code. You'll be excluding a segment of your programming community (gamedevs) because they're forced by the nature of the industry to compile their code on Windows.
That's changing, but it's changing slowly. In the meantime, some people don't care about that, but it would be nice to get some sympathy when it's so easy to write the code in a way that can include more groups of programmers, which is what open source is all about.
I doubt plan9 uses cmake, though one could presumably use cmake on plan9. The general point they were making was "use your tooling to build different files, rather than different versions of the same file." Nit-picking at particular tools misses the point. Building it that way means you just need to pick a compatible set of source files and build them. Writing your code with a bunch of #ifdefs set up with autotools is going to be harder to get working on Windows not easier.
> (b) you kill your portability by writing inline assembly. If you expect your code to run on Windows (and gamedevs expect their code to run on Windows, though this may be changing) you'll need to maintain two parallel, #ifdef'd hand written bits of assembly that do exactly the same thing: one for GCC, and one for MSVC. It's a Royal Pain.
This is why, when I write assembly, I separate out entire functions, use an external assembler and link them into the final executable. You still have to worry about calling conventions, but in general your code is much easier to maintain than inline assembly.
Also, at least for the x86 and gcc, the inline assembly syntax is horrific, and I can't see how anyone can write software in it and still keep their sanity. You really want to use Intel syntax.
The disadvantage of writing externally-linked assembly code is that in general, x86 assemblers are crap. You don't even get register allocation, so when I program, I have those pieces of paper around with register allocation tables (1970s-style).
> This is why, when I write assembly, I separate out entire functions, use an external assembler and link them into the final executable. You still have to worry about calling conventions, but in general your code is much easier to maintain than inline assembly.
This will only yield performance benefits if the functions you write in assembly are large enough to actually create a performance benefit. Having a non-inlineable function call to a few instructions of assembly may yield worse performance than having a few lines of C code that may be optimized. This is especially true if you write 32 bit code where the ABI needs you to put parameters into stack.
> Also, at least for the x86 and gcc, the inline assembly syntax is horrific, and I can't see how anyone can write software in it and still keep their sanity. You really want to use Intel syntax.
This is just a matter of getting used to. It's just an assembler syntax, get over it.
You're going to have to learn to read AT&T syntax assembler anyway when looking at disassembly in the debugger or inspecting object files with binutils.
I, too, learned assembly programming with the Intel syntax and I still kinda prefer it, but for the occasional few instructions of ASM code I need here and there, it's just easier to use inline asm.
> You're going to have to learn to read AT&T syntax assembler anyway when looking at disassembly in the debugger
Not anymore. I use `set disassembly-flavor intel` in my .gdbinit. Modern LLVM binutils clones also allow disassembly into Intel syntax with `-x86-asm-syntax=intel`.
Modern versions of GAS can also use the `.intel_syntax` directive, which carries over into GCC inline assembly.
Because it sure feels backwards to be doing that. And the guys next door who program Texas Instruments DSPs roll on the floor laughing -- they write linear assembly for their VLIW chips and the assembler composes the VLIW instructions, takes care of register allocations, and even reorders branch instructions as needed. I feel like using a chisel next to guys with jackhammers.
Writing SSE by hand can sometimes give you over a 10x speedup in an inner loop of a computationally expensive problem [..] a clever programmer will always be able to boost performance more than a compiler.
Right now, but I think future proofing is another issue to keep in mind alongside portability. CPUs in, say, 5 years are likely to have alternative and more powerful ways of dealing with certain things, and a good optimizing compiler should (optimistically!) be able to compile older code to run even more quickly on newer CPUs than hand coded assembly might allow.
I'm not sure how much impact this issue has in something like the Linux kernel, say, but I could imagine if significant parts of the kernel were written in i386-level x86, it would be a big issue now and demand much rewriting compared to C. (But I could be wrong as I'm not a kernel hacker, and would appreciate expert insight here ;-))
I think this is a really good point. It's worth adding that this has been much less of an issue in the past than it will be in the future, as until recently you could always count on increased clock speed. By not being architecture-flexible, your code is basically stuck with today's performance.
That said, for some algorithms there's a decent chance that "today's performance" is about what we can expect 20 years from now, so maybe hand-optimization is beneficial.
> Writing SSE assembly (not intrinsics) by hand can sometimes give you over a 10x speedup in an inner loop of a computationally expensive problem, like CPU bone animation of vertices.
My experiences from assembly code vs. intrinsics are just the opposite from yours. What I got out of some intrinsics heavy C code from GCC and Clang was way better than I could ever write.
Secondly, using inlineable functions, I could get interprocedural optimization from the compiler. I only needed to write the low level primitives (4x4 matrix multiply, matrix inverse, etc) with intrinsics and the high level code would be inlined and optimized. Adding assembly code to the mix will inhibit compiler optimization.
Additionally, the code I wrote could be retargeted for ARM NEON just by switching compilers. I was using mostly the compilers' vector extensions instead of CPU-specific intrinsics (ie. __builtin_shuffle, not __mm_shuffle_ps).
> And while it's convenient to use intrinsics to generate SSE (intrinsics look like ordinary C functions and save you from having to write actual assembly), it's impossible for them to always be as efficient as a clever programmer. I'm pretty sure that generating perfectly efficient SSE from SSE intrinsics is an NP-complete problem. So a clever programmer will always be able to boost performance more than a compiler.
Looking at the assembly code generated by GCC and Clang, I can see that the compiler does near optimal instruction scheduling and register allocation. On the other hand, the compiler was rather poor in doing instruction selection. I needed to pay a little attention to the code I wrote to avoid GCC spilling registers to the stack, Clang was a bit better.
So in my experience, a clever programmer writes poorer assembly code than a clever programmer using a clever compiler and intrinsics. You might need to look at the emitted code from the compiler and do some fine adjustments, so assembly skills are still needed.
I bet there are still cases where you can beat the compiler in an isolated problem, but when it comes to maintainability, re-targeting and the amount of effort needed to write the code, writing assembly code isn't worth it most of the time.
Well, of course intrinsics perform just as well as inline assembly when you're doing a single matrix multiply or a single matrix inverse. You won't see any benefit unless you're doing high-throughput computation like thousands of consecutive multiplies. That's when the clever programmer will wipe the floor with your clever compiler. (Well, probably only by a factor of 2x. But still, 2x is better than not-2x in the exceptionally rare case when performance is critical.)
I wasn't arguing that people should generally do this. The tradeoffs are rarely worth it. What I was saying is that humans can be more capable than compiler magic, and when it matters, it can really matter. Sometimes by up to 600x, as the other commenter demonstrated.
> Well, of course intrinsics perform just as well as inline assembly when you're doing a single matrix multiply or a single matrix inverse. You won't see any benefit unless you're doing high-throughput computation like thousands of consecutive multiplies. That's when the clever programmer will wipe the floor with your clever compiler.
This was exactly what I was doing - high throughput inner loop code with lots of matrix multiplies, etc. I wrote the primitive operations (multiply, inverse, etc) with intrinsics but my high level code was readable and maintainable C code. The compiler-emitted assembly code was at least as good as I could have written by hand, pretty close to the speed of light (ie. pretty close to CPU peak flops and/or memory bandwidth, whichever was the relevant bottleneck).
It takes an exceptional programmer and a lot of time to "wipe the floor with the compiler" with hand written assembly code. The compiler's ability to come up with near-optimal instruction scheduling and the ability to quickly change register allocation is pretty hard to match - you can do it if you spend enough time with the Intel manuals but that's not time well spent.
By analyzing your compiler output and fine-tuning your intrinsics code, you should be able to get pretty close to the peak performance with less time spent than hand writing the asm code. I stand by my point - a clever compiler and a clever programmer together can get a better job done than either of those individually.
This, however, requires a programmer with assembly skills. It's still valuable to be able to write and especially read assembly code.
I wrote the primitive operations (multiply, inverse, etc) with intrinsics but my high level code was readable and maintainable C code.
What I'm saying is that you have to write an algorithm in assembly which has an inner loop. You can't have the body of your inner loop in assembly, but the looping mechanism in C. This is because you can pull out parts of the algorithm from the inside of the loop to outside of it, or you can prevent the compiler from constantly re-loading from memory into SSE registers inside the inner loop, etc.
In other words, what I think you're saying is:
for (...) {
/* inline assembly or intrinsics */
}
What I'm saying is:
/* start inline assembly
preload anything that can be preloaded into SSE registers
loop:
achieve parts of the overall goal
write any finished operations back to main memory
prefetch N cachelines ahead
end inline assembly */
It may seem like a small difference, but in the former case, you're still doing a "single matrix multiply" (you're doing them one at a time). I think that might be why you're not seeing much of a difference between intrinsics vs inline assembly.
Matrix multiply isn't such a good example... there are better cases like texture lookups in a software rasterizer.
Also, the prefetching step is vital. It may be you're not seeing performance gains due to lack of prefetching, which is orthogonal to the debate of SSE intrinsics vs SSE inline assembly. Something like VTune can help reveal precisely why your program is going slow. There are many pitfalls, and it's easy to accidentally optimize for the wrong thing and then trick yourself into believing that particular optimization never helps in any circumstance.
No you didn't understand what I was doing nor why I was getting good results.
My code was something like:
__builtin_prefetch(first cache line);
for(a lot of 3d objects) {
__builtin_prefetch(next cache line);
vec4 q = read_quaternion(), p = read_position();
mat4 m = matrix_product(translate(p), quat_to_mat(q));
mat4 n = inverse_transpose(m);
mat4 p = matrix_product(camera_matrix, m);
// lots of more matrix-vector-quaternion math here
stream_store(m); stream_store(n); ....
}
In other words, my "high level" code was readable C code, only the primitives (matrix_product, etc) were intrinsics code.
What the compiler did is inlined all the primitive ops, store all the values in registers at all times (no loads or stores or register spilling in the inner loop) and finally re-organize the instructions to get near optimal scheduling.
In some simpler programs, I got even more benefit from the compiler doing some loop unrolling to keep loads/stores balanced with ALU ops to get very effective latency hiding.
Writing that whole loop with assembler would have yielded next to no improvement but that would have sacrificed readability, maintainability and portability.
> You can't have the body of your inner loop in assembly, but the looping mechanism in C.
You're right in that you can't mix inline assembler and C and get the compiler to optimize it correctly, but using intrinsics you can get the best of both worlds.
This is pretty much exactly what I had, except that my "inner loop" primitives were C code+intrinsics that looked like assembly code. Because they were not assembler code, the compiler was able to optimize the whole loop and the end result is something similar to what you show as an example of good performing asm code.
If you do end up writing assembler, you have to be sure that it is worth sacrificing the compiler optimizations that would take place otherwise.
And when the compiler wants to use more SSE registers than are available, it will register spill. A clever programmer may be able to avoid that, and that's where the performance boost comes from.
You've convinced me to leave that as an ultra-last-resort though, and just try out intrinsics first. Dumb compilers are probably less of an issue nowadays than in ye olden days of ~5 years ago. Thank you for the thorough explanation!
> And when the compiler wants to use more SSE registers than are available, it will register spill. A clever programmer may be able to avoid that, and that's where the performance boost comes from.
Yeah, at some point the compiler can't do any more magic and will start spilling. But it's not very often when a programmer intervention is required.
Quite often you can get the effect you need by looking at the compiler-emitted code, see where the spilling or other unwanted effects happen and do small tweaks to your C code. This is a bit annoying but it still beats hand writing asm code when it comes to time investment (it is not necessarily as fun, though).
> You've convinced me to leave that as an ultra-last-resort though, and just try out intrinsics first. Dumb compilers are probably less of an issue nowadays than in ye olden days of ~5 years ago. Thank you for the thorough explanation!
Yes, compilers have improved and will keep on improving. I was blown away by the quality of code I saw coming from GCC and Clang, in particular about how good the instruction scheduling was.
In any case - writing C code that looks like Assembly code (ie. plain and simple) gives very good results and doesn't require you to sacrifice compiler optimizations like using real Assembly code does. Read the output Assy code and revisit the C code if required, this way you should be able to get near-optimal code with less time investment.
Writing Assembler code is still really fun, though. Unfortunately it's not a good investment when it comes to achieving your goals in time.
I work on solving NP complete problems by computer. The idea that humans are better at NP-complete problems than computers is laughable. I set as 2nd year 1 week programming tasks writing computer programs that solve NP-complete instances which no person will ever be able to solve by hand.
Sure, in special cases; and it's an important lesson for students to learn: that in small sizes, theoretically impossible problems (in the general case) are perfectly well attackable by brute force enumeration, and it may be the right way to do it under the circumstances.
>So a clever programmer will always be able to boost performance more than a compiler.
I believe the term you're looking for there is "a sufficiently-clever programmer" :)
Arguably one is more likely to exist than a sufficiently-clever compiler, but it's exactly the same claim. Even the best humans will have sub-optimal edge cases just like compilers. Given perfect information, sure, I'll agree humans can do better. But it's probably impossible to have (much less make use of) perfect information.
He does mention SSE: "Replacing the _mm_andnot_si128 intrinsic with an actual and-not on vectors caused gcc to use other, slower instructions instead of the vmovq to move the result out of the SSE registers for reasons I don't particularly want to track down"
I'm miscommunicating, sorry. Let me try again: The author talks about SSE intrinsics, which masquerade like regular C functions. But this is a post saying "don't use assembly," and writing the actual SSE assembly can give higher performance than using SSE intrinsics.
I meant that I was disappointed he didn't talk about SSE assembly directly.
Don't follow this editorialized title's advice if you'd actually like to become an expert [1].
The actual article is quite good and makes a very valid point. Don't jump to machine code whenever things get a bit slow, chances are you're introducing more problems than you're fixing, and it's very difficult to outdo the compiler. Generally you should start by looking for macro-optimizations instead. But to add my own advice, it's still an incredibly worthwhile thing to learn, even if you don't expect to ever run into the types of scenarios which will benefit from optimization at this level.
Why? Programming languages do everything they can to abstract away the machine, and abstractions do leak [1]. Even mostly air-tight abstractions over machine code [2]. Because of this, learning the entire "stack" makes debugging higher-level code much easier. When you can understand what's happening with every level right down to the metal, what were once ridiculous off-the-wall problems become recognizable and much easier to reason about.
So do use machine code if you'd like to become an expert. Just use it on your own time, and don't use it under duress.
1: Presently reads "Don't use assembly unless you're an expert."
A trend which is both tempting and annoying. I have a feeling that this is causing some people to develop an analogue to "advertisement blindness" where people tend to ignore titles which fit into certain models. I think the filter gets tripped for me any time people use overly emotional modifier words, the word "this," or the phrase "what happens next." This title didn't fit any of those, however.
and then runs the function 1000 more times, measuring each run independently and reporting the average runtime at the end.
When it comes to measuring the performance of code like this, averaging run times is not the way to do it.
To remove the noise caused by context switching, just run the code many times and report the single fastest run you get. This should be the value closest to running the code on an OS without preemption (i.e. you want to measure how fast the code runs on the bare metal without interruption).
Even Facebook's Folly library [0] changed their benchmarking code from using statistics to just providing the fastest run. As the comments say:
// Current state of the art: get the minimum. After some
// experimentation, it seems taking the minimum is the best.
return *min_element(begin, end);
This is explained in the docs [1]:
Benchmark timings are not a regular random variable that fluctuates around an average. Instead, the real time we're looking for is one to which there's a variety of additive noise (i.e. there is no noise that could actually shorten the benchmark time below its real value). In theory, taking an infinite amount of samples and keeping the minimum is the actual time that needs measuring.
Sort of a tangent, but you know what I'd love to see? A compiler that could emit comments with the code it generates. They wouldn't have to be brilliant, just whenever it decides to optimize some bit it emits a generic "why". It'd be even cooler if it was a website that hooked up to something like GCC, and you could feed it some C code and get a hypothetical "what would the compiler do here" sort of deal.
I realize that's kind of crazy, but it'd be an awesome teaching and debugging tool. I don't look at disassembly a lot, but when I do it's not always particularly obvious why the compiler generated a thing in a certain way.
I can make things go more than 10 times faster in assembly. My main job is as manager/entrepreneur but I could read-write assembly as a result of my experience and I help-guide other people easily.
In the real world 10 times faster is nothing. You should spend the time understanding the problem in a mathematical way, and VERY IMPORTANT, documenting your work using images, text, voice and video.
This way you could make things go 100, 1000, 10000 times faster as most algorithms could be indexed, ordered in some way as to make it extremely fast, like doing log() operations instead of n squared or cubic or to the elevated to four or five(when you manage several dimensions like 3D with time or video analysis or medical tomography).
More important than that, 10 years from now it will continue working in new devices or OSs and will be something that supports the company instead of being a debt burden because the original developer is not here now(or you don't have the slightest idea of what you did so far away in the past and did not document).
The main problem is that people is not self aware that they forget things. And your brilliant idea that makes everything go 3 times faster is nuts if it makes everything way harder to understand, or if it could be forgotten even by you.
That's not good advice. Whether you need to use assembly depends on the particular situation at hand.
Here's a practical example: as a result of redesigning the algorithm to use fixed-point and implementing it in assembly, I got it to run 600x faster than the initial C version. Big O complexity was the same, the difference was in the constant factor. But the constant factor matters! In my case, it meant that you could get your computation done in half a day instead of a year.
Yes, it took me 3 weeks to get the algorithm implemented, instead of a single day, but even so — it was definitely worth it. And in many cases even a 3-fold improvement in speed is important, if you have long-running calculations.
Not knowing too much about processor architecture, I don't understand how fixed point can be much faster, since floating point ops are implemented in hardware.. I presume you used integer operations on your fixed point values, but could you explain a bit why it ends up being much faster than floating point?
It all depends on how precise your fixed point values need to be. If you can squeeze them into 8 bits (I could), you can use SSE 128-bit registers to operate on 16 values at a time. It gets even better with AVX, although that wasn't available to me at the time.
So the speedup is not just from going to fixed point, but from managing to use the vector instructions.
You're probably right that in most cases you should not write assembly code to try to make some code run faster. However, it is a very valuable skill to know how to read assembly code and spot the inefficiencies.
For a low level hacker, it is a very valuable skill to be able to write and especially read assembler code. I need that skill regularly in my day job. And I would have not acquired that skill if I had not written some assembly code. And besides, writing assembly code is fun!
Sometimes you need that 10x speed improvement to be able to do what you need to. To get your game running smoothly or your video playback work. You need to know when and how to optimize for performance.
> More important than that, 10 years from now it will continue working in new devices or OSs
They said that about x86... 20 years ago. I have applications written in Asm that still work on the latest CPUs today. The same binaries, not even needing recompilation, now run several orders of magnitude faster. I still see a lot of potential in extracting performance from x86 and although I hesitate slightly to make this prediction, I think it'll be the dominant architecture for at least 10 more years.
> I think it'll be the dominant architecture for at least 10 more years.
That needs to be qualified as "for desktops/servers" or similar. x86 haven't been the dominant architecture for at least a decade, if ever, in terms of units shipped. It's being outsold in number of units by ARM at a 10:1 ratio, and MIPS and PPC's are shipped in higher volume as well, or at least did as of a year or two ago. Possibly even 6502 and various micro-controllers, though getting numbers is harder.
Keep in mind how many CPU's are around you. Our servers have an ARM core per harddrive, and several of our RAID controllers have multiple PPC cores, for example. We have some servers with dozens of non-x86 CPUs per x86 CPU. Even some SD cards have ARM cores on them.
Now consider your car, microwave, washing machine, dish washer, tv, set-top box, phones, camera, music player, digital radio. A lot of stuff that was semi-mechanical or employed discrete logic a few years back now have CPUs that are ridiculous overkill, but used because they're so cheap there's no reason not to.
x86 is a diminishing niche if you look at electronics as a whole.
There are exceptions... things like color space conversions (though technically the last time I did that was in Cg) where that is about all you can get.
You forgot the whole quote. "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
Knuth is saying that in most cases optimizing prematurely is not worth it, which is right, but just as critical is the part saying that there are parts where pays off to consider optimization early. Writing your logging function in assembly likely fits into the 97%. Writing your N-queens solver, if you're aiming for all-out speed, likely fits into the 3%.
Of course not thinking about optimization at all will lead you down a path where your software is unoptimizable, or at least very difficult to optimize. Ask anyone who's had to optimize a video game that manages all objects in a scene graph, where reorganizing a cache hostile tangled web of pointers to base classes is a herculean feat.
I upvoted this post (not enough IMO) when it was in the "new" page and glad that it made it to the front page, many good posts dont.
A persistent peeve of mine is this vapid commentary that would invariably show up in programming language related discussions
"what's the point of this language, if one needs speed, it will be written in C".
Some would replace "C" with "assembly" in that sentence.
The point is that if a piece of code is doing something non-trivial, it is extremely hard to have confidence in the correctness of handwritten code that is written entirely at a low level. Smart compilers apply drastic, and sometimes unintuitive transformation to optimize code. To do the same manually and using low levels of abstractions would require a toure de force to pull off correctly. This simply is not going to happen often. Its precisely for speed that we need a high-level, yet optimization friendly, language so that we can delegate the job of optimizing the code to the compiler. Compilers are much better at applying large correctness preserving transformations than we humans are. The "write it in assembly" works in the small, not in the large.
Whenever I write anything non-trivial in assembly, I always go through a series of progressively lower-level implementations in C. The last ones emulate SSE instructions with for loops and correspond very closely to the assembly code. It isn't possible for me to write a serious chunk of assembly 'just like that'.
Please make a video of the various stages of you doing this! Or a tutorial pointing out how the various progressions look. That's awesome, and it's extremely difficult for others to learn that process except through years of trial and error.
I did not bookmark it, but there was this fun blog post about writing a number crunching HW assignment in Scheme (much against contemporary wisdom). But It ended up other implementations in C. The trick was to write it in CPS and keep hand transforming it iteratively till it was efficient assembly in a different syntax. I would still argue that that you are really writing in a high level language/abstraction and the rest is mostly tedious transliteration with special attention to keep the FPUs busy, efficient memory loading and register allocation.
No big deal. Just become an expert then. It is not that hard. Just have a look at Ian Piumarta's work [1,2]. Future systems will probably facilitate programming on every level from the bare metal to the highest abstraction. I.e. no strict boundaries between the implementation of the application and the implementation of the compiler/interpreter.
I feel like people are assuming that the only reason you write assembly is for performance. I try to stay as far away from it as possible, but sometimes there is no other way (barring finding a library that itself has written assembly). Some examples: finding a backtrace, getting the CPUID, or calling a function of arbitrary number of arguments (for some reflection like capabilities: imagine a function like "void foo(void (* func)(), ArrayList * args, char * arg_types)" and getting foo to call func with args passed as arguments, not an array). If you can, you should try and find a library to do those tasks, but the libraries themselves have no choice but to write assembly as you just cannot do those things in plain C.
It's not the application but a man's need to be an animal again. Frankly, writing Ember.js apps via CoffeeScript which has roughly 11 layers of abstraction feels like a metrosexual man eating a processed food with organic marketing on a IKEA furniture (CoffeeScript -> Compiled Javascript -> Ember.js -> jQuery -> JavaScript -> Chrome V8 Engine -> WebKit Layout Engine -> GTK+/Cocoa/Chrome Windows View -> Linux/OS X/Win32 display API -> C/C++ -> Assembly).
There was a time when a man can eat raw meat (machine code) but he can start again today by eating bloody beef.
Unless the man grows, feeds and slaughters the cow himself, that "bloody beef" is just as removed from animalistic experience as the organic food.
It's ridiculous to pretend that eating hormone-accelerated mass-processed meat is closer to nature. Sure, it's oozing blood, but that blood is tainted by an enormous industrial profit machine.
Only if he made the bow and arrows himself from sticks, fibre, and shale he found in the woods, and if he also purified/doped/etched his own silicon wafers.
Also the CPU re-orders everything. Also there are 4 CPUs and 8 threads. The threads run at different speeds. Also division will take a variable amount of time.
This has been true for a long time, even when the x86 architecture was much simpler. I worked on an SDK for an 80386-based platform back in '97, and it was unusual for me to do better than the Watcom C compiler.
Well the previous discussion was centered around the same assembly implementation as this one. The only difference is, in the previous discussion the fastest C/C++ implementation was 1.1x slower than the assembly. In this current discussion, the author managed to write a C++ implementation that runs faster than the assembly version.
Okay, but as far as I can tell he changed the algorithm (recursive c++, pop/push to global array, etc) ... doesn't that kind of negate the argument unless you change it in the assembly version as well?
True enough. That reinforces a couple of other points that tend to become apparent from these sorts of comparisons and benchmarks:
1) High level languages are easier to refactor and maintain than lower level languages (changing the algorithm in the assembly's a big job).
2) For performance, the algorithm used is almost always more important than the language you choose to implement it in.
Both of those points back up the author's assertion, it's a shame he didn't specifically discuss them.
I used to hand-code assembly (or generate it, eg "compiled bitmaps") back in the 8086 days. For many situations there were easy gains to be had that couldn't be achieved so easily in high level languages. These days I wouldn't dream of attempting it other than perhaps for vector code in an inner loop. I'd far rather spend the time profiling and optimising the algorithms and data structures because that's where the big gains are going to be. To go ahead and implement something in assembly when the algorithm clearly isn't yet optimal is perhaps fun but also kinda insane.
The author of the (current title of) the post is presumably hinging the argument on the definition of 'expert' described in this video: http://www.youtube.com/watch?v=BKorP55Aqvg
Percentages are for suckers. If you're not optimizing your code with actual use cases then you're just wasting your time. Id say 90 percent of code any one would ever write would not provide any benefit if it "ran faster".
No one will ever know or care that a 5ms operation only takes 2ms because of how "savvy" you are. No one knows or cares what language you used. Its just a program.
If you can produce a working program in a high level language but chose to use a low level one.... whyy?
If your operation is relaying a 20ms audio packet from one side of a call to another, 50µs to 20µs is the difference between handling 200 concurrent calls and handling 500 (per CPU, minus non-linear scaling with increasing CPUs and other overhead). Unlike the cost of hardware, which every customer would have to buy separately, the development cost can be spread out over many customers.
If $50,000 of developer effort can save 50 customers $1000 on hardware, you've already broken even in a sense.
This is nothing new; letting an optimizing compiler do the heavy lifting has been the general-case recommendation for a long time now.
But there are still circumstances where it's necessary to write assembly, particularly in things like real-time programming. Doesn't matter how good a job the compiler did on the overall program, if you can afford 16 cycles on an inner loop and the compiler spits out 24 then you've gotta hand-tune it.
I wish there was more ability to specify the paths I really care about being low latency (even at the expense of throughput) and the like. As it is, it's rare that I drop below C, but more common that my C winds up restructured for what it happens to do to the output code.
Even if you are an expert, consider whether the next person to maintain your code will be an expert. Assume they aren't and that they know where you live.
Something the author doesn't mention is how intel tunes their architecture to be faster on code output by popular compilers. This means that it is likely that N generations of x86 from now, the C code will run even faster relative to the hand-written code.
The x86 these days is effectively a poor-man's VLIW machine: you have a number of units with out-of-order execution. This means that you can look at your instruction stream as a stream of VLIW instructions for all of the CPUs units. The performance of your code might depend not just on the exact number of instructions, their execution times, their latencies, but also on the interdependencies between your instructions and scheduling constraints. Getting that right is a nightmare, and even if you do get it right on your chip, there are at least several major x86 architectures in use, each one with its own peculiarities.
As a counterexample, I recently wrote some ARM (Thumb-2, for the Cortex-M4) code. It wasn't very difficult, and the code I wrote was much better than what gcc could produce.