Don't use assembly unless you're an expert

jwr · on April 4, 2014

I'd rather say "don't use Intel x86 assembly unless you are measuring very, very carefully". First, few architectures have so many weird traps and interactions as x86 does. Second, most of these things become visible once you start measuring (although you have to be very careful when measuring performance, as it is an art in itself).

The x86 these days is effectively a poor-man's VLIW machine: you have a number of units with out-of-order execution. This means that you can look at your instruction stream as a stream of VLIW instructions for all of the CPUs units. The performance of your code might depend not just on the exact number of instructions, their execution times, their latencies, but also on the interdependencies between your instructions and scheduling constraints. Getting that right is a nightmare, and even if you do get it right on your chip, there are at least several major x86 architectures in use, each one with its own peculiarities.

As a counterexample, I recently wrote some ARM (Thumb-2, for the Cortex-M4) code. It wasn't very difficult, and the code I wrote was much better than what gcc could produce.

pbo · on April 4, 2014

I second that.

I write assembly for DSPs on a near-daily basis. Up until a few months ago there didn'nt even exist a C compiler for the target architecture.

Even when you write C code for an embedded platform that does have a decent C toolchain, you cannot truly understand what you're doing without spending a lot of time looking at the generated assembly, and writing some of it yourself.

However, this kind of architecture has nothing to do with the x86. RISC, no cache, in most cases no MMU (and rarely any DMA), an extremely simple and straightforward pipeline, etc. I've written assembly for several such architectures, but compared to them I find x86 assembly intimidating.

dankoss · on April 4, 2014

I also do DSP development and looking at the x86 architecture it almost seems pointless to try to predict what the processor is doing. Out of order execution? Register renaming? The only way I can be sure assembly is faster than generated C is by profiling, which isn't even exact anyway without isolating the operating system.

For us, latency and determinism are the most important requirements of any processor, and it's very hard to get that on x86 without writing your own OS. But that negates the entire advantage of using the x86 platform with so much available code written on top of other kernels.

crazy1van · on April 4, 2014

This brings back memories of when I wrote a lot of DSP code. The architecture I was on had a decent C compiler, but it wouldn't correctly take advantage of some of the special instructions. By rewriting a viterbi decoder in assembly to take advantage of their special subtract-compare-branch instructions, I recall getting a huge performance increase. But the compiler did a decent job at optimizing for multiply-accumulate instructions especially if you gave it the right hints with pragmas.

ballard · on April 5, 2014

I wrote x86 assembly demos for fun in the golden age of asm between the eras of 8088 and p6... It became so much about putting random instructions here and there to get better performance, one almost needs like a QuickCheck-like sythensizing+benchmarking hybrid compiler that automagically mutates code to find what is actually fastest by genetic algo-like heuristics.

Peaker · on April 4, 2014

In my experience, performance is dominated almost completely by memory throughput and/or memory latency (depending on whether the prefetcher succeeds).

All the instruction latencies, VLIW optimizations, and so forth have a minor effect.

However, the cases where these things do have an effect, are exactly the cases where it might make sense to go down to ASM level.

The first avenue to improve performance, though, is improving memory access patterns. Compacting, aligning and rearranging data in cache windows. Even creating and maintaining redundant copies of data which is compacted or organized in a way that aids the cache can help.

Another big improvement I've witnessed is helping the prefetcher. For example, I've had code that had to iterate a few linked lists to work on all elements. When I reorganized that code to interleave the iteration of 8 linked lists at a time (rather than iterating them all serially), I went up from being memory-latency-bound, to being closer to memory-throughput-bound. The prefetcher could predict the next 8 ptrs I'll dereference, and prefetch 8 cache lines, instead of 1 at a time.

sillysaurus3 · on April 4, 2014

I recently wrote some ARM (Thumb-2, for the Cortex-M4) code. It wasn't very difficult, and the code I wrote was much better than what gcc could produce.

Better in terms of code size, speed, or both? (Hm, does code size even matter anymore? Maybe for the CPU cache.)

jwr · on April 4, 2014

Better in terms of both, and code size matters on all architectures. On microcontrollers it's because you really have limited memory, and on larger machines it's because your code cache imprint matters for performance.

panic · on April 4, 2014

Since memory latency improves more slowly than CPU instruction throughput, code size actually matters more and more over time.

userbinator · on April 4, 2014

That's what I thought too when I saw the numbers; OK, he managed to get it 18% faster, but how much bigger was the result? I'm guessing far more than 18% bigger. I've seen cases where replacing a function with a smaller one, but one which microbenchmarked slower, actually lead to the application as a whole performing much faster due to more code fitting in the cache. Ditto for individual instructions; (unfortunately - but Intel seems to be slowly changing this) many of the shortest instructions aren't the fastest when considered in isolation, so compilers tend to avoid them, but the 2-3x difference there is far less than the cost of a single cache miss.

64K of L1 might seem like a huge amount for a function to fit into, but that's 64K shared among every other often-used piece of code. It's for this reason that loop unrolling is no longer recommended too.

The first one is alignment of branches: the assembly code contains no instructions to align basic blocks on particular branches, whereas gcc happily emits these for some basic blocks. I mention this first as it is mere conjecture; I never made an attempt to measure the effects for myself.

And if you were, you would likely not see much on a modern x86; in fact it would probably make things slower, as it may confuse the branch predictor --- instructions are variable-length, so their addresses naturally are not aligned, and that enables e.g. lower-order bits to be used as cache tags.

ithkuil · on April 4, 2014

yes, actually a better word would be code density. A weird encoding like x86 behaves like a compressed code, that gets decoded in the cpu into another instruction set that actually gets executed. This property, along with the backwards compatibility is what kept the x86 instruction set alive.

With the move to 64-bit, code had to be recompiled anyways; hence there was a good opportunity to reshape the x86 instruction set significantly. Instead, the number of changes were relatively small; the instruction set maintained his flavour (variable length encoding, a non orthogonal operands, ...). I believe there are many reasons for that choice (made by AMD), but code density was an key factor.

raverbashing · on April 4, 2014

Yes, but that's why you have an L1 cache

And code is much smaller than data usually, it's a small code snippet operating on a lot of data, not the other way usually

zokier · on April 4, 2014

> Does code size even matter anymore? Maybe for the CPU cache

Cortex-M4 implies microcontroller. Quickly looking, the cheapest M4 chips have 32KB of flash for program memory.

joezydeco · on April 5, 2014

And little to no instruction caching.

caf · on April 4, 2014

It wasn't very difficult, and the code I wrote was much better than what gcc could produce.

This is at least partially because a lot more work has been put into gcc's x86 backend. If the compiled code was that bad then presumably you saw quite a bit of low-hanging fruit in terms of potential improvements to the optimiser.

charlieflowers · on April 4, 2014

A random thought just crossed my mind -- how about a genetic algorithm that experiments with all these options and evolves towards the best performance?

Run speed is objective and measurable, which is desirable for a fitness criterion.

Somehow, you'd have to specify which instructions can be moved around and which can't -- or maybe not, if you could also assess whether the program executed "successfully."

I don't really see it as practical, but it might be fun. Might uncover optimizations that no one has seen before.

bradleyjg · on April 4, 2014

Take a look at this paper from last year:

Stochastic Superoptimization (http://cs.stanford.edu/people/eschkufz/research/asplos291-sc...)

dankoss · on April 4, 2014

> Run speed is objective and measurable

Unfortunately in most modern OSes run speed on x86 is variable and profiling itself is intrusive. You might be able to get a general trendline toward a faster algorithm but it'll still be lost in statistical noise.

The second problem is that the underlying processor is a moving target. The x86 programming model doesn't directly map to hardware anymore, with out of order execution, branch predictors and register renaming. One variant of the processor with a particular caching algorithm might have totally different results than the next.

jamesjporter · on April 4, 2014

It's not a genetic algorithm, but FFTW's planning system actually does perform experiments to optimize for a particular machine's hardware.

stcredzero · on April 5, 2014

Genetic algorithms are just a particular kind of stochastic hill-climbing graph-search. (EDIT: It's actually a general graph, not a tree.)

caf · on April 5, 2014

http://blog.regehr.org/archives/1109

sillysaurus3 · on April 4, 2014

Writing SSE assembly (not intrinsics) by hand can sometimes give you over a 10x speedup in an inner loop of a computationally expensive problem, like CPU bone animation of vertices.

And while it's convenient to use intrinsics to generate SSE (intrinsics look like ordinary C functions and save you from having to write actual assembly), it's impossible for them to always be as efficient as a clever programmer. I'm pretty sure that generating perfectly efficient SSE from SSE intrinsics is an NP-complete problem. So a clever programmer will always be able to boost performance more than a compiler.

That said, the two biggest arguments against assembly are:

(a) performance matters much less in the modern day,

(b) you kill your portability by writing inline assembly. If you expect your code to run on Windows (and gamedevs expect their code to run on Windows, though this may be changing) you'll need to maintain two parallel, #ifdef'd hand written bits of assembly that do exactly the same thing: one for GCC, and one for MSVC. It's a Royal Pain.

Those are two damning arguments, and it's why SSE assembly crafting has become something of a lost art.

EDIT: I originally led with "It's disappointing that the author makes no mention of SSE assembly," but this was confusing due to the distinction between assembly vs intrinsics. They did briefly mention SSE intrinsics, but not that hand-written SSE assembly is sometimes superior for raw performance. Also, getting a strikethrough modifier on HN for edits would be wonderful.

zwegner · on April 4, 2014

> I'm pretty sure that generating perfectly efficient SSE from SSE intrinsics is an NP-complete problem. So a clever programmer will always be able to boost performance more than a compiler.

Being NP-complete doesn't really have anything to do with whether humans can solve a given problem better than computers or not. It just means that it (probably) takes exponential time to solve.

I believe there is no distinction whatsoever, in principle, between computers and humans (our brains are just big, complicated calculators). As such, it's only a matter of time before computers beat humans in compilation at every point, just like the situation in chess. People claimed for years that humans will always have some edge in chess, because they "understand" the game, or have "ingenuity", or something like that. Nobody claims that anymore, since computers absolutely massacre humans. And similarly to chess, computers have been getting better and better at a very fast rate, while humans have stagnated more-or-less.

As an aside, computer chess got quite boring for me, due to the homogeneity of current programs, but it did help to show me the limitations in current programming languages and compilers. So I'd like to do my part to put humans in their place in another field. My still very-much-not-even-close-to-finished contribution, that I haven't found much time to work on in a while, the Mutagen programming language: https://github.com/zwegner/mutagen

joosters · on April 4, 2014

Often, the compiler cannot produce better code than humans because of the failings/features of the language being used. Even c/c++ has many gotchas, for example the assembly code might have to constantly dereference a pointer because the source code (in this case) does not guarantee that the data won't have been changed by others. Meanwhile, a programmer who hand-crafts the same assembly language might know that this won't happen, so they can keep the value in a register.

In theory, some of these shortcomings could be solved by a well-written aggressive whole-program optimiser that could deduce data usage & aliasing, but in practice they don't/can't, few people use WPO and it fails when you use library code.

In short, compilers probably don't have enough info to beat a skilled assembly programmer.

zwegner · on April 4, 2014

Absolutely--I see this as a failure of current programming languages and compiler technology, though, and not a fundamental barrier. We should aim to take away all the impediments to the compiler's job, like the aliasing problem you point out. Then, we can just sit back and let the blinding speed and precision of computers take over, trying all the different possible compilations of a given program.

This is the main goal of the Mutagen language. See the link above for a bunch of hand-waving about compiler technology that might exist some day :)

sharpneli · on April 4, 2014

It's quite doable in C nowadays. Just use the restrict keyword which pretty much just gives the compiler a permission to assume that nothing aliases the pointer in question.

It's not that hard to write loops in C that vectorize well. And you get the benefit that those same loops also vectorize for ARM neon.

joosters · on April 4, 2014

While the restrict keyword helps several cases, it's not a catch-all. Even worse, it is very easy to get wrong, leading to near-undetectable errors in your code at some future date.

Worse, it's almost impossible to retrofit into large existing code. Even turning on strict aliasing in the compiler might cause hideous problems. The compiler can't always warn you about these.

The 'holy grail' is still for a compiler that can infer this stuff from the code. With existing C code, that requires at least WPO and some very deep reasoning about pointer usage. The other approach is a different programming language that makes data/variable access control more explicit.

sharpneli · on April 4, 2014

I agree that retrofitting existing codebase is a pain.

Turning on strict aliasing is a mixed bag as programs that break when it's on are explicitly breaking the C spec. But one must make amends to support badly written legacy code. The official way to handle the aliasing across types is to use unions.

In my opinion the default behaviour in C should have been what restrict does. And a keyword to allow alias and yet another to allow aliasing across types, because unions are bit tedious to use. As an example float *aliased foo. And warnings when you cast normal pointer into aliased one or cast across types if the pointer is not aliased across types.

dllthomas · on April 4, 2014

sillysaurus3 said "a clever programmer will always be able to boost performance more than a compiler." That's a much, much stronger claim than the mostly unobjectionable "a clever programmer can still typically boost performance more than a compiler (especially when they have the ability to look at the output of that compiler with various optimization settings)".

mkesper · on April 4, 2014

In short, compilers probably don't have enough info to beat a skilled assembly programmer.

Isn't that what LLVM tries to solve dynamically?

erichocean · on April 4, 2014

> Isn't that what LLVM tries to solve dynamically?

No. And in fact, even writing directly in LLVM IR, there are certain kinds of code (high-speed interpreters like LuaJIT) that cannot even be expressed.

LLVM is awesome, but it's a language (LLVM IR) that does not support the full range of what can be computed efficiently with a particular CPU.

wobbleblob · on April 4, 2014

> I believe there is no distinction whatsoever, in principle, between computers and humans (our brains are just big, complicated calculators).

Our brains seem to be primarily an I/O processor attached to a device micro controller.

erichocean · on April 4, 2014

The reason you're wrong is because the semantics of C restrict what compilers can do. Humans know the actual semantics they require, and thus, when writing assembly language, can retain only those semantics, dropping everything else. Thus, "ingenuity" as you put it is a real advantage that humans actually posses over what a compiler can do—even if we gave that compiler an infinite amount of time.

This is quite unlike chess, where the rules are known up front, and are the same for both the computer and the human.

zwegner · on April 4, 2014

The reason you're wrong is that I wasn't talking about C :)

Yes, for any current programming language, humans do have some advantage by being able to utilize both low- and high-level semantic information that simply cannot be communicated to the compiler. The trump card, I believe, is coming up with new languages where this is not the case.

SixSigma · on April 4, 2014

The worst gift #ifdef gives to the world is processor specific code paths.

You should really be using conditionals in your make file to select source code files.

It is easier to test, easier to debug and you can have a portable fallback.

Plan9 uses this method, which is why porting to a new architecture is orders of magnitude simpler than other software platforms.

There is no #ifdef hell to work your way through.

For instance, plan9's lib C http://plan9.bell-labs.com/sources/plan9/sys/src/libc/ You can choose which functions you would like to hand code and test them against portable C versions with the flick of an environment variable.

tl;dr put your logic where it belongs

sillysaurus3 · on April 4, 2014

There is no such capability on Windows as a GNU makefile. There is nmake, but it's substantially different.

Please don't do this. It will make your code much harder to port. So almost every gamedev will be harmed by this.

mschuster91 · on April 4, 2014

Actually you can do the following (example is a hypothetical memory allocator implemented at least partially in asm):

1) create a "meta header file" malloc.h, containing the function prototypes.

2) create a "meta source file" malloc.cpp, with #ifdef switches for e.g. x86, x86_64, arm, etc, and inside the ifdefs some #include "malloc_x86.cpp"

3) inside the malloc_x.cpp source files, implement your architecture-specific asm code.

4) if you're after portable code, just do a #else #include "malloc_c.cpp" with a pure-c/cpp implementation. This way everyone can use the malloc and those with matching CPUs can benefit from acceleration.

SixSigma · on April 4, 2014

GNU make, that's the last thing I would recommend!

sillysaurus3 · on April 4, 2014

You should really be using conditionals in your make file to select source code files.

If you start moving your programming logic (even compiletime logic) into any kind of makefile, you make life much harder for gamedevs, because it becomes harder to port your code to Windows. If you care about open source, then this is less inclusive a way to write code. You'll be excluding a segment of your programming community (gamedevs) because they're forced by the nature of the industry to compile their code on Windows.

That's changing, but it's changing slowly. In the meantime, some people don't care about that, but it would be nice to get some sympathy when it's so easy to write the code in a way that can include more groups of programmers, which is what open source is all about.

dllthomas · on April 4, 2014

The gamedevs I have worked with use cross-platform makefiles (CMake, IIRC, but it's been a while) for their builds.

sillysaurus3 · on April 4, 2014

CMake is pretty awesome. If that's what they were recommending, then I apologize.

It sounded like they were talking about standard make though. Does plan9 use CMake?

dllthomas · on April 4, 2014

I doubt plan9 uses cmake, though one could presumably use cmake on plan9. The general point they were making was "use your tooling to build different files, rather than different versions of the same file." Nit-picking at particular tools misses the point. Building it that way means you just need to pick a compatible set of source files and build them. Writing your code with a bunch of #ifdefs set up with autotools is going to be harder to get working on Windows not easier.

SixSigma · on April 5, 2014

Plan9 uses its own mk.

dllthomas · on April 5, 2014

There's also the web (literate programming) approach of "change files".

jwr · on April 4, 2014

> (b) you kill your portability by writing inline assembly. If you expect your code to run on Windows (and gamedevs expect their code to run on Windows, though this may be changing) you'll need to maintain two parallel, #ifdef'd hand written bits of assembly that do exactly the same thing: one for GCC, and one for MSVC. It's a Royal Pain.

This is why, when I write assembly, I separate out entire functions, use an external assembler and link them into the final executable. You still have to worry about calling conventions, but in general your code is much easier to maintain than inline assembly.

Also, at least for the x86 and gcc, the inline assembly syntax is horrific, and I can't see how anyone can write software in it and still keep their sanity. You really want to use Intel syntax.

The disadvantage of writing externally-linked assembly code is that in general, x86 assemblers are crap. You don't even get register allocation, so when I program, I have those pieces of paper around with register allocation tables (1970s-style).

exDM69 · on April 4, 2014

> This is why, when I write assembly, I separate out entire functions, use an external assembler and link them into the final executable. You still have to worry about calling conventions, but in general your code is much easier to maintain than inline assembly.

This will only yield performance benefits if the functions you write in assembly are large enough to actually create a performance benefit. Having a non-inlineable function call to a few instructions of assembly may yield worse performance than having a few lines of C code that may be optimized. This is especially true if you write 32 bit code where the ABI needs you to put parameters into stack.

> Also, at least for the x86 and gcc, the inline assembly syntax is horrific, and I can't see how anyone can write software in it and still keep their sanity. You really want to use Intel syntax.

This is just a matter of getting used to. It's just an assembler syntax, get over it.

You're going to have to learn to read AT&T syntax assembler anyway when looking at disassembly in the debugger or inspecting object files with binutils.

I, too, learned assembly programming with the Intel syntax and I still kinda prefer it, but for the occasional few instructions of ASM code I need here and there, it's just easier to use inline asm.

pcwalton · on April 4, 2014

> You're going to have to learn to read AT&T syntax assembler anyway when looking at disassembly in the debugger

Not anymore. I use `set disassembly-flavor intel` in my .gdbinit. Modern LLVM binutils clones also allow disassembly into Intel syntax with `-x86-asm-syntax=intel`.

Modern versions of GAS can also use the `.intel_syntax` directive, which carries over into GCC inline assembly.

sillysaurus3 · on April 4, 2014

when I program, I have those pieces of paper around with register allocation tables (1970s-style).

Would you please make a video or tutorial demonstrating you doing this? Because that's awesome.

jwr · on April 4, 2014

Hmm, I'm not quite sure if you're ironic or not?

Because it sure feels backwards to be doing that. And the guys next door who program Texas Instruments DSPs roll on the floor laughing -- they write linear assembly for their VLIW chips and the assembler composes the VLIW instructions, takes care of register allocations, and even reorders branch instructions as needed. I feel like using a chisel next to guys with jackhammers.

erichocean · on April 4, 2014

Not ironic, I'd love to see it too. I like knowing how things were done originally. :)

caf · on April 4, 2014

I cut my teeth on Intel style (using tasm), but these days I've come to prefer AT&T. The DWORD PTR thing just feels a lot less elegant than movl.

petercooper · on April 4, 2014

Writing SSE by hand can sometimes give you over a 10x speedup in an inner loop of a computationally expensive problem [..] a clever programmer will always be able to boost performance more than a compiler.

Right now, but I think future proofing is another issue to keep in mind alongside portability. CPUs in, say, 5 years are likely to have alternative and more powerful ways of dealing with certain things, and a good optimizing compiler should (optimistically!) be able to compile older code to run even more quickly on newer CPUs than hand coded assembly might allow.

I'm not sure how much impact this issue has in something like the Linux kernel, say, but I could imagine if significant parts of the kernel were written in i386-level x86, it would be a big issue now and demand much rewriting compared to C. (But I could be wrong as I'm not a kernel hacker, and would appreciate expert insight here ;-))

benjamincburns · on April 4, 2014

I think this is a really good point. It's worth adding that this has been much less of an issue in the past than it will be in the future, as until recently you could always count on increased clock speed. By not being architecture-flexible, your code is basically stuck with today's performance.

That said, for some algorithms there's a decent chance that "today's performance" is about what we can expect 20 years from now, so maybe hand-optimization is beneficial.

exDM69 · on April 4, 2014

> Writing SSE assembly (not intrinsics) by hand can sometimes give you over a 10x speedup in an inner loop of a computationally expensive problem, like CPU bone animation of vertices.

My experiences from assembly code vs. intrinsics are just the opposite from yours. What I got out of some intrinsics heavy C code from GCC and Clang was way better than I could ever write.

Secondly, using inlineable functions, I could get interprocedural optimization from the compiler. I only needed to write the low level primitives (4x4 matrix multiply, matrix inverse, etc) with intrinsics and the high level code would be inlined and optimized. Adding assembly code to the mix will inhibit compiler optimization.

Additionally, the code I wrote could be retargeted for ARM NEON just by switching compilers. I was using mostly the compilers' vector extensions instead of CPU-specific intrinsics (ie. __builtin_shuffle, not __mm_shuffle_ps).

> And while it's convenient to use intrinsics to generate SSE (intrinsics look like ordinary C functions and save you from having to write actual assembly), it's impossible for them to always be as efficient as a clever programmer. I'm pretty sure that generating perfectly efficient SSE from SSE intrinsics is an NP-complete problem. So a clever programmer will always be able to boost performance more than a compiler.

Looking at the assembly code generated by GCC and Clang, I can see that the compiler does near optimal instruction scheduling and register allocation. On the other hand, the compiler was rather poor in doing instruction selection. I needed to pay a little attention to the code I wrote to avoid GCC spilling registers to the stack, Clang was a bit better.

So in my experience, a clever programmer writes poorer assembly code than a clever programmer using a clever compiler and intrinsics. You might need to look at the emitted code from the compiler and do some fine adjustments, so assembly skills are still needed.

I bet there are still cases where you can beat the compiler in an isolated problem, but when it comes to maintainability, re-targeting and the amount of effort needed to write the code, writing assembly code isn't worth it most of the time.

sillysaurus3 · on April 4, 2014

Well, of course intrinsics perform just as well as inline assembly when you're doing a single matrix multiply or a single matrix inverse. You won't see any benefit unless you're doing high-throughput computation like thousands of consecutive multiplies. That's when the clever programmer will wipe the floor with your clever compiler. (Well, probably only by a factor of 2x. But still, 2x is better than not-2x in the exceptionally rare case when performance is critical.)

I wasn't arguing that people should generally do this. The tradeoffs are rarely worth it. What I was saying is that humans can be more capable than compiler magic, and when it matters, it can really matter. Sometimes by up to 600x, as the other commenter demonstrated.

exDM69 · on April 4, 2014

> Well, of course intrinsics perform just as well as inline assembly when you're doing a single matrix multiply or a single matrix inverse. You won't see any benefit unless you're doing high-throughput computation like thousands of consecutive multiplies. That's when the clever programmer will wipe the floor with your clever compiler.

This was exactly what I was doing - high throughput inner loop code with lots of matrix multiplies, etc. I wrote the primitive operations (multiply, inverse, etc) with intrinsics but my high level code was readable and maintainable C code. The compiler-emitted assembly code was at least as good as I could have written by hand, pretty close to the speed of light (ie. pretty close to CPU peak flops and/or memory bandwidth, whichever was the relevant bottleneck).

It takes an exceptional programmer and a lot of time to "wipe the floor with the compiler" with hand written assembly code. The compiler's ability to come up with near-optimal instruction scheduling and the ability to quickly change register allocation is pretty hard to match - you can do it if you spend enough time with the Intel manuals but that's not time well spent.

By analyzing your compiler output and fine-tuning your intrinsics code, you should be able to get pretty close to the peak performance with less time spent than hand writing the asm code. I stand by my point - a clever compiler and a clever programmer together can get a better job done than either of those individually.

This, however, requires a programmer with assembly skills. It's still valuable to be able to write and especially read assembly code.

sillysaurus3 · on April 4, 2014

I wrote the primitive operations (multiply, inverse, etc) with intrinsics but my high level code was readable and maintainable C code.

What I'm saying is that you have to write an algorithm in assembly which has an inner loop. You can't have the body of your inner loop in assembly, but the looping mechanism in C. This is because you can pull out parts of the algorithm from the inside of the loop to outside of it, or you can prevent the compiler from constantly re-loading from memory into SSE registers inside the inner loop, etc.

In other words, what I think you're saying is:

  for (...) {
    /* inline assembly or intrinsics */
  }

What I'm saying is:

  /*  start inline assembly

  preload anything that can be preloaded into SSE registers

  loop:
  
     achieve parts of the overall goal

     write any finished operations back to main memory

     prefetch N cachelines ahead

  end inline assembly */

It may seem like a small difference, but in the former case, you're still doing a "single matrix multiply" (you're doing them one at a time). I think that might be why you're not seeing much of a difference between intrinsics vs inline assembly.

Matrix multiply isn't such a good example... there are better cases like texture lookups in a software rasterizer.

Also, the prefetching step is vital. It may be you're not seeing performance gains due to lack of prefetching, which is orthogonal to the debate of SSE intrinsics vs SSE inline assembly. Something like VTune can help reveal precisely why your program is going slow. There are many pitfalls, and it's easy to accidentally optimize for the wrong thing and then trick yourself into believing that particular optimization never helps in any circumstance.

exDM69 · on April 4, 2014

No you didn't understand what I was doing nor why I was getting good results.

My code was something like:

    __builtin_prefetch(first cache line);
    for(a lot of 3d objects) {
        __builtin_prefetch(next cache line);
        vec4 q = read_quaternion(), p = read_position();
        mat4 m = matrix_product(translate(p), quat_to_mat(q));
        mat4 n = inverse_transpose(m);
        mat4 p = matrix_product(camera_matrix, m);
        // lots of more matrix-vector-quaternion math here
        stream_store(m); stream_store(n); ....
    }

In other words, my "high level" code was readable C code, only the primitives (matrix_product, etc) were intrinsics code.

What the compiler did is inlined all the primitive ops, store all the values in registers at all times (no loads or stores or register spilling in the inner loop) and finally re-organize the instructions to get near optimal scheduling.

In some simpler programs, I got even more benefit from the compiler doing some loop unrolling to keep loads/stores balanced with ALU ops to get very effective latency hiding.

Writing that whole loop with assembler would have yielded next to no improvement but that would have sacrificed readability, maintainability and portability.

> You can't have the body of your inner loop in assembly, but the looping mechanism in C.

You're right in that you can't mix inline assembler and C and get the compiler to optimize it correctly, but using intrinsics you can get the best of both worlds.

This is pretty much exactly what I had, except that my "inner loop" primitives were C code+intrinsics that looked like assembly code. Because they were not assembler code, the compiler was able to optimize the whole loop and the end result is something similar to what you show as an example of good performing asm code.

If you do end up writing assembler, you have to be sure that it is worth sacrificing the compiler optimizations that would take place otherwise.

sillysaurus3 · on April 4, 2014

And when the compiler wants to use more SSE registers than are available, it will register spill. A clever programmer may be able to avoid that, and that's where the performance boost comes from.

You've convinced me to leave that as an ultra-last-resort though, and just try out intrinsics first. Dumb compilers are probably less of an issue nowadays than in ye olden days of ~5 years ago. Thank you for the thorough explanation!

exDM69 · on April 4, 2014

> And when the compiler wants to use more SSE registers than are available, it will register spill. A clever programmer may be able to avoid that, and that's where the performance boost comes from.

Yeah, at some point the compiler can't do any more magic and will start spilling. But it's not very often when a programmer intervention is required.

Quite often you can get the effect you need by looking at the compiler-emitted code, see where the spilling or other unwanted effects happen and do small tweaks to your C code. This is a bit annoying but it still beats hand writing asm code when it comes to time investment (it is not necessarily as fun, though).

> You've convinced me to leave that as an ultra-last-resort though, and just try out intrinsics first. Dumb compilers are probably less of an issue nowadays than in ye olden days of ~5 years ago. Thank you for the thorough explanation!

Yes, compilers have improved and will keep on improving. I was blown away by the quality of code I saw coming from GCC and Clang, in particular about how good the instruction scheduling was.

In any case - writing C code that looks like Assembly code (ie. plain and simple) gives very good results and doesn't require you to sacrifice compiler optimizations like using real Assembly code does. Read the output Assy code and revisit the C code if required, this way you should be able to get near-optimal code with less time investment.

Writing Assembler code is still really fun, though. Unfortunately it's not a good investment when it comes to achieving your goals in time.

aktau · on April 4, 2014

Btw, exDM69 has a project on github written in this style (I'm almost sure he was in fact referring to this in his earlier posts): https://github.com/rikusalminen/threedee-simd

CJefferson · on April 4, 2014

I work on solving NP complete problems by computer. The idea that humans are better at NP-complete problems than computers is laughable. I set as 2nd year 1 week programming tasks writing computer programs that solve NP-complete instances which no person will ever be able to solve by hand.

jloughry · on April 5, 2014

Sure, in special cases; and it's an important lesson for students to learn: that in small sizes, theoretically impossible problems (in the general case) are perfectly well attackable by brute force enumeration, and it may be the right way to do it under the circumstances.

Groxx · on April 4, 2014

>So a clever programmer will always be able to boost performance more than a compiler.

I believe the term you're looking for there is "a sufficiently-clever programmer" :)

Arguably one is more likely to exist than a sufficiently-clever compiler, but it's exactly the same claim. Even the best humans will have sub-optimal edge cases just like compilers. Given perfect information, sure, I'll agree humans can do better. But it's probably impossible to have (much less make use of) perfect information.

seertaak · on April 4, 2014

He does mention SSE: "Replacing the _mm_andnot_si128 intrinsic with an actual and-not on vectors caused gcc to use other, slower instructions instead of the vmovq to move the result out of the SSE registers for reasons I don't particularly want to track down"

sillysaurus3 · on April 4, 2014

I'm miscommunicating, sorry. Let me try again: The author talks about SSE intrinsics, which masquerade like regular C functions. But this is a post saying "don't use assembly," and writing the actual SSE assembly can give higher performance than using SSE intrinsics.

I meant that I was disappointed he didn't talk about SSE assembly directly.

benjamincburns · on April 4, 2014

Don't follow this editorialized title's advice if you'd actually like to become an expert [1].

The actual article is quite good and makes a very valid point. Don't jump to machine code whenever things get a bit slow, chances are you're introducing more problems than you're fixing, and it's very difficult to outdo the compiler. Generally you should start by looking for macro-optimizations instead. But to add my own advice, it's still an incredibly worthwhile thing to learn, even if you don't expect to ever run into the types of scenarios which will benefit from optimization at this level.

Why? Programming languages do everything they can to abstract away the machine, and abstractions do leak [1]. Even mostly air-tight abstractions over machine code [2]. Because of this, learning the entire "stack" makes debugging higher-level code much easier. When you can understand what's happening with every level right down to the metal, what were once ridiculous off-the-wall problems become recognizable and much easier to reason about.

So do use machine code if you'd like to become an expert. Just use it on your own time, and don't use it under duress.

1: Presently reads "Don't use assembly unless you're an expert."

2: http://www.joelonsoftware.com/articles/LeakyAbstractions.htm...

3: http://stackoverflow.com/questions/11227809/why-is-processin...

elwell · on April 4, 2014

> The title of this post was obviously meant to be an attention-grabber

It's becoming very common to have a catchy title and then a disclaimer for it in the article.

benjamincburns · on April 10, 2014

A trend which is both tempting and annoying. I have a feeling that this is causing some people to develop an analogue to "advertisement blindness" where people tend to ignore titles which fit into certain models. I think the filter gets tripped for me any time people use overly emotional modifier words, the word "this," or the phrase "what happens next." This title didn't fit any of those, however.

csl · on April 4, 2014

and then runs the function 1000 more times, measuring each run independently and reporting the average runtime at the end.

When it comes to measuring the performance of code like this, averaging run times is not the way to do it.

To remove the noise caused by context switching, just run the code many times and report the single fastest run you get. This should be the value closest to running the code on an OS without preemption (i.e. you want to measure how fast the code runs on the bare metal without interruption).

Even Facebook's Folly library [0] changed their benchmarking code from using statistics to just providing the fastest run. As the comments say:

    // Current state of the art: get the minimum. After some
    // experimentation, it seems taking the minimum is the best.
    
    return *min_element(begin, end);

This is explained in the docs [1]:

Benchmark timings are not a regular random variable that fluctuates around an average. Instead, the real time we're looking for is one to which there's a variety of additive noise (i.e. there is no noise that could actually shorten the benchmark time below its real value). In theory, taking an infinite amount of samples and keeping the minimum is the actual time that needs measuring.

[0]: https://github.com/facebook/folly/blob/master/folly/Benchmar...

[1]: https://github.com/facebook/folly/blob/master/folly/docs/Ben...

overgard · on April 4, 2014

Sort of a tangent, but you know what I'd love to see? A compiler that could emit comments with the code it generates. They wouldn't have to be brilliant, just whenever it decides to optimize some bit it emits a generic "why". It'd be even cooler if it was a website that hooked up to something like GCC, and you could feed it some C code and get a hypothetical "what would the compiler do here" sort of deal.

I realize that's kind of crazy, but it'd be an awesome teaching and debugging tool. I don't look at disassembly a lot, but when I do it's not always particularly obvious why the compiler generated a thing in a certain way.

CUViper · on April 4, 2014

It's not exactly what you're asking for, but gcc with -fdump-rtl, -fdump-tree, and -fopt-info can show you quite a lot of what's going on.

carterschonwald · on April 4, 2014

Clang does that when you have it emit Asm

forgottenpaswrd · on April 4, 2014

Don't use assembly, period.

I can make things go more than 10 times faster in assembly. My main job is as manager/entrepreneur but I could read-write assembly as a result of my experience and I help-guide other people easily.

In the real world 10 times faster is nothing. You should spend the time understanding the problem in a mathematical way, and VERY IMPORTANT, documenting your work using images, text, voice and video.

This way you could make things go 100, 1000, 10000 times faster as most algorithms could be indexed, ordered in some way as to make it extremely fast, like doing log() operations instead of n squared or cubic or to the elevated to four or five(when you manage several dimensions like 3D with time or video analysis or medical tomography).

More important than that, 10 years from now it will continue working in new devices or OSs and will be something that supports the company instead of being a debt burden because the original developer is not here now(or you don't have the slightest idea of what you did so far away in the past and did not document).

The main problem is that people is not self aware that they forget things. And your brilliant idea that makes everything go 3 times faster is nuts if it makes everything way harder to understand, or if it could be forgotten even by you.

jwr · on April 4, 2014

That's not good advice. Whether you need to use assembly depends on the particular situation at hand.

Here's a practical example: as a result of redesigning the algorithm to use fixed-point and implementing it in assembly, I got it to run 600x faster than the initial C version. Big O complexity was the same, the difference was in the constant factor. But the constant factor matters! In my case, it meant that you could get your computation done in half a day instead of a year.

Yes, it took me 3 weeks to get the algorithm implemented, instead of a single day, but even so — it was definitely worth it. And in many cases even a 3-fold improvement in speed is important, if you have long-running calculations.

graphene · on April 4, 2014

That bit about fixed point is exremely interesting.. I found your blog post about the project you're (I think) referring to (http://jan.rychter.com/enblog/2009/12/4/x86-assembly-encount...), but it doesn't mention the fixed point part.

Not knowing too much about processor architecture, I don't understand how fixed point can be much faster, since floating point ops are implemented in hardware.. I presume you used integer operations on your fixed point values, but could you explain a bit why it ends up being much faster than floating point?

jwr · on April 4, 2014

It all depends on how precise your fixed point values need to be. If you can squeeze them into 8 bits (I could), you can use SSE 128-bit registers to operate on 16 values at a time. It gets even better with AVX, although that wasn't available to me at the time.

So the speedup is not just from going to fixed point, but from managing to use the vector instructions.

exDM69 · on April 4, 2014

> Don't use assembly, period.

That is just terrible advice.

You're probably right that in most cases you should not write assembly code to try to make some code run faster. However, it is a very valuable skill to know how to read assembly code and spot the inefficiencies.

For a low level hacker, it is a very valuable skill to be able to write and especially read assembler code. I need that skill regularly in my day job. And I would have not acquired that skill if I had not written some assembly code. And besides, writing assembly code is fun!

Sometimes you need that 10x speed improvement to be able to do what you need to. To get your game running smoothly or your video playback work. You need to know when and how to optimize for performance.

hcarvalhoalves · on April 4, 2014

"A 10x speedup isn't cool. You know what's cool? A 1000x speedup."

userbinator · on April 4, 2014

> More important than that, 10 years from now it will continue working in new devices or OSs

They said that about x86... 20 years ago. I have applications written in Asm that still work on the latest CPUs today. The same binaries, not even needing recompilation, now run several orders of magnitude faster. I still see a lot of potential in extracting performance from x86 and although I hesitate slightly to make this prediction, I think it'll be the dominant architecture for at least 10 more years.

vidarh · on April 4, 2014

> I think it'll be the dominant architecture for at least 10 more years.

That needs to be qualified as "for desktops/servers" or similar. x86 haven't been the dominant architecture for at least a decade, if ever, in terms of units shipped. It's being outsold in number of units by ARM at a 10:1 ratio, and MIPS and PPC's are shipped in higher volume as well, or at least did as of a year or two ago. Possibly even 6502 and various micro-controllers, though getting numbers is harder.

Keep in mind how many CPU's are around you. Our servers have an ARM core per harddrive, and several of our RAID controllers have multiple PPC cores, for example. We have some servers with dozens of non-x86 CPUs per x86 CPU. Even some SD cards have ARM cores on them.

Now consider your car, microwave, washing machine, dish washer, tv, set-top box, phones, camera, music player, digital radio. A lot of stuff that was semi-mechanical or employed discrete logic a few years back now have CPUs that are ridiculous overkill, but used because they're so cheap there's no reason not to.

x86 is a diminishing niche if you look at electronics as a whole.

yaur · on April 4, 2014

There are exceptions... things like color space conversions (though technically the last time I did that was in Cg) where that is about all you can get.

dror · on April 4, 2014

Just another form of Knuth's "Premature optimization is the root of all evil (or at least most of it) in programming."

Or my personal variant Most optimizations will end up in things being slower.

maximilianburke · on April 4, 2014

You forgot the whole quote. "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

Knuth is saying that in most cases optimizing prematurely is not worth it, which is right, but just as critical is the part saying that there are parts where pays off to consider optimization early. Writing your logging function in assembly likely fits into the 97%. Writing your N-queens solver, if you're aiming for all-out speed, likely fits into the 3%.

Of course not thinking about optimization at all will lead you down a path where your software is unoptimizable, or at least very difficult to optimize. Ask anyone who's had to optimize a video game that manages all objects in a scene graph, where reorganizing a cache hostile tangled web of pointers to base classes is a herculean feat.

jedicoffee · on April 4, 2014

Don't try to learn things, unless you want small minded people to get upset.

otikik · on April 4, 2014

If no one did it, there would be no experts.

srean · on April 4, 2014

I upvoted this post (not enough IMO) when it was in the "new" page and glad that it made it to the front page, many good posts dont.

A persistent peeve of mine is this vapid commentary that would invariably show up in programming language related discussions

   "what's the point of this language, if one needs speed, it will be written in C".

Some would replace "C" with "assembly" in that sentence. The point is that if a piece of code is doing something non-trivial, it is extremely hard to have confidence in the correctness of handwritten code that is written entirely at a low level. Smart compilers apply drastic, and sometimes unintuitive transformation to optimize code. To do the same manually and using low levels of abstractions would require a toure de force to pull off correctly. This simply is not going to happen often. Its precisely for speed that we need a high-level, yet optimization friendly, language so that we can delegate the job of optimizing the code to the compiler. Compilers are much better at applying large correctness preserving transformations than we humans are. The "write it in assembly" works in the small, not in the large.

jwr · on April 4, 2014

Whenever I write anything non-trivial in assembly, I always go through a series of progressively lower-level implementations in C. The last ones emulate SSE instructions with for loops and correspond very closely to the assembly code. It isn't possible for me to write a serious chunk of assembly 'just like that'.

sillysaurus3 · on April 4, 2014

Please make a video of the various stages of you doing this! Or a tutorial pointing out how the various progressions look. That's awesome, and it's extremely difficult for others to learn that process except through years of trial and error.

srean · on April 4, 2014

I did not bookmark it, but there was this fun blog post about writing a number crunching HW assignment in Scheme (much against contemporary wisdom). But It ended up other implementations in C. The trick was to write it in CPS and keep hand transforming it iteratively till it was efficient assembly in a different syntax. I would still argue that that you are really writing in a high level language/abstraction and the rest is mostly tedious transliteration with special attention to keep the FPUs busy, efficient memory loading and register allocation.

pflanze · on April 4, 2014

Perhaps you're referring to this: http://www.cs.ucla.edu/~palsberg/course/purdue/cs565/F96/sob...

srean · on April 4, 2014

Yepp! thats the one

nmrm · on April 4, 2014

It would be useful to watch someone (you?) step through this process.

randunel · on April 4, 2014

How can one become an expert without using assembly on a daily basis?

Havoc · on April 4, 2014

And in order to become an expert...you'll need to use assembly whilst not quite there yet. [Hopefully not for mission critical code though]

wedesoft · on April 4, 2014

No big deal. Just become an expert then. It is not that hard. Just have a look at Ian Piumarta's work [1,2]. Future systems will probably facilitate programming on every level from the bare metal to the highest abstraction. I.e. no strict boundaries between the implementation of the application and the implementation of the compiler/interpreter.

[1] http://piumarta.com/software/maru/

[2] http://www.youtube.com/watch?v=cn7kTPbW6QQ

EDIT: There are other reasons to write machine code than performance.

chacham15 · on April 4, 2014

I feel like people are assuming that the only reason you write assembly is for performance. I try to stay as far away from it as possible, but sometimes there is no other way (barring finding a library that itself has written assembly). Some examples: finding a backtrace, getting the CPUID, or calling a function of arbitrary number of arguments (for some reflection like capabilities: imagine a function like "void foo(void (* func)(), ArrayList * args, char * arg_types)" and getting foo to call func with args passed as arguments, not an array). If you can, you should try and find a library to do those tasks, but the libraries themselves have no choice but to write assembly as you just cannot do those things in plain C.

benjamincburns · on April 4, 2014

Absent of other details, I think in most cases this is an argument to go higher-level, not lower-level.

noname123 · on April 4, 2014

It's not the application but a man's need to be an animal again. Frankly, writing Ember.js apps via CoffeeScript which has roughly 11 layers of abstraction feels like a metrosexual man eating a processed food with organic marketing on a IKEA furniture (CoffeeScript -> Compiled Javascript -> Ember.js -> jQuery -> JavaScript -> Chrome V8 Engine -> WebKit Layout Engine -> GTK+/Cocoa/Chrome Windows View -> Linux/OS X/Win32 display API -> C/C++ -> Assembly).

There was a time when a man can eat raw meat (machine code) but he can start again today by eating bloody beef.

AndrewDucker · on April 4, 2014

Oh God. This is so ridiculous.

Even back in the 80s people were spoofing this attitude, with essays like "Real Men Dont Use Pascal" http://homepages.inf.ed.ac.uk/rni/papers/realprg.html

pavlov · on April 4, 2014

Unless the man grows, feeds and slaughters the cow himself, that "bloody beef" is just as removed from animalistic experience as the organic food.

It's ridiculous to pretend that eating hormone-accelerated mass-processed meat is closer to nature. Sure, it's oozing blood, but that blood is tainted by an enormous industrial profit machine.

com2kid · on April 4, 2014

My boss knows assembly.

My boss also hunts animals with a bow and arrow.

Does he get a pass?

benjamincburns · on April 4, 2014

Only if he made the bow and arrows himself from sticks, fibre, and shale he found in the woods, and if he also purified/doped/etched his own silicon wafers.

dllthomas · on April 4, 2014

That Seattle kid who fought an octopus on its turf, bare handed, probably counts though.

frozenport · on April 4, 2014

Also the CPU re-orders everything. Also there are 4 CPUs and 8 threads. The threads run at different speeds. Also division will take a variable amount of time.

projectileboy · on April 4, 2014

This has been true for a long time, even when the x86 architecture was much simpler. I worked on an SDK for an 80386-based platform back in '97, and it was unusual for me to do better than the Watcom C compiler.

nemasu · on April 4, 2014

I don't really understand how the author is managing to contradict the previous HN discussion.

https://news.ycombinator.com/item?id=7301913

chris_overseas · on April 4, 2014

Well the previous discussion was centered around the same assembly implementation as this one. The only difference is, in the previous discussion the fastest C/C++ implementation was 1.1x slower than the assembly. In this current discussion, the author managed to write a C++ implementation that runs faster than the assembly version.

nemasu · on April 4, 2014

Okay, but as far as I can tell he changed the algorithm (recursive c++, pop/push to global array, etc) ... doesn't that kind of negate the argument unless you change it in the assembly version as well?

chris_overseas · on April 4, 2014

True enough. That reinforces a couple of other points that tend to become apparent from these sorts of comparisons and benchmarks:

1) High level languages are easier to refactor and maintain than lower level languages (changing the algorithm in the assembly's a big job).

2) For performance, the algorithm used is almost always more important than the language you choose to implement it in.

Both of those points back up the author's assertion, it's a shame he didn't specifically discuss them.

I used to hand-code assembly (or generate it, eg "compiled bitmaps") back in the 8086 days. For many situations there were easy gains to be had that couldn't be achieved so easily in high level languages. These days I wouldn't dream of attempting it other than perhaps for vector code in an inner loop. I'd far rather spend the time profiling and optimising the algorithms and data structures because that's where the big gains are going to be. To go ahead and implement something in assembly when the algorithm clearly isn't yet optimal is perhaps fun but also kinda insane.

dvanduzer · on April 4, 2014

The author of the (current title of) the post is presumably hinging the argument on the definition of 'expert' described in this video: http://www.youtube.com/watch?v=BKorP55Aqvg

ASneakyFox · on April 4, 2014

Percentages are for suckers. If you're not optimizing your code with actual use cases then you're just wasting your time. Id say 90 percent of code any one would ever write would not provide any benefit if it "ran faster".

No one will ever know or care that a 5ms operation only takes 2ms because of how "savvy" you are. No one knows or cares what language you used. Its just a program.

If you can produce a working program in a high level language but chose to use a low level one.... whyy?

LnxPrgr3 · on April 4, 2014

Depends on the context.

If your operation is relaying a 20ms audio packet from one side of a call to another, 50µs to 20µs is the difference between handling 200 concurrent calls and handling 500 (per CPU, minus non-linear scaling with increasing CPUs and other overhead). Unlike the cost of hardware, which every customer would have to buy separately, the development cost can be spread out over many customers.

If $50,000 of developer effort can save 50 customers $1000 on hardware, you've already broken even in a sense.

nemasu · on April 4, 2014

So you're basically saying "why do anything without a valid reason"?

paulmd · on April 4, 2014

This is nothing new; letting an optimizing compiler do the heavy lifting has been the general-case recommendation for a long time now.

But there are still circumstances where it's necessary to write assembly, particularly in things like real-time programming. Doesn't matter how good a job the compiler did on the overall program, if you can afford 16 cycles on an inner loop and the compiler spits out 24 then you've gotta hand-tune it.

dllthomas · on April 4, 2014

I wish there was more ability to specify the paths I really care about being low latency (even at the expense of throughput) and the like. As it is, it's rare that I drop below C, but more common that my C winds up restructured for what it happens to do to the output code.

nicholassmith · on April 4, 2014

Even if you are an expert, consider whether the next person to maintain your code will be an expert. Assume they aren't and that they know where you live.

aidenn0 · on April 4, 2014

Something the author doesn't mention is how intel tunes their architecture to be faster on code output by popular compilers. This means that it is likely that N generations of x86 from now, the C code will run even faster relative to the hand-written code.

j45 · on April 4, 2014

I would hope that asking people to stay away from assembly doesn't inhibit creating new experts in assembly through putting in the effort.

stcredzero · on April 5, 2014

And keep in mind, that on CISC processors, the average C statement works out to ~2.7 assembly language operations.

baldfat · on April 4, 2014

I MISS THE Z80!!!!

Old folks only would even understand

andyidsinga · on April 4, 2014

time to go dust off nasm and maybe an avr assembler.

assembly isn't just for speed ..its for fun too.

frozenport · on April 4, 2014

Don't you need to be an expert to use ASM?

Aloha · on April 4, 2014

No. C and Assembly are not that far apart, using ASM requires knowing Assembly.

icantthinkofone · on April 4, 2014

In other news ... don't use C#, PHP or any other language for the same reason.