Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No you didn't understand what I was doing nor why I was getting good results.

My code was something like:

    __builtin_prefetch(first cache line);
    for(a lot of 3d objects) {
        __builtin_prefetch(next cache line);
        vec4 q = read_quaternion(), p = read_position();
        mat4 m = matrix_product(translate(p), quat_to_mat(q));
        mat4 n = inverse_transpose(m);
        mat4 p = matrix_product(camera_matrix, m);
        // lots of more matrix-vector-quaternion math here
        stream_store(m); stream_store(n); ....
    }
In other words, my "high level" code was readable C code, only the primitives (matrix_product, etc) were intrinsics code.

What the compiler did is inlined all the primitive ops, store all the values in registers at all times (no loads or stores or register spilling in the inner loop) and finally re-organize the instructions to get near optimal scheduling.

In some simpler programs, I got even more benefit from the compiler doing some loop unrolling to keep loads/stores balanced with ALU ops to get very effective latency hiding.

Writing that whole loop with assembler would have yielded next to no improvement but that would have sacrificed readability, maintainability and portability.

> You can't have the body of your inner loop in assembly, but the looping mechanism in C.

You're right in that you can't mix inline assembler and C and get the compiler to optimize it correctly, but using intrinsics you can get the best of both worlds.

This is pretty much exactly what I had, except that my "inner loop" primitives were C code+intrinsics that looked like assembly code. Because they were not assembler code, the compiler was able to optimize the whole loop and the end result is something similar to what you show as an example of good performing asm code.

If you do end up writing assembler, you have to be sure that it is worth sacrificing the compiler optimizations that would take place otherwise.



And when the compiler wants to use more SSE registers than are available, it will register spill. A clever programmer may be able to avoid that, and that's where the performance boost comes from.

You've convinced me to leave that as an ultra-last-resort though, and just try out intrinsics first. Dumb compilers are probably less of an issue nowadays than in ye olden days of ~5 years ago. Thank you for the thorough explanation!


> And when the compiler wants to use more SSE registers than are available, it will register spill. A clever programmer may be able to avoid that, and that's where the performance boost comes from.

Yeah, at some point the compiler can't do any more magic and will start spilling. But it's not very often when a programmer intervention is required.

Quite often you can get the effect you need by looking at the compiler-emitted code, see where the spilling or other unwanted effects happen and do small tweaks to your C code. This is a bit annoying but it still beats hand writing asm code when it comes to time investment (it is not necessarily as fun, though).

> You've convinced me to leave that as an ultra-last-resort though, and just try out intrinsics first. Dumb compilers are probably less of an issue nowadays than in ye olden days of ~5 years ago. Thank you for the thorough explanation!

Yes, compilers have improved and will keep on improving. I was blown away by the quality of code I saw coming from GCC and Clang, in particular about how good the instruction scheduling was.

In any case - writing C code that looks like Assembly code (ie. plain and simple) gives very good results and doesn't require you to sacrifice compiler optimizations like using real Assembly code does. Read the output Assy code and revisit the C code if required, this way you should be able to get near-optimal code with less time investment.

Writing Assembler code is still really fun, though. Unfortunately it's not a good investment when it comes to achieving your goals in time.


Btw, exDM69 has a project on github written in this style (I'm almost sure he was in fact referring to this in his earlier posts): https://github.com/rikusalminen/threedee-simd




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: