Michael Abrash's Graphics Programming Black Book

gommm · on April 28, 2010

That was my bible when I was a kid, I spent so much time reading and pondering over it... It really is a great book although it's getting to be a bit outdated, a lot of the concepts and ideas can still be very useful.

bootload · on April 28, 2010

"... That was my bible when I was a kid ..."

John Carmack learned from Abrash reading his graphics and assembler articles Dr.Dobbs ~ http://www.firingsquad.com/features/carmack/page7.asp

ComSubVie · on April 28, 2010

I've read this book some time ago, it's definitely a must-read, however I'm not sure how valid the assembler-tuning tips are on current processors.

Somebody should write a new edition of this book which takes account of multithreading optimizations and best practices.

DCoder · on April 28, 2010

In my experience (limited to single threaded code though), modern C++ compilers (especially MSVC which does a better job than GCC in general) optimize really well. It's interesting to compare the assembly generated for the same operations in VC6 and VC2008/2010.

In line with what was mentioned below, VC6 used a series of lea to seek (a*x+b)th byte in a structure whereas VC2010 just uses a single multiply.

And something I was surprised by, VC6 always used fnstsw ax; test ax, 40h; jz $+16; to compare floating point values. VC2010 uses jp instead, I'd presume for a good reason as well.

Anecdote: One time and one time only I witnessed GCC write more "clever" code than MSVC, though I haven't benchmarked this:

  switch(val) {
   case 1:
   case 2:
   case 6:
   case 15:
    return foo;
   default:
    return bar;
  }

Both compilers returned > 15 normally, but optimized differently: MSVC compiled this into a normal two-step jumptable

  mov eax, table1[dl] // assuming dl contains val, table1 contains either 1 for bar or 0 for foo
  jmp table2[al] // contains addresses of two branches

GCC did the equivalent of this:

  return ((1 << val) & ((1 << 1) | (1 << 2) | (1 << 6) | (1 << 15)))
   ? foo
   : bar
  ;

VBprogrammer · on April 28, 2010

I know that the classic SHL instead of multiplication has been out dated for a number of years. My Intel IA32 manuals confirm that this is now slower than the multiply operation in most cases.

ComSubVie · on April 28, 2010

That is pretty interesting, how can a shift operation be slower than a multiplication?

Robin_Message · on April 28, 2010

A fast, variable shifter is a reasonably complicated slice of chip. Having a fast multiplier is one of the best improvements chips have been making for the last 20 years. At some point, you'll make the multiplier as fast as the fast shifter, and then you might as well slow the shifter down to save space, as no-one should be using it any more.

ComSubVie · on April 28, 2010

Which is a good thing (tm).

I think I'll have to take a look at state-of-the-art implementations of shifters and multipliers. Does anybody know of some resources regarding this topic (papers or reasonably optimized VHDL code)?

Robin_Message · on April 28, 2010

http://en.wikipedia.org/wiki/Wallace_tree is the implementation used on the Pentium III and http://en.wikipedia.org/wiki/Dadda_tree is an improvement on that. The Dadda tree was invented in 1965 [1], so I guess the issue is more in having enough transistors and joining them up fast than in designing the multiply algorithm.

Having said that, the following might be some possible references.

Glenn Colón-Bonet and Paul Winterrowd, "Multiplier Evolution: A Family of Multiplier VLSI Implementations" -- http://comjnl.oxfordjournals.org/cgi/content/abstract/bxm123...

Shailendra Jain, Vasantha Erraguntla, Sriram R. Vangal, Yatin Hoskote, Nitin Borkar, Tulasi Mandepudi, Karthik VP, "A 90mW/GFlop 3.4GHz Reconfigurable Fused/Continuous Multiply-Accumulator for Floating-Point and Integer Operands in 65nm," VLSI Design, International Conference on, pp. 252-257, 2010 23rd International Conference on VLSI Design, 2010. -- http://www.computer.org/portal/web/csdl/doi/10.1109/VLSI.Des...

I would have thought the state-of-the-art is trade secret for Intel and AMD, unless anyone actually depackages their chips to look at the multipliers? Patent applications might be worth looking at, but they are probably incomprehensible.

Also, http://www.ece.ucsb.edu/~parhami/ece_252b.htm looks like a good course web page on VLSI maths, but I don't know.

[1] Dadda, L. (1965). "Some schemes for parallel multipliers". Alta Frequenza 34: 349–356.

ComSubVie · on April 28, 2010

thanks!

ssp · on April 28, 2010

Maybe he means that things like reducing (a * 5) to (a << 2 + a) are now obsolete.

ComSubVie · on April 28, 2010

Where the latency of the (integer) multiplication is 3 clock cycles, of (integer) addition is 1 clock cycle and of (integer) shift it is also 1 clock cycle (according to the Intel® 64 and IA-32 Architectures Optimization Reference Manual [1]). Therefore the shifting is still faster than multiplication, however an optimization for 1 or 2 clock cycles doesn't make sense anymore...

[1] http://www.intel.com/products/processor/manuals/

Robin_Message · on April 28, 2010

That was something like lea eax,[ebx*4 + ebx] anyway :)

TimothyFitz · on April 28, 2010

IMVU is proud to be (one of the last?) users of Pixomatic, Michael Abrash and Mike Sartain's commercial software 3d renderer. Even today there's a surprising demand for software 3d rendering, especially in the laptop/netbook space: many (potential) users out there have terrible intel graphics accelerators but fairly fast dual core systems.