That was my bible when I was a kid, I spent so much time reading and pondering over it... It really is a great book although it's getting to be a bit outdated, a lot of the concepts and ideas can still be very useful.
In my experience (limited to single threaded code though), modern C++ compilers (especially MSVC which does a better job than GCC in general) optimize really well. It's interesting to compare the assembly generated for the same operations in VC6 and VC2008/2010.
In line with what was mentioned below, VC6 used a series of lea to seek (a*x+b)th byte in a structure whereas VC2010 just uses a single multiply.
And something I was surprised by, VC6 always used fnstsw ax; test ax, 40h; jz $+16; to compare floating point values. VC2010 uses jp instead, I'd presume for a good reason as well.
Anecdote: One time and one time only I witnessed GCC write more "clever" code than MSVC, though I haven't benchmarked this:
switch(val) {
case 1:
case 2:
case 6:
case 15:
return foo;
default:
return bar;
}
Both compilers returned > 15 normally, but optimized differently: MSVC compiled this into a normal two-step jumptable
mov eax, table1[dl] // assuming dl contains val, table1 contains either 1 for bar or 0 for foo
jmp table2[al] // contains addresses of two branches
I know that the classic SHL instead of multiplication has been out dated for a number of years. My Intel IA32 manuals confirm that this is now slower than the multiply operation in most cases.
A fast, variable shifter is a reasonably complicated slice of chip. Having a fast multiplier is one of the best improvements chips have been making for the last 20 years. At some point, you'll make the multiplier as fast as the fast shifter, and then you might as well slow the shifter down to save space, as no-one should be using it any more.
I think I'll have to take a look at state-of-the-art implementations of shifters and multipliers. Does anybody know of some resources regarding this topic (papers or reasonably optimized VHDL code)?
http://en.wikipedia.org/wiki/Wallace_tree is the implementation used on the Pentium III and http://en.wikipedia.org/wiki/Dadda_tree is an improvement on that. The Dadda tree was invented in 1965 [1], so I guess the issue is more in having enough transistors and joining them up fast than in designing the multiply algorithm.
Having said that, the following might be some possible references.
Shailendra Jain, Vasantha Erraguntla, Sriram R. Vangal, Yatin Hoskote, Nitin Borkar, Tulasi Mandepudi, Karthik VP, "A 90mW/GFlop 3.4GHz Reconfigurable Fused/Continuous Multiply-Accumulator for Floating-Point and Integer Operands in 65nm," VLSI Design, International Conference on, pp. 252-257, 2010 23rd International Conference on VLSI Design, 2010. -- http://www.computer.org/portal/web/csdl/doi/10.1109/VLSI.Des...
I would have thought the state-of-the-art is trade secret for Intel and AMD, unless anyone actually depackages their chips to look at the multipliers? Patent applications might be worth looking at, but they are probably incomprehensible.
Where the latency of the (integer) multiplication is 3 clock cycles, of (integer) addition is 1 clock cycle and of (integer) shift it is also 1 clock cycle (according to the Intel® 64 and IA-32 Architectures Optimization Reference Manual [1]). Therefore the shifting is still faster than multiplication, however an optimization for 1 or 2 clock cycles doesn't make sense anymore...
IMVU is proud to be (one of the last?) users of Pixomatic, Michael Abrash and Mike Sartain's commercial software 3d renderer. Even today there's a surprising demand for software 3d rendering, especially in the laptop/netbook space: many (potential) users out there have terrible intel graphics accelerators but fairly fast dual core systems.