this is really beautiful. If you used an fused dot product [0], you probably ver...

jacobolus · on May 13, 2017

(Disclaimer: I’m not a numerical analyst or hardware expert.) I’m not sure if you can use a dot product unit like the one in your link to make as much of a difference when handling recurrences like “Clenshaw’s algorithm” for evaluating Chebyshev polynomials (the method used in the OP, an analog of Horner’s algorithm), as you can get for taking a big dot product or computing an FFT or whatever. What I expect you really want is to keep your temporary variables in higher precision all the way through the recurrence.

You can probably get some improvement vs. separate floating point addition and multiplication with FMA instructions on existing standard hardware (and also a slight speed boost). I haven’t ever carefully tested the accuracy improvement in practice though.

If you are willing to take a hit on performance, you can even save the lower order bits of higher-precision accumulated floating point numbers by using two floats (doubles) per variable, and get almost double the precision with the same hardware (keywords: “compensated arithmetic”, “Kahan summation”).

Or see http://www.chebfun.org/examples/cheb/Turbo.html https://arxiv.org/abs/1404.2463 for a kind of mind-blowing, much mathier trick.