Anyway, anyone interested enough in this to be reading this comment thread should go look up approximation theory and approximation practice by trefethen [0], which is a very very good book, that covers chebyshev polynomials in a very clear way. The real party trick you get out of it, though, is that by taking samples of functions at the roots of chebyshev basis polynomials and applying the discrete cosine transform, you can get a set of chebyshev coefficients to approximate a function in O(n log n) time
Ah, my bad. I thought browser support for SVGs was pretty robust at this point. I've gone ahead and rasterized all the images (save for the inline equations), but clicking on the images will still take you to the original SVGs.
I've always found DCTs and particularly Clenshaw-Curtis quadrature (better than Gauss quadrature arguably!) attractive: you can do an incredible amount of heavy lifting with this corner of approximation theory.
You can do a little better with a minimax polynomial that you can derive with the Remez algorithm. If you have Matlab, you can use the excellent Chebfun package's remez command to do this almost instantly for any well-behaved function.
The best approximation I could find calculated sin(x) and cos(x) for [-pi/4,pi/4] and then combined these to cover all other inputs. These functions need only 4 or 5 terms to get +/-1 ULP accuracy for 32-bit floats.
I thought the harder problem was the argument reduction to enable calculating something like sin(10^9). It turns out that you need a lot of bits of pi to do this accurately. While testing my implementation, I was surprised to learn that the x87 fsin implementation only has 66 bits of pi, so it gives extremely wrong answers for large inputs.
> You can do a little better with a minimax polynomial that you can derive with the Remez algorithm.
Yup! This was something I meant to mention in a footnote. If I could get equivalent [infinite precision] error bounds using a minimax polynomial of a lower degree, I think this would make a difference. Otherwise, even if the theoretical error bounds are lower for a minimax polynomial of the same degree, it wouldn't do anything to combat the noise introduced by rounding errors, assuming you keep everything else the same - and that noise seems to be by far the limiting factor. But that's just conjecture - no way to know for certain without trying it.
I do still like the Chebyshev polynomial approach because it's the only non-iterative approach I know of. It's something that I could implement without the aid of a CAS program pretty trivially, if I wanted to.
> The best approximation I could find calculated sin(x) and cos(x) for [-pi/4,pi/4] and then combined these to cover all other inputs.
Yes, that also seems to be the way to go (at least for precision). I didn't mention it in the article, but even though the code I showed was rust, the actual implementation is in the form of an AST for a plugin system that only supports f32 addition, multiplication, division, and modulo as operations (no branching, f32<->int conversions, etc). My entire motivation was to create a sine function for this system. It's known that the inputs to the function will be in (-pi, pi), so I don't have to perform reduction over any larger range. I don't see a practical way to reduce down to (-pi/4, pi/4) using the operations available to me, but I could be overlooking something.
Under those constraints, Chebyshev approximation will probably outperform minimax. Minimax approximations have non-zero error at the endpoints, so they wouldn't quite hit zero at sin(pi) and your relative error would be horrible.
If you want exactly zero at the end points, you could do something like the post does, of approximating sin(x) / x(pi+x)(pi-x), or similar. You can still do that with the Remez algorithm.
Also, a while ago I realized that you can tweak the Remez algorithm to minimize relative error (rather than absolute error) for strictly-positive functions - it's not dissimilar to how this blog post does it for Chebyshev polynomials, in fact. I should really write a blog post about it, but it's definitely doable.
So combining those two, you should be able to get a good "relative minimax" approximation for pi, which might be better than the Chebyshev approximation depending on your goals. Of course, you still need to worry about numerical error, and it looks like a lot of the ideas in the original post on how to deal with that would carry over exactly the same.
For nice enough smooth functions, the minimax approximation can only improve Chebyshev by at most ~2 bits; this http://www.uta.edu/faculty/rcli/papers/li2004.pdf paper shows that the improvement is in fractional of bits for elementary functions. I'm really not sure it's worth the effort and incurring more error around every pole to tamp things down around Chebyshev's worst case. For single floats, I did find it interesting to restrict the search to coefficients that can be represented exactly as single floats. For doubles, rounding of coefficients is probably noise unless you're writing a libm.
(Disclaimer: I’m not a numerical analyst or hardware expert.) I’m not sure if you can use a dot product unit like the one in your link to make as much of a difference when handling recurrences like “Clenshaw’s algorithm” for evaluating Chebyshev polynomials (the method used in the OP, an analog of Horner’s algorithm), as you can get for taking a big dot product or computing an FFT or whatever. What I expect you really want is to keep your temporary variables in higher precision all the way through the recurrence.
You can probably get some improvement vs. separate floating point addition and multiplication with FMA instructions on existing standard hardware (and also a slight speed boost). I haven’t ever carefully tested the accuracy improvement in practice though.
If you are willing to take a hit on performance, you can even save the lower order bits of higher-precision accumulated floating point numbers by using two floats (doubles) per variable, and get almost double the precision with the same hardware (keywords: “compensated arithmetic”, “Kahan summation”).
Here is approximating Fibonacci with polynomials. Evidently I wrote this in June, 2014:
#include <math.h>
#include <stdio.h>
int fib(int n)
{
double out = 0.002579370;
out *= n; out -= 0.065277776;
out *= n; out += 0.659722220;
out *= n; out -= 3.381944442;
out *= n; out += 9.230555548;
out *= n; out -= 12.552777774;
out *= n; out += 7.107142857;
out *= n;
return floor(out + 0.5);
}
int main(void)
{
int i;
for (i = 0; i < 9; i++) {
printf("fib(%d) = %d\n", i, fib(i));
switch (i) {
case 7:
puts("^ So, is this calculation fib or not?");
break;
case 8:
puts("^ Answer to life, universe & everything responds negatively!");
}
}
return 0;
}
Run:
$ ./fib2
fib(0) = 0
fib(1) = 1
fib(2) = 1
fib(3) = 2
fib(4) = 3
fib(5) = 5
fib(6) = 8
fib(7) = 13
^ So, is this calculation fib or not?
fib(8) = 42
^ Answer to life, universe & everything responds negatively
Notice that your program could use a fold [1]. Not sure it would be much more efficient, but that would give it a more functional-programming style, which is arguably more elegant.
This is silly. You should (almost) never optimize low-level numerical building-block routines for code style at the expense of performance.
The proper thing to do is make a generic library function for evaluating Chebyshev polynomials, and then pass in the polynomial coefficients and the values to evaluate in arrays. That’s a much more “functional” approach than hard-coding the coefficients could ever be.
Then if you want it to go fast, you can vectorize the code and use FMA instructions, and run it on multiple CPU cores, as in https://github.com/alkashani/numcrunch/blob/master/src/clens... (this is code my wife wrote for me; I think there’s still a factor of 2 or 4 being left on the table getting data flowing efficiently through the CPU caches, but we didn’t really deeply profile it to figure out what the issue was). Of course you can do even better by running this type of highly parallelizable code on a GPU.
>> This is silly. You should (almost) never optimize low-level numerical building-block routines for code style at the expense of performance.
>> The proper thing to do is make a generic library function for evaluating Chebyshev polynomials, and then pass in the polynomial coefficients and the values to evaluate in arrays.
It's funny that you express concern for performance and then advocate creating a generic function and passing everything as parameters. For things like the function in question, doing straight line code is usually the best way. Optimization should consist of what the author did - find the simplest mathematical expressions to calculate the answer and just write them down.
On the other hand, if for some reason a program is really doing this calculation over and over your approach (worrying about data flow and parallelism) may become preferable. I just haven't seen a case where sin(x) is used that much ;-)
I thought the OP’s code was fine. My main point was that the top-level poster’s argument about what counted as “functional” seemed silly, and that if he really wanted to go all the way he could write a more completely generic function which could also be made faster by getting data flowing through the CPU more efficiently or using a GPU. (I shouldn’t have phrased this as “the proper thing to do” without clarification, but I can’t edit it now.)
> if for some reason a program is really doing this calculation over and over
Most of the time when you want to evaluate sin(x) for one number x, you want to also evaluate it for a million other numbers. For example, you might be updating the velocity vectors of all the rockets in a space game, finding the map coordinates of a polyline representing the coastline of England, converting color space coordinates for every pixel in an image from CIELAB to CIELCh, or whatever.
In cases where you are just calling the sine function as a one-off, performance is pretty well irrelevant.
* * *
Since sin(x) has a few steps before and after the polynomial evaluation, you are right that it usually makes sense to hard-code the whole function with domain normalization etc. (ideally still vectorized), though those other steps could also be genericized if necessary, since you could also use them for various other special functions.
One last thing: I recommend trying to re-frame trigonometry problems as vector problems whenever possible, and thereby not needing the sine function at all. Code which avoids transcendental function evaluations altogether is usually simpler to reason about and faster.
>> One last thing: I recommend trying to re-frame trigonometry problems as vector problems whenever possible, and thereby not needing the sine function at all. Code which avoids transcendental function evaluations altogether is usually simpler to reason about and faster.
Your comment is beyond silly. The whole point of numerical code is to perform predictably and within the performance constraints of the specific problem. Style purism is the worst kind of elitism - it's braggadocio for it's own sake without real world merit. Every solution to a problem should be evaluated in relation to the problem domain and constraints given. Sometimes the more understandable version is better. Sometimes the functional version is better. But there is no stencil based development methodology that would provide the best answer every time. And, yes, I like code too that does not have side-effects and reads more like math. That's not always the best code to solve a problem though (unlike in problem specification where striving for something like TLA+ like presentation usually has it's merits).
* OP uses standard optimization techniques. Their use has been justified over decades of actual research.
* OP uses a compiled language. OP can look at disassembly and see what's happening on the CPU.
* Your use of +reduce+ does not make sense at all - it's just syntactic sugar (you of course can use HOFs to handle constants, but that's completely missing the idea of function composition, which arguably is the only reason +reduce+ exists).
I hear your basic points, and this is a tangent on a tangent, but:
Rakudo, the existing implementation of standard Perl 6, is a compiler.
(I wouldn't be surprised if the code currently generated by Rakudo for this algorithm is far slower than the code generated by rustc but that isn't because it isn't a compiler.)
Loop unrolling as shown by the author is quite common in maths functions. It is still quite readable and ensure good performance. Openblas use a similar tricks in the old days when there is no intrinsics. Not sure for now.
Doesn't Perl 6 have macros? If you have macros, you can do it both elegantly and efficiently by expanding Horner's rule at compile time. You can still use reduce; just have it return the form unevaluated.
In fact, because the coefficients are constant at compile time, you could precondition the polynomial and do even better than Horner's rule, with only three multiplications.
Perl 6 does have macros but they are not quite mature and that would probably be overkill for a six coefficients polynomial. Reduce should be pretty fast anyway.
sin( ) is typically implemented with (more sophisticated versions of) exactly the same "tricks"; most library implementations trade off a little speed for more accuracy (and obviously support inputs from the whole number line, not just (-pi,pi)), but honestly they're pretty fast already.
There are also some libraries that provide vectorized implementations, often with a few options for different accuracy/performance tradeoffs.
No. Typical implementations use polynomial evaluation and range reduction. The advantage of CORDIC is that it only uses shifts and additions, and no multiplications. The disadvantage of CORDIC is that it converges very slowly. On modern hardware where multiplication is blazingly fast, CORDIC no longer makes sense.
For comparison, these are the results I got from benchmarking this implementation with Rust's standard .sin() with a Core i3 M 380 (fast_sine is code from blog, normal_sine is .sin())
test fast_sine ... bench: 746,638 ns/iter (+/- 2,489)
test normal_sine ... bench: 1,535,147 ns/iter (+/- 13,731)
These ns/iter values seem crazy (a sin function that takes 1.5ms is just really bad). There is implicit loop when using Rust benchmark features, no need to put your own loops.
Which "few operations" do you have in mind? Accurately reducing an arbitrary argument into (-pi, pi) in double precision requires over 1000 bits of pi--you don't use all of them for any one input, but you need different bits depending on the exponent, and in the worst a bit more than 150 bits contribute meaningfully to the reduced result (see, e.g. https://www.csee.umbc.edu/~phatak/645/supl/Ng-ArgReduction.p...)
I'm confused. Floating point math is only accurate to 2^53 anyway. Any higher than that, and rounding errors are going to vastly overwhelm any inaccuracy of the sine function. E.g. if your input is 2^54+5, floating point will round it down to 2^54 before it even goes into the sine function. Which will give a totally wrong answer, even if the sine function is 100% perfect! So there's no point in handling numbers that large.
If you care about writing standard library functions which conform to a certain error bound, always. If one does not care, one should reconsider their routine as part of a standard library.
As far as I can tell, the standard library functions don't generally handle numbers that large. Try it yourself in the browser, e.g. `Math.sin(Math.pow(10, 15))`. Floating point rounding errors make computing the sine pointless.
While working in computer graphics, it is really easy to end up with some variant of taking the sin of a very large angle; if you imagine a simple spinning cube spinning over time, it's easy to end up with the angle of the cube simply increasing without bound. I believe it's common practice to try to truncate the angles in that case, but in real programs where you're dealing with arbitrarily complicated networks of operations involving billions of numbers it's not that hard for them to slip through.
From my own experience, if you're using angles much at all you're doing it wrong. First, the cube angle should be reduced to reasonable range quite often. Better in 3d is to use a different representation for orientation - a matrix or a quaternion are common and both should be "normalized" regularly. These also have the advantage of eliminating trig functions from the code.
In my day job doing motor control, I track shaft angle using a 16 bit fixed-point integer. It wraps around perfectly and I've got an awesome sin(x) function that takes -32768 to 32767 as input (equivalent to -pi to +pi) and returns a fixed point value in a few operations.
It's all a matter of using a representation that make whatever you're doing easy to compute.
This is pretty cool. I always find it interesting when cool mathematics ideas are used to help computer science.
I remember a while back I tried implementing something like this in a hobby operating system I was making, with constants similar to the one in the post. I found them on some random paper that included constants for a bunch of different trig functions. I don't think it ended up working and I ended up using a table of values that it estimated between.
One of the major concerns I have with this code, and codes derived from this, has to do with specific numerical underflow that can be seen to be happening before the series is summed.
1st:
let pi_major = 3.1415927f32;
let pi_minor = -0.00000008742278f32;
This is fine, pi_major is basically pi. pi_minor is a small constant.
One knows from basic numerical analytics as applied to computing systems floating point [0] to never do additions/subtractions along the lines of x +/- c*epsilon, and expect to have anything close to an accurate result.
What you are really doing is adding/subtracting numbers of such widely different scales. This is a bad thing in general, and the only real way to handle this is to increase precision. Otherwise you completely swamp the small number with a larger one, and you lose all precision from your smaller number.
Simple example in Julia (0.5.1 or higher)
julia> deltax = Float16(0.1)
Float16(0.1)
julia> sum = Float16(0)
Float16(0.0)
julia> for i=1:1000000
sum+=deltax
end
julia> sum
Float16(256.0)
Yes, that is right, according to this, 1/10th as a Float16 summed 1 million times reports as 256. Which is wildly ... ridiculously ... wrong.
Then redo this as Float32
julia> deltax2 = Float32(0.1)
0.1f0
julia> sum2 = Float32(0)
0.0f0
julia> for i=1:1000000
sum2+=deltax2
end
julia> sum2
100958.34f0
Much closer, but you can still see the impact from the catastrophic loss of precision [1]. This is a very well known phenomenon, that people seem to forget about every now and then, until it comes back to bite you.
I illustrated this for an HPC/parallel programming class I taught a decade ago [2] using some simple C code. The math isn't the problem. Its the order of accumulation of error.
This should be fairly sobering to those whom are happy running f16 codes ... you need to be very careful about the order of your operations (far more so than with f32, or f64), lest you suffer a catastrophic loss of precision, and wind up with a code that "computes faster", or as we old HPC types like to say about codes that make these sorts of mistakes "generates garbage at a tremendous rate".
During my class a decade ago, I had an astronomer whom coded in Fortran complain to me bitterly that he's always done sums this way, and there was no way that this concept was correct. The C code was written to demonstrate the effect to him.
It is sobering how much garbage numerical code there is out there. It gets even harder to distinguish good numerical code from bad numerical code when people get lost in almost irrelevant things, like coding style, or language of implementation.
If you really want to freak someone out, and you have a calculator app, see if you can take 1/3 + 1/3 + 1/3 - 1 (or 1/7 + ... + 1/7 - -1). If they use less than f32, the 1/3 trick will get you a few ULP from zero. If they use something below f64, the 1/7 and a number of others will show this.
Basically do a forward and reverse function application of a non-zero non-pathological argument, and subtract the original value from this. A compiler may optimize this away. An interpreter wont. And you'll see the issue of precision right there in the non-zero result.
Note as well, some apps will try to be "smart" and handle that case (sort of like optimizing benchmark codes with specialized handling). This is more troubling than doing the right thing, but I've always been able to find ways to expose the precision limits with simple tests like this.
The limited precision isn't a problem, its how you make use of it, and use, compare, add/subtract numbers with it that matters.
These are good points in general, but I don't think they're relevant to the author's partitioning of PI to get extra precision near the zeros. The article does a good job of explaining exactly what's going on here, and has plots to prove that nothing is going catastrophically wrong.
Anyway, anyone interested enough in this to be reading this comment thread should go look up approximation theory and approximation practice by trefethen [0], which is a very very good book, that covers chebyshev polynomials in a very clear way. The real party trick you get out of it, though, is that by taking samples of functions at the roots of chebyshev basis polynomials and applying the discrete cosine transform, you can get a set of chebyshev coefficients to approximate a function in O(n log n) time
[0] https://people.maths.ox.ac.uk/trefethen/ATAP/