Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How many floating-point numbers are in the interval [0,1]? (lemire.me)
200 points by mgdo on March 1, 2017 | hide | past | favorite | 153 comments


The correct answer is 1,065,353,216.

This is easy to work out yourself if you remember one basic, handy property of floats: adjacent floats are adjacent in bit representation, except -0.0f and 0.0f.

For example, 0x00000000 is +0.0f. 0x00000001 is the smallest non-zero positive float. 0x00000002 is the second smallest.

The only exception to this is -0.0f and +0.0f, which are 0x80000000 and 0x00000000. The rule works with denormal floats, normal floats, and even right on the line between the two. If you want the next positive float, you always just add 0x00000001.

Now, +1.0f happens to be 0x3f800000. Recall +0.0f is 0x00000000. The number of values between the two is 0x3f800000 - 0x00000000 == 0x3f800000. Write that in decimal and you get 1065353216.


The correct answer is 1,065,353,217. You have an off-by-one error in the following line of argument:

> The number of values between the two is 0x3f800000 - 0x00000000 == 0x3f800000

This should be corrected to:

> The number of values between the two is 0x3f800000 - 0x00000000 == 0x3f800001

In general, the number of numbers in the closed integral interval [a, b] is not b-a, but 1+b-a. For example, the closed interval [0,10] contains 11 integers, not 10.

(The above assumes that the original question assumes the floating point number -0 to be outside of [0,1]. Otherwise, you'll have to add 1 more.)


I'll not endorse your actual line of argument, but without knowing much about the internals of a 32-bit float, I can confirm that your final answer is correct.

Source: brute force counting in a small C program.

Naturally, this might not scale to 64-bits, unless you have lots of time and computing power to spare.


This type of off-by-one error is so common it has its own name: fencepost error[1].

One little note, is that I would disagree with your correction.

> This should be corrected to:

> > The number of values between the two is 0x3f800000 - 0x00000000 == 0x3f800001

I think the value originally computed was correct. It would make more sense to correct the wording rather than just the computed value.

i.e.

> The number of values in the range [0x00000000, 0x3f800000] is 0x3f800000 - 0x00000000 + 1 == 0x3f800001

[1]: https://en.wikipedia.org/wiki/Off-by-one_error#Fencepost_err...


I was under the impression (after several thousand attempts) that the correct answer is infinity, limited only by your addressable memory and bitspace.


I think people are assuming IEEE 754 floating point (as per the article). Your memory is finite, 32 bits. https://en.wikipedia.org/wiki/Floating_point


The discrepancy with the number in the article is that you're including 2^23 mantissae with exponent 0. In any case, the interesting bit is that half of floats are in the interval [-1, 1].


> the interesting bit is that half of floats are in the interval [-1, 1]

This is not so surprising once you realize that this just means you have the same precision for 1/x as for x, which makes a lot of sense for scientific calculations.


I'll clarify mostly for my sake (I hadn't thought about this before)

f(x) = 1/x takes takes numbers in the range (-1, 1) to numbers outside that range, and vice versa. The representable floats are split evenly between those two sets


Yes, that was exactly my line of thinking.

One minor nitpick: f(x)=1/x maps [-1,1] to [-inf,-1]u[+1,+inf] ... that is, 1 and -1 are in both sets and mapped onto themselves. However, this detail doesn't affect the argument much.


the question is [0, 1] so wouldn't that include +/-0 and 1, so 3 more?


Probably shouldn’t include –0, which is used to represent underflow of negative numbers.


Why 3 more? Grandparent already counted one of the border points, so the correct number is either 1 more or 2 more, depending on whether you consider the floating point number -0 to be in the closed interval [0,1] or not.

See my other posting for more details: https://news.ycombinator.com/item?id=13772578


-0 is "below zero, but for so little that we cannot represent", so it's outside [0, 1].


Though for the purposes of comparisons, -0 and +0 are the same in IEEE 754.

So the expression "x >= 0.0f && x <= 1.0f" is true for x = -0.


Honestly, I think IEEE 754 floating point numbers are a pretty bad way of dealing with real numbers and just cause tons of headaches that every math library has to deal with. Even high level programmers aren't shielded from the NaN nonsense. It would be great if we had an underlying implementation we could ignore and that allowed us to think at a more mathematical level.

I actually found something recently on this topic while googling for any alternatives, its called Unum and looks very interesting: https://en.wikipedia.org/wiki/Unum_(number_format)

Does anyone know more about this? Are there any chips out there that support it?


so I'm the author of one of the first usable unum implementations. We're actually moving to something called sigmoid numbers which are even better. The lecture doesn't cover "valid mode" but that's more like the "unum". At the end of the video, there's a demonstration where I show some very interesting results concerning machine learning.

https://www.youtube.com/watch?v=aP0Y1uAA-2Y

You can try out sigmoid numbers in julia here:

https://github.com/interplanetary-robot/SigmoidNumbers

Also, I am working on some verilog implementations, so we may have this in hardware rather soon.

A few things about sigmoid numbers: 1) they have fixed sizes, so you don't have to worry about keeping track of things (which was a pain). 2) they have a property I call "isomorphic structure" - which means that zero-padding does not affect the value, and truncation gives you the nearest value at a lower bit resolution. 3) they really do perform better than floating points in several applications. There's a higher bit-per-bit entropy for 'most things you want to do', so the information content is richer. The distribution of the numbers is "more human" - that is to say at a certain point you stop caring about the details around the number, and more care that it's 'really big' or 'really small'; and the distribution reflects that.


> very interesting results concerning machine learning

This is the big chance for alternative number formats. a whole lot of people are working on neural net ASICs these days and IEEE 754 is not a requirement. A new number format with a low power/low area hardware implementation could easily find adoption in this new area.


Thanks. I'm working on an Exciting architecture in that direction. We'll see if I can raise money.


Those sigmoid numbers look interesting. Is there an article somewhere that explains how they work?


no, not yet. john literally invented them either late december or early january. If you have any questions, feel free to ask me (contact info in profile). I'm also happy to send you a draft "article" that John is writing.


A cursory examination of the documentation for and against the unum stuff seems to indicate a vehement ad hominem argument running between Gustafson and Kahan, which makes it somewhat hard to find objective information.

The unum stuff does seem to be greatly oversold, though. Fundamentally, there is a problem in mathematics: some equations naturally blow up tiny uncertainties into major ones. Indeed, in chaotic systems, such uncertainties can be extremely difficult to describe accurately. Floating point numbers would give you an implausibly precise but meaninglessly inaccurate number, whereas unums would indicate that the answer is basically everything. The difference here basically boils down to Kahan arguing that you should do numerical analysis first to figure out if the result will be meaningful, while Gustafson wants to throw calculus and numerical analysis out the window.

Surprisingly, the dispute is less about the actual format than the style of numerical algorithms. Gustafson seems to be advocating an iterative, increasing-precision approach to solving problems like root finding in lieu of standard numerical techniques. Kahan's retort is that the state-of-the-art techniques are more than sufficient, and that Gustafson's techniques take longer to achieve acceptable results. I didn't see a response from Gustafson to the latter accusation--the slide deck I see has him shifting from an argument about the power usage involved in storing 64 bits of "unnecessary" precision to a commentary about the parallelizability of his approach (no comment about the work-efficiency!).

Personally, I think that the unum approach is not particularly satisfying as a better alternative to floating point. The advantages that are presented focus on storage space and parallelizability--and the fact that the precision is meant to slowly ("automatically", which I scare-quote because it doesn't appear to be done for the programmer) increase implies that the hardware has to deal with variable-sized data formats, which is not conducive to more efficient computing. It's only efficient to use more compressed formats if it's saving you a lot of work compared to a dense format--and it's not at all clear that unums are actually saving work.


> I didn't see a response from Gustafson to the latter accusation--the slide deck I see has him shifting from an argument about the power usage involved in storing 64 bits of "unnecessary" precision to a commentary about the parallelizability of his approach (no comment about the work-efficiency!).

Gustafson has addressed this. His argument essentially boils down to the fact that moving 64 bits to memory costs far more than doing the actual computation. Therefore, you have to spend more transistors doing the computation for unums, but you spend much less energy overall, since you have to transport less energy.


Wikipedia provides a pretty good summary:

William Kahan claims the following issues with unums:

1. Unum computation does not always deliver correct results.

2. The description of unums sidesteps using unum for solving calculus problems.

3. Unums can be expensive in terms of time and power consumption.

4. Each computation in unum space is likely to change the bit length of the structure. This requires either unpacking them into a fixed-size space, or data allocation, deallocation, and garbage collection during unum operations, similar to the issues for dealing with variable-length records in mass storage.

5. Unums provide only two kinds of numerical exception, quiet and signaling NaN (Not-a-Number).

6. Unum computation may deliver overly loose bounds from the selection of an algebraically correct but numerically unstable algorithm.

7. The costs and benefits of unum over short precision floating point for problems requiring low precision are not obvious.

8. Solving differential equations and evaluating integrals with unums guarantee correct answers but may not be as fast as methods that usually work.

Point #4 makes implementation tricky and slow. Point #6 (from what I understand) means unum operations tend to guarantee correctness by being overly vague - by denoting results that FP can produce using ranges that are just too big to be useful.


If NaN is nonsense, what would you have 1/0 or acos(2) to produce? (An exception is the same NaN, only less convenientlying packaged.)


Ideally I would restrict all functions to their natural domain using refinement types. This ways 1/0 simply doesn't type check and you don't have to bother assigning it some nonsensical value. However I can see something like this becoming much more a hassle than a benefit when things get complicated enough.


I'm not very familiar with refinement types. How would that work?

For example, how would you write a function that answers the mean value of an array of numbers? You could define Array.Length to be of type NonNegativeInteger, but that still includes 0. Alternatively, you'd have to define a NonEmptyArray type that excludes arrays of length 0, but that seems less than useful.


> Alternatively, you'd have to define a NonEmptyArray type that excludes arrays of length 0, but that seems less than useful.

I think the usefulness of a specific NonEmptyArray type is controversial, and I would say such types could be extremely useful. AFAIK there have been discussions if Haskell's Prelude (part of the standard library providing a name space for convenient functions and types, similar to the functions you get in e.g. Python like map, zip, filter, range, etc.) should be more save by not advocating too generic types and functions like the standard `map`, `head` and `tail` implementations.

Functions like `head` and `tail` can be seen as problematic as they can fail and cause runtime errors (the head and tail of an empty list cause a runtime error), but I do not think they already changed that (or if it's even a topic of discussion at the moment).

It's a little bit a matter of taste. Think about Unix' "head" program that by default prints maximal 10 lines of a file. How should we deal with the following?

    $ touch empty.txt
    $ head -n1 empty.txt
Currently this works - you do not get output and $? == 0, but one could also say "head" should panic here.


There no reason to "panic" about perfectly reasonable, valid use cases. `head` and `tail` in Haskell are badly designed (to address one of your examples), which is why every SO question about them and every question on haskell-beginners ends with the same answer: don't use them, pattern match instead.


Yes, that's exactly how it works. See this article for an example: http://nikita-volkov.github.io/refined/

The thing is that, instead of redefining everything yourself, you could use a library that provides you with a set of predicates to compose and build the refinement you need. There are also much sophisticated ways to do this, like Liquid Haskell.


I wonder how this could be done on hadware level.

You can look at the NaN value as an implementation of Option / Maybe, where it means None. It even behaves like it should, short-circuiting any normal operations with it to also produce NaN. Too bad Fortran and C lack sum types to surface that properly on the type level.


Agreed, NaN can be very useful indeed in certain circumstances. If nothing else, a NaN is still a floating point primitive, and can be stored in a large array with other floating point numbers, whereas an exception cannot. If you want an exception, just test Double.isNaN(), and throw one.

NaNs should not be scary.


> An exception is the same NaN, only less convenientlying packaged.

I prefer an exception because I'd rather have my code fail fast, and immediately point me close to the source of the bug, rather than letting a NaN or Infinity propagate through my code base and cause some harder-to-debug problems down the line.

That's the main thing I don't like about JavaScript. 1/0 is Infinity, Math.acos(2) is NaN, Object().foo is undefined, etc. I find myself wasting a lot of debugging time just figuring out where those (usually expected) values came from originally.


Exceptions are utterly painful for vectorized operations. Goodbye Tensorflow if you say goodbye to NaNs


That's an excellent explanation for why scientific computing has been using vectors for a long time and has traditionally faulted when hitting the first NaN. Is there something about Tensorflow that makes propagating NaNs useful? And if so, why are you generalizing from Tensorflow to all "vectorized operations"?


my guess is that this is a performance argument. If you don't have to check the results, your circuits to do huge vector calculations can be simpler.


I'm in favour of 1.0 / 0.0 = Infinity, as in the current IEEE floating point standard. It makes much more sense with many algorithms. Checking whether the denominator is 0 or not costs time and unnecessarely bloats the code.


The JS behavior regarding 1/0, etc, is the same as C: http://www.gnu.org/software/libc/manual/html_node/Infinity-a...


Yes, but the issue comes up more often in JS and PHP since they don't have native integer division.


> I'd rather have my code fail fast

Doesn't it depend on the product in question? In some cases, you want to fail fast and loud. In others, you want to recover at all costs. Sometimes a single calculation going 1/0 means the whole program is failing. Sometimes it's a completely ignorable, minor error, not even worth checking for.

Can we please not have a "default" way of handling errors? None fit all.


> Can we please not have a "default" way of handling errors? None fit all.

Well language designers have to make the choice. I think overall, the best approach is Python's behavior, where both 1/0 and 1.0/0.0 trigger a ZeroDivisionError.


They could give options. E.g. have a "throwing float" and a "NaN float" as distinct data types, or some other way to parametrize the behaviour. Is that a good idea? I don't know.


> Well language designers have to make the choice.

Of course. I meant, no "default" that we would consider the "right way" for all the languages.


For floats, 1.0 / 0.0 produces Infinity according to the IEEE standard. I prefer it that way because it perfectly makes sense for the algorithms I usually deal with and having to check whether the denominator is zero or not makes code slower, which isn't an option if you're crunching tens of millions of numbers a second. It also unnecessarely bloats code.


Whatever you choose for that, if it doesn't throw an exception, then given x = 1/0 (or x = something else NaNish) we really should have x == x. Basic equality really needs to be reflexive, or else you're asking for trouble.


you throw an error and stop the calculation, and let the programmer decide what to do.


"Throw an error" makes no sense at the hardware level. That's like saying a NIC should "throw an error" when it loses link. It's up to languages and libraries to turn signals/registers/flags into higher-level concepts like structured exceptions.


The IEEE standard contains a signalling NaN which is intended to throw an error when it is attempted to be used. It is not the result of any calculation, however, so typically it's only used as a marker to indicate that code is improperly attempting to access uninitialized data. There are such things as hardware exceptions and hardware interrupts on exceptions; and these things will trigger on sNaN in a (compliant) hardware FPU implementation.

My contention is that all NaN-resulting operations (Inf * 0, sqrt(-), e.g.) should halt the program instead of propagating a silent, dud value. When you reach them, it means usually there is something wrong with your algorithm (or less commonly, your data) and it is better to know that sooner than later.


> Does anyone know more about this? Are there any chips out there that support it?

Not an answer to your question, but I found out recently that there is definitely significant interest in ideas like interval arithmetic in hardware among mathematicians. The application I am somewhat familiar with is "brute forcing" proofs of inequalities, which can get very annoying, even for apparently easy stuff like 2 variable ones.

Indeed, "The Princeton Companion to Applied Mathematics" edited by N. Higham et al has an entire article on this (in section II.20). In particular, the article has a reference to a library called filib++ (http://www2.math.uni-wuppertal.de/~xsc/software/filib.html). I unfortunately have no idea how it performs/how good it is.

By the way, thanks a lot for the Unum reference; seems like it is still in too early a stage for inclusion in the book above.


I emailed Nick Higham asking about Unums a year and a half ago. Hopefully he doesn’t mind me quoting from private email:

> I have Gustafson's book and have read a lot of it. I don't yet know what to make of it. [..] His Unum system is a completely different way of working to floating point, so all I can say is that standard error analysis techniques will probably not be applicable.

> The book does not yet seem to have atttracted much attention in the numerical analysis community.


Floating point is Nick Higham's thing isn't? I have his book "Accuracy and Stability of Numerical Algorithms, Second Edition"

http://www.ma.man.ac.uk/~higham/asna/

but I haven't got very far with it yet.

Maybe the article in the Princeton Companion is a summarized version of his book?


You say it's pretty bad and nonsense, yet you haven't found (or even looked very hard) about alternatives. So maybe don't jump to the conclusion that floating point is terrible just yet :)


Just reading the list of disadvantages, the format sounds interesting but not that practical in a world where massive numbers of FPUs operating within a thermal limit is a primary design concern.


He references one of my favorite functions, 'nextafter'. http://en.cppreference.com/w/cpp/numeric/math/nextafter

I'm the kind of person who understands floats by looking at the underlying bit patterns, and nextafter makes it easy to traverse the integers.


> He references one of my favorite functions, 'nextafter'. http://en.cppreference.com/w/cpp/numeric/math/nextafter

It is a cool function. I remember using it for some testing I did while checking the accuracy of various implementations of basic math functions.


> Of all the float-pointing point numbers your computer can represent, a quarter of them lie in [0,1].

Many people think floating point numbers are magically precise. They're not. They're far more accurate at lower magnitudes and precision fades as you work with larger values.


Or you can think of it as precision in the number of significant digits for a number. Each floating point number can only store a certain amount of significant digits. The fractional precision on each float is the same, but the absolute magnitude of the error in the float value increases as the value increases.


That's how I learned it.


>Many people think floating point numbers are magically precise.

Even worse, it kicks in earlier than one might naively think. People are often surprised to hear a value as simple as 0.1 has no matching floating point representation.

(For those interested, http://www.exploringbinary.com/why-0-point-1-does-not-exist-... has a nicely illustrated explanation.)


Being unable to represent 0.1 exactly is unrelated to any precision limitation. It's because the quantization of binary and decimal is different.


This isn't true with e.g. decimal64.


Those sorts of numbers aren't floating point, they're typically fixed point.


No, that's explicitly a floating point format. It's one of the representations in IEE-754-2008. The three decimal floating point formats are decimal32, decimal64, and decimal128. That people often use them where fixed-point math would work doesn't mean they're not floating point formats.


John Gustafson’s new “sigmoid unum”/“posit” proposal is kind of interesting; see his recent talk: http://web.stanford.edu/class/ee380/Abstracts/170201.html https://news.ycombinator.com/item?id=13562164

It does a variable number of fraction bits, so that the values near 1 are even more densely represented than under the usual IEEE floats, while also providing greater dynamic range (but at reduced precision).

I made a little diagram showing a visual comparison of every represented value in a toy 8-bit version vs. IEEE-style floats: https://groups.google.com/group/unum-computing/attach/e80274...


Out of curiosity, I started watching this talk, but right there in the introduction, the very first example is a dot product where supposedly IEEE 754 double precision gets the wrong answer; I stopped to check the result, and I got the correct answer with double precision (even without binary sum collapse). Then, he says the x87's results are nondeterministic due to being affected by cache, which is incorrect. Then, he trots out a series of half-truths about IEEE 754, which are somewhat true but don't mean what he's presenting them to mean. He even cites a Cray 1 quirk as an example of IEEE 754 weirdness, when in reality, the Cray 1 (1975) predated IEEE 754 by about 10 years.

Designing a good arithmetic system takes a lot of attention to detail, and he doesn't make a good impression by being that loose with details in his introduction.


Doubles on my Intel give correct answer too, but change the exponent used everywhere to 8 and you'll see the problem. Gustafson may have used different implementation of IEEE 754 that gave him different result.

There are problems with repeatability of float operations such as loss of precision when moving value from registers to memory and data alignment issues, but it seems that these can be avoided with proper compiler options, at the expense of speed.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=323

https://software.intel.com/en-us/articles/run-to-run-reprodu...


There is no "different implementation of IEEE 754" that could give a different result in double precision for that example.

There are reasonable criticisms of IEEE 754; that's not the issue.


> There is no "different implementation of IEEE 754" that could give a different result in double precision for that example.

How do you know that?


IEEE 754 is a standard, and it requires a specific, bitwise-reproducible, answer for the computation in question. I'm the author of most of the floating-point tests in WebAssembly's conformance testsuite, which tests such things in practice across several hardware platforms.

One of the half-truths in the presentation (in the intro) is that IEEE 754 is a mixture of requirements and recommendations. IEEE 754 does have both requirements and recommendations, however what the presentation doesn't say is that, within a given format like double precision (aka binary64), the basic operations like add, subtract, multiply, divide, squareRoot, etc.) have exactly one possible result for any given input (except that NaNs may have some implementation-defined bits, though this is usually unimportant).


The Cray 1 was done before anyone realized the importance of denormals. And denormals are gruesome to implement on vector machines anyway.


Only a subset of scientific codes benefit from denormals, and it does not appear that implementing them is that big of a deal, given that 4-stage pipelines can do it. The pipeline depth to memory is a lot bigger than 4.


Only a subset of scientific codes benefit from denormals, and it does not appear that implementing them is that big of a deal, given that 4-stage pipelines can do it.

Maybe I'm misunderstanding your terminology, but it seems like you are saying that operations involving denormals have the same latency as normal floating point multiplications and additions. At least for multiplication on Intel chips through Haswell, I think the current case is that subnormals between 0 and FLT_MIN still have abysmal performance --- 100+ cycles of penalty.

Here's Bruce Dawson from a few years ago:

Performance implications on SSE

Intel handles NaNs and infinities much better on their SSE FPUs than on their x87 FPUs. NaNs and infinities have long been handled at full speed on this floating-point unit. However denormals are still a problem.

On Core 2 processors the worst-case I have measured is a 175 times slowdown, on SSE addition and multiplication.

On SandyBridge Intel has fixed this for addition – I was unable to produce any slowdown on ‘addps’ instructions. However SSE multiplication (‘mulps’) on Sandybridge has about a 140 cycle penalty if one of the inputs or results is a denormal.

https://randomascii.wordpress.com/2012/05/20/thats-not-norma...

And here's the overview from a slightly more recent paper:

C. Subnormal Performance Variability

Due to the complex nature of the floating point numbers, processors struggle to handle certain inputs efficiently. In particular, it is well understood that operating on subnormal values can cause extreme performance issues, including slowdowns of up to 100× [19]. As an example, on a Core i7 processor using SSE instructions, performing standard mul- tiply between two normal numbers takes 4 clock cycles, whereas the same multiply given a subnormal input takes over 200 clock cycles.

https://cseweb.ucsd.edu/~hovav/dist/subnormal.pdf

Are you saying this has been fixed in recent (or non-Intel) chips? Or maybe you were considering only NaN and Inf when you said denormals?


Sorry, you're correct, I was thinking about the speed of NaN and Inf.


If you consider every iterative algorithm that solves systems of nonlinear equations a subset you cam ignore.... SPICE, linpack, fishpack.... sorry, denormals are essential to today's algorithms.

As to "no big deal".... show me your code.

Don't worry, I'll be able to understand it. After 20 years as a CPU designer I've learned how to understand bit-bashing.


The iterative algorithms for all of the scientific fields that I've worked on don't fall within the subset that you're describing, so no, I'm not going to agree that you're speaking about the majority of today's algorithms. Weather, climate, finite differencing, FFT... nope. Implicit systems, sure. Radar cross-sections, sure.

And as a CPU designer, are you disagreeing that most FP units today take 4 cycles (fully pipelined) (add and multiply, of course)? And that main memory is a lot farther away than that?

Given 20 years of experience, you missed out on the Cray PVP machines that were the start of this sub-thread. But the cycle counts I'm giving are modern ones. Cray did eventually implement IEEE in these machines without any significant problems, but that was more than 20 years ago.


Every number system in which all numbers occupy the same amount of space must have this property (namely, that there are roughly as many positive numbers above 1 as below 1, and therefore that there is greater precision near 0) or the accuracy of the function x->1/x will suffer.


Correct me if I'm wrong, but my understanding is that a consequence of this is that it's better to store, for example, angles as radians from -pi to pi rather than as degrees between 0.0 and 360.0, in order to take advantage of all that precision between 0 and 1.


The most efficient/uniform representation is to store angle measures as fixed-point numbers in [–1, 1) representing [-π, π) radians, using two’s complement for negative numbers (so you can also think of this as [0, 2) with unsigned fixed-point numbers). This representation is called “binary angles”. You can think of it as an extended version of magnetic compass directions, or the corners of a regular polygon with 2^n sides. https://en.wikipedia.org/wiki/Binary_scaling#Binary_angles

Using IEEE floating point to represent angle measures is horribly wasteful of bits, regardless of how you do it.

As an alternative, consider not storing angle measure at all, but instead using a stereographic projection (“half-angle tangent”) of the circle onto the whole line, and then using floating point numbers in the range (–∞, ∞] for your representation. https://en.wikipedia.org/wiki/Stereographic_projection

What you really want as a working format for rotations, ray directions, or points on a circle is to use a 2-dimensional square-grid (Cartesian coordinate) format like unit-magnitude “complex numbers”. The angle measure or stereographic projection is primarily useful as a compression format when you need to save bits transferring or storing data. Angle measures work okay for this, but can be annoying for requiring transcendental function evaluations to convert them to/from Cartesian coordinates, whereas to take the stereographic projection or its inverse only requires a single multiplicative inverse (i.e. division operation) plus some addition and multiplication per point.


I don't see why that would help. Of course there is more precision in terms of decimal places around 1 than around 360. But if you use 360 each 1 represents a much smaller piece of the circle.

To put it another way...it's about significant digits. A float has about 7 and a double about 15. It does not matter whether you use 0 to 360 or 0 to 3.6 million...the number of digits that are meaningful remain the same.


For binning data in spreadsheets for histograms etc, I typically use text() to set significant digits, and then value(). It's also good sometimes to make sure that 0 is really 0.


It's not really relevant there, that's just scaling the error factor. The potential downside to using degrees is that the values in the 0-1 range are going to have slightly more accuracy than those in the 358-359 range. The difference will be very minor if these aren't used in compounding calculations.

It's more of an issue when you're dealing with large sums composed of very tiny ones. If you add them together incorrectly the number becomes so large the tiny values stop mattering even if in aggregate they're important.

Like adding 1e-6 to itself a hundred million times gives you a result different than 1e-6 * 1e8. In the first case I get 999.999998191639 when the multiplied version is 1000.

These errors can accumulate to a dangerous degree if you don't do your operations in the right order.


Useful answer: Use a double.

Really, 52 bits of mantissa is more than enough for representing anything. You have to be wary of error building up on calculations, not of original representation errors (unless you are living on the edge; and I'd tell you: don't).

Useless answer: No, the precision is fixed for any magnitude. When you multiply your values, you will also multiply the interval, and there are as many useful numbers inside that new, larger, interval than were inside the old, smaller one.


Even 'better' just store them in fixed point representation.


If all you're dealing with is angles, 64-bit quantization of the 360° space will give you far more accuracy than you'll ever need, and it will be absolutely uniform throughout.


True enough... my mind is in the embedded space where say a good enough representation in say 16-bit is desirable.


That's equivalent to fixed point, right?


Your worst-case relative precision is going to be the same no matter what your end point is. You get a little more best-case precision, but most of that is crammed into a tiny fraction of the circle so it doesn't really help you. You get an extra bit of precision by using negative numbers too, but -180 to 180 would also do that.


Depends on how big are the angles you want to store. May be you're using a variable to measure how much the wheel have rotation, totally, while driving (i.e. millions of degrees)? Or angular diameter of stars on the night sky?


What's hilarious is that this is what is meant by calling them "floating point". The counterpoint "fixed point", in comparison, does not necessarily imply integers.


Ummm... maybe I am mistaken, but as the fp numbers are not uniform in [0,1] chances are I don't want to pick one at random. Rather I want to pick a fp representation of a random real between 0 and 1. In that case the readers suggested strategy seems fine.

Am I wrong?


The author says in a comment

> If hitting exactly zero is much less probable than hitting exactly 0.5, then I would argue that you do not have a uniform distribution…

But due to the non-uniformity of floating point numbers, I think the author is wrong here.

We can think of each floating point number as an interval. The probability that it is picked should ideally be equal to the size of the interval.


Yes. The effect of having a coarser partition around a mesh point is countered by the effect of a higher frequency of hitting that partition. This actually is the essence of importance sampling. For the purposes of numerical integration, as long as the integrand is relatively smooth with respect to the partition size, there shouldn't raise any serious problem. On the contrary, if the integrand does vary wildly between two adjacent floating point numbers, then the real issue here is that the precision (single, double, quadruple, etc.) used is not fine enough, rather than the deviation from uniform distribution.


Exactly.


I had a similar question a while ago: how many decimal digits does it take to represent the smallest possible 80 bit float. The answer is ~16000:

  #include <stdio.h>

  int main(int argc, char **argv)
  {
        unsigned long long int mant, exp ;

        mant = 0;
        exp = 1;

        typedef struct _bitfield80 {
                unsigned int sign:1;
                signed int exp:15;
                unsigned long long int mant:64;
        } bitfield80;

        union {
                long double a;
                bitfield80 b;
        } union80;

        union80.b.sign = 0;
        union80.b.mant = mant;
        union80.b.exp = exp;

        printf ("%0.17000Lf\n", union80.a);
        return 0;
  }


Depends what you mean by “represent”.

The smallest positive “denormal” 80-bit float is (I think) 2^(2 − 2^(14)). I just represented it using 5 decimal digits and two minus signs. :-)


The Berry paradox takes this a bit further.

It may appear that "the least integer not nameable in fewer than nineteen syllables" is 111777. But "the least integer not nameable in fewer than nineteen syllables" is itself a name consisting of only eighteen syllables, and therefore the least integer not nameable in fewer than nineteen syllables can be named using only eighteen syllables (!).

There are alternative expressions, for example wikipedia mentions "the smallest positive integer not definable in fewer than twelve words" and "the smallest positive integer not definable in under sixty letters".


for many applications, consider randomizing in [1,2) and then subtracting one. Although many of the fp values in [0,1) will be inaccessible, you are guaranteed uniformity, and also the lattice that you're drawing from will be fixed.


I'm interested in this as I maintain a little floating point RNG project, but I suspect you are somehow mislead on this.

Simple RNGs can discard significant bits during math operations to help get random-like data, eg. the "middle-square method" or the lower bits of Math.sin(seed++) can be used as practically random. The little rng I developed and tested discards 0 to 3 MSbs erratically in its state as part of its generation scheme. But getting a library random function to return [1,2] and then subtracting 1 can barely affect overall uniformity of distribution. At best its just discarding 1 bit of potential entropy for more regular and i suspect louder noise floor.


How do randomize in [1,2)? You might as well just generate numbers from [0, 2^53) or [0, 2^24) and divide.

If you didn't do that on the way to generating in [1,2), you're going to have fine-grained non-uniformity anyway.


>How do randomize in [1,2)?

0x3f800000 + 23 random bits


Well. That'll do it. Probably no slower than the integer to float conversion too. It gets you multiples of 2^-52 instead of 2^-53 though.


I see your talking 64 bit floats there which is my habit too. Having only 24 bit mantissa float32s seem insufficient for producing practically uniform variates - the missing 2^24th can be spotted averaging just a billion or so of them. I dont know if it has been improved recently but last year Chrome's Math.random was only putting 32 bits into its float64 - the missing four-billionth can be suggested from the average of a few hundred billion variates.


Yeah, I really think a better way is to imagine generating a uniform real and rounding -- generating [0,1) instead of [0,1] is just sick. If you do rand() * k + n you can get [n, n+k] anyway, so it's not a good primitive to get a half-open interval with. You can generate an integer n in [1, 2^25], then take (n >> 1) * (1 / (float)(1 << 24)) to get a pretty good [0,1].


For a 64-bit floating point number, the difference between [0,1) and [0,1] is for all practical purposes nonexistent. The chances of getting 1.0 exactly are so low that you can't build a test to know if it's excluded or not.

For a 32-bit floating point number it's more of a problem, but your application would have to be extremely demanding to detect the difference.


A rough calculation, a float64 rng which might reasonably supply billions of rnds an hour for simulations, would have a fair risk of hitting '1.0' every million hours of use, which means its important for production quality code to not run that risk.

A prng called Alea works with floating point arithmetic to generate rnds securely confined to [0,1] using a 'subtract with carry' scheme. It involves this kind of arrangement:

  state=state*something+carry
  carry=floor(state)
  return state-carry //this is 0<1
Its very similar to linear congruential generator where top bits are masked away, in this case they are 'floored' off and fed back into the state. Alea seems to generate high quality rands and works extremely quickly in javascript engines by avoiding integer math.

My rnd utility module uses a similar algorithm. There may be a note of caution whether floating point math is standardised enough to use for a seedable prng, but so far my machines and phone have agreed in tests.

http://strainer.github.io/Fdrandom.js/


My argument was for the other direction, using [0,1) when you actually want [0,1]. I agree that if 1.0 would cause a problem in your code you should use a rng that avoids it. All of the floating point random number generators I'm familiar with deliver [0,1).


Interesting, I assumed java and javascripts behaviour were common. I dont recall seeing this notation before either [0,1) maybe i will now %)


I usually use C++ or Python, so I had to look up the others.

From Java: "Returns a double value with a positive sign, greater than or equal to 0.0 and less than 1.0."

From Javascript: "Return a random number between 0 (inclusive) and 1 (exclusive)"

So yes, all 4 are consistent in not including 1.0 in the range of random real numbers. I'd call that common.

For more on the notation see https://en.wikipedia.org/wiki/Interval_(mathematics)#Includi...


Apologies I wasn't following properly, and thanks for the interval math pointer.


I ended up just doing a one op conversion. Multiply by a manually eked factor which is the maximum value which the maximum-source-value can be multiplied by and not make 1. That just minimises the gap at the end of the unit interval which is caused by the limited resolution. I judged its not possible to get too close to 1, so just found the biggest float factor which doesn't get to 1. If that last 0.9999... value has a larger chance of occurring than any of the lesser, that seems good to me for numerical purposes as it hides the shortfall of not being able to express 1< There will be irregularities in the meaning of the least significant bits anyway due to rounding 'noise'


f() = (rand() + 1.0) - 1.0

and pray that your compiler isn't clever! Or you can actually randomize the fraction bits.


Generate integral values, then move them into the [0,1) interval.


I don't like the divisions, though, and all the analysis that comes with it.

A way that is intuitively more robust would be to generate 23 or 52 random bits from an integer RNG for the mantissa (given you trust you integer RNG is unbiased). For example, take a random uint64_t, clear the sign bit, rewrite the exponent with a predefined value, and let the mantissa be untouched. That will give you a random floating point number in the range between 0.5-1.0, 1.0-2.0, or 2.0-4.0 etc, depending on the exponent you choose.

You can trust that the value is entirely random to the precision offered by the mantissa's size. Then you can control the accumulation of error by what you choose to do with the random number.

You still can't get to numbers that aren't representable by floating point, of course, but everything will start from the most random and unbiased value you can think of. You might have to introduce some error by subtracting the lower bound and multiplying appropriately to get to your desired range. Or, if you have the luxury to choose your desired range appropriately, you might be able to use the random value directly.


Hitting each representable floating point number with an equal probability does not generate an approximately uniform distribution of [0,1). For a sufficiently small positive number, eps, you have to have equal probabilities of numbers falling in [0,eps) and [0.5,0.5+eps) for a uniform distribution, and using random bits would fail this requirement.


Article says 1,056,964,609 / 1,056,964,610

Quick Java program says 1,065,353,216 / 1,065,353,217

Which one is right? Or are Java floats just "different"?

  static {
    float a = 0;
    long count = 0;
    while (a < 1) {
      count++;
      a = Math.nextAfter(a, 2);
    }
    System.out.println("count = " + count);
  }


nextAfter is probably also including the denormals (an additional 2^23 values near 0).


Yep:

  1,056,964,609 - 1,065,353,216 = 8388607 = 2^23 - 1
because the denormal that is all-zero is already present.


And the author explicitly stated he was excluding the denormals.


Offtopic, but:

Is this not just shorthand, but an actual legitimate Java program these days?


I was curious as well, so I took it for a spin[0]. In short, no. A static initializer still needs to be wrapped in a class that contains a main method.

    public class Main {
    
      static {
        System.out.println("Hello world");
      }
    
      public static void main(String[] args) {}
    }
[0]: https://repl.it/GDZG/1


A bit of both. Code still belongs in a class, not pictured here. `static` blocks are executed when the class is loaded and are usually used to fill `static final` members. In a toy example like this it's shorter than writing the `main()` incantation.


It might depend on the precise representation. This has become increasingly standardized lately which is why those numbers are really close and not wildly off.


Slightly OT but regarding floats - I was wondering if we could get by just with an exponent and ditch the mantissa. Each number would be represented as an integer exponent of 2^-n for n corresponding to the precision.


The more I read about the gnarly details of floating point specs and implementations, the more they look like a messy hack. I think it'd be nice to just use rational numbers in my programs.


What about nan? The test '!(x < 0 || x > 1)' will say yes, while 'x >= 0 && x <= 1' will say no.

Or -0 for that matter.


Common sense approach: like the name says, NaN is not a number. The question was about floating point numbers, not floating point values. So I suppose we shouldn't count NaN. Further, -0 (an arbitrarily small number less than 0) doesn't logically fall in the range [0, 1], so it is excluded.


There's no need to exclude -0.0, since mathematically and by all tests available to the floating point unit it's equal to +0.0.


I can't believe I'm the only person to ask "what bit depth?"

There is a pretty big assumption baked into the title.

Edit: hi, I see that I made a mistake, can someone correct my misunderstanding?


IEEE 754 floats are 32-bit. Doubles are 64-bit.


What is the practical application of this?


Currently working on monetization strategy.

Doing A/B tests with ads.

Looking for first round investors.


): I will buy your IPO.


Asking how many "are" in the interval implies that these numbers exist in some sense in that interval.

A clearer way to phrase the question might be: "How many floating-point numbers between 0 and 1 can be generated without duplicates?"


(I invoke my pedant-pass.) As formulated in the title, "How many floating-point numbers are in the interval [0,1]?" you could argue that this is the cardinality of the Real Numbers. What the article says it is really talking about are single-precision IEEE 754 floating-point numbers. However, I could define any number of my own floating-point representations at various sizes. The cardinality of all possible floating point representations would be the same as that of the real numbers. This is true even for irrational numbers, as one could formulate floating point representations that specially represent those numbers.

(Also, this is arguably an exception to Betteridge's Law, and arguably not an exception to Betteridge's Law.)


What makes you think any arbitrary real number qualifies as a “floating point number”?

I have never seen the term used in anything like that context.

To me, the term “floating point number” is about number representation, not number identity. It is an inherently finite quantity in every instance I’ve ever seen.

Examples of floating point numbers: The sexagesimal cuneiform 42 25 35 written on the Babylonian tablet YBC 7289; 3.1415 × 10⁰; 1.010101010101₂ >> 2.

Not examples of floating point numbers (in my opinion): 1/√2; π, 1/3.


The GP is claiming that all possible floating point systems, taken together, would contain the same number of values as there are points on the real line.

Consider a floating point system based on https://en.wikipedia.org/wiki/Golden_ratio_base


Thanks! To be just a bit more precise, I am claiming that all possible floating point systems, taken together, would contain the same number of values in [0,1] as there are points on the real line.


Hmmm, I missed that wrinkle. My grasp of this slightly unusual topic is not 100% reliable, and so I'm now wondering whether the cardinality of the set you describe is different from the cardinality of the set of all possible floating point numbers.


Nobody uses "floating point number" to mean "a number that could theoretically exist in a floating point system I will invent for you after you tell me the number". That's not being pedantic, that's insisting on a wrong definition.


Nobody uses "floating point number" to mean "a number that could theoretically exist in a floating point system I will invent

Put a full stop there -- Oh yes, people do mean that! Most of the time, people are referring to a bit representation in a specific system like IEEE's.

a floating point system I will invent for you after you tell me the number"

The above part of the sentence isn't a definition of "floating point number." It's how I use the common notion of "floating point number" in my argument. Your objection only looks like it works because you conflate the two concepts. If you don't conflate the two concepts, then you are arguing that most low level programming that deals with 32 bit floats doesn't actually deal with "floating point numbers." That is a reasonable assertion for a reasonable set of definitions. However, it's not the one I'm using, which also fits the reality of the mental models most people are actually using.

In terms of program correctness, you can find many examples where using an abstract concept of "floating point" instead of the concrete representation will produce errors. So then why is it "the correct one" as you say?


I can't parse what you're saying. What two concepts are you accusing me of conflating?

There are many different floating point systems, but they all share certain attributes. None of them can exactly represent sqrt([insert large prime]). If you want to invent a bunch of such systems post-facto, you are abusing the term. The union of all systems that can reasonably be called "floating point" is not the set of real numbers. It's countable, as a very loose upper bound.


You can make an exception for a particular bit pattern for an irrational, like Pi, just like you can use a particular bit pattern to represent -0 and NaN. Since the title doesn't specify a floating point representation, I thereby invoke all possible floating point representations. I construct a set of such representations with the cardinality of Reals like so: For any given real number X, just formulate a representation like single-precision IEEE 754 floating-point but which substitutes X for NaN.

The article doesn't fall into this trap, but the title does.

Now, if you think your objection sinks my definition of "floating point number," note that it also sinks IEEE 754 floating-point.

EDIT: So you formulate a particular definition of "floating point number" for which you are right. To which I say: So What?


This is not true, and I can think of at least two arguments.

First, the space of all computable reals is countable[1] (i.e. numbers which can be approximated by a sequence of rational numbers produced by a computable function). Note that this subsumes the irrational numbers as well.

Second, all floating point representations would individually have at most the cardinality of the natural numbers, even if the representation was arbitrarily sized (if they were finitely sized, they would of course each have finite cardinality). So you would need the space of possible representations to be uncountable. You would have to come up with a mighty strange notion of "floating point representation" that allowed you to have more distinct representations than computable functions, which again have the cardinality of the natural numbers. Even ignoring the fact that there will be overlap between the representations, we have an upper bound for the total cardinality as that of the set product of the natural numbers with itself, which is again the cardinality of the natural numbers.

[1]: https://en.wikipedia.org/wiki/Computable_number


Second, all floating point representations would individually have at most the cardinality of the natural numbers

All of that is irrelevant. I'm using the cardinality of all possible floating point representations. Give me a real in [0,1] that you say doesn't have a representation, and I will give you a floating point representation that will include it. (Using an operation that's just cut and paste on the IEEE one.)


Chaitin's constant for Turing Machines. Also note that I mentioned the cardinality of all representations in the next sentence.


Sorry, you missed it. You could formulate the proof with only 32 bit FP numbers. I've created a function that maps any real number, computable or not, to a notional FP representation. The function doesn't actually have to be computable to show a bijection. I just have to show there is a way to do the mapping.

You would have to come up with a mighty strange notion of "floating point representation" that allowed you to have more distinct representations than computable functions

That is what I think I did. Let's say that you have a real number X in [0,1] that you say isn't in the set of possible "floating point representations." I could just posit a reformulation of the IEEE 32 bit FP where NaN is used to represent that number instead. There you have your mapping. Now reset the universe to the state it was 5 minutes ago. In that alternate reality, you say you have a different real number. We also call that real number X and in that other reality I posit a reformulation of the IEEE 32 bit where NaN is used to represent that number X instead.

Just because such a set isn't computable, that doesn't mean it isn't conceivable in a way that can show a bijection.


Even when you move the goalposts (a floating point representation where you can't perform any operations with the numbers you "represent"?) there is still at least one big hole in your trick.

The set of describable real numbers in any particular theory (e.g ZFC) is also countable (because there are a countable number of formulas of any kind), and the set of all formal theories (even including the ones which cannot model the real numbers) is still countable. That's already giving you a lot of leeway with the term "real number". So now you'd have to call your floating point representation a representation of a real number when you can't even specify the X, no matter how abstract or lengthy the description.

In the future if you're going to use your "pedant-card" like that, do it someplace that isn't HN where you'll run headlong into a lot of people who actually know a few basic things about cardinality.


>Give me a real in [0,1] that you say doesn't have a representation, and I will give you a floating point representation that will include it.

pi


Oh.

pi - 3


Yes, for every number, one can come up with a floating point format that can represent it exactly (proof that stretches the meaning of 'float format' a bit: given x, the one-bit float format with bit value 1 meaning 'x', and bit value 0 meaning 'not x' is a representation that represents x exactly), but the cardinality of the set of arbitrarily length bit sequences is smaller than that of the reals in [0,1], so there isn't a single floating point representation that can represent all of the numbers in [0,1]

So, a choice has to be made. IEEE 754 made some choices, balancing utility with ease of implementation. Writing numbers as s × c × 2^q for integer c and q makes addition, multiplication and comparisons fairly easy. Fixing the bit lengths of n and e makes it even easier because you don't need hardware to find the parts, at the price of that zig-zag pattern of accuracy. I wasn't around when this was discussed, but I expect that played an important role in choosing this format.

For example, here is a conceptually simpler format that, if it was considered, I think would have been ruled out because it is hard to implement:

Pick a number f close to but not equal to 1 and a bit size, and have a signed integer s represent sign × f^abs(s). That doesn't have the zig-zag problem, and multiplication would be very simple (binary add the dot patterns), but addition would be difficult, almost certainly requiring large lookup tables.

That format also cannot represent zero, but that's easily corrected by picking a special bit pattern for it.


so there isn't a single floating point representation that can represent all of the numbers in [0,1]

I agree! What I've done is to propose (an admittedly non-computable) set of floating point representations for which you can show a bijection to the real numbers. So if you accept my stretched meaning, which you do so above, then we can answer the title's question with "the cardinality of the real numbers."


Assuming "floating point" refers to things fitting the IEEE754 spec at some precision, I'm pretty sure there are still only countably many of them.

After all, for any specific size of mantissa and exponent, there will be a finite number of floats of that size, and there are a countable number of options for mantissa and exponent (corresponds to N²), thus the number of floating point values is countable.

Alternatively, each of them corresponds to an arbitrary length bitstring, in addition to a pair of numbers defining the mantissa and exponent, which would put them in bijection with N³ and thus also be countable.

EDIT: to add to that, I believe that one cannot in any meaningful way encode uncomputable values (in the sense that even if one introduces distinguished bitstrings intended to "encode" a specific¹ uncomputable value one can't do anything other than treat is as a distinguished value, and especially one can't perform arithmetic on it or print its digits or similar), so you'll still be limited to a countable set of floating point numbers even with other tricks.

¹ If that term even has meaning when dealing with uncomputable numbers...


Assuming "floating point" refers to things fitting the IEEE754 spec at some precision

Not my assumption!




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: