How Spread Out are Floating Point Numbers?

Since there are an infinite number of numbers and, as Georg Cantor proved, an infinite number of numbers in between any two numbers, a finite state machine such as a computer must settle for an approximation of the set of all numbers. In the case of integers, that's relatively easy to resolve - given n bits, 2ⁿ integers can be represented. If you only include positive numbers starting from 0, you can count up to 2ⁿ - 1; if you include negative numbers, you can count to 2ⁿ/2 = 2^{n - 1} (give or take one, depending on whether you want one extra negative number or one extra positive number). When it comes to representing rational numbers, things are a bit trickier. You have to give up precision in between integers; that is, you can represent integers within a given range exactly, but you can only specify certain values between two given integers.

The standard floating-point representation of numbers in computers is IEEE 754: you probably know, for example, that 0.1 + 0.2 != 0.3 in IEEE 754; 0.3 can't be represented exactly.

To be fair to binary floating-point numbers, there are numbers that can't be represented accurately in decimal form, either: 1/3, for example, is actually 0.33333..., so 1/3 + 1/3 + 1/3 actually ends up being 0.99999... when converted to decimal form as well.

So, given a fixed number of bits — say, 32 — you can represent 2³² different floating point numbers; it's up to you to decide how to apportion them out. One simple representation would be fixed-point; you could decide that you wanted to be able to represent up to 5 decimal places, so you can count 0 to 2³²/5 or 858,993,459. This is actually a good choice for things like currency computations that are inherently fixed point. For scientific calculations, though, this is sort of a waste of the available bit-space: you can't represent either very large nor very small numbers. Floating point offers a trade-off — you can represent very small (but not arbitrarily small) and very large (but, again, not arbitrarily large) numbers but as the numbers get bigger, the available values in between them get bigger, too.

Of course, as humans, most of us think in decimal, but computers represent numbers in binary. The '.' in the decimal number '10.3' is called a decimal point, but that phrase doesn't make sense in binary. You could call it the binary point (and some people do), but I'll just abbreviate it to point. You probably know how to convert decimal integers to binary: start with the largest power of two that's less than the target number and record a '1'. Subtract that power of two from the target number and divide the power of two in half until it's less than the new target number, recording a '0' each time. Repeat until the target number is equal to 1. So, to convert 5,632 to binary:

5,632	4,096	1
1,536	2,048	0
1,536	1,024	1
512	512	1
256	256	0
128	128	0
64	64	0
32	32	0
16	16	0
8	8	0
4	4	0
2	2	0
1	1	0

Converting a decimal number to a floating-point binary number is essentially the same; if the decimal part is greater than 1/2, record a '1' and subtract 1/2 from the decimal part. If the remainder is greater than 1/4, record a '1'; if less, record a '0'. Divide by two and repeat. So, to convert 0.875 to decimal, you'd compute:

0.875	0.5	1
0.375	0.25	1
0.125	0.125	1

To convert 0.15625, you'd compute:

0.15625	0.5	0
0.15625	0.25	0
0.15625	0.125	1
0.03125	0.0625	0
0.03125	0.03125	1

Now, in both the above examples I got "lucky" and the decimal number reduced to zero. However, consider the decimal number 0.1:

0.1	0.5	0
0.1	0.25	0
0.1	0.125	0
0.1	0.0625	1
0.0375	0.03125	1
0.00625	0.015625	0
0.00625	0.0078125	0
0.00625	0.00390625	1
0.00234375	0.001953125	1
...

This doesn't seem to be converging and, in fact, never will; just as 1/7 is 0.142857142857142857... in decimal, 1/10 is 0.0001100110011001100... in binary (the only numbers you can represent exactly in any n-digit floating point representation are those whose denominators are perfect power of the prime factors on n). Above, I computed 0.1 out to 9 binary positions (not "decimal" positions!), but if you convert that back to decimal, you actually get 0.099609: a terrible approximation of 0.1! Consider the error that would be introduced when compounding, say, 10% interest per year on a with the 9-digit approximation of 0.1 on a $1000 investment. After one year, you should have $1100, but you'd actually have $1099.609. After two, you'd have $1209.1299 - almost a dollar short. After ten years, you'd have lost over $9! You need to compute all the way out to 16 places before the error is less than 0.000001.

So, to be useful for computation, it's clear that floating point numbers have to keep a lot of binary places. If you have 32 bits to work with total and you dedicate 16 of them to the decimal portion in a fixed-point representation, your available counting range is a paltry 32,768 if you include negative numbers. What IEEE 754 does, instead, is it considers everything a fractional number and keeps track of what power of two that number needs to be multiplied by to "recover" the original number. You've probably seen this in decimal form as "scientific notation": 1,563.95 is written as 1.56395 * 10³. This notation has the added benefit that multiplication and division can be performed "in place" by adding or subtracting the exponents: to multiply 1.56395 * 103 (1,563.95) by 7.62 * 10¹ (76.2), you can multiply 1.56395 * 7.62 to get 11.917299 and then add 3 + 1 to get 11.917299 * 10⁴ which does work out to the correct value of 119,172.99 (although you should then re-normalize that to 1.1917299 * 10⁵). This makes for more efficient hardware; you can use integer multiplication hardware to perform this operation as long as you follow it with a hardware "normalization" step.

IEEE 754, then (in 32 bit), dedicates 24 bits to the fraction, 8 bits to the exponent, and one bit for the sign (there's no equivalent of integer two's-complement representation in floating point; negative numbers have their sign bits set to 1, positive numbers to 0). What this means is that there are 2²⁴ = 16,777,216 fractions available in between each power of two — including negative powers. IEEE 754 is somewhat biased toward representing very small numbers, so the 8-bit exponent represents either a positive or a negative number.

You can peek in on the bit representation of floating point numbers with C's obscure (but standard) union structure. A union is like a struct but instead of members occupying contiguous space, they each occupy the same space. so declaring:


typedef union u_fint  {
  float f;
  int i;
} fint;

Allows you to set a floating point value in f and see the individual bits set via the i member (you can't apply bit-wise operators like left/right shift or and/or to floating point values directly). The IEEE 754 standard for 32-bit floating point numbers has a 1-bit sign, an 8-bit exponent and a 23-bit "mantissa" (the actual digits). Among the simpler numbers to represent is decimal 0.5 which just becomes binary 0.1₂ (1 * 2^-1). If you look at the result of:


u.f = 0.5F;
printf("sign = %d\n", (u.i & (1 << 31)) >> 31);
printf("exponent = %d\n", (u.i & (0xFF << 23)) >> 23);
printf("mantissa = 0x%x\n", (u.i & 0x007FFFFF));

The output is maybe a little bit surprising:


sign = 0
exponent = 126
mantissa = 0x0

The sign bit is 0 indicating a positive number, as expected. However, the mantissa is 0 and the exponent is 126, suggesting that 0.5 is represented a 0 * 2¹²⁶... which is obviously not correct. Recall that floating point numbers are always normalized so that the first digit is a 1. Since the first digit is always 1, there's no point in actually storing it, so the standard doesn't. This mantissa should therefore be interpreted as 1.0. But what about the exponent? 1.0 * 2¹²⁶ is actually somewhere in the neighborhood of 10³⁸.

IEEE 754 specifies that exponents aren't stored in two's-complement form like integers are, but instead 00000000₂ represents the smallest possible exponent (-126) and zero is in the middle of the bit range: 01111111₂ = 127. So, to find the exponent, you have to subtract 127 from it. That means that 0.5 is represented correctly as 1.0 * 2^{126 - 127 = -1}.

Why are exponents represented this way instead of the more natural two's-complement way? By making the smallest exponent 0 and the largest 0xFF, sorting floating points as if they were integers puts the positive floating point numbers in the correct order. Negative numbers, unfortunately, are backwards, but they're exactly backwards, so you can just sort them descending if you need to.

0.1 is, as expected, more complicated:


sign = 0
exponent = 123
mantissa = 0x4ccccd

This works out to 1.10011001100110011001101₂ * 2^-4 (remember that the leading 1 is implicit) or 0.000110011001100110011001101 which converts back to decimal 0.10000000149011612... pretty close to 0.1, but never quite exact.

But what about the floating-point representation of the number zero? There's no number n such that 1.0 * 2ⁿ = 0. If you look at the internal representation of 0.0F, you'll see it has a sign, exponent, and mantissa of 0, 0, and 0. By the rules stated above, that would actually work out to 1.0 * 2^-127: a very small number, but not actually 0. Instead, 0 is a special value to IEEE 754: all 0's means actual 0, ignore all the rest of the rules.

So... how "spread out" are these numbers? It depends on the exponent. The smaller the exponent, the closer the "next" mantissa is to the prior one. So 1.0F is 0x3f800000; 0x3f800001 — 1 bit greater — is 1.00000011920928955078125. Each extra bit increases the value by 0.00000011920928955078125 until the mantissa of 0x7fffff (23 one bits) is reached; after that point, the exponent increases by 1 and single-bit increments increase the value by .0000002384185791015625: exactly 0.00000011920928955078125 * 2. Each incremental exponent doubles the gap between each subsequent mantissa. As it turns out, the difference between 0x3f800000 and 0x3f800001 is exactly 1 * 2^-23, since there are 23 bits available for the mantissa. In general, given an exponent n, the difference between one floating-point value and the next value (viewed as an integer) is 1 * 2^{-24 + n}. This means that, after the exponent is greater than 24, the gaps between the representable floating point values are no longer fractional: the smallest such value is 0x4b000000 which is 8,388,608 (2²³). Values greater than this (whose exponent is 151) can only be represented in floating point as whole numbers, and as the numbers get greater, you can't even represent all available whole numbers. The next exponent is 16,777,216, and after this, not every value is representable - 16,777,217 (which is exactly representable as a 32-bit integer) will be rounded down. Once you reach the top of the floating point range, exponent 127, you start at an astronomical 170141183460469231731687303715884105728, but each single-bit increment adds a whopping 20282409603651670423947251286016 (approximately 10³¹)!

But wait, you might protest: exponent 127 is actually 0xFE — what about 0xFF? Well, actually the 32-bit floating point value 0x7f800000 is reserved for the value infinity; every positive value greater than this is NaN (Not A Number). Of course, once you roll past 0x7fffffff into 0x80000000, the sign bit changes, and you start counting negative numbers, backwards. This means that there's a range of 2²³ values that are "wasted", but considering how large the previous step was, you probably won't miss them (and if you do, you can always use a double-precision 64-bit floating point).

What about numbers between 0 and 1? As you can see, IEEE 754 reserves a lot of them: as many as it reserves for numbers greater than 1, 2³⁰ = 1,073,741,824. The largest representable number less than 1.0 would be 0x3f800000 - 1 = 0x3f7fffff or 0.999999940395. So, you might expect the smallest positive non-zero number would be 1.00000000000000000000000000000001 * 2^-127 = 1.26 * 10^-29, since I said there's always an implied leading 1. Well, actually I lied about that implied leading 1 always being present. When the exponent is 0, the implied leading 1 is dropped so that even smaller numbers can be represented. This form is called sub-normal (recall that a floating point number is said to be normalized if it has exactly one digit before the point). Hence the very smallest representable positive non-zero floating point number is 1.43010 * 10^-45.

These subnormal numbers are important for precise computation (well, as precise as you can manage considering that you have a finite number of bits, anyway). Consider what would happen if you subtracted two very small, close together (but still not equal) floating point numbers from each other: say, 0x00100002 - 0x00100002. That works out to 1.4013 * 10^-45: very small, but not zero. However, if the smallest representable number were 1.26 * 10^-29, this would be necessarily rounded down to 0. (This is called underflow). In fact, as you can see, there's a large range of numbers that would be rounded down to 0 this way. By allowing sub-normal floating-point representation, any two normalized numbers can be subtracted with no fear of underflow. Subtracting sub-normal numbers from one another, of course, does risk underflow.

So, as you can see, floating point numbers range from being clustered very tightly together to being incredibly spread out: if you're doing complex scientific computation, pay careful attention to precision!

Add a comment:

Completely off-topic or spam comments will be removed at the discretion of the moderator.

You may preserve formatting (e.g. a code sample) by indenting with four spaces preceding the formatted line(s)

Naereen, 2020-10-01

Hi!
Very nice article, thanks!
There is a missing minus sign in the last sentence of third to last paragraph:
Hence the very smallest representable positive non-zero floating point number is 1.43010 * 10^45.
--> 1.43010 * 10^(-45)
Thanks!

Josh, 2020-10-01

Oh, shoot! Fixed, thank you!

Command Line Fanatic

How Spread Out are Floating Point Numbers?

Add a comment:

Past Posts