Representing Float Number in Computers

3 min

Some unexpected results always occur when calculating the value of two float numbers. For example, in JavaScript, 0.1 + 0.2 = 0.30000000000000004. This unusual result is a feature of how float numbers are represented in computers; it is not a bug. Storing float numbers in a computer requires encoding and decoding, and the algorithm is different from that used for integers. IEEE 754 is widely used in computers and uses a formula to encode and decode float numbers.

Sign * Exponent * Fraction 

We can apply this formula to convert the decimal number 3.14 to:

(-1) * 10^(-2) * 314

Now, we can fill these three parts into a 32-bit binary container (the number 10 can be ignored by convention).

[1bit][8bits][23bits]
[-1]  [2]    [314]

The same idea applies in IEEE 754, but there is a slight difference. IEEE 754 is based on powers of 2, and the fraction part must be converted to binary. The fraction number always starts with 1.xxxx, and it is multiplied by the exponent to get the decimal. The formula can be applied as:

Sign * 2^n * (1 + Fraction)

First, we should divide 3.14 by 2 to get the exponent:

3.14 = 3.14 / 2 = 1.57 * 2^1

Next, we can apply the formula:

(-1) * 2^1 * (1 + 0.57)

By convention, we don’t need to store the base number 2 and the integer 1 in the fraction. Unfortunately, 0.57 is not a binary number. The next step is to convert 0.57 to binary. The calculation is simple: multiply it by 2. If the result is greater than 1, set the binary bit to 1; if not, set it to 0. We can convert 0.57 to binary as follows:

0.57 * 2 = 1.14  | 1
0.14 * 2 = 0.28  | 1
0.28 * 2 = 0.56  | 0
0.56 * 2 = 1.12  | 1
0.12 * 2 = 0.24  | 0
0.24 * 2 = 0.48  | 0
0.48 * 2 = 0.96  | 0
0.96 * 2 = 1.92  | 1
.....

This process is infinite and ultimately provides an approximate value. We get the result 10010001111010111000011 to fill into the fraction part.

The next step is to fill the Exponent part. IEEE 754 uses a bias number (127) to represent the range -126 to 127 in an 8-bit binary number. If we want to store 2^1, we must add the bias number 127, resulting in 128, which converts to binary as 10000000.

The third part is the Sign, which is the same as in integers: 1 means negative and 0 means positive. Finally, we can store -3.14 in float binary as three parts:

1     10000000   10010001111010111000011
Sign  Exponent   Fraction

Converting IEEE 754 to Decimal

We can apply the same formula, Sign * 2^Exponent * (1 + Fraction), to convert the above binary result to decimal. The first step is to divide it into three parts.

1     10000000   10010001111010111000011
Sign  Exponent   Fraction

We can map the Sign and Exponent in a simple way, but the fraction part requires converting each bit to multiply by 2^(-n). So, we can convert the fraction part to:

10010001111010111000011 = 1x2-1+0x2-2+0x2-3+1x2-4 +....+ 1x2-23 = 0.57

Finally, we can apply the formula to calculate the decimal:

(-1) * 2128-127 * (1+0.57) = 3.14

The Puzzles

There are two puzzles in IEEE 754:

  1. Why does IEEE 754 use a bias number?
  2. Why does IEEE754 use power 2 instead of power 10?

For converting a negative number, integers use the first bit to indicate the sign, and the same method applies in IEEE 754. In the exponent part, we could still apply this method, but IEEE 754 uses a bias number to represent the negative number. When we need to compare two float numbers, storing exponent parts like integers would require decoding them first and then comparing them. Using a bias number can avoid the need for decoding. For example, we want to compare 3.14 and -3.14. Their binary representations are:

11000000010010001111010111000011
10111111010010001111010111000011

We can compare them directly and don’t need to decode them first.

IEEE 754 uses powers of 2 to trade off time and space against precision. Calculating powers of 2 is faster than powers of 10, and storing the fraction in binary can reduce space. For example, if we want to store 0.5, storing the decimal in binary must occupy 3 bits. If we convert it to binary directly, we just need to occupy one bit:

101  //decimal fraction
1    //binary fraction

Note: This post was originally published on liyafu.com (One of our makers' personal blog)