Preamble

This interactive guide to select parts of the IEEE754 specification for floating point representation was made to reinforce concepts in my own hands-on learning style. This page was made so that I can look back in the future and re-gain my understanding of this concept if I need to. I also hope that this may be useful to some that come across this page. If there is any error, let me know.

On this page, we will only be covering single precision (binary32) floating point numbers in detail, equivalent to a float type in some strongly-typed languages (e.g. C and C++). We will not be covering exception handling and decimal representations (only binary) of floating point numbers in the specification.

Fundamentals - Binary I (Basic introduction)

We are used to counting numbers by using the digits 0 to 9. That is, there are 10 digits from 0 to 9. We can call this convention a decimal number system with a radix (how many digits there are avaliable to use) of 10. After 19, we have 20, which uses two digits in the range 0 to 9, with the digit 2 in the leftmost position and the digit 0 in the rightmost position. Notice how for some number n, for n + 1 a digit in a position will be incremented by 1 if this digit is not 9. If this digit is 9, the digit is reset to 0 and the next digit is incremented by 1. The following interactive cell will give a better illustration of this concept.

Find out the next number

My number (0 to 9998 only):

n	0	0	0	0
n+1	0	0	0	0

As you can see, for say number n = 19, the blue digit (which is 9) will 'reset' to 0 (next row). The red left of the blue digit will be incremented from 1 to 2. If there are multiple 'resets', say n = 199, the changes propagate from right to left. That is, the first digit resets to 0, the next digit is also reset to 0 and the change is 'carried' over until the next non-9 digit. Try this for yourself above!

Let us say that instead of having 10 digits in a convention (digits 0 to 9), we have only digits 0 and 1 avaliable to use. That is, we have 2 digits in this convention (a radix of 2). This is exactly the binary number system. In terms of counting in binary, we can use the same blue/red logic that we have devised above. That is, if we reach the last digit in our new convention, that is 1, we will reset the digit in that position to 0 and increment the next digit.

0	0	0	0
0	0	0	1
0	0	1	0

For the following two illustration, notice how that when the next position is also the last digit in our convention (1), this change is propagated right to left in the same way as the decimal number system.

0	0	1	1
0	1	0	0

0	1	1	1
1	0	0	0

Fundamentals - Binary II (Conversion from and to decimal)

Hopefully you will now have a general understanding of the patterns of how simple counting of numbers in radix 10 (decimal) and 2 (binary) can be performed seperately. It is important to be sure we can convert from decimal to binary for this document. Conversion from binary to decimal will be briefly discussed as well.

There are two parts to every number in any number system. There is an integer part and a fractional part. The integer part is simply the number before a radix point usually represented by a dot '.'. The fractional part is after this radix point. You will have heard about the decimal point, which is exactly the radix point for the decimal number system (radix 10). In binary, the radix point is called the binary point.

Integer and fractional components

Enter a number:

INTEGER		FRACTIONAL
0	.	0

For the methods we will be using, conversion from decimal to binary is different for integer and fractional components.

Integer

The simplest method of conversion is to count up in binary and decimal side-by-side and see what decimal number corresponds to binary. But as you can imagine, converting from decimal 1,000,000 to binary would need you to have a very large table to count from 1 to 1,000,000 and the corresponding binary representation which would be infeasible for large numbers. Hence, there is a method that can be used involving repeated divisions by 2 to find out a binary number from a decimal number. The number '2' is the radix number of binary, and you will see a strong relationship between such repeated divisions by 2 and the process of converting binary into decimal later on.

Divide a decimal number repeatedly by two, keeping the remainder of the division.
If the divided number has a fractional component, remove that component (do not round!).
Repeat steps 1 and 2 until the decimal number is zero itself.
Read the recorded remainders from the last remainder found (bottom) to the first remainder (top). This is the binary number.

In the following interactive section, the grey boxes are our divisor (2). The yellow boxes hold the remainder of our division.

Convert decimal to binary (integer component)

Enter a positive whole number:

Fractional

In the fractional component, instead of dividing, multiplication is done. In addition, the results are read from top to bottom instead of bottom to top like in determining the integer component. This is similar to the process of making the fractional component of a number into an integer component. For a decimal number 0.12345, to make this 12345.0 we just multiply 0.12345 by 10 (since the decimal number system has a radix of 10, i.e. has 10 digits from 0 to 9) until the fractional component is 0. This idea is applied to the following process, except that we are multiplying by 2 since the binary number system has a radix of 2.

Multiply a decimal number repeatedly by two, keeping the integer component of the multiplication (the whole number).
If the multiplied number is greater than 1, subtract 1 from the result. This is equivalent to setting the integer component to 0 at ever step.
Repeat steps 1 and 2 until the decimal number is zero itself.
Read the recorded integer components from the first (top) to the last remainder (bottom). This is the fractional number.

Convert decimal to binary (fractional component)

Enter a positive number less than 1:

If you noticed, some numbers have an ellipsis ('...') when calculated. This is because the number is recurring (such as 1/3 = 0.33333333...) or are too large to represent accurately. The technology running this page (Javascript) actually uses an IEEE754 floating point specification called binary64 (double-precision), however even though this is more precise than the binary32 we will discuss shortly, there is still a limit of precision.

Combining integer and fractional parts together

Now that we can convert both integer and fractional parts to binary seperately, representing a decimal number like 123.456 in binary can be performed by simply seperating integer and fractional components and then joining these up to produce the final binary number sepearted by the binary (radix) point. Try to work out a decimal number by hand, then use the interactive cell below to check! If you are not sure how to convert a particular part (integer or fractional), you can re-visit previous cells for assistance.

Convert decimal to binary (both components)

Enter any number:

INTEGER		FRACTIONAL
0	.	0

Conversion from binary to decimal

We can convert a number from binary to decimal by adding up powers of 2. Let the rightmost digit in the integer part (before the binary point) be the 'zeroth position'. The position of digits to the right of this position are -1, -2, ... and the position of digits left of the zeroth position are 1, 2, ... . Now consider each position and its numeric identifier, say for binary number 101.101. We can multiply the value of each digit (either 0 or 1) by 2^{(digit position)} and add these together to find the decimal equivalent. So for 101.101, the decimal value is:

1x(2²) + 0x(2¹) + 1x(2⁰) + 1x(2^(-1)) + 0x(2^(-2)) + 1x(2^(-3)) =

1x(4) + 0x(2) + 1x(1) + 1x(0.5) + 0x(0.25) + 1x(0.125) =

4 + 0 + 1 + 0.5 + 0 + 0.125 = 5.625 (check this value using the interactive cells!)

As noted earlier, there is a strong relationship between repeated divisions to convert an integer part of a decimal number to binary. This also can be observed with the repeated multiplications for the fractional part. In essense, binary numbers (radix 2) can be represented in decimal by powers of 2. In fact, for decimal numbers (radix 10) themselves, they can be expressed as powers of 10. So 15.625 can be represented as 1x(10¹) + 5x(10⁰) + 6x(10^(-1)) + 2x(10^(-2)) + 5x(10^(-3)). In fact, for any other number system (such as octal (radix 8) if you want to read further), this can be represented in the decimal number system by considering powers of its radix. By considering this fact, we can convert numbers from decimal to any other radix by either doing repeated divisions by the radix number (for the integer part) or repeated multiplications by the radix number (for the fractional part).

This general introduction to binary arithmetic is more than sufficient to understand the basics of the binary32 specification.

Motivation for floating point representation

The term 'floating point' may be ambiguous at this stage, but this section will give an understanding into the floating point. Notice how that in the previous interactive cells involving integer and fractional parts, the binary point (radix point) is fixed. Digits in the integer part expand leftwards. We limited the fractional component to a few digits due to technological limitations, but in a perfect world where computers have infinite amounts of memory, we can expect digits in the fractional component to expand rightwards.

Think of the radix point as a strange sprinkler and this sprinkler only squirts in a single dimension (north and south, for example) where water falls at two distinct points only. If this sprinkler has infinite power, the water will not fall onto any surface. But if otherwise (the sprinkler can only squirt with a certain power), water will fall onto some surface eventually on both sides. The length of the distance from the sprinker to where the water falls is where we can mark out on the ground the digits of the integer component on one side and the fractional component on the other. To mark out more digits, we must make the sprinker more powerful. However, realistically we cannot make the power too much, or the sprinkler may explode or something (sorry everyone, my degree is not in agriculture!).

A large issue with fixed-point representation is mirrored by our strange sprinkler analogy. The slight difference is that the 'distances' (maximum number of digits stored by any part, integer or fractional) may not be equal as may implied by the analogy. Modern general-purpose computers have 64-bit (bit = binary digit) central processing units (CPUs), which generally means that 64 digits of binary can be processed quickly. We could represent numbers in a system with 32 digits of binary for the integer part and the other 32 bits for the fractional part. To represent a number that is very large, we could allocate 56 bits to the integer part and the rest for the fractional, and vice versa. However, we will have a limited precision for numbers.

Precision

Say that we allocated 63 bits for the integer part and 1 bit for the fractional part. The binary point does not need to be represented as its own digit and is implicit in this fixed-point representation (implicitness will be a key concept in binary32 representation later). To represent 101.101, we cannot represent the last 2 digits of the fractional part at all. The simplest way to resolve this problem is to truncate (chop off) the last 01, so we will have 101.1. Instead of 5.625 now, we have 5.5 (check with the above interactive cells!). Because our precision is limited to one fractional digit, where the binary point cannot move, we cannot represent the number 101.101 accurately. However, we can represent very large numbers (that have a fractional component of either .0 or .25 in decimal only) very accurately. In computers we could allocate more space for digits, such that we can represent 63 integer-part digits and 63 fractional-part digits, but this would use up more memory and slow down calculation times, so you can expect people to complain to Intel or AMD in addition to their Google Chrome woes. What if we could move the binary point around without changing the actual word length (e.g. 64 binary digits as for 64-bit processors)?

A floating point representation is that where the radix point can 'float' (move) to any position (maybe given some rules, as per binary32). Consider our 'strange sprinker' example. If the laws of physics says that we can move the sprinkler around while keeping the two points on the ground where water lands the same without a change of power, this is an analog of floating point. Less confusingly, a floating point is a radix point that, within some total digit limit, is able to move by some mechanism for to represent integer and fractional components with varying precisions. In the interactive cell below, illustrated is a simple floating point representation with a digit limit of 32. Note that 64-bit processors have a maximum 'digit limit' of 64, but can potentially handle digit limits lower than this, usually 32, 16 and 8. How they exactly handle this will not be part of this document.

Simple floating point for a total digit limit of 32

Maximum value in binary
Integer part
Fractional part
Maximum number
Precision

As you can see, by changing the position of the binary point, values can be represented with varying precision. If we need to represent a very small number comprising of mostly numbers in the fractional component, we can 'float' the binary point rightwards. If we need to represent a very large number the point can 'float' leftwards. Note that computers represent numbers often in binary, but such floating point concepts can be applied to the decimal number system or any number system (of any radix).

Issues with the naive floating point representation

In the previous section, we were presented with a very simple (naive) floating point representation where the precision of numbers can change depending on where the radix point is. However, there are a few problems with this representation as you might have noted:

How can we represent the position of the actual binary point?
How do we represent positive and negative numbers?
How do we know how to do basic mathematical operations (add, subtract, multiply, divide) on these numbers?

In the following sections, we will seek to extend our basic floating point representation to optimally solve these two issues.

binary32 representation - Solving issue 1

We seek to solve the first issue of our naive floating point representation in this section.

Representing point movements

We should store in some way how many positions left or right the point has moved. For this we will need a default placement of the binary point, but this will be discussed later. For now, we just need a general idea of how to represent such movements from any given default placement.

Let us have a decimal number, 321.123. I want to move the decimal point 2 places to the right so we have 32112.3. Such a move may seem trivial, but will aid us in generalising the idea to solve our precision problem (issue 1). On paper, we could write something like 'move 2 places right'. If I wanted to move the point 2 places to the left to get 3.2123, we could write 'move two places left'. However, such a representation in computer hardware would be difficult to represent since computers only understand numbers (in binary). Let us try to represent the move with a single number. How would we do that? We can easily take any right movements as positive movements and any left movements as negative movements. Hence 321.123 to 32112.3 would be a move of +2 and 321.123 to 3.21123 would be a move of -2.

In mathematics, this is the concept of exponential representation (also called scientific notation). For our number 321.123, we can represent a move two places right with the following format: 321.123 x 10². If you calculate this on a calculator, you would in fact get 32112.3, which is perfectly what we have discussed! A move two places left would be 321.123 x 10^(-2) = 3.21123. This concept is very similar to our representation of a move with a single number, but we introduced this more formal notation because some textbooks may refer more religiously to the IEEE754 specification.

You can try some examples of exponential notation in the interactive cell below. Also do try representing a shift of a binary number, and notice the differences in the multiplier (the second number being multiplied)!

Represent decimal or binary moves in exponential (scientific) notation

Number system:
My number:
Number of places I want to move the point (minimum 0):

Direction I want to move the point:

Left Right

x =

If you have noticed through experimenting with the interactive cell above, the multiplier for the binary number system is 2^k where k, which is formally called the exponent or power of the number, is effectively our single number shift representation (taking right as positive). So if we want to shift the binary point of 101.101 by 2 places right to make 10110.1, k = 2 and the multiplier is 2² (try this!). Notice how the multiplier is 10^k for decimal numbers but 2^k for binary numbers. We can say that the number 'at the bottom', is formally called the base of the number that is raised to the exponent of k. This base is determined by the radix of the number, and if you have forgotten, the radix is how many digits there are in the number (0-9 for decimal, so the radix of decimal is 10 for example). We will not be using all of these terms in this document, but it is essential to get a general understanding of such concepts as we proceed further into the specification, especially the movement of the point.

Exponent of binary32

Actually, the single number system that we devised before introducing the base and exponent of a number is all that is needed to keep track of where the point has 'floated'. However, we will be working in binary representation from now on, so the pattern of the base following the radix is essential to understand, especially when referencing other sources. Recall the space we can store our binary number in a digit limit of 32 in an above interactive cell. We can set aside some digits just to represent this single number system, more formally the exponent, of the number. If you want to know, we do not need to represent the base of the multiplier since that base is always 2 (recall: 2^k in the above interactive cell for binary number shifts). You might think that keeping the information elsewhere and keep all 32 digits for the actual number is an alternative, though this would likely result in worse performance and/or higher costs because additional hardware may be needed to store and retrieve such information (I am not a computer engineer, so I do not know the specifics). How many digits can we set aside?

If we have many digits to represent a move, we will have few digits to represent the actual number. This means that we can move the point very far left or right, but cannot represent the number very well (if we have only 4 digits to represent the number, we can have 0, 1, 10, 11, or 0.0, 0.1, 1.0, 1.1, or 00, 10, 100, 110, i.e. we have 'low resolution').
If we have few digits to represent a move, we will have many digits to represent the actual number. This means that we cannot move the point very far left or right, but the number will be of high precision (if we have 28 digits to represent the number, we can have any combination of 1s and 0s for 28 digits, but can only move the binary point in four ways).

IEEE754 set aside 8 digits in binary32 to represent shifts, leaving out the rest for the actual number (not all for the actual number, because we still have issue 2 to solve later). This means that we have a maximum of 256 ways to move the point. In fact, this number is a little less, because IEEE754 also specifies some special patterns in digits that indicate specific things (which will be dicussed later). In fact we have 254 ways to represent a shift because particular binary numbers are reserved for such special patterns.

Now the question is how we can specifically represent shifts in 254 ways. We could say that all the shifts are to the right, but that would only allow for very large numbers. If all shifts are to the left, this would only allow for very small numbers. However, we can split half of the moves as left and half as right.

If we divide 254 by 2 we get 127. So we can represent a move of the binary point from some default point (very important ambiguity, but will be discussed later) by 126 shifts left, one 0 shift (this will be useful!), or 127 shifts right. Recall the computer-friendly way we represented a move with a single number, which agrees with the mathematical concept of exponential (scientific) notation that we explored in interactive cells. We represent left shifts with negative numbers (i.e., -100 means that the binary point moves 100 places to the left) and right shifts with positive numbers. This is how the exponent field in the binary32 specification works.

You might have heard of the exponent bias that accompanies the exponent field. This is because, in a very general sense, computers will 'assume' that all 254 moves are to the right because of the way the exponent is stored. Hence we need to tell the computer in some way that half of the moves are actually to the left. We can subtract a fixed number, in this case 127, from what the computer 'assumes' the shift is. For example, a left shift of 123 (-123 in our single number representation) will be actually 'seen' as 4 to the computer. 4 - 127 = -123 which is the way to 'tell' the computer that we are not doing a right move of the binary point by 4 places to the right but in fact 123 places to the left. Technically the exponent can be said to be in excess-127 representation, because the exponent is biased by 127 more than actual.

binary32 exponent bias visualisation

How I want to shift some number (right as positive, -126 to 127 only):

My request
What the computer 'assumes'
After subtracting a bias of 127

Another way to think about this is if you consider 8 digits in binary. There is no way to represent without correction a negative number with all 8 digits. There are ways around this, such as something more advanced readers might have heard of 2's compliment, which will not be discussed here, but IEEE decided not to implement such a representation. Wikipedia currently states that comparisons would be made 'harder' without any citations. This is likely because additional hardware would be required to detect a negative number represented in 2's compliment which would probably reduce the performance of a computer.

Mantissa (significand, fractional part) of binary32

Earlier we referred our floating point moves to some default point. We need some 'default location' of the point (where we represent the number as-is, i.e., with 0 shifts of the point) so that all computers that use our floating point representation can be on the same page. In binary32, the 'default point' is actually after the very first digit that is a 1. So, for example, the binary numbers 1.011001, 1.00001, 1.111111 are some examples of the binary point being at the 'default point'. For other numbers, say 101.101, we must 'convert' the number so that the point is after the very first 1 digit, while keeping the information about such a move in our 32 digit limit.

Such a 'conversion', which is formally called normalisation (sorry Americans), is very easy when you think about representing a point moved by a single number, right as positive in the interactive cells we have experimented with. For the number 101.101, all we need to do is the following:

Normalise 101.101 to 1.01101 by moving the binary point 2 places to the left.
We need to move the point 2 places to the right to get back our original number,
Transform '2 places to the right' to our single number representation, taking right as positive, which is 2
Add a bias of 127, so 2 + 127 = 129. You might remember earlier about our 'computer assumes' logic. In simple terms, the computer's hardware will automatically 'correct' the 'assumption', so we must add the 'assumption' back. More simply but in a more advanced way, because all our 8 exponent binary digits cannot be negative, and we are not using 2's compliment, we must bias 2 so that the number fits into the range 1 to 254 inclusive (0 and 255 used for special purposes).
Convert 129 into binary, which is 10000001.
Sore 10000001 into the exponent section of the number.

We are not done yet! The computer now knows how to 'float' the binary point from a default position. The question arises as to how are we going to store the actual number? We could just store 1.01101 in the remaining space of 24 digits as 101101 (without the binary point), where the binary point is known to be just after the first 1. However, we can store this same number in 23 digits. How is that possible? Because the computer knows that the binary point is always just after the first 1 after normalisation, we do not need to represent this. Otherwise, the first binary digit in the group of 24 digits will always be 1, but we can use the first digit for some other better use (to solve issue 2 in fact). So we erase the first digit, and make the computer assume as such. This first digit is formally called the implicit leading bit, where bit is a shorthand for 'binary digit'. So now only the fractional part is stored in 23 digits, which is 01101 (after normalisation the integer part is always 1). These group of 23 digits is called the mantissa among other names, such as the significand and fractional part. It might be easier to refer to this field as the fractional part.

There is one more query you might have, which is how you are supposed to represent zero. Zero has no digits of 1 in binary. In fact, there is a special representation of zero that you will see later.

binary32 representation - Solving issue 2

The second issue is how to represent positive and negative numbers. Recall that in our 32 digit limit, we have used 8 digits for the exponent and 23 digits for the fractional part. We have one more digit left, and this can actually be used to represent positive and negative numbers as follows:

If the number is negative, the digit is 1.
If the number is positive, the digit is 0.

This single digit is called the sign because it indicates to the computer the sign of the number, which is either positive or negative. So for example, -101.101 will be represented with a sign of 1 and 101.101 will be represented with a sign of 0.

Putting it all together - binary32 representation

We are almost ready to represent in IEEE754 single-precision floating point (binary32) any number. The final issue is how we are going to lay out our groups of digits, namely the exponent, mantissa and sign groups. In binary32, the groups are represented in the following order:

SIGN

EXPONENT

MANTISSA

Try to represent a number on pen and paper, and then use the interactive cell below to check your answer.

Importantly, please note the following deviations from the IEEE754 standard:

If a value cannot be represented exactly, the mantissa without the implicit leading bit is truncated instead of rounded (rounding is discussed later) to 23 digits.

Convert a decimal number to binary32 representation

Decimal number to represent:

Sign	Exponent	Mantissa
0	00000000	00000000000000000000000

The most confusing part about this conversion is that we must consider the number of places the binary point moves to get back to our actual number, not the number of places to get from the actual number.

Congratulations! You now know how to represent (most) decimal numbers in binary32!

We will not go into detail how to transform a binary32 number into decimal, but this is essentially the reverse of the steps to convert a decimal number into binary32 of which we covered.

Special representations in binary32

As mentioned earlier, and as you might have noticed if you tried to convert 0 to binary32, there are some special representations of specific numbers (and non-numbers) in binary32. These are as follows ('?' means either 1 or 0):

Sign	Exponent	Mantissa	Special item
0	00000000	00000000000000000000000	+0
1	00000000	00000000000000000000000	-0
0	11111111	00000000000000000000000	Positive infinity (+inf)
1	11111111	00000000000000000000000	Negative infinity (-inf)
0	11111111	0?????????????????????1	Not a number, signalling (sNaN)
0	11111111	10????????????????????1	Not a number, signalling (sNaN)
0	11111111	1??????????????????????	Not a number, quiet (qNaN)
0	11111111	11?????????????????????	Not a number, quiet (sNaN)
0	00000000	???????????????????????	Positive subnormal number
1	00000000	???????????????????????	Negative subnormal number

The reason for a +0 and -0 is because in some scientific or mathematical fields, very small numbers that are very close to 0 but not 0 can be represented simply with +0 or -0. +0 and -0 is equivalent of saying 'a very small positive number' and 'a very small negative number' respectively. You might have seen 'not a number' (NaN) on bugged programs before and this means that what is represented by the floating point representation (which may not be binary32) is actually not a number ('the number is not a number', as some may confusingly say). There are many types of NaNs, but generally speaking, a signalling NaN means that this particular non-number is in most cases caused by a serious error enough to signal to some part of the computer about the existence of this representation. A quiet NaN would mean the opposite (no serious errors raised). You might want to go into the exception handling part of the IEEE754 specification to find out more.

Emphasised is the subnormal number, which is a number that in essense does not follow the rule of having the binary point moved to the position such that the number starts with a 1 and then the decimal point as we have discussed. In other words, the number is not normalised and does not need a movement of floating points. In such a case our implied leading bit is zero. So for example, 0.00000000000000000000001 may be an example of a subnormal number with sign = 0, exponent = 00000000 and mantissa = 00000000000000000000001. Do note that we cannot simply have a relatively large number represented in this way, for example 0.101101. Such relativity is determined by the minimum possible value if represented in the 'normal' way (with the implied leading bit as 1, as per our interactive cell), formally known as the smallest normal number. This number is 1.00 x 2^(-126) (recall that our range of shifts are 126 left and 127 right including 1 zero shift) in binary. Numbers below 1.00 x 2^(-126) will not be normalised in the usual way and will be a subnormal number. How such subnormal numbers are actually 'normalised' is as follows:

Convert the number to binary.
Move the binary point 126 places to the right.
The value of the mantissa is the resulting number after the move.

To convert back from this subnormal representation, the point is moved 126 places to the left. You can think of the seemingly arbritrary number of 126 as the maximum number of left shifts we can do, which is in fact 126. A subnormal number can then be thought of a floating point representation in which to get the actual number we add a 0 and then the binary point before whatever is in the mantissa, and then we always move the binary point 126 places to the left.

Finally and most importantly, notice that special items always have exponent values of eight 1's or eight 0's. Eight 1's in binary, 11111111 = 255 and eight 0's in binary is 0. If we want to apply the bias to get back our actual single number shifts taking right as positive, 255 - 127 = 128 and 0 - 127 = -127. This is why we cannot represent a binary point movement of 127 places to the left nor a movement of 128 places to the right using a bias of 127, since these numbers that could have represented shifts are actually used as one of the mechanisms to identify special items.

Maximum and mimimum values in binary32

Maximum and minimum values are described below. The format presented in the value column follows the interactive cell, Represent decimal or binary moves in exponential (scientific) notation :

Item	Value	Notes
Minimum positive (normal) value	1 x 2^(-126)	We can only represent a maximum of 126 moves of the binary point to the left from our default position of 1.???... when converting from binary32 back into decimal.
Maximum positive (normal) value	(2 - 2^{(-23)) x 2¹²⁷}	We can think about this value in the following steps: Make out a binary pattern such that there are 23 ones after 1 and a decimal point (i.e., 1.[23 ones]). This is because 23 ones is the largest mantissa value and we have a implicit leading bit of 1 for normal numbers. 2^(-23) is basicially 0.[22 zeroes]1 and 2 is 10.0. Consider the very first topic about 'resetting' digits and 'propagating' changes for additions. We are doing subtraction now, but we can think about what number came before 10.0 with an increment of 0.[22 zeroes]1 instead of the 1 we had in the first topic. It turns out that the previous number is 1.[23 ones]. This is because, if you add 0.[22 zeroes]1 to 1.[23 ones], the rightmost digit is reset to zero, and the next digit is incremented (1 carried over). But since the next digit is also 1, that is reset to 0 and the 1 is carried over again. The 1 keeps getting carried over until we have 10.[23 zeroes] = 10.0. We now have our binary pattern of 1.[23 ones] which is equivalent to 10.0 - 0.[22 zeroes]1. This converts to a decimal form of 2 - 1 x 2^(-23) or just 2 - 2^(-23). Now think about 2 - 2^(-23) as a number in itself and not something like '10 subtract by (1 with binary point 23 places to the left' (which is correct, but is obviously confusing). We need to move the binary point in this number in a way that we have the largest number. Recall that for our single number shift representation, with right as positive, we can move a maximum of 127 places to the right. Moving the point right makes the number larger. So for our current number which is 1.[23 ones], that is actually already normalised, we just need to instruct the computer to move the binary point 127 places to the right. This can be done with an exponent section of 127 + 127 = 254, but how can we write this down on paper? We can use a previous interactive cell to find out! The answer is (our number) x 2¹²⁷. What is in the value column is basicially (our number) written in full.
Minimum positive subnormal value	1 x 2^{(-126 + -23)}	Recall that the binary point when converting back to decimal always moves 126 places to the left for subnormal numbers. If we set the mantissa to only the very rightmost digit as 1 and all other 22 digits as 0, then the number represented is 0.[22 + 126 zeroes]1 after conversion from binary32 to decimal which is in fact the smallest number.
Maximum positive subnormal value	(1 - 2^(-23)) x 2^(-126)	If you have read the section for the maximum positive normal value, the steps are very similar. The only difference is that for the first step, a value of 01.0 is used instead of 1.0. This is because we want a binary pattern of 0.[23 ones] instead of 1.[23 ones] because the leading binary digit of a subnormal number is always zero instead of one.

Rounding modes in binary32

We will discuss about the different rounding modes in binary32 as to complete the discussion about our third issue, rounding will be important. For this section, we will consider a modified version of the (expanded) mantissa section of the binary32 representation. Specifically, we will consider a section of 48 bits ((23 + 1) x 2) as follows:

1

.

(other 47 bits)

The aim is to find some way to remove the last 24 bits of the section so that we can easily remove the beginning 1. to get our actual mantissa back. To do this, we can make use of rounding.

Rounding to nearest: ties to away

You would likely have used rounding for decimal numbers in some capacity to make a number more easy to read. For example, in the common rounding method, 0.125 rounds to 0.13 if we want to read with 2 decimal places. Similarly, 0.124 rounds to 0.12 and 0.126 rounds to 0.13. We can say that the usual rounding method follows the following rules:

If the decimal digit just right of the place we want to round a number to is greater than 5, and at least one digit to the right of the 5 is not zero, then we increment the number at the place we want to round if the number is positive, otherwise we will decrement that number.
If the digit right of the place we want to round is less than 5, we simply 'chop off' (truncate) the value.
The last case is that the digit right of the place we want to round is exactly 5, and all digits to the right of this 5 is zero (for example 0.500000...), of which we will follow step 1 if the number is positive and step 2 if the number is negative. This situation which you might not have heard of is called a tie.

For binary numbers, the usual rounding method is very similar:

If the binary digit right of the place we want to round is 1, and at least one of the remaining digits is not zero, then we increment the number at the place we want to round.
If the binary digit is zero, then truncate.
In a tie situation, follow step 1 if the number is positive, otherwise step 2.

For example, let us round the numbers 101.101, 101.111 and 101.100 to one binary place. 101.101 rounds to 101.1 because the digit next to the place we want to round, 101.1[0]1, is a zero. For 101.1[1]1, we increment 101.[1]11. However, if you remember binary arithmetic, this 'carry' is 'propagated', so we will have 110.0. 101.[1]00 is a tie situation which will result in the value becoming 110.0 since the number is positive. This can be called a method of rounding to the nearest value, breaking ties away from zero (or ties to away). This method is actually only required for decimalxx implementations of IEEE754.

Direct rounding: Truncation (round to zero)

Truncation is basicially 'chopping off' all digits that we want to omit. Think about this like a case of taking an eraser and erasing any digits that you do not need. Hence, if we want to round 1.999 to 1 decimal place, this would just become 1.9 because we erased the remaining '99'. For binary, this concept is the same: 1.01101 to 1 binary place becomes 1.0.

Direct rounding: Rounding up

In rounding up, we always round a value up. Hence 1.999 in our previous example becomes 2.0, but 1.911 also becomes 2.0. For 1.000, this becomes 1.1. For binary, 1.01101 becomes 1.1. For negative numbers, we will also round up, so -1.999 becomes -1.0.

Direct rounding: Rounding down

Rounding down is the opposite of rounding up, hence 1.999 in our previous example becomes 1.0. 1.911 becomes 1.9 (since the 1 to the right of 9 would not 'propagate'). For binary, 1.01101 becomes 1.0. For negative numbers, -1.999 will round down to an even more negative number of -2.0.

Rounding to nearest: ties to even

This is the default mode of rounding for IEEE754 floating point numbers, and is also called the 'Banker's method'. Consider our usual 'ties away' method. If we have a list of values where the digit just right of the number we want to round to a place, such as 1.2[3]5 for a 2 decimal place or 1.0[1]1 for a 2 binary place rounding, all these numbers will be rounded up. This might be intuitive but in some fields, this would add too much weight to a round up. Hence we have to have some method of distributing equally the number of times a tie gets rounded down and up. Consider now the concept of odd and even numbers. We can split the digits in some radix by odd and even and decide on how to round by considering these digits in some way. So for decimal, 1,3,5,7 and 9 will be rounded differently from 0,2,4,6 and 8, which will result in an equal distribution of rounding in a very general sense (assuming we have a lot of data with different numbers).

Our improved method, for decimal, is as follows:

If the decimal digit just right of the place we want to round a number to is greater than 5, and at least one digit right of this 5 is not zero, we round up if the number is positive and round down if the number is negative.
If the digit right of the place we want to round is less than 5, we truncate.
In a tie, we will round either up for positve numbers or down for negative numbers in such a way that the decimal digit at the place we want to round (the least significant digit) is an even digit.

Hence, rounding to 2 decimal places,

13.4[3]500... will be rounded to 13.44 as we have a tie and we round up since the number is positive.
13.4[2]500... will be rounded to 13.42 as we have a tie and since 2 is already an even digit.
13.439 will be rounded to 13.44, the same as step 1 in 'ties away'.
13.431 will be rounded to 13.43, the same as step 2 'ties away'.
-13.4[3]5 will be rounded to 13.42 as we round down since the number is negative.

This is very similar for binary. For 100.[1]100..., this is rounded to 101.0 since we have a tie and 1 is an odd number, so we round up to the next even digit ('up to 0' instead of 'down to 0' in this case). For 100.[0]100...1..., this is rounded to 100.0 since 0 is an even number.

How do we calculate this on a computer? Generally there are a few concepts:

We need to take note of the least significant digit (the rightmost digit we want to round to).
We need to take note of the first digit that is to be rounded off, or 'erased' (but unlike truncation, we cannot simply erase this digit without considering the effects of other digits).
We need to check if all other digits are zero or not.

We can express these digits as the following groupings:

The digit to the right of our least significant digit as the guard.
The digit right of (1) as the round.
The remaining digits to the right of (2) as sticky.

Using these groupings, we can determine if there is a tie or not in a way that is friendly to a computer. Do note that all actions say to 'truncate' the value, because rounding up or down is done in a similar way to first truncating a value then operating on the truncated value.

Round a binary number to a whole number using the 'ties to even' method

Enter a binary number:

0

Guard	Round	Is any digit in sticky 1?	Action
0	Any value	Doesn't matter	Truncate
1	1	Doesn't matter	Round up or down: Truncate then add 1 to least significant digit
1	0	Yes	Round up or down: Truncate then add 1 to least significant digit
1	0	No	Tie to even (round up or down): Truncate then add 1 to least significant digit ONLY IF the digit is 1

You might be confused by the actions that state 'round up or down'. Is the process of adding 1 to the least significant digit correct for rounding up and down for positive and negative numbers? The process of rounding up or down in the computer is equivalent to the following:

Ignore whether the number is positive or negative.
Add 1 to the least significant digit if required.
If the number is negative, simply put a minus sign in front of the number.

The third step is equivalent to modifying the sign digit in binary32 representation as required.

For our 48 digit representation, recall that we will need to keep the first 24 digits only. This means that we are effectively rounding our modified mantissa to the first 23 binary places (24 binary digits) (recall also that the 1 left of the binary point will be removed when storing in the binary32 representation). Since the sign of the number is stored seperately, we can pretend that the number is positive and we will always round up, i.e. add 1 to the least significant digit, if required. Then if the number is negative we can just add a minus sign in front of the number. If you look at hardware diagrams for the floating point arithmetic logical unit (ALU), you would see that the sign digit is infact dealt with seperately from the mantissa.

1

.

23 digits after the binary point

24th digit (b₂₃)

25th digit (b₂₄)

26th - 48th digit (b₂₆, b₄₈)

binary32 arithmetic - Solving issue 3

If you have noticed, we have one more issue to solve with our floating point representation. Currently we do not have a standardised way to perform basic operations on our numbers in binary32 from what we have seen so far. We can only look at the numbers and convert them from decimal to binary32 representation and back. In this section, we will explore how we can operate on binary32 numbers and reprsent the results properly.

Importantly, we will not be discussing about how to perform arithmetic operations on values where at least one of the operands are subnormal. How such arithmetic is performed may be implementation-specific.

Multiplication

We start with multiplication since this is the easiest to do. Recall how we usually perform multiplication on two decimal numbers, say 500 x 2 = 1000. All that is needed is the elementary school multiplication method of lining up, right-aligned, the numbers in a column and multiplying digit-wise the multiplicand (500) and the multiplier (2). We can extend this to binary notation, easily, remembering that when adding the partial products (intermediate steps) together you can only use digits 0 and 1.

How about the fact that we may shifted the radix point a few places? We can just add up the shifts together! Think about it: if we have a number 1.01 x 10^(-2) and another 2.05 x 10⁵, we can calculate this in two parts:

Just the numbers without the shifts, 1.01 x 2.05 = 2.0705
The shifts themselves, which is 2 left and then 5 right, so you land in a place three to the right. This is equivalent to 10^(-2) x 10⁵ = 10^{(-2 + 5)} = 10³.
Combine the parts together: 2.0705 x 10³.

Check for yourself that this answer is correct, since 2.0705 x 10³ = 207.05.

For binary multiplication, remember that the base of the exponent is 2 instead of 10 since the radix of binary is 2. Otherwise, our steps remain the same. Just remember that we only work with digits 0 and 1 in binary.

The steps presented above can actually be translated to a more binary32-friendly set of steps:

If only one but not both of the numbers is negative (checking the sign digit), mark the result's sign digit as 1. Otherwise the result's sign digit is 0. This can be done with an exclusive-or (XOR) operation (1 if any sign is different, but 0 if both signs are the same).
Add the implicit leading bit of 1 to both the opreands (we will not be operating on subnormals). Now both values will read 1.[23 digits of the mantissa].
Perform multiplication between the two numbers formed in step 1. We will have a maximum of 48 digits that results from this multiplication. We keep all 48 digits temporarily.
Add the exponents together, then subtract 127. This is because both exponents are biased by 127. Hence, for exponents (a + 127) and (b + 127), adding them together will result in (a + b + 127 + 127). There are two copies of the bias, so we subtract one of them: a + b + 127. Note that a and/or b can be either positive or negative.
Observe the 48 digits we have. If we need to re-normalise this, say if we have 101.100101..., we will have to shift the binary point (left by 2 in our example to get 1.01100101...). Then we will add to a + b + 127 we calcualated in step 3 the value to get back to 101.100101..., which we will generalise as c: a + b + c + 127. If we do not need to renormalise, c = 0.
Now a + b + c + 127 may be more than 254. In this case, we have to perform overflow handling (not covered, but generally we set the number to infinity).
Round, breaking ties by rounding to even, the mantissa to 23 binary places.
Combine the sign, exponent and mantissa together to get the answer.

How a computer does multiplication will not be discussed, but you can read about Booth's algorithm for multiplication if you are interested as one method of a computer-friendly way to perform multiplication.

Division

Division is similar to multiplication:

Perform the same sign determination (using XOR).
Add the implicit leading bit of 1 to both the opreands (we will not be operating on subnormals).
Perform division between the two numbers formed in step 1. We will have a maximum of 48 digits that results from this division. We keep all 48 digits temporarily.
Subtract the exponents from each other, then add 127. (a + 127) - (b + 127) = a - b, but we will need a bias since the bias is cancelled out: a - b + 127.
Renormalise the number so that we have a - b + c + 127. If no renormalisation is needed, c = 0.
Now a - b + c + 127 may be negative. In this case, we may have to handle this (not covered).
Round, breaking ties by rounding to even, the mantissa to 23 binary places.
Combine the sign, exponent and mantissa together to get the answer.

Unfortunately, since I am actually running out of time on this project, we won't have interactive cells for now. If you are interested, tell me, and I might add the cell in the future.

Addition and subtraction

We have combined addition and subtraction together because, considering different signs, adding a positive number a to a negative number -b is equivalent to a - b, i.e. subtracting the negative number from the positive number. In fact, the hardware to perform binary32 addition and subtraction use the same circuitry, so these operations have a close connection.

Addition and subtraction is slightly more complicated than multiplication and division. This is because we have to find some way to align the numbers to add them together since the rules of exponentiation do not allow us to simply add the seperate parts together. For some binary number 1.01 x 2^-2 and another 1.10 x 2³, we cannot add the parts together to get 1.11 x 2¹ (this is incorrect). Instead, we must make sure that the exponent k in 2^k must be the same before we perform addition or subtraction. An idea is to make k some constant, say 0. So the first binary number will be represented as 0.0101 x 2⁰ and the second as 1100 x 2⁰. However, there is a way to speed up the calculation by only shifting one binary point. For this, we will shift only the number with the smaller value of k (the exponent) while keeping the other with the larger value of k. Hence we will have 0.0000101 x 2³ and 1.10 x 2³, such that we can do addition as follows:

0.0000101
1.1000000	+
1.1000101

The steps for either addition or subtraction are as follows:

Add the leading 1 to both values.
Determine which value has the smaller exponent (smaller shift).
For the value with the smaller exponent, move the binary point of this number by k_(larger) - k_(smaller) places where k is the exponent. As this movemement is at maximum 23, we must store 48 binary digits in the result field. The result has an exponent of k_(larger).
Perform addition or subtraction as neccesary, taking note of the signs of both numbers. We can somehow consider both the 'expanded' mantissa and the signs together. On pen and paper, we can simply append a minus to a negative number. However some implementations may use what is called 2's compliment that basically makes a subtraction pretend to be an addition, but we will not cover this concept here. The result's sign is also recorded for later.
If the number is too small or too big to be represented, handle this accordingly (not covered).
Renormalise the number so that we have k_(larger) + c where c is the shift after the renormalisation. If no renormalisation is required, c = 0.
Round, breaking ties by rounding to even, the mantissa to 23 binary places.
Combine the sign, exponent and mantissa together to get the answer.

A more in-depth list of steps, can be experimented with below. Do note the following behaviours for the interactive cell:

If a value cannot be represented exactly, the mantissa without the implicit leading bit is truncated instead of rounded to 23 digits.
Special items are not allowed (including subnormals).
Results that are special items are not allowed.

Addition or subtraction of normal binary32 numbers.

Enter one normal number in decimal:
Enter another normal number in decimal:

What I want to do:

Add Subtract

Sign	Exponent	Mantissa
0	00000000	00000000000000000000000

Congratulations! You now know how to do arithmetic on binary32!

We are done with this page. Thanks for reading! I did this myself, so there will obviously be problems. Do notify me if there are problems that I didn't spot, thanks.

IEEE754 binary32 with Foofy