Floating Point Fun and Frolics

Floating Point Fun and Frolics

By Frances Buontempo

Overload, 17(91):, June 2009

Representing numbers in computers is a non-trivial problem. Frances Buontempo finds using them is hard, too.

It is unworthy of excellent persons to lose hours like slaves in the labour of calculation. Gottfried Wilhelm von Leibniz

This article is a summary of a recent discussion on accu-general concerning floating point numbers. First, we look at how numbers are represented on computers, starting with whole numbers and moving on to real numbers. Next, we consider comparing numbers stored in floating point representation, noting pitfalls and common mistakes along the way. The third section investigates performing calculations with floating point numbers, noting the significance the order of calculations makes and gives warning of clever tricks that might not actually be all that clever. Mathematical proofs of the accuracy of the algorithms and tricks presented are not given: the purpose of this article is simply to give a taster of what is possible and what should be avoided. Working with floating point numbers is hard, frequently mistake ridden, but with care and attention can be done sensibly.


Let us begin with non-negative integers or whole numbers. These can be easily represented in binary. Using 32 bits, we can represent numbers between 0 and 2 32 -1. For signed numbers, we could use a bit to represent the sign, and the remaining bits to represent a magnitude. Alternatively integers can be encoded using the more common two's complement. A non-negative number is stored as the binary representation, while a negative number -y is stored as 2 32 -y. This is equivalent to switching the bits and adding one. For example, let us see how -1 is represented. Flipping the bits on 1 gives


Adding 1 gives the representation of -1 as


As a sanity check, we should find if we add 1 we get zero.

       00000000  (ignoring the overflow carry bit) 

This representation allows subtraction to be performed using just the hardware for addition. Overflow can occur for either approach, since only a fixed number of bits are available. This is dealt with in different ways by different programming languages. Only a finite set of integers can be represented: work is required to represent arbitrarily large or small numbers.

Rational numbers, or fractions, can be represented symbolically by storing the numerator and denominator, though arbitrarily large integers will still need to be represented to cover all the rational numbers. This symbolic representation requires several extra calculations for basic arithmetic. It does not allow exact representation of irrational or transcendental numbers, such as ÷ 2 or p . The obvious alternatives are fixed or floating point encodings, allowing the representation of real numbers with a degree of precision.

Fixed point representation uses one bit for the sign, a field of a fixed number of bits before the decimal point, and a fixed length field for the fractional part after the point. If 32 bits are used, with one bit for the sign, 8 bits before and 23 bits after the decimal point, we can only represent numbers between 2 -23 and 2 8 - 1. In addition, numbers can now underflow (anything less than 2 -23 ), as well as overflow (anything greater than or equal to 2 8 ), and the numbers we can represent are not dense (in lay person's terms they have gaps in-between, unlike rational or real numbers: whichever two you think of, no matter how close, you can always find another one in between) and so we will have rounding errors to consider.

An alternative is floating point, based on scientific notation. Any real number, x , is written as a sign followed by a mantissa or significand, S, and a magnitude or exponent, E, of a base, usually 2 or 10. Nowadays, computers commonly use base 2. For example:

x = ± S . 10 E

The decimal point 'floats' as the exponent changes, hence the name. Under-and overflow still occur if a fixed number of bits are used for the exponent and mantissa, however a greater range of numbers can now be represented than with fixed point.

This general sketch of floating point representation gives much scope for discussion. How many digits should the significand and the exponent be? How do you represent the sign of the exponent? ' Anarchy, among floating-point arithmetics implemented in software on microprocessors, impelled Dr. Robert Stewart to convene meetings in an attempt to reach a consensus under the aegis of the IEEE. Thus was IEEE p754 born. ' [ IEEE98 ]

IEEE defines several representations of floating point data, along with negative zero, denormal or subnormal numbers, infinities and not-a-numbers (NaNs) [ IEEE87 ] . It also covers rounding algorithms and exception flags. As with any floating point representation, a finite number consists of three parts: a sign, a significand (or mantissa), and an exponent. This triple is encoded as a bit string, whose length varies depending on which representation is being used. The IEEE standard gives ' three basic binary floating-point formats in lengths of 32, 64, and 128 bits, and two basic decimal floating-point formats in lengths of 64 and 128 bits' . [ IEEE87 ] A conformant programming environment provides one or more of these formats. The base may be either 2 or 10, and each limits the possible ranges for S, giving the precision, and E, giving minimum and maximum possible exponents, emin and emax.

Of course, the floating point representation of a specific number will not be unique without a convention: + 1.0 x 2 0 == + 0.1 x 2 1 . In order to give a unique encoding for floating point number, the significand S is chosen so that either E = emin, or S>=1. If S<1, the number is denormal or subnormal, having reduced precision but providing gradual underflow. They cause special issues for rounding: the error bounds on underflow differ from those on normal numbers. For normal numbers using base 2 the rounding error will be at most one unit in the last place, ulp, 2 -( p -1) , where p is the precision. This does not hold for denormal numbers. Without these numbers, the gap between zero and the smallest representable number is larger than the gap between two successive small floating point numbers. More details on their controversial introduction are given in [ IEEE98 ] .

The signed bit is 0 for positive and 1 for negative for all representations. For a 32-bit format, 8 bits are used for the exponent, ranging between -126 and 127. This is stored as a biased exponent adding 127 to get a value in the range 1 to 254. For double precision the bias is 1023. For single precision, exponents of zero and 255 have special meanings, leaving 23 bits. If we are representing normalised numbers in binary, they all start 1. x 1 x 2 x 3 ..., so there is no point in storing the 1. If the exponent is all zero, we are dealing with denormalised numbers, 0. x 1 x 2 ... .

Let us consider the bit pattern of some single precision binary numbers.

      0 1......8 9.................... 31  

Note first, we have two zeros:

      0 00000000 00000000000000000000000 = 0  
      1 00000000 00000000000000000000000 = -0  

As mentioned, all the ones for the exponent have special meanings:

      0 11111111 00000000000000000000000 = Infinity  
      1 11111111 00000000000000000000000 = -Infinity  
      0 11111111 00000100000000000000000 = NaN  
      1 11111111 00100010001001010101010 = NaN  

Otherwise, for non-zero exponents we have

        0 10000001 10100000000000000000000  
      = +1 * 2129-127 * 1.101 = 6.5  
        1 10000001 10100000000000000000000   
      = -1 * 2129-127 * 1.101 = -6.5  

And finally we have denormal numbers, such as

        0 00000000 00000000000000000000001  
      = +1 * 2-126 * 0.00000000000000000000001  
      = 2-149

For w exponent bits, the biased exponent allows every integer between 1 and 2 w -2 inclusive to encode normal numbers, the reserved value 0 is used to encode ±0 and subnormal numbers, and the reserved value 2 w -1 to encode ±Inf and NaNs (quiet and signalling). A quiet NaN is the result of an invalid operation, such as 0/0 or a operation with a NaN for one operand, and silently propagates through a series of calculations. Whenever a NaN is created from non-NaN operands, an INVALID flag is raised. A signalling NaN must be explicitly created and raises a floating point exception when accessed which could be picked up by a trap handler. They can be used to detect uninitialised variables. As we have seen, all the bits in the biased exponent field of a binary NaN are 1. In addition, a NaN will contain diagnostic information in its 'payload', giving a family of NaNs. Note that NaNs ' compare unordered with everything, including itself '. [ IEEE87 ] . Infs are constructed as the limits of arithmetic with real numbers, allowing operations with them to be exact rather than signalling an exception event. As an aside, note that IEEE allows decimal floating point numbers to have more than one encoding: numerically equal 'cohorts' are identified. Different cohorts can then be chosen for different operands. The specifics of Infs and NaNs differ, but they are still easily represented.

Mapping a real number to a floating point representation may incur rounding, which signals an inexact event. There are five IEEE rounding options. It can be round to the nearest number: for 'ties', such as 4.5, rounding is either to the nearest even number, here 4.0, which is the default, or away from zero: for 4.5 this will be 5.0. The former, also known as bankers' rounding, makes sense, as it will sometimes round down, and sometimes round up, so in theory all the small differences will come out in the wash. Consider 0.5 + 1.5. This should be 2.0. If we round each number away from zero to the nearest whole number, we calculate 1.0 + 2.0 = 3.0. If bankers' rounding is employed, the sum is 0.0 + 2.0 = 2.0, giving the correct answer. Other forms of round can be set: round towards zero, or truncation, and round towards plus or minus infinity.

Exceptions are error events, rather than language specific exceptions. There are five possible events: underflow, overflow, division by zero, inexact and invalid operations, such as square root of a negative number. By default, a flag will be set to indicate if such an event has occurred, though this behaviour can be overridden for example by installing a trap handler. The specifics are platform dependent. Once a flag has been set, it is 'sticky', that is it remains set until explicitly cleared.

Many gotchas can occur with floating points. Rounding errors occur before any calculation for some numbers when more bits are required than are available in the representation. Similarly underflow or overflow can cause problems. This includes comparison and calculations. Unintuitive things occur, since floating point operations are neither commutative nor associative. Furthermore, some compiler optimisers assume algebraic properties hold and make mistakes. What do you think x1 and x2 will be in your favourite C++ compiler?

      double v = 1E308;  
      double x1 = (v * v) / v;  
      double temp = (v * v);  
      double x2 = temp/v;  

See [ Monniaux08 ] for further fun and frolics your compiler, platform and optimiser can cause.


Let's start with a simple example of what can go wrong when floating point numbers are compared. Consider:

      bool test = 0.10000000000000001   
         == 0.10000000000000000;  

What is the value of test? What would a 10 year old tell you? Your computer will tell you the value on the left hand side is in fact equal to the one on the right, whether you try this in C, C++, Python or Haskell. This happens because of the underlying IEEE representation of the number. Simply put, ' Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation '. [ IEEE87 ] So, how should floating point numbers be compared?

Numbers could be compared to a pre-specified number of decimal places [ ACCU09 ]

      bool equal(double x, double y, double places)  
        return floor(x * pow(10.0, places) + 0.5) ==  
           floor(y * pow(10.0,places) + 0.5);  

Multiplication by 10 places moves the required number of digits to the left of the decimal place. Addition of 0.5 ensures that numbers ending with the digit 5 are rounded up. This is a common trick. This comparison will be unsatisfactory in general. How many decimal places are required? What happens for really big or really small numbers? Comparison to a specific number of decimal places is probably not sensible. Instead, what is required is a function that detects whether two numbers are close. The decimal place comparison above can be adopted to check if two numbers are equal to a given number of significant figures. Alternatively, the relative difference can be employed.

Consider the function [ ACCU09 ]

      bool equal(double x, double y, double precision)   
        return fabs(x-y) < precision;  

This certainly checks if two numbers are within a given range of one another. However, it will be hard to use, as the precision is an absolute difference, and must vary with the two numbers being compared. What value is suitable for

      equal(1000000000001,1000000000002, ???)? 

What about

      equal(0.000000000001,0.000000000002, ???)?

Epsilon may spring to mind, for example in C++, std::numeric_limits<double>::epsilon() .

This is the smallest number that can be added to 1.0 to obtain a number that doesn't compare equal to 1.0 The C++ standard defines ' Machine epsilon: the difference between 1 and the least value greater than 1 that is representable ' [ C++03 ] . I have seen code out there in the wild that just checks if two numbers are within epsilon of one another. This will (almost) never be what is required. For bigger or smaller numbers, we require a multiple of epsilon, or our chosen precision, giving the relative difference. [ ACCU09 ]

      bool equal(double x, double y, double precision)  
        return fabs(x-y) < fabs(x)*precision;  

This relative error approach still raises questions. How do we decide whether to multiply by x or y ?

I suspect rounding errors, particularly ones in production systems initiating support calls at 3am, cause programmers to first start thinking about comparing floating point numbers. Numbers are required to be close enough, allowing for computational rounding. We can check if two numbers are within the worst-case difference that can be caused by rounding errors. For one rounding error, resulting from one calculation, the absolute difference will be less than fabs(x)*epsilon , as used in the relative error above. For n IEEE floating point operations, a seemingly sensible comparison is therefore [ ACCU09 ]

      fabs(x-y) <= double(n)*fabs(x)*epsilon;  

However, what happens if x is zero? We then check the difference is less than or equal to zero, which is not the intention. It is safer to use [ ACCU09 ]

      fabs(x-y) / sqrt(x*x + y*y + epsilon*epsilon)   
        < n * epsilon;  


Comparison of floating point numbers is just the beginning of the matter: the order of calculations can also make a difference. Consider three ways of summing some floating point numbers. See Figure 1.

  double sum1 = 1000000000.0 + 0.00000001 + 0.00000001 + 0.00000001 + 0.00000001 + 0.00000001 +
                0.00000001 + 0.00000001 + 0.00000001 + 0.00000001 + 0.00000001 + 0.00000001 +
  double sum2 = 1000000000.0 + (0.00000001 + 0.00000001 + 0.00000001 + 0.00000001 + 0.00000001 +
                0.00000001 + 0.00000001 + 0.00000001 + 0.00000001 + 0.00000001 + 0.00000001 +
  double sum3 = 0.00000001 + 0.00000001 + 0.00000001 + 0.00000001 + 0.00000001 + 0.00000001 +
                0.00000001 + 0.00000001 + 0.00000001 + 0.00000001 + 0.00000001 + 0.00000001 +
Figure 1

This gives sum1 as 1000000000.0000000, and sum2 and sum3 as 1000000000.0000001. Addition is neither associative nor commutative in floating point arithmetic, giving a very strange number system. Sorting the numbers to be summed so that the smallest numbers are aggregated first will be more accurate, however this will be time consuming. A better approach is the Kahan Summation Formula [ Goldberg91 ] . This takes into account the lower order bits, avoiding some rounding errors as it runs. For the sum of an array of n floating point numbers, x, Kahan's summation is calculated as follows:

      double sum = x[0];  
      double small_bits = 0.0;  
      for( int j = 1; j < n; ++j )  
        double y = x[j] + small_bits;  
        double temp = sum + y;  
        small_bits = y - (temp - sum);  
        sum = temp;  

This illustrates one approach to calculations - compensating for lower order bits that would otherwise get lost. On the first step round the loop, for sum1 , sum = 1000000000.0 , so the smaller 0.00000001 has got lost. However, it is remembered in the small_bits , and will be taken into account in the final result.

We have seen how addition can cause problems: so can subtraction. Catastrophic cancellation occurs if two numbers are close and are the result of previous calculations, so have inaccuracies in the lower digits. Most of the accurate digits may cancel as a result of subtraction, leaving behind cruft in the lower order digits. On the other hand, if the operands are exactly known rather than calculated their difference is guaranteed to have a very small relative error, which is referred to as benign cancellation. Consider, the mathematically identical formulae x 2 - y 2 and ( x + y )( x - y ). Though they are identical algebraically, they may not be computed to be equal. [ Goldberg91 ] demonstrates why the right hand side is usually a more appropriate computational approach. Intuitively, when x and y are close, x 2 - y 2 will be very small, giving a potentially large relative error. However, although the expression ( x - y )( x + y ) does not cause a catastrophic cancellation, it is slightly less accurate than x 2 - y 2 if x is much smaller than y or vice versa . In this case, ( x - y )( x + y ) has three rounding errors, but x 2 - y 2 has only two since the rounding error committed when computing the smaller of x 2 and y 2 does not affect the final subtraction. [ Goldberg91 ]

Sometimes it is tempting to rearrange a mathematical formula for greater efficiency, but mathematically equivalent formulae have different 'stability properties' [ Highham96 ] . Consider calculating the variance of an array of numbers.

      variance = 1/(n-1) sum( x-mean(x) )  


      mean = 1/n(sum(x))  

This involves two passes through the numbers, once to calculate the mean and a second time to calculate the difference from the mean. Mathematically this is identical to the sum of squares method [ Cook ]

      variance = 1/n(n-1) (n.sum(xi^2) - (sum(xi)^2))  

This second formulation can lead to difficulties. For example, the variance can become negative if the numbers are large with comparatively small differences. Try it with 1.0e09 + 4, 1.0e09 + 8, 1.0e09 + 13. This gives about -156.77. This will cause trouble if we wish to calculate the standard deviation, that is the square root of the variance.

A better approach is Welford's method [ Welford62 ] , which is an 'on-line algorithm'-- that is, it provides the new result in one step for a new value of x , rather than having to recalculate everything from the start, so will give us the efficiency we require but without the potential instability. This is achieved via a recurrence relationship or iteration, which calculates the new value from the previous value. If we have the current mean for k - 1 numbers, for a new value x

      mean = previous_mean + (x - previous_mean)/k  

The variance can then be calculated from

      sum_variance = previous_sum_variance +  
                     (x - previous_mean)*(x-mean)  

as variance = sum_variance / k;

This version is both a one-pass algorithm and does not suffer from the same potential rounding problems as the sum of squares method. Any formula can be expressed in more than one way. It is worth considering the rounding errors that can occur and looking for ways to rearrange the order of the calculation to avoid them. As a quick finger in the air test of a formula, compare the output for the same sequence of numbers presented in different orders: if they differ you probably have a rounding error. Then try different rearrangements try for your formula, with the inputs presented in different orders. This can be a quick way to detect rounding errors. Another approach is to try one formula with different rounding modes set. Furthermore, your optimiser can also wreak havoc with your carefully laid plans. Changing your optimisation settings, re-running your program and examining the results can reveal some fun and frolics.

Other representations

Each of the rounding errors considered so far stem from the way real numbers are represented on a machine. Many real numbers cannot be represented precisely in binary using a fixed number of bytes. There are many ways to represent the real numbers on a computer. These include binary-coded decimal (BCD) [ Wikipedia ] , p-adic numbers [ Horspool78 ] and IEEE floating point standard [ IEEE87 ] .

BCD typically uses four bits to represent each digit. For example, 127 would be encoded as

   0001 0010 0111  

It takes more space than floating point representation, but decimals like 1/10 which recur in binary (0.0001100110011...) can be represented exactly. Using base ten, any rational number which cannot be written with a denominator that is an exact power of 10 recurs, for example 1/3 is 0.333..., while 1/2 = 5/10 = 0.5. Similarly using base two, any rational number whose denominator is not an exact power of two will require a recurring representation. If a floating point representation with a fixed precision is used, such numbers will be rounded. In BCD, as many decimal places as required can be used. Other benefits include quick conversion of BCD numbers to ASCII for display since each digit is in a byte. More circuitry is required for calculations in BCD, though the calculations are still straightforward.

Consider adding 9 and 8 in binary coded decimal. Adding the digits as one would usually gives

         + 1000
      0001 0001  

The first 0001 is a carry digit. If the number were in binary this would be sixteen, but since it only represents ten the extra six, 0110 , must be added whenever we have a carry digit:

         + 1000
      0001 0001  
         + 0110
      0001 0111  

This represents a one followed by a seven, giving 17 as required. For any carry digits or invalid BCD numbers ( 1010 to 1111 ), six must be added to give a valid BCD.

Other less well-known representations of the rational numbers are possible. One I have come across recently is the p-adic numbers [ Horspool78 ] . They can represent any rational number exactly. It uses a compact variable length format. Addition, subtraction and multiplication are familiar and straightforward. Division is slightly more difficult.

The numbers are represented as a series ∑ d i b i , where d i are the digits and b is the base. The p-adic numbers do not require a decimal point and many of them involve an infinite, but repeating, sum. The natural numbers are represented as they would usually be in the chosen base. For example, using base two, 9 is 1001 . If a prime number is chosen as the base each rational number can be represented uniquely. For negative numbers using base two, note that

      +  00001

This means -1 can be represented by ...11111 . Though this sequence repeats indefinitely, it repeats and so can be represented finitely. Some irrationals and complex numbers can be represented. Consider rational numbers. For example 1/3 is ...0101011, since three of them sum to unity:

       + 0101011  
       + 0101011

[ Horspool78 ] gives further details on this interesting representation of real numbers.

Further issues

This article has looked at representations of numbers on computers, various ways to compare floating point numbers, touched on issues concerning calculations, including the order of operations and tricks to compensate for rounding errors. Many other issues have not been considered including platform specifics, such as such as how to control IEEE rounding modes, and how to find out if an exception event has signalled. Other more general points have not been raised, including troubles that occur when floating point numbers are streamed into a string.

What does this output?

      #include <iomanip>  
      #include <iostream>   
      int main()  
        std::cout << std::fixed << std::setprecision(2)  
           << 5000.525 << "\n";  

VS2005 rounds down to 5000.52, when you might expect to get 5000.53. What does your compiler do with this?

      double x = 0.00000000000001234567890123456789;   

Various myths concerning floating point calculations abound. Hopefully it is now clear why the following are myths:

  • Since C's float (or double) type is mapped to IEEE single (or double) precision arithmetic, arithmetic operations have a uniquely defined meaning across platforms.
  • Arithmetic operations are deterministic; that is, if I do z=x+y in two places in the same program and my program never touches x and y in the meantime, then the results should be the same.
  • A variant: 'If x < 1 tests true at one point, then x < 1 stays true later if I never modify x'.
  • The same program, strictly compliant with the C standard with no 'un-defined behaviours', should yield identical results if compiled on the same IEEE-compliant platform by different compliant compilers. [Monniaux08]

I am still left with the feeling I have more to learn and consider regarding floating point numbers. I suspect there is still much I do not fully understand or appreciate, however that is enough for now. As Wittgenstein said, ' Whereof we cannot speak, we must remain silent '.


A big thanks to everyone who helped review this - you know who you are.


[ACCU09] accu-general discussion, March 2009

[C++03] C++ standard 2003.

[Cook] http://www.johndcook.com/standard_deviation.html

[Goldberg91] David Goldberg 'What every computer scientist should know about floating-point arithmetic' ACM Computing Surveys, Vol 23(1), pp 5-48, March 1991

[Highham96] 'Accuracy and stability of numerical algorithms' Nicholas J. Higham, 1996

[Horspool78] R.N.S.Horspool, E.C.R.Hehner: 'Exact Arithmetic Using a Variable-Length P-adic Representation', Fourth IEEE Symposium on Computer Arithmetic, p.10-14, Oct 1978

[IEEE87] IEEE Standard 754-1985 for Binary Floating-point Arithmetic, IEEE 1987

[IEEE98] IEEE 754: 'An Interview with William Kahan' Charles Severance, IEEE Computer, vol. 31(3), p. 114-115, March 1998

[Knuth] Knuth. The Art of Computer Programming.

[Monniaux08] 'The pitfalls of verifying floating-point computations' David Monniaux May 2008. hal-00128124 http://hal.archives-ouvertes.fr/docs/00/28/14/29/PDF/floating-pointarticle.pdf

[Welford62] B. P. Welford 'Note on a method for calculating corrected sums of squares and products'. Technometrics Vol 4(3), pp 419-420, 1962 (cited in [Goldberg91])

[Wikipedia] http://en.wikipedia.org/wiki/Binary-coded_decimal

Your Privacy

By clicking "Accept Non-Essential Cookies" you agree ACCU can store non-essential cookies on your device and disclose information in accordance with our Privacy Policy and Cookie Policy.

Current Setting: Non-Essential Cookies REJECTED

By clicking "Include Third Party Content" you agree ACCU can forward your IP address to third-party sites (such as YouTube) to enhance the information presented on this site, and that third-party sites may store cookies on your device.

Current Setting: Third Party Content EXCLUDED

Settings can be changed at any time from the Cookie Policy page.