Floating-Point Comparison

Comparing floating point values can be difficult. Paul Floyd shows how you should perform floating-point comparisons (and how not to do it).

It is often said that you shouldn’t compare two floating point numbers for equality. To quote one erstwhile guru:

Any programmer claiming competence that uses equality between floating point values clearly requires re-education on this point.

I agree that in general you need to take care with floating point comparisons, but there are times when using == and != is the right thing to do. There are two major problems that I see

You don’t want to use a tolerance on every single floating-point comparison. Not knowing what you are doing and just randomly sprinkling tolerances into your comparisons won’t make the problems go away.
Often the way that the tolerance is used is just plain wrong.

When you use a tolerance to make a floating-point comparison you are in effect saying, “This value has some inaccuracy, so I’ll do an inaccurate comparison to compensate for it”. The two questions that need to be answered are a) is that really the case and b) how inaccurate should the comparison be?

A lot depends on your domain of application. First, let’s consider mathematical libraries. Here you want to strive to get as much accuracy as possible (and indeed the precision may be mandated by [IEEE754]). I’ll take the C standard library ‘pow’ function as an example. One implementation can be found on GitHub [GNU].

Consider this check

  if (y == 0)
    return 1.0;

I think that it would be an extremely bad idea to use a tolerance here. I expect x**0 to be 1. I do not expect x**1e-5 to also be 1 (unless x is 1 or close to it). There are similar tests for x**1, x**2 and x**-1. There is not one use of tolerance in this function, and several floating-point comparisons.

Another situation where you shouldn’t use a tolerance is if you are doing 3-way testing like

  If (a < b) {
    // handle less than
  } else if (a == b) {
    // handle equality
  } else {
    // handle greater than
  }

This is fine as it is and using a tolerance instead of a == b just adds gratuitous inaccuracy.

If your numbers come from physical observations, then your precision is probably way below the precision of double and even float. Double has a precision of about 1 part in 10¹⁶ (1 in ten quadrillion). That’s about a thousand times more precise than the most accurate measurements that we can make [StackExchange].

Even if your data doesn’t come from some inexact source, if you do any sort of operation on the data then there will be some rounding error. One common case that often surprises the uninitiated is the conversion from base 10 strings to binary floating-point (via functions like ‘atof’, strtod or std::stod). This doesn’t give an exact result in the sense of being identical to the original base 10 value. Furthermore, there will be rounding errors on most floating-point operations. The analysis of floating-point errors is too big a subject to cover even superficially here.

If you’d like to delve deeper into the subject, I recommend Accuracy and Stability of Numerical Algorithms [Higham02]. For an overview of techniques to mitigate rounding errors, see the series of articles published by the late Richard Harris [Harris11]. Thirdly, there is ‘What every computer scientist should know about floating-point arithmetic’ [Goldberg91] which gives a thorough explanation of floating-point. All I will say is that if your numerical code is not too badly written then rounding errors will generally accumulate slowly in the manner of a drunkard’s walk.

Potential errors

What could happen if you don’t use a tolerance in a floating-point comparison? The most immediate thing is that the wrong branch of your code could execute.

  if (a == b) {
    action1();
  } else {
    action2();
  }

In the above code, there is the risk that a and b are very close but not equal and action2() gets performed rather than action1().

One possibly worse situation that can occur is that your code hangs in a loop.

  while (estimation != answer) {
    estimation = refinement(x);
  }

I say possibly worse since at least if this does get stuck in an infinite loop it will be easy to debug and fix. Algorithms like this that use progressive refinement are common in numerical analysis.

How not to perform floating-point comparisons.

For pedagogical purposes, I’ll start with some bad code that shows how not to perform floating-point comparison. This will build up to something reasonably functional. All my examples use double but apply equally to float and long double.

My first example is the worst code. Don’t do this.

  bool cmpEq(double a, double b)
  {
    return std::fabs(a – b) < 
      std::numeric_limits<double>::min();
  }

I’ve seen this in production and it is almost totally wrong. This applies equally to the macro DBL_MIN. The problem is that the ‘min’ value here is the smallest possible non-denormalized value of a double, typically something like 2e-308. For the expression to be true, either a and b have to be equal or either/both denormalized. (For those that are not familiar with denormalized numbers, they are a special case of the set of floating-point numbers that extend the minimum possible value to approx. 5e-324 at the expense of losing precision and slower CPU execution.)

To all intents and purposes, this code only gives a false sense of security and behaves like operator==.

The intention here was probably to use std::numeric_limits<double>::epsilon(). Epsilon is the smallest representable number that can be added to 1.0 to make a new value. This means that it represents the limits of numerical precision around the value 1.0. The precision is determined by the bits of the hardware – 53 binary digits for 8-byte doubles which is roughly 16 decimal digits.

This is not the same thing as the accuracy, which is the measure of error of a sequence of computations.

Since I’ve mentioned epsilon, how about using that? Quite often I see code like

  bool cmpEq(double a, double b)
  {
    return std::fabs(a – b) < 
      std::numeric_limits<double>::epsilon()*10.0;
  }

(DBL_EPSILON could be used instead of std::numeric_limits<double>::epsilon()).

I’ve somewhat arbitrarily multiplied it by a factor of 10 to give a bit of margin. But what if a and b are far away from 1.0? If both a and b are much smaller than 1.0 then the above comparison will always be true. Let’s say a is 1e-18 and b is 1e-24. a is a million times bigger than b yet the above function will say that they are equal. That may not be what you want. Now what if a and b are much greater than 1.0? Let’s say a is 1e10 and b is 1.0000000000001e10. That’s a difference that could arise from rounding error, but the difference is still 1e-3 which is much greater than the 2e-15 tolerance used above. In short, the above function will only work well for values that are not too far from 1.0. That might be a big restriction.

For my third bad comparison function, I’ll introduce the notion of relative tolerance (reltol). The idea behind a reltol is to have a tolerance that scales with the values being compared which avoids the problems with big and small values that the previous version suffered from. Unfortunately, there are still weaknesses and several ways to get this wrong.

  bool cmpEq(double a, double b)
  {
    double reltol = fabs(a)*1e-7;
    return std::fabs(a – b) < reltol;
  }

There is a potential problem with this function not being commutative if the double tolerance is larger. More specifically, if the factor used for the tolerance is larger than sqrt(DBL_EPSILON) (which is about 1.5e-8) and the difference between a and b is a factor between (1.0 + DBL_EPSILON) and (1.0 + DBL_EPSILON + DBL_EPSILON^2) then the comparison is not commutative.

This can be illustrated by Listing 1, which outputs only “then these should also equal”. That can be the source of nasty bugs that are hard to track down (for instance if this function were used as part of a test in code that requires strict weak ordering, for instance as a predicate for std::sort or the ordered containers std::map and std::set).

#include <iostream>
#include <limits>
#include <cmath>

const double EPSILON = 1e-7;

bool cmpEq(double a, double b)
{
  double reltol = std::fabs(a) * EPSILON;
  return std::fabs(a - b) < reltol;
}
int main()
{
  double testValue = 42.0;
  double otherValue = testValue * 
    (1.0 + EPSILON + EPSILON * EPSILON / 2.0);
  if (cmpEq(testValue, otherValue)) {
    std::cout << "if these are equal then \n";
  }
  if (cmpEq(otherValue, testValue)) {
    std::cout << 
    "then these should also equal\n";
  }
}

Listing 1

The second problem is that it doesn’t handle infinity nicely. Infinity is sticky, which mean that the reltol will also be infinite. So, if both a and b are infinite then the expression std::fabs(a - b) < reltol becomes Inf < Inf which is false, but Inf == Inf is true.

There are a few things that we can do to fix these problems. Firstly, we can apply the reltol factor to a combination of a and b. But what combination? I’ve seen their sum, average, min and max used. The sum and average have the disadvantage of requiring checks that there is no floating-point underflow or overflow, so I don’t recommend using them. That leaves min and max. The only difference is that the choice will slightly narrow or widen the tolerance, respectively. Since the reltol factor itself is somewhat arbitrary, that’s not a big difference. I’ve seen the min version referred to as “essentially equal to” ([Knuth97] ), and the Boost documentation refers to the max version as “close enough with tolerance” and the min version as “very close with tolerance”.

The problem with infinity can be fixed by adding a quick check for equality of the arguments.

So now we have Listing 2.

bool cmpEq(double a, double b)
{
  const double EPSILON = 1e-7;
  if (a == b) {
    return true;
  }
  double reltol = std::max(std::fabs(a),
    std::fabs(b)) * EPSILON;
  return std::fabs(a - b) < reltol;
}

Listing 2

This works with infinity (and NaN which should never compare equal) as well as my corner case with values separated by a factor of about 1.0 + EPSILON.

Are we all done? Sadly not. Using just a reltol is not good for very small values that are close to zero. cmpEq(1e-85, 2e-85) will return false. In many situations values these small are just numerical noise possibly arising from cancellation and should be considered equal to zero. To eliminate these, we must add back a test along the lines of our second version using an absolute tolerance (abstol).

One final version is in Listing 3.

bool cmpEq(double a, double b, 
  double epsilon = 1e-7, double abstol = 1e-12)
{
  if (a == b) {
    return true;
  }
  double diff = std::fabs(a - b);
  double reltol = std::max(std::fabs(a),
    std::fabs(b)) * epsilon;
  return diff < reltol || diff < abstol;
}

Listing 3

This is still not completely safe. If a and b are very large and of opposite signs, then a - b could overflow. If a and b are both very small, then multiplying by EPSILON could underflow – moving the abstol comparison earlier would fix that.

That only leaves one thing and that is how to choose the relatively arbitrary constants used for abstol and reltol EPSILON? For that you need to apply your domain knowledge as unfortunately there is no one-size-fits-all solution. It also depends on your accuracy requirements. I work in the domain of analogue microelectronic circuit simulation. In this domain voltages are typically 1V and currents are typically 1µA. A good rule of thumb is to have an EPSILON for reltol something like 1e-4 and an abstol something like 1e-6 times your typical domain values. So, for circuit simulation that would be a voltage abstol of 1e-6 and a current abstol of 1e-12 [Kundert95].

Existing implementations

Most unit test libraries will have functions for performing floating-point comparisons, for instance the WithinAbsMatcher of [Catch2]. Boost has floating_point_comparison.hpp [Boost]. One thing that I don’t like about this is that the code normalizes the difference by dividing by the values. Floating-point division is slow, and I’d rather avoid it when possible. The other problem that I see with this is that it suffers from Boost template bloat. The header when pre-processed is about 54k lines of which about 42k lines are code. That’s a lot for what is essentially a 5-line function. A functor close_at_tolerance is provided which tests with a relative tolerance. Despite my reservations I would still prefer to see boost being used than a wrong home-rolled comparator.

Summary

Don’t blindly use tolerances in all floating-point comparisons.
Don’t use std::numeric_limits<double>::min() or DOUBLE_MIN for tolerances.
Consider whether you need an absolute tolerance or not.
Choose abstol and reltol EPSILON with care depending on the domain of application.
Consider using existing and well-tested libraries.

References

[Boost] floating_point_comparison.hpp: https://www.boost.org/doc/libs/1_81_0/boost/test/tools/floating_point_comparison.hpp

[Catch2] catch_matchers_floating_point.hpp: https://github.com/catchorg/Catch2/blob/223d8d638297454638459f7f6ef7db60b1adae99/src/catch2/matchers/catch_matchers_floating_point.hpp

[GNU] e_pow.c: https://github.com/lattera/glibc/blob/master/sysdeps/ieee754/dbl-64/e_pow.c

[Goldberg91] David Goldberg ‘What every computer scientist should know about floating-point arithmetic’, published in ACM Computing Surveys, Volume 23, Issue 1 March 1991 and available at: https://dl.acm.org/doi/10.1145/103162.103163

[Harris11] Richard Harris wrote a series of four articles in Overload from 2010 to 2011:

Why Fixed Point Won’t Cure Your Floating Point Blues, Overload 18(100):14-21 available at: https://accu.org/journals/overload/18/100/harris_1717/
Why Rationals Won’t Cure Your Floating Point Blues, Overload 19(101):8–11 available at: https://accu.org/journals/overload/19/101/harris_1986/
Why Interval Arithmetic Won’t Cure Your Floating Point Blues, Overload 19(103):18-23 available at: https://accu.org/journals/overload/19/103/harris_1974/
Why Computer Algebra Won’t Cure Your Floating Point Blues, Overload 20(107):14-19 available at: https://accu.org/journals/overload/20/107/harris_1938/

[Higham02] Nicholas J. Higham (2002) Accuracy and Stability of Numerical Algorithms, 2nd Ed.,, SIAM, ISBN 978-0-898715-21-7

[IEEE754] 754-2019 – IEEE Standard for Floating-Point Arithmetic, available at: https://ieeexplore.ieee.org/document/8766229 (Non-free download.)

[Knuth97] Donald E. Knuth (1997) The Art of Computer Programming, Volume 2, Seminumerical Algorithms, 3rd Ed, Addison-Wesley, ISBN 0-201-89684-2

[Kundert95] Ken Kundert (1995) Appendix A ‘Simulator Options’ from The Designers Guide to SPICE and Spectre available at: https://designers-guide.org/analysis/dg-spice/chA.pdf Details of the book are available at: https://designers-guide.org/analysis/dg-spice/index.html

[StackExchange] Physics: What is the most precise physical measurement ever performed? Question and answers available at: https://physics.stackexchange.com/questions/497087/what-is-the-most-precise-physical-measurement-ever-performed

Paul Floyd has been writing software, mostly in C++ and C, for about 30 years. He lives near Grenoble, on the edge of the French Alps and works for Siemens EDA developing tools for analogue electronic circuit simulation. In his spare time, he maintains Valgrind.