Mutation Testing

We all know that testing improves our code, guarding against errors. Filip van Laenen asks how we know that the tests are comprehensive?

Quis custodiet ipsos custodes?Who is guarding the guards?

Juvenal (Satire VI, lines 347-8)

Unit tests guard the source code, verifying that the behaviour of the system is correct. But how do we know that the unit tests cover all the lines of the source code, all the branches, let alone all the paths through the source code? And how do we know that the unit tests test the important aspects of the source code? Indeed, it could very well be that the unit tests cover all lines, branches and even paths of the source code, but never test anything that really matters. Mutation testing can help provide an answer to all these questions.

What is mutation testing?

Mutation testing starts with source code and unit tests. At some place in the source code, it makes a small change, which we’ll call the ‘mutation’. Typical mutations include replacing an addition with a subtraction, negating or short-cutting conditions, changing the values of constants and literals, commenting out a line, and many more. The result of the mutation is a new system, which we’ll call the ‘mutant’. What mutation testing then does is that it runs the original unit tests on the mutant, and checks whether any of them fail. Depending on the mutation and the quality of the unit tests, different things can happen. Sometimes only one or two unit tests fail. Other times, almost all the unit tests fail. And in some cases, none of the unit tests fail.

Let’s go through the different scenarios that can occur. The most straightforward case is the one in which only one, or perhaps two or three unit tests fail. This means that the mutation happened in a place that’s tested by the now failing unit tests, and that the change in behaviour due to the mutation is detected by the unit tests. In general, this should indicate that the unit tests test an important and relevant aspect of the system’s behaviour, and that the part of the source code where the mutation occurred matters for the system’s behaviour. We can therefore conclude that everything is in order with this mutant, and generate the next mutant to be tested by the unit tests.

Alternatively, many, or in some cases almost all the unit tests could start to fail as a result of the mutation. This often happens when the mutation occurs in one of the fundamental and heavily reused parts of the system, like e.g. the core engine or a project-specific framework. Again, many failing unit tests is a good sign, because it means that the part of the source code where the mutation occurred matters for the behaviour of the system.

It’s also possible that all the unit tests pass when they’re run against the mutant. This would obviously be the case when the mutation occurs in a part of the system that isn’t covered at all by the unit tests. But it could also be that the line where the mutation occurred is covered by unit tests, but that the behaviour of that particular part of the code isn’t tested by any of the unit tests. Imagine, for example, that the mutation happened in a log message. Although correct logging is important, it’s often only informative and seldom critical to a project. It’s therefore not unusual that a system’s unit tests don’t test the content of the log messages, and any mutations to the content in log messages will therefore remain undetected.

There may, however, be something else going on. There are cases where a change to the source code doesn’t change the behaviour of the system. Obviously, we’re not talking about changes like renaming variables or extracting small functions from larger functions. This sort of refactoring can hardly be called a mutation, and certainly doesn’t belong to the same category as replacing a multiplication with a division or switching the order of parameters in a function call. But what about replacing a less-than sign ( < ) with a less-than-or-equal-to sign (≤)?

Let’s consider two simple cases. Listing 1 illustrates the first case, where a less-than-or-equal-to sign in a function’s loop is replaced with a less-than sign. Clearly, this changes the behaviour of the system, and a simple unit test verifying that sum(1) = 1 would already detect the mutation. Listing 2 however illustrates another case, where the same type of mutation in a condition doesn’t change the behaviour of the system at all. (We assume that we’re dealing with immutable objects like e.g. simple integers, and not objects where there may be a difference between equality and identity.) Notice that for certain input values, the execution path through the system may be different between the original system and the mutant, but the end result is still same.

def sum(n) {
  s ¬ 0;
  for (i ¬ 1; i £ n; i++) {
       s ¬ s + n;
  }
  return s;
} 

def sum_mutant(n) {
  s ¬ 0;
  for (i ¬ 1; i < n; i++) {
       s ¬ s + n;
  }
  return s;
}

Listing 1

def max(a, b) {
  return (a £ b) ? b : a;
}

def max_mutant(a, b) {
  return (a < b) ? b : a;
}

Listing 2

In general, a case like the one described in Listing 2 is called an equivalent mutation, since it results in a mutant that’s equivalent to the original system. For certain types of mutations, it’s not always easy to establish whether a specific mutation to a system is an equivalent mutation or not. Therefore, most mutation testing tools simply don’t include these types of mutation, but some do allow it if the user wants it.

Does mutation testing work?

It may sound like a strange concept to make changes to the source code of a system, and then running the original unit tests on the resulting mutant. It seems so obvious that some of the unit tests should start to fail when we do that, yet we already mentioned a few reasons why it doesn’t always happen. So does it improve the quality of our software, and if so, how?

As Walsh noted in this Ph.D. Thesis in 1985, ‘ mutation testing is more powerful than statement or branch coverage ’. [ Walsh85 ] Many years later, Frankl, Weiss and Hu stated that ‘ mutation testing is superior to data flow coverage criteria ’ [ Frankl97 ]. My personal experience with mutation testing is in line with these two statements. In fact, if I’m able to use mutation testing in a project, I usually don’t care that much any more about what my test coverage tool may be reporting. If you have good mutation test coverage of your source code, chances are you’re close to 100% test coverage anyway. Actually, mutation testing can often give you an impression of a ‘more than 100% test coverage’, as a small example that we’ll dig into shortly will show.

But why does it work? Andrews, Briand and Labiche give us a first clue about why this is so: ‘ Generated mutants are similar to real faults’ [ Andrews05 ]. Geist et. al. provides us a with a longer explanation, saying that ‘ in practice, if the software contains a fault, there will usually be a set of mutants that can only be killed by a test case that also detects that fault ’ [ Geist92 ]. Could this be because in some cases, one of the mutants would actually be the correct system, but we simply haven’t written the correct unit tests yet to differentiate between the incorrect original system, and the correct mutant? Finally, Wah adds to this that ‘ complex faults are coupled to simple faults in such a way that a test data set that detects all simple faults in a program will detect most complex faults ’ [ Wah95 ]. Replace ‘test data set’ with ‘unit test’, and it becomes clear that if you apply many simple mutations to your original source code, and your unit tests can detect all the mutants, the unit tests will be able to detect complex mutations too. Or in other words, if you’ve done the little things right, you have a good chance that you got the big things right too.

A practical example

Let’s walk through a simple, practical example, to show how mutation testing works in practice. Actually, the example is not just an academic exercise I made up for this article, but distilled from an eye-opening experience I had a couple of years ago while working on a similar piece of code.

Assume we would want to implement a function that returns the maximum of a vector of integers. (In reality you should of course use the functions that are already available in your language or the standard libraries.) For simplicity, we ignore the case where the input vector could be null or an empty vector, and concentrate on vectors that have actual content. Listing 3 shows the first unit test, represented as an assertion, and the corresponding source code, a trivial implementation.

Assertion: max([0]) = 0

def max(a) {
  return 0;
}

Listing 3

Listing 4 adds a second unit test, and a simple implementation that would make both unit tests pass. Things get more interesting when we add a third unit test, as shown in Listing 5. In order to make all three unit tests pass, we add what could be the final implementation for our function. But the question is: are we really done yet? And did we really follow the TDD principle that we should always aim for the simplest possible implementation that could work?

Assertion: max([0]) = 0
Assertion: max([1]) = 1

def max(a) {
  return a.first;
}

Listing 4

Assertion: max([0]) = 0
Assertion: max([1]) = 1
Assertion: max([1, 2]) = 2

def max(a) {
  m ¬ a.first;
  foreach (e Î a)
    if (e > m)
      m ¬ e;
  return m;
}

Listing 5

Let’s have a closer look at the code in Listing 5. First of all, we have three unit tests. Second, test coverage is 100%, both in number of lines and branches, even including the implicit else clause of the if statement. One could wonder whether there’s really more to be wished for. But when we run mutation testing against the source code, we find out that there’s still something wrong. In fact, it complains that the mutant shown in Listing 6 survived all unit tests. This mutant is the result of short-cutting the condition of the if statement, or in other words, removing the condition and keeping only the positive branch of the if statement. Practically, this is done by replacing the content of the if statement’s condition with a simple true . But can it really be true that the mutant of Listing 6 survived all unit tests? After all, the unit tests did have 100% test coverage, both for lines and branches!

Assertion: max([0]) = 0
Assertion: max([1]) = 1
Assertion: max([1, 2]) = 2

def max(a) {
  m ¬ a.first;
  foreach (e Î a)
    if (true)
      m ¬ e;
  return m;
}

Listing 6

If we run the unit tests by hand on Listing 5, we find quickly that all the unit tests enter the loop of the function. Since we allocated the value of the first element in the vector to the local variable m , we pass into the implicit else clause during the first round. Only during the second iteration in the third unit test do we enter the positive branch of the if statement, since 2 is greater than 1. However, if we do the same exercise for the source code in Listing 6, we find that the three unit tests still pass. The only difference is that we allocate the value of the first element of the vector twice to m , once at the beginning of the loop, and once inside the if statement’s positive branch. It turns out that with these three unit tests, there’s no real need to ever enter the else clause of the if statement.

Actually, if you think it a bit through, the source code of Listing 6 is equivalent to the source code of Listing 7. In fact, if we really wanted to follow the TDD principle that we always should aim for the simplest possible implementation that could work, we should have gone from Listing 4 to Listing 7, not Listing 5. We were probably still too eager to start programming, even though we tried to do our best to follow the principles of TDD, and jumped into implementing the full function one unit test too early.

Assertion: max([0]) = 0
Assertion: max([1]) = 1
Assertion: max([1, 2]) = 2

def max(a) {
  return a.last;
}

Listing 7

So where do we go from Listing 7? We add a new unit test, just like TDD says we should do. Listing 8 shows the fourth unit test – we just pick a vector where the last element isn’t the largest one – and we can finally finish the implementation of this function.

Assertion: max([0]) = 0
Assertion: max([1]) = 1
Assertion: max([1, 2]) = 2
Assertion: max([2, 1]) = 2
def max(a) {
  m ¬ a.first;
  foreach (e Î a)
    if (e > m)
      m ¬ e;
  return m;
}

Listing 8

So where did things really go wrong? There is, in fact, a subtlety in Listing 5 (and Listing 8) that tricked our test coverage tool into believing that we really had 100% branch coverage. Take a look at Listing 9, which is basically the same as Listing 5, but with a small yet important difference in the first line of the loop. Instead of using the value of the first element of the vector as the initial value for the local variable m , it uses - ∞. This may look like a detail, but it does make a huge difference. The most important difference is that with the three unit tests from Listing 5 only, the implicit else clause of the if statement can’t be reached. Indeed, the condition of the if statement will evaluate to true in every iteration of every unit test. As a result, if we had chosen to implement the function as shown in Listing 9, and then run and checked the test coverage, we would never have said that we had 100% branch coverage, and we would have known that there was still missing at least one test case.

Assertion: max([0]) = 0
Assertion: max([1]) = 1
Assertion: max([1, 2]) = 2

def max(a) {
  m ¬ -¥;
  foreach (e Î a)
    if (e > m)
      m ¬ e;
  return m;
}

Listing 9

This simple example taught us that even when our test coverage tool reports that we have 100% test coverage both in terms of lines and branches, mutation testing may still find problems. This is the reason why I wrote earlier that mutation testing sometimes gives the impression that it leads to a ‘more than 100% test coverage’, as if 100% test coverage isn’t enough. Fundamentally, the mutant of Listing 6 warned us that there was something wrong with the unit tests – in this case that there was at least one unit test still missing.

As a side-effect, this example also showed that it’s all too easy to break TDD’s principle that you should write only the simplest possible source code to make the unit tests pass. When I do presentations about mutation testing, the audience usually understands that something must be wrong in Listing 5, but only because otherwise it would just make a silly example. Of course, the presentation ‘tricks’ the audience into the wrong track, just like this article did with you (or at least tried to), but I’m convinced that most developers would never bother to write more than the first three unit tests if they had to implement a function like this. In fact, it’s not easy to explain to people that they really need a fourth unit test for this type of function when they’re not familiar with the concept of mutation testing. I consider it one of the main benefits of mutation testing that it sometimes ‘disturbs’ me when it throws a mutant like the one from Listing 6 at me. By now I’ve learned that when a mutant refuses to disappear, mutation testing is trying to tell me something, and I’m probably going to learn something new.

Relation to other testing techniques

Before we dive into the techniques used by mutation testing and the tools that exist, how does mutation testing relate to other testing techniques? Obviously, mutation testing depends on unit testing , because it needs the unit tests to run against the mutants. But in order to create good unit testing, I think Test-Driven Development (TDD) is absolutely necessary too.

We already discussed the relation with test coverage , i.e. that mutation testing is superior to test coverage and even can give an impression of a ‘more than 100% test coverage’. But that doesn’t mean mutation testing eliminates the need to run test coverage as part of your build cycle. As we’ll see later, mutation testing can consume a lot of resources, and it’s therefore usually a good idea to run test coverage before mutation testing as a sort of smoke test. If test coverage is low, mutation testing will almost certainly find lots of mutants that will pass the unit tests.

As far as I know, there’s no relation or overlap between mutation testing and static code analysis . One could imagine though that some of the mutants produced by mutation testing will be caught by static code analysis, e.g. when the mutation removed a statement closing a resource. It would therefore make sense to run some sort of static code analysis on the mutants, but I’m not aware of any mutation testing tools that do that actively. It should be noted though that many compilers include some static code analysis, and therefore run it implicitly as part of the compilation.

A lot of people confuse mutation testing with fuzz testing , but the two testing techniques are quite different. Fuzz testing produces random sets of input data, feeds them to the system, and then detects whether the system crashes as a consequence of the input data. Traditionally, it has been used a lot as part of the security testing of programmes, but it has other applications too. The main differences between the two techniques are that fuzz testing operates on the input data, whereas mutation testing mutates the source code, and that fuzz testing often is a random process while mutation testing is a deterministic process.

That last aspect – the deterministic nature of mutation testing – often surprises people too. Most people connect the term ‘mutation’ almost always with cosmic rays, radioactivity, Chernobyl, chickens with three legs, and therefore random processes. But what really happens during mutation testing is that a tool goes through all the code in the system, tries to apply a fixed set of mutations on every piece of code, and then runs the unit tests on the resulting mutants. There’s no randomness in where the mutations are applied, and there’s no randomness in what sort of mutations are applied.

Is mutation testing new?

The answer to the question is a simple no. The technique was already described in the seventies [ Lipton71 , Lipton78 ] but didn’t gain much success at that time. The problem was that many obstacles had to be overcome for it to become practical to run in real-life projects. First of all, unit testing didn’t break through until the turn of the century, and even then we had to wait a few years before TDD started to become mainstream. But as we’ll see, mutation testing can quickly become very time-consuming if we don’t have a good strategy to select the right unit tests, and therefore it remained mostly an academic tool until a few years ago. As a consequence of that, nobody ever tried to integrate mutation testing into an IDE, and even today that’s a part that’s still missing. This means that even though mutation testing was described more than forty years ago, it hasn’t been until very recently that it has become possible to actually use it in real projects.

Mutation testing techniques

If you want to build a mutation testing tool, there are three aspects that you have to consider. First of all, you need a way to inject mutations into the original source code in order to create mutants. You also have to consider which types of mutations you want to use. Not all types are useful, and especially the ones that can create equivalent mutations can be problematic. Finally, you need to find a strategy to select the unit tests to be run against the mutants. As we’ll see, without a good unit test selection strategy mutation testing quickly becomes infeasible for real systems. The key properties for any mutation testing tool are its efficiency and performance: how good is it at creating mutants, and how fast is it at processing all these mutants?

Starting with the injection of the mutations , there are basically two alternatives available: source code and binary code mutation. Source code mutation makes the changes in the source code, and then compiles the mutated code. The big advantage of source code mutation is that it can be done using text manipulation, but the big disadvantage is that compilation takes time, even if it could be done as an incremental compilation on top of the original compiled code. Source code mutation has also the advantage that it’s easy to output the generated mutant, e.g. when it survives all the unit tests. On the other hand, with source code mutation the mutation testing tool has to be able to handle the compilation round, including compilation errors that may occur.

Binary code mutation is faster, because it operates directly on the already available original compiled source code, but it requires that you know how to manipulate the binary code correctly. And even then, it’s possible a mutation leads to invalid binary code, which has to be handled by the mutation testing tool. Outputting the source code for the mutant isn’t a straightforward job, but requires a decompilation round. This decompilation can be as simple as knowing how to map a binary code mutation to a source code mutation though.

But how about those compilation errors, or the invalid binary code? A problem like this can occur when a mutation in the source code leads to lines that can’t be reached, and the compiler detects such situations. Also, binary code can become invalid, e.g. due to a mutation such that a literal falls outside its legal range. Both cases are good cases though, because just like a failing unit test, they indicate that the mutated source code mattered for the system’s behaviour. In fact, since compilation errors or the invalidity of binary code will be detected long before the first unit test is run against the mutant, they’re probably the best thing you can hope for when you’re generating a mutant.

It should also be mentioned that there is a middle road between source code and binary code mutation which is particularly relevant for interpreted languages. For these languages, there is no real binary code, but it’s often possible to access the parsed code as an abstract syntax tree (AST). This can be very useful in a language like Ruby, which allows in its syntax many ways to express what’s fundamentally the same thing in the parsed code. This makes it easier to write the mutation testing tool, while at the same time speeding up the generation of mutants.

The next thing we have to consider when we want to build a mutation testing tool, is what types of mutations we want to inject. Basically, mutations fall into three categories. Some mutations never change a system’s behaviour, like e.g. changing the value of an internal constant or flag that’s only used for flow control. We don’t really care whether we used 0, 1 or any other integer value, as long as it’s unique within its context. Your unit tests shouldn’t care about this either.

Other mutations can change the system’s behaviour, but do not always do so. Switching between the less-than ( < ) and less-than-or-equal-to sign (≤) is an example of that, as we already showed in the beginning of this article. The problem with these types of mutations is that it’s hard to find out when they do change the system’s behaviour, and when they don’t. In general, one has to do it manually, and therefore this type of mutations isn’t very useful when we want to automate mutation testing.

Finally there are mutations that always change the system’s behaviour – or should we say: should always change the system’s behaviour. We already mentioned switching between the less-than ( < ) and greater-than-or-equal sign (≥). Here’s a short list with mutation types that are guaranteed to change a system’s behaviour, that is, if the code can be reached and actually matters:

Negation of comparisons, like switching between = and ≠, < and ≥, and > and ≤.
Negation of boolean conditions ( ¬ )
Short-cutting boolean conditions by replacing them with True or False
Switching between addition and subtraction, and multiplication and division
Switching between the increment and decrement operator

Simply removing a line should work too, but not every mutation testing tool tries to do that. Mind though that you have to watch out that you’re not removing a variable declaration and therefore simply generating a compilation error. On the other hand, removing a whole line of code is often such a big change to the source code that you’re almost guaranteed to find a unit test that will fail. It’s therefore usually better to save the effort and rather concentrate on mutations that have a better chance of producing mutants passing all unit tests.

Finally, we have to find a good strategy to select the unit tests that we’re going to run against the mutant. For every generated mutant, the strategy should find an order in which to run the unit tests such that those that fail are run as early as possible. This is crucial in order for the mutation testing tool to be useful, because a brute-force approach, i.e. just running all unit tests against all mutants in a random order will quickly become very time and resource consuming.

Let’s illustrate the complexity of the unit test selection problem with a simple example. Assume our system consists of 50 classes, with 20 unit tests per class, and each unit test taking 1 ms on average. It’s easy to verify that each full round with unit tests will take 1 s to complete. Now imagine that for every class, 10 mutants can be generated. If we use the brute-force strategy, i.e. we run all unit tests against all mutants, we’ll need up to 6 m 20 s to complete the mutation testing round.

A smarter strategy to run the mutation testing round would be to run only those unit tests that are relevant for the class that was mutated in order to generate the mutant. In that case, we run only 20 unit tests of 1 ms for the 50 classes with 10 mutants, and that will take only 10 s in total.

Now let’s scale the system up from 50 to 500 classes to make things a bit more realistic. When we do the same exercise as above, we find out that the brute-force strategy will need up to 14 hours to complete one round of mutation testing, whereas the second strategy will use less than 2 minutes. Clearly, just running all unit tests on all mutants isn’t a good strategy. Even for a relatively small system with relatively fast unit tests, running a round of mutation testing becomes quickly infeasible.

We can estimate the complexity of mutation testing in the following manner. Assume the system has c classes, and that there are f function points per class. Clearly, f must be equal to or larger than 1. Assume further that we have t unit test per function point, and can generate µ mutants per function point. Again, both t and µ will be equal to or larger than 1. In fact, in contrast to f , which can be arbitrarily large, t and µ will in general be rather small numbers. There are normally not that many unit tests to be written for a single function point, and in the same way there won’t be that many mutants that can be generated per function point.

Using the symbols from above, it is easy to see that there are c·f·t unit tests in the system, and c·f·µ mutants. This means that for the brute-force strategy, there are potentially c·f·t × c·f·µ = c ²· f ²· t·µ unit tests to be run in a single mutation testing round. Notice that this result is quadratic both in terms of the number of classes and the number of function points per class, i.e. the total number of function points. This explains the explosion in the time required to run mutation testing in the example above: multiplying the system’s functionality by a factor of 10 multiplies the resources needed to run mutation testing by a factor of 100.

Now consider the smarter strategy from above. It runs only those unit tests that are related to the class that has been mutated, and therefore the number of unit tests is reduced to f·t × c·f·µ = c · f ²· t·µ . This number is still quadratic in the number of function points per class, but only linear in the number of classes. The ideal strategy would of course be to run only those unit tests that are related to the function point that has been mutated, and in that case the number is even further reduced to t × c·f·µ = c·f·t·µ . The lesson to be learned from these two results is that it’s important to narrow down the number of unit tests that have to be run per mutation as much as possible if we want to keep mutation testing a feasible task even for realistically large systems.

As an aside-note, these results also suggest that it’s a good idea to keep f , the number of function points per class, as small as possible in the per-class strategy. This is basically the same thing as saying that your classes should be small, which is a recommendation that shouldn’t be too unfamiliar to the reader. It’s nice to see that what’s considered a good practice anyway can help improve the performance of mutation testing too.

Luckily, there are a number of strategies that can be used to find those unit tests for a mutant that have a reasonable chance of failing, and some of these strategies work quite well. First of all, one can use hints, e.g. in the form of annotations provided by the developer, as to which unit tests relate to which classes, parts of classes, or even modules. This can also be automated, e.g. by package and class name matching. If there’s a convention that all unit tests reside in the same package as the source code that they’re supposed to test, this can limit the number of unit tests per mutant drastically in an automated manner. If there’s a convention that the unit tests for a class named Foo reside in a class named FooUnitTest , or perhaps in classes with names following the pattern Foo*Test , this will limit the number of unit tests per mutant even further down.

Another way to reduce the number of unit tests per mutant is to use a code coverage tool, and record which lines of source code are covered by which unit test. This gives exact information about which unit tests are relevant for mutations in a given source code line, and this in a completely automated manner. On the other side, this requires interaction with a code coverage tool, or if that’s not possible, the implementation of a code coverage tool that is part of the mutation testing tool. Usually, the effort required to do this is a good investment, because it will improve the performance of the mutation testing tool dramatically.

One can also use automated learning to record which unit test failed for which mutant. Every time a mutant is generated, the tool looks up which unit test failed in the previous round, and tries to run that unit test again as the first one. Of course, this requires some bookkeeping, but if it works, the number of unit tests to be run in a single mutation round can be reduced to the absolute minimum c·f·µ , i.e. simply the number of mutants.

So far we’ve dealt with how we can inject mutations, what kind of mutations we can inject and how we select the unit tests that we want to run against the mutant. In order to do this, we’ve had to deal with compilation errors, invalid binary code, equivalent mutations and the quadratically growing complexity of the problem. Could there be even more problems that we have to handle in order to build a good mutation testing tool?

Actually, there’s still one problem left. Have a look at an implementation of the factorial function in Listing 10, together with one of its more interesting mutants. The mutant is generated by short-cutting the condition of the while -loop.

def factorial(n) {
  i ¬ 0;
  f ¬ 1;
  while (i < n) {
    i++;
    f ¬ f × i;
  }
  return f;
}
def factorial_mutant(n) {
  i ¬ 0;
  f ¬ 1;
  while (true) {
    i++;
    f ¬ f × i;
  }
  return f;
}

Listing 10

At first glance, this mutant has an infinite loop. However, when you think of it, there are a number of things that can happen, and not only does it depend on the programming language, but sometimes also on the compiler or the interpreter. If we’re really lucky, the compiler spots the infinite loop, throws a compilation error, and we don’t even have to run a unit test at all. Otherwise, we might get an exception when the multiplication hits the upper limit of the range of integers. Or it could be that the programming language allows arbitrarily long integers, and that the system continues to run until the multiplication generates such a large number that it doesn’t fit into memory any more. Alternatively, the multiplication overflows, and the function continues to run forever. This means that there is indeed a potential for a real infinite loop in this mutant, but even if it’s not completely infinite in the programming language, the compiler or the interpreter of your choice, it may take a long time before the final exception is thrown.

This means that we have to be able to detect infinite (or pseudo-infinite) loops. Luckily, that’s not so difficult: we just have to measure the time a unit test normally takes, and decide how much longer it can take when we run it against a mutant before we interrupt it, and assume that the mutant contained a (pseudo-)infinite loop. A time factor of three may be reasonable, but we should also take care that we measure actual CPU time, and not just time elapsed. We don’t want to interrupt a unit test that took a long time just because the operating system decided to allocate the CPU to another process with a higher priority.

Once we’ve detected a (pseudo-)infinite loop, we should be happy though. Sure, it takes some extra time compared to a unit test that passes, but again we discovered that the mutation at hand changed something in the system that mattered, and we can move on to the next mutant.

Mutation testing tools

Unfortunately, there aren’t so many good tools available to run mutation testing, but the situation has been getting better and better these last years. If you’re a Ruby programmer, you should check out Heckle [ Ruby ]. It integrates with Test::Unit and rSpec, and is usually run from the command line to run a set of unit tests on a method or a class. Mutations include booleans, numbers, string, symbols, ranges, regular expressions and branches. It has good to-the-point reporting and writes out the first mutant it can find that passes all the unit tests in a clear manner, with good performance. There is, however, virtually no documentation available. I’ve been able to write a manually configurable Rake task for Heckle, but only by letting Ruby call out to the command-line and catching and parsing whatever is output by Heckle. It’s possible there’s a better way to do this, but getting it right can be a difficult job without any real documentation.

Boo_hiss [ BooHiss ] is another mutation testing tool for Ruby, but since it hasn’t had a single check-in for the last three years now, the project doesn’t look like it’s very much alive.

I’ve recently started to experiment with PIT [ Pitest ], a Java tool that integrates with both JUnit and TestNG. It can be run from the command-line, but it also has a plug-in for Maven. It’s highly configurable, has some good defaults, and a website with clear documentation. PIT operates on the byte code and runs pretty fast thanks to a pre-round with a test coverage tool in order to map out which unit tests are relevant for mutations in which parts of the source code. Mutations include conditionals boundary, the negation of conditions, mathematical operators, increment operators, constants, return values, void and non-void method calls, and constructor calls. If you’re a Java programmer and want to have a try at mutation testing, this is the tool I recommend for you. And since the project started not so long ago, there’s still a lot of activity going on and many chances to get involved in the development of the tool.

Other mutation testing tools for Java include Jester [ Jester ], Jumble [ Jumble ], Javalanche [ Javalanche ] and µJava [ Mujava ]. None of the projects seem to be alive any more, and I wouldn’t say they’re an alternative for PIT. Jester integrates only with JUnit, and operates on the source code, which means that it’s rather slow. Jumble is faster than Jester because it manipulates the Java byte code, but since it has no plug-in for Maven yet, it has to be run from the command-line. Javalanche and µJava have always seemed cumbersome to me, and seem mainly to be two academic exercises.

Jester has two ports to other languages: Nester [ Nester ] for C# and Pester for Python. A colleague of mine tested out Nester, and found quickly out that it works only for an older version of .Net. It’s therefore probably not very relevant for any ongoing project, but could be a nice to play with and a good start if you want to create a new mutation testing tool for C#. Mutagenesis [ Mutagenesis ] is a mutation testing tool for PHP, but since I’m not familiar with PHP, I don’t know whether it’s any good.

I’m not a C++ programmer either, so I can only point the readers to CREAM [ Cream ] and PlexTest [ Plextest ] and encourage them to try the two tools out. If anybody does, or finds another tool for C++, I’m pretty sure the editor of Overload would be happy to receive an article with an experience report.

Improvements

As the reader may have guessed, I’m already glad there are mutation testing tools available that are usable at all. Indeed, these last years the evolution has been huge. Not so long ago, even the best mutation testing tools could not be considered as much more than proofs of concept, whereas today it is fully possible to use mutation testing tools in real-life projects. But there is still lots of room for improvement. Integration with common development environments is high on the list, together with better reporting on the mutants that don't make any of the unit tests fail. Unit test selection strategies seem to be becoming better and better. Parallellisation is certainly an issue that will be addressed in upcoming versions and future tools, since mutation testing is such a CPU intensive process.

Personal experiences and recommendations

So how can you get started with mutation testing? First of all, select a good tool. If you’re a Java programmer, try PIT, and if you’re a Ruby programmer, try Heckle. It’s important that the tool is easily configurable and flexible, and that it can output mutants that survive the unit tests in a clear and meaningful manner. Start on a small code base, and ideally, start using mutation testing from day 1 in the project. If you start using it on an on-going project, you’ll have the same problem as when you run static code analysis for the first time in the middle of the project: the tool will find an overwhelmingly long list of problems.

Once you’ve started using mutation testing, believe the tool when it reports that a mutant passed all the unit tests. And if you don’t believe it, try to prove that it’s wrong before you dismiss the mutant. It has happened to me a couple of times that the tool presented me with a mutant that I simply couldn’t believe passed all the unit tests, but these have also been the occasions where I learned something new about programming and code quality.

As with other code quality techniques, when the tool finds a problem, fix the problem. And embrace the ‘more than 100% test coverage’. Mutation testing leads to path coverage, more than simple line or branch coverage. When you use it, you’ll notice that you’ll write less, but more condensed code, with more and better unit tests. The reason for this is simple: if you write an extra line of code, there will be a couple of extra mutants, and some of them may pass the unit tests. Mutation testing will also influence your coding style, such that even when you work on a project where you can’t use mutation testing, you’ll write better code. I’ve used mutation testing for a couple of years now, mostly Heckle in Ruby, but recently also PIT in Java, and I feel it has helped me become a much better programmer than otherwise would have been the case.