Deterministic Components for Distributed Systems

Non-deterministic data leads to unstable tests. Sergey Ignatchenko considers when this happens and how to avoid it.

Over the last 10 years, significant improvements have been made in program testing. I don’t want to go into a lengthy argument whether TDD is good or bad here – this is not the point of this article. However, one thing I want to note is a rather obvious (so obvious that it is often taken as granted) point that

For test results to have any meaning, the test needs to be reproducible

In theory, we can speak of ‘statistical testing’ when the program is run for 1000 times (and if it fails not more than x times out of these 1000 runs, it is considered fine), but for most of the programs out there it would be a Really Bad Way to test them.

So far so obvious, but what does this really mean in practice? Let’s consider the relationship between two subtly different properties: the ‘test being reproducible’ and the ‘program being deterministic’ (defined as ‘the program has all its outputs completely defined by its inputs’). For the time being, we’ll refrain from answering the question ‘what qualifies as an input’; we’ll come to discussing it a bit later.

First, let’s note that for a 100% deterministic program, all the tests are always reproducible. In other words, a program being deterministic is sufficient to make all the tests against this program reproducible. On the other hand, if all the possible tests for our program are always reproducible, it means that our program is deterministic.

However, there are tests out there which are reproducible even when the program is not entirely deterministic. For example, if our program X prints the current time when we specify the -t option, and it emits concatenation of the input string with ‘beyond any repair’ otherwise, we can say (assuming that we don’t consider current time as one of the program inputs) that:

The program is not entirely deterministic
the test with the -t option is not reproducible
all the other tests are reproducible

Let’s see what we have observed up to now:

for deterministic programs, all the tests are reproducible
for non-deterministic programs, there can be both reproducible tests and non-reproducible ones

The second statement can be seen as an indication that all is not lost – we can have meaningful tests for non-deterministic programs. And this does stand; on the other hand, a much more unpleasant statement stands too:

For a real-world non-deterministic program, we are bound to leave some functionality untested

Let’s see this on an example of our program X. We certainly can have a set of reproducible tests over our program – the ones which don’t use the -t option. However, this will leave us with a significant part of our functionality untested.

To avoid this, we can try to test our program with the -t option to emit the right format, but wait – without meddling with the system time we won’t even be able to test it on Feb 29th etc. Ok, we can add manipulating system time to our test, but even then we’ll be testing only the time format (and not correctness of the time which was printed). We may try to measure current time before calling our program, and to compare this value with that printed by our program X; it might seem a good solution, but apparently, even if the program prints time with minute precision and runs for only 0.1 second, from time to time the test will fail. More specifically, such a test will fail whenever this 0.1 second (between the moment when we’ve measured current time and the moment our program has made its own measurement) happens to occur exactly across the boundary between two different minutes. Ok, we could add tolerance (which BTW already makes our test imprecise) and consider the program valid if the printed output printed_time satisfies an equation measured_time < printed_time < measured_time + 0.1s , but even this is not sufficient, as we haven’t accounted for potential system time adjustments (including automated ones).

As we can see, even for an absolutely trivial non-deterministic program, accurate testing becomes a very non-trivial task. In practice, for a non-trivial program writing a set of tests which cover a significant part of the functionality quickly becomes a daunting task (or, more often, leads to significant functionality left untested). And if our program is traditionally multithreaded (defined as ‘multithreaded with explicit thread sync using mutexes or equivalent’) testing very quickly becomes pretty much pointless exactly because of the lack of determinism: the program which works perfectly on your test rig can easily fail on some other computer (or fail on the same computer from time to time) – just because whoever runs it was unlucky. Bummer.

Distributed programs and unit-tests: false sense of security

Let’s come back from our so-far-theoretical clouds to the earth of programming. Let me tell you one real-world story in this regard (let’s call it ‘Unit Test Horror Story’ for future reference).

Once upon a time, there was a company which successfully ran a multi-million-dollar online business. Moreover, they enjoyed very little unplanned downtime by their industry standards (1 hour per year or so). Then, on a nice sunny day (or a dark cloudy one – it won’t change anything), a new developer came to the company. It just so happened that he was a strong proponent of TDD, so he started writing a unit-test for a new feature, and the test failed; he implemented the feature, and the test succeeded. And then the new developer read a lecture to all those non-TDD developers, saying ‘Now you see how easy it is to write error-free programs!’ Next day, his changes went to production and caused one of those once-per-year downtimes.

The moral of this story is certainly not along the lines of ‘see, TDD is useless’ (IMHO, TDD, when taken in moderation, is quite useful, though it is far from being a silver bullet). The point here is to understand what has caused the downtime; and apparently, it was an unusual sequence of otherwise perfectly valid messages (sometimes such unusual sequences are referred to as ‘races’).

Moreover, from my experience, these

unusual sequences of otherwise perfectly valid messages are by far The Most Difficult To Find Bugs in pretty much any real-world distributed system.

As a consequence (and combined with an observation that such unusual sequences tend to be among the most unexpected for the developer), it means that

Unit tests, while important, are insufficient to ensure the validity of a distributed system

This happens because even if every component of your program is deterministic, when you make a distributed system out of these components, your distributed system (taken as a whole) is very unlikely to be deterministic. However, as we’ll see below, even per-component determinism helps a lot for debugging such distributed systems. This includes things such as automated replay-based regression testing, low-latency fault-tolerance and relocation for components, and the holy grail of production post-mortem analysis.

Benefits of deterministic components

Let’s discuss the major benefits of deterministic components in a distributed system.

Replay-based regression testing

As we’ve seen in the ‘Unit Test Horror Story’ above, unit tests are not sufficient to test components of the distributed system. On the other hand, if our component is entirely deterministic, we can do the following:

while a previous version of the component is running in production, we can write down all the inputs for this specific component as an input log . Ideally, we should be able to do this for all the components in the system, but it can be done on per-component basis too.
then, we can run the same input log against the new version of the component
if our new version of the component doesn’t have any changes to the logic (and has only extensions which are not activated until new inputs are used) – then the output should be exactly the same as for the previous version of the component
while there is no strict guarantee that this kind of testing would help with the ‘Unit Test Horror Story’, the chances are that it would. Moreover, for heavily loaded systems, experience shows that the vast majority of those unusual sequences of otherwise valid inputs do happen every few hours (YMMV, batteries not included). In any case, this kind of testing, while not being a guarantee against bugs (nothing is), is a very valuable addition to the arsenal of regression tests which can and should be run.

Of course, you’ve already noticed the weak point of this kind of testing: as noted above, it will work only as long as you have made no changes to existing functionality . This indeed is one of the reasons why this kind of testing is not a silver bullet. However, two things help in this regard:

most of the time, for a real-world system, functionality is added rather than modified
provided that you make rather small and functionally oriented git-style independent commits (as opposed to one commit for everything which happened this week), it is often still possible to separate those changes which are not supposed to change any existing behavior, to re-apply these changes on top of the previous version, and to run replay-based regression tests against these changes. While it is admittedly an imperfect way of testing, it still does help quite a bit in practice.

Deterministic production post-mortem

However thoroughly we test our systems, from time to time they do fail . In such cases, it is of utmost importance to identify the problem ASAP, and to fix it, so it doesn’t happen again.

In this regard, deterministic components can provide a Holy Grail of production post-mortem: an ability to reproduce the bug exactly as it has happened in production. If our component is deterministic, then:

we’re able to write all the inputs of the component into an input log
after the program fails in production, we can get the input log and run it in the comfort of a developer’s machine, under a debugger, as many times as we want, and get exactly the same variables at exactly the same points as happened in production. Note that, strictly speaking, to get exactly the same variable values, we’ll need each line of our component to be deterministic, which is a stronger property than the component being deterministic as a whole. However, in practice the latter is normally achieved via the former, so we can ignore the difference between the two for most practical purposes.

In reality, of course, it is not that simple. If our system runs for months, storing and replaying all the inputs which have happened during those months is rarely feasible. However, we can avoid this problem by running our input log in a circular manner.

As mentioned above, running input log for the entire history of our component rarely qualifies as a viable option. On the other hand, if our component is deterministic, and we can get current state of our component in some kind of serialized form, then we can build our input log as follows:

input log is written in a circular manner, so that it stores only the last t seconds of the component inputs
on each wrap-around of our circular storage, we store the current state of our component in the same input log
if we need to perform a post-mortem, we use a deserialized current state to initialize our deterministic component and then reapply the remaining inputs from the input log to get a complete picture of the last t seconds of the component’s life before the crash

Such circular input logs are highly practical at least in some deployment scenarios, and provide an ability to see the last t seconds of life of a component before it failed. While it might be that the problem occurred before these last t seconds, and was only manifested later, this technique still happens to be quite useful in practice.

Low-latency fault tolerance and component relocation

Yet another major benefit of having your components both deterministic and serializable is that it is possible to gain some fault tolerance from them (without any help from application level code). In particular, the following schema will work:

circular input log runs on a physical server X, and our deterministic component runs on a physical server Y
all the outputs from our deterministic component are sequentially numbered, and go through physical server X, where the last number of the output is kept as a variable LastOutputNumber

Then you can recover from any single server failure in a perfectly transparent manner:

if server X has failed, then server Y has all the information we need, and replacement server X' can start writing input log from scratch (starting from storing serialized current state )
if server Y has failed, then we can follow the following procedure:
- start replacement server Y' with a replacement instance of our deterministic component (initialized with serialized current state from input log )
- variable LastSkipNumber is set to LastOutputNumber
- all the records in input log after serialized current state are replayed
- during replay, all the outputs of our deterministic component are skipped until we reach LastSkipNumber . This is possible because all the outputs are deterministic, and they exactly repeat those outputs which have already been sent back to the other components/clients/whoever-else

This kind of determinism-based fault tolerance provides fault-tolerance with very little additional latency. In fact, it is ideologically similar to the virtual lockstep way of achieving fault tolerance (and provides significantly better latency than so-called fast checkpoints ).

In a somewhat similar manner, it is possible to achieve low-latency relocation of the components from one physical server to another one. The idea revolves around running two instances of the component for some time to reduce latency during serialization/deserialization/state-transfer. The second instance of the component would have its outputs skipped until it has caught up with the first one, and then the first one can be dropped.

Making components deterministic

Ok, I hope that by now I’ve managed to convince you that determinism is a Good Thing™. Now, let’s discuss how to write those deterministic components.

Obvious stuff

First of all, let’s note that there are obvious things to keep in mind when aiming for determinism. In particular:

don’t use any uninitialized memory in your program
more generally, don’t use anything which causes the dreaded ‘undefined behavior’ in C/C++
don’t use pointers for any operations except for dereferencing.
- In particular, using pointer values (as opposed to data pointed to by pointers) as a key for sorting is especially dreadful (as usually allocators are not required to be deterministic, especially in a multithreaded environment, so such sorting can easily lead to non-deterministic results)

NB: for the purposes of this article, we won’t aim for cross-platform determinism; while cross-platform determinism is an interesting beast and has its additional benefits, for now we will set it aside to avoid complications. This will save us from quite a few problems, such as iterations over unordered collections and even more nasty different-order-of-floating-point-operations stuff.

Relation to Reactors

Now, let’s note that deterministic components tend to go very well alongside with the Reactor event-driven pattern (as described in [ Wikipedia ]). This more or less includes such things as Erlang message passing, Node.js, and to certain extent – Akka Actors.

In summary, the Reactor pattern takes incoming inputs (a.k.a. ‘service requests’ or ‘events’) and processes them synchronously, one by one. It also serves as a very convenient point to write all these incoming events into our input log .

So far so good, and if all our Reactor does is calculate a pure function, it stays deterministic just by its very nature. However, as soon as you do as little as call get_current_time() within your component, it ceases to be deterministic . Therefore, we need to discuss ways how to deal with non-determinism.

While the determinism-assuring techniques described below are, strictly speaking, not limited to Reactor and Reactor-like patterns, it is easier to describe some of them in terms of Reactors (and they’re likely to be used in the context of Reactors too).

Considering call output as a program input

As we’ve seen above, even a very simple program which prints current time is not deterministic. However, there is a trick which allows to make it deterministic. More specifically,

If we make current time an input of our program, related non-determinism will go away

In other words, all we need to do to make our program deterministic, is to replace a call to a system-level function get_current_time() with a call to our own function get_our_own_current_time() , where get_our_own_current_time() goes along the following lines:

time_t get_our_own_current_time() {
  switch( deterministic_mode ) {
    case DETERMINISTIC_REPLAY:
      return read_time_t_from_input_log();
    case DETERMINISTIC_RECORDING:
      time_t ret = get_current_time();
        //NON-DETERMINISTIC!
      write_time_t_to_input_log(ret);
        //wrote to input-log
        //to ensure determinism
      return ret;
    case DETERMINISTIC_OFF:
      return get_current_time();
      //NON-DETERMINISTIC!
  }
}

Bingo! We can have our determinism and eat it too!

This trick of declaring something non-deterministic as an input and writing it into input log is pretty much universal in the sense that in theory it can be applied to pretty much any non-deterministic function call including, but not limited to, reading of real random data from /dev/urandom. Even mutexes can be handled this way (though in this case all the data protected by mutex needs to be written to the input log , ouch). On the other hand, in practice there are two significant caveats:

when large data chunks are read, it is often infeasible to handle them this way (if you read 1Gbyte from your file/DB, throwing it into input log is rarely a good idea); however, there are many practical cases when it can be avoided:
- if the data is expected to be constant (like ‘read from a constant file’) – then there is no need to record it to the input log (as it can be reproduced); in extreme cases, if you suspect that data potentially can be corrupted, you can write some kind of hash into input log instead of the data itself
- if the data is some kind of cache (which can be re-obtained from the authoritative source instead) – it doesn’t need to be logged either.
- If we can say that data on disk is a part of our current state , then there is no need to log such accesses either. Note though that this option would usually make serializing your current state significantly more expensive
For frequently called functions such as obtaining current time , using this trick makes replay more fragile than it is necessary. For example, if you use this trick and then add another call to get_current_time() somewhere within your implementation, it makes your input logs incompatible (and usually this is not intended)

This trick doesn’t go well with fault-tolerant implementations (which need to write all the inputs to a separate physical box)

Guessing game

A different technique to ensure determinism can be described as follows. If we know in advance what will be necessary for the processing of our input event, then we can supply it to our component (logging it to input log in advance, so that it is no longer a problem for fault-tolerant implementations). In particular, as lots of input events need current time , we can say that we’ll provide it to all of them. In this case:

Code outside of the Reactor will call the system level get_current_time() before processing each input event.
Code outside of the Reactor will store the time read (say, to TLS variable event_time )

Note that global/per-process singleton won’t be able to ensure determinism if you have more than one thread running your deterministic objects within your process.

It will call Reactor’s process_event() or equivalent

Reactor app-level code still needs to call: get_our_own_current_time() instead of: get_current_time() but implementation of get_our_own_current_time() becomes much simpler in this case:

  thread_local time_t event_time;
    //event_time is pre-populated by caller
  time_t get_our_own_current_time() {
    return event_time;
  }

Note that with this implementation, it is not possible to measure execution times within the event handler (as all the calls to get_current_time() for the same event will return the same value). On the other hand, as any kind of execution times measurements would make our program inherently non-deterministic (at the very least when we’re running it on a modern CPU), it is not that big deal. And if you need to make some performance analysis, you still can use something along the lines of Node.js-style console.time()/console.timeEnd(); as these functions do not return the measured time interval value to the program, but rather print it to a log file – then, as long as we do not consider log file as one of program outputs (and the program itself doesn’t read these values from the log file), we’re fine from determinism point of view.

Unfortunately, not all the required inputs can be pre-guessed successfully. However, in quite a few cases, the following technique does help:

We’re starting to process_event() with or without the data that may be needed
If the data is needed but is not provided, we throw a special exception requesting the data from the caller
- This exception must be thrown before any changes to Reactor’s state are made. This fits very well into a typical Validate-Calculate-Modify Reactor pattern described in [ ITHare16 ].
- It is also important to discard all the outputs coming from the processing of this process_event() (or to avoid emitting them in the first place)
If a caller receives this ThisKindOfDataRequested exception, it simply provides the requested data and repeats the same call

This exception-as-a-request-for-non-deterministic-data does help in quite a few scenarios. However, as with anything else, it is not a silver bullet , and chances are that you will need to look for your own ways to ensure determinism.

Conclusions

We’ve discussed the impact of per-component determinism on programming. In particular, we’ve found several major benefits of making your components deterministic (from replay-based regression testing to determinism-based fault tolerance). Also, we’ve discussed some ways of achieving this holy grail of determinism; while techniques discussed here won’t ensure determinism for all programs, they have been seen to allow determinism for quite a few real-world systems (ranging from stock exchanges to online games).

Acknowledgement

Cartoon by Sergey Gordeev from Gordeev Animation Graphics, Prague.

References

[ITHare16] ‘No Bugs’ Hare, ‘Asynchronous Processing for Finite State Machines/Actors: from plain event processing to Futures (with OO and Lambda Call Pyramids in between)’, http://ithare.com/asynchronous-processing-for-finite-state-machines-actors-from-plain-events-to-futures-with-oo-and-lambda-call-pyramids-in-between/

[Wikipedia] Reactor pattern https://en.wikipedia.org/wiki/Reactor_pattern