Using locks will slow down threaded code. Len Holgate demonstrates how an Activatable Object can reduce the time spent blocked.
An Activatable Object is a variation on the Active Object Pattern [ Wikipedia-1 ] in which the object does not run on a thread of its own but instead borrows one of the calling threads to process operations. Activatable Objects can be used to replace objects that can be accessed by multiple-threads and that have state that can only be manipulated by one thread at a time. They play nicely with threads in thread pools and allow for scaling to vast numbers without burdening the process with an excessive number of threads.
I work on high performance networking systems on Windows platforms and these deal with many thousands of concurrent connections each of which is represented by a connection object with complex state. Things such as SSL and asynchronous flow control requirements [ ServerFramework ] mean that when I/O events occur on a connection it’s often necessary to ensure that only one thread is manipulating the connection object’s state at a given time. Since multiple I/O events can occur at the same time for a given connection and since ‘user code’ could also be calling into the connection object to initiate new I/O requests it is often necessary to protect these objects with locks. Unfortunately using locks in this way meant that threads are often blocked by other threads whilst operations are processed on the connection. See Figure 1.
In Figure 1, we have a thread-safe connection object whose internal state is protected by a lock. Given the threads 1, 2, 3 & 4 with the four work items A, B, C & D, the threads are serialised and block until each can process its own operation on the object. This is bad for performance, bad for contention and bad for locality of reference as the thread-safe object can be bounced back and forth between different CPU caches as different threads enter the object to perform work. It also means that a lock is held whilst the threads do their work on the object. Holding a lock like this makes it more complicated to call back into ‘user code’ as it’s easy to cause lock inversions if you hold locks whilst calling into code that you don’t own [ DrDobbs ].
Active Objects, where the object itself contains a thread and where operations are passed to that thread via a work queue, can’t help here as we do not wish to use one thread per connection as this does not scale [ Kegel ]. The preferred method of high performance I/O on Windows is to use I/O Completion Ports and this method can scale to tens of thousands of concurrent connections with but a small number of threads. In this kind of system a small number of threads reside in a thread pool and perform work on the connections as I/O operations complete.
The Activatable Object that I describe in the rest of this article enables efficient command queuing from multiple threads where commands are only ever processed on one thread at a time, minimal thread blocking occurs, no locks are held whilst processing and an object tends to stay on the same thread whilst there’s work to be done with it.
Reducing thread blocking
The amount of time that a thread needs to block on an object can be reduced by adding a per-object queue of operations to be performed. If we state that all operations must be placed into this queue before being processed and only one thread can ever own the right to remove and process operations then we have a design whereby a single thread will process an operation and any additional operations that occur whilst the thread is processing will be queued for processing by that same thread.
Obviously the per-object operation queue needs to be thread-safe and so we need to lock before accessing it. We also need a way to indicate that a thread has ownership of the queue for processing. The important thing is that the processing thread does not hold the queue lock whilst it is processing and so threads with operations for this object are only blocked by other threads with operations for this object whilst one of them is holding the lock to add its operation to the queue rather than for the duration of all operations that are ahead of it in the queue.
In general, it should be cheaper to marshal the command and command data into the work queue and onto to the processing thread than it is to move the object to the thread that is issuing the command, both in terms of the enqueue/dequeue process that is required and in terms of locality of reference [ Wikipedia-2 ]. This is generally true for my use cases as the connection objects are usually pretty heavy weight and the commands are generally marshalling pointers to memory buffers. Any marshalling costs are also mitigated by the fact that the alternative thread-safe object would block the calling thread until all previous operations have been completed on the object whereas the Activatable Object only blocks the thread if the work queue is locked by another thread.
In the diagram below three operations for an Activatable Object occur simultaneously and the first thread, thread 1, enters the work queue lock, adds its operation to the queue and sees if any other threads are processing. During this time the other two threads must wait. See Figure 2.
Note that it’s possible to optimise the processing of an operation if the work queue is empty and there is currently no thread processing by avoiding the enqueue/dequeue steps and allowing thread 1 to become the processing thread and process the operation directly.
Once thread 1 is processing it releases the lock on the queue and enters the object to process the operation. Thread 2 can then obtain the lock on the queue and enters to add its operation to the queue and see if any other thread is processing. See Figure 3.
Since thread 1 is processing, thread 2 simply enqueues its operation and returns to do other work. Thread 3 can then enter the lock that protects the queue. See Figure 4.
As thread 1 is still processing, thread 3 also simply enqueues its operation and returns. See Figure 5.
When thread 1 has completed processing its operation it needs to give up the right to be the ‘processing thread’. To ensure that operations are processed efficiently it must first check the queue for more operations to process and process those before leaving the object to do other work. If another operation occurs whilst thread 1 is checking the queue then the thread with the operation, in this case thread 4, will block for a time. See Figure 6.
Thread 4 can then enter the queue, check to see if it can process and if it cannot then enqueue its operation for processing by the processing thread. See Figure 7.
Eventually the processing thread will reach a point where no further operations are queued for the object and it can surrender ‘processing mode’ and return to do other work. See Figure 8.
The key features of the design used above are as follows:
- Operations can be enqueued quickly and efficiently so that the lock around the object’s operation queue is held for the least amount of time possible.
- Only one thread can be ‘processing’ at any given time.
- The processing thread removes all queued operations when it checks the queue.
- Locks are not held whilst processing occurs.
When I was designing this code for integration into my networking system several other requirements emerged during integration:
- The processing thread must be able to enqueue a new operation.
- Some operations require that they are processed synchronously, before the thread that is to enqueue them is released to do other work as the other work may require that the operation on the object has already been processed.
The final requirement for synchronous operations introduces some complication as these operations should block each other but not affect threads with asynchronous operations. This requires a second lock and an event for the blocking threads to wait on. If a processing thread discovers other threads waiting to process synchronous operations it relinquishes processing and allows one of the waiting threads in to process. That thread processes its operation and then continues to process any queued operations in the usual way, unless, of course, there’s another thread waiting to process.
A synchronous operation works exactly the same as all operations would on the alternative thread-safe object but doesn’t affect any of the asynchronous operations. Obviously a thread executing a synchronous operation on an Activatable Object will block until the asynchronous operations that are currently being processed are complete, but this is the same situation as would be encountered with a thread-safe object.
Whilst my current implementation of an Activatable Object includes support for synchronous operations it's possibly a better design to preclude these and rely solely on commands which can be completed asynchronously.
One thing that you may have noticed is that none of the operations that are performed on the Activatable Object can return a result. All are fully asynchronous and are initiated in a ‘fire and forget’ style. If results need to be delivered somewhere then this must be done asynchronously, usually via some form of callback interface or by adding the result to another Activatable Object’s input queue.
One of the key pieces of the C++ design that I ended up with was the simplicity of the command queue. In my implementation this is just a simple memory buffer which can be expanded when it fills up. Each command is represented by a one byte value and a variable number of parameter bytes. The user of the queue is responsible for adding their commands and deblocking the commands once a thread is processing. To enable a swift switch from inserting data into the queue to processing the queued commands I have two queues; the active queue and the processing queue. When a thread makes the transition to processing mode it swaps the active queue for the processing queue and processes all of the commands in the processing queue. New commands are always queued to the active queue. There’s a lock around the active queue which must be acquired before you can manipulate the active queue and a separate lock to enter processing mode.
When attempting to leave processing mode a thread must first check the active queue for commands. If the active queue contains commands then it must swap the active queue for the, now empty, processing queue and continue processing. A thread can only leave processing mode when both the active and processing queues are empty. This in turn means that when there are lot of operations to be performed on a given object in a short period of time it’s usually a single thread that will end up doing all of the work.
It’s possible that an ‘over-active’ Activatable Object could never relinquish control of a thread. If a constant stream of work items comes in at a rate that means there’s always work to do then the processing thread will just continue processing. In many designs this isn’t an issue, but it could be a problem if you’re using a fixed sized pool of threads and expecting to share the threads between a larger number of objects with a lot of work to do. In this situation I’ve found that the Activatable Object needs to know a little more about how work is fed to the threads that service it. It’s then often possible to have the object decide that it has done enough for now and have the processing thread stop processing even if there are items left in the object’s work queue. Care must be taken to ensure that the object will eventually have these outstanding work items processed.
An important consequence of the Activatable Object design is that the processing thread need not hold any lock whilst processing. Locks are only held when manipulating the command queue or entering and leaving processing mode. This simplifies command processing code as the processing thread can safely call into user-defined callbacks without risk of creating lock inversions.
Thanks to Chris Oldwood for encouraging me to adjust my original blog posting and write it up for Overload . He also helpfully pointed out the Active Object similarity which was obvious once he’d mentioned it but didn’t occur to me initially. Thanks also to Fran, the Overload editor, for making the whole process of submitting an article so painless.
[Kegel] ‘Dan Kegel’s Web Hostel’, http://www.kegel.com/c10k.html#threaded
This article was previously published at: http://www.lenholgate.com/blog/2014/07/efficient-multi-threading.html