Job Title: Sarcastic Architect
Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,
Calling a Spade a Spade, Keeping Tongue in Cheek
Now, we can get to a discussion of (Re)Actors (and more generally, Message Passing) being one of the ways to implement our "Multi-Coring" and "Non-Blocking" requirements.
The concept behind Message Passing is known at least for 30 years (at least since the days of occam and Erlang, but I think that the most succinct expression of it has been given relatively recently, and reads as
Do not communicate by sharing memory;instead, share memory by communicating.
In other words, under message passing paradigm, if we have to share something between two different CPU cores (which can be represented either by threads, or by processes, or even by different boxes) – we don’t use mutexes or atomics to share a state between them; instead – we send a message with all the necessary information.
This contrasts Message-Passing approach with Shared-Memory multi-threading approaches (which almost-inevitably require synchronization such as mutexes or atomics).
Now, we can proceed to my favorite incarnation of Message-Passing – (Re)Actors. (Re)Actors are known at least for 40 years, and are known under a half a dozen of different names, in particular Actors, Reactors, ad-hoc Finite State Machines, and Event-Driven Programs.
As we’ll be speaking about (Re)Actors quite a bit – let’s establish some basic terminology which we’ll use.
Let’s name Generic Reactor a base class for all our (Re)Actors; the only thing it has is a virtual function react().
Let’s name Infrastructure Code a piece of code which calls Generic Reactor’s react(). Quite often – this call will be within so-called "event loop" as shown on the slide.
As we can see – there is one single thread, so there is absolutely no need for thread synchronization within react(); this is very important for several reasons (including making our coding straightforward and less error-prone).
Let’s also note that get_event() function can obtain events from wherever-we-want-to – from select() (which is quite typical for Servers) to libraries such as libuv (which are common for Clients).
What is REALLY important for us now – is that when writing our (Re)Actor, we don’t really care which Infrastructure Code will call it. As the interface between Infrastructure Code and (Re)Actor is very narrow (and is defined by the list of the events processed by (Re)Actor) – it provides very good decoupling. In real-world, I’ve seen (Re)Actors which were successfully run by FIVE very different implementations of Infrastructure Code.
Let’s also note that while ALL the code within (Re)Actor::react() HAS to be multithreading-agnostic,Infrastructure Code MAY use threading (including mutexes, thread pools, etc.) within. After all, the whole point here is that it is not THAT big deal to rewrite Infrastructure Code entirely, and well-behaving (Re)Actor should run within new Infrastructure seamlessly.
And finally, let’s name any specific derivative from Generic Reactor (the one which actually implements our react() function) – a Specific Reactor.
I need to note that a virtual-function-based implementation above is not the only one possible – for example, the same thing can be done using templates instead of virtualization – but for the purposes of our examples, we’ll stick to this one (and any changes to alternative implementations should be pretty straightforward).
As I already mentioned, (Re)Actor is one of the possible incarnations of the Message Passing. As such, it means that:
- first, (Re)Actor state (which is represented by data members of SpecificReactor) is exclusive to the (Re)Actor, and moreover – only one member function of (Re)Actor can run at the same time. This, in turn, means that we don’t need to bother with those extremely-error-prone inter-thread synchronization mechanisms such as mutexes <phew />.
- second, ALL the communication with other (Re)Actors goes via messages. To enable it, Infrastructure Code has to provide a function such as postMessage(). Here, ReactorAddress can be pretty much anything which is relevant to your (Re)Actor (for distributed systems, personally I am usually arguing for having it as a meaningful string, so translation into IP:port format can be done separately and transparently from the app level).
- For receiving (Re)Actor, this message is translated into a special Event type (EventMessageFromOtherReactor or something).
In addition, as we have observed, (Re)Actors have a VERY clean and a VERY simplistic separation between Infrastructure Code and (Re)Actor code. And as a result of this VERY clean separation, it becomes VERY easy to deploy exactly the same (Re)Actor code into very different Infrastructures. In particular, I’ve seen (Re)Actors to be deployed in the following configurations:
- Thread for each (Re)Actor. This is the most obvious – but certainly not the only one possible – deployment model. Sometimes it is restricted to have one thread per process (which is useful to add an additional protection against inter-(Re)Actor memory corruption or heap fragmentation).
- Also, (Re)Actor can be used as a building block of Another (Re)Actor. (Re)Actors are composable (actually, they’re more flexible than any other building block I know about), so (Re)Actor can be used as a part of another (Re)Actor quite easily.
-
Multiple (Re)Actors per thread. This is possible due to non-blocking nature of (Re)Actors, and with not-so-loaded (Re)Actors can be useful to save on thread maintenance costs, which in turn improves overall performance. In real-world production environments, I’ve seen up to 32 (Re)Actors per thread. Oh, and BTW – while usually, (Re)Actors within the same thread are of the same type – it is not a strict requirement and I’ve seen systems with different (Re)Actor types running within the same thread.
-
And to demonstrate a more complex deployment scenario – it is possible to build a system which would store states for all (Re)Actors in a centralized manner in a cache such as memcached, Redis, or even a database. Then, when a request to process something in one of (Re)Actors arrives, (Re)Actor’s state can be retrieved from Redis, deserialized (effectively constructing (Re)Actor object from the serialised state), then (Re)Actor::react() function can be called, then the (potentially modified) state can be serialized, and written back to Redis. To avoid races, we’ll need to have some locking mechanism (preferably optimistic one) – but the whole thing is perfectly viable, AND for certain classes of (Re)Actors it can provide an easy way to ensure Fault Tolerance.
We don’t have time to go into further variations of possible deployment options, but what is most important for us now, is that all these deployments can be made with our (Re)Actor code (more specifically – (Re)Actor::react() function) BEING EXACTLY THE SAME. It means that the choice of the surrounding Infrastructure becomes a DEPLOYMENT-TIME option, and can be changed "on the fly" with EXACTLY ZERO work at app-level.
This has been observed to provide significant benefits for real-world deployments, as such flexibility allows to design optimal configurations – AND without rewriting app-level code which is almost-universally a non-starter.
When writing (Re)Actor code, there are two not-so-obvious features to be kept in mind. The first one is what I prefer to call (mostly-)non-blocking processing.
The non-blocking code is long criticised for being unnecessarily convoluted, but I think that this perception stems from two misunderstandings. One misunderstanding assumes that to write non-blocking code, you have to resort to the stuff such as lambda pyramids or (Djikstra forbid) "callback hell". While it was indeed the case 10 years ago, these days several major programming languages (AT LEAST C++, C#, and JavaScript) have introduced a concept of "await". We don’t have time to discuss how it is working internally, BUT – we’ll use it in our examples to demonstrate how it can be used at the app-level.
The second misunderstanding is that non-blocking approach is all-or-nothing, so there is a MISperception that if we have started to go non-blocking, we cannot have any blocking calls whatsoever. While this assumption MIGHT stand for systems handling life-or-death situations, but for your usual business system nothing can be further from the truth (I’ve seen HUGE real-world systems which mixed blocking processing with non-blocking one – and with great success too).
Very briefly – for ALL the I/O processing, there are two DISTINCT scenarios. In the first scenario, we DO NOT want to process ANY potentially happening events while we’re waiting for the result of the outstanding call. In this case, IT IS PERFECTLY FINE TO USE BLOCKING CALL. Of course, there is a caveat that IF operation just MIGHT take too long even once in a really while, we DO want to process intervening events for it.
This might be the reason for the misunderstanding, but in reality, things tend to be quite simple: THERE ARE things out there which, if they happen to take too long, mean that THE SYSTEM IS ALREADY BROKEN BEYOND ANY REPAIR. For example, if we’re trying to read 1kilobyte from a local disk and it takes any observable-for-user time – the PC is most likely already dead; and on the Server-Side, IF our database (which normally responds within 10ms), starts to exhibit response times an order of magnitude longer – the system won’t work as-we-expect REGARDLESS of us using blocking or non-blocking call here. Real-world examples of such operations-which-we-MAY-handle-as-blocking, often include accessing the local disk and/or database,and sometimes even operations in the Server-Side intra-datacenter LAN.
If, on the other hand, we MIGHT want to process something-which-has-happened-while-we’re-waiting – we SHOULD process it in a non-blocking manner; however, in this case, any complications are not because of the CODE being non-blocking, but because of the EXTERNAL REQUIREMENT to process those intervening events (and believe me, handling intervening events via mutex on a state is MUCH worse). One example of such operations which ALWAYS need to be handled as non-blocking – is ANY operation which goes over the Internet.
Fortunately, with "await" operator in mind, the non-blocking code looks REALLY simple. Not just that, but as we can see, it looks REALLY simple EVEN IN C++.
On the other hand, I would be cheating if not mentioning one very significant potential issue arising here. At the point of co_await, there can be a context switch to process a different event (hey, this is EXACTLY what we wanted when we added co_await in the first place). And this processing can CHANGE the state of our (Re)Actor, sometimes in an unexpected manner.
Let’s consider the following blocking code...
With this code, the assertion at the end always stands (well, with some not-so-unreasonable assumptions about readFile() function not having too drastic side effects)
Now, when moving to the non-blocking paradigm, the same assertion MAY FAIL. This can happen because while we’re waiting for the result of readFile(), another event could come in, and its processing can change m_x while we’re waiting on co_await operator.
OTOH, we have to note that:
- first, as discussed before, we DO NOT need to use non-blocking handling AS LONG AS we don’t need to process-events-while-waiting.
- second, this problem is INHERENT to ANY kind of handling-of-intervening-events-while-waiting-for-operation-to-complete. In other words, WHATEVER WE DO, there is no way to avoid this kind of interaction (and very often, it is exactly this interaction which we need when dealing with intervening events).
- third, there is still NO need for thread sync in the code on the slide (while context switches are possible, they can happen ONLY at those co_await points, so by the time when we’re running again, we’re GUARANTEED to have exclusive access to members of our (Re)Actor <phew />)
- last but certainly not least, this code is inherently MUCH SIMPLER and MUCH less error-prone than Shared-Memory approaches. Not only all the thread sync is gone (and it is a MAJOR source of non-debuggable problems), but also while we DO have a potential context switch here, it’s position is well-defined, while in a mutex-based program such a context switch can happen AT EACH AND EVERY POINT IN OUR PROGRAM (which makes reasoning about potential effects of such switches orders of magnitude more difficult).
As a nice side effect, the same approach can be used to ensure parallelization even while staying within the same (Re)Actor.
The idea here is very simple: we’re merely saying "hey, execute for me this-function-with-these-params (or ‘this lambda function using these captures’) and resume execution when you’re done." While lengthy calculations are performed – we can still process incoming events, and context switches and their implications here are exactly the same as for the non-blocking code discussed above, which in turn simplifies coding further.
The second not-so-obvious-feature of (Re)Actors (and the one which I happen to love a LOT <smile />) is determinism.
Strictly speaking, making (Re)Actor deterministic is not required to run a (Re)Actor-fest architecture, so we MAY have our (Re)Actors as non-deterministic; however – if we’ll spend additional effort on making them deterministic – we’ll get several very-useful-in-real-life properties, so I consider it as a VERY nice to have.
For our purposes, the following definition (which corresponds to "same-executable determinism" from [Nobugs17]) will do:
The program is deterministic if
we can write down all its inputs into inputs-log,
and then replay this inputs-log using the same executable,
obtaining EXACTLY THE SAME outputs.
We don’t have time to discuss the way HOW deterministic (Re)Actors can be implemented (last year, I made a 90-minute talk just about determinism at ACCU 2017), but I’ll still outline a few very basic points about it.
In general, there are three major sources of non-deterministic behavior:
- multithreading (which doesn’t apply to (Re)Actors, <phew />)
- Compiler/library/platform (which doesn’t apply to same-executable determinism, another <phew />)
- All kinds of system calls (starting from very-innocently-looking GetTickCount() and gettimeofday()).
To become deterministic (and in addition to recording all the input Events), we have to record return values of these System Calls to the same inputs-log; this is known as "Call Wrapping". Then, when we’re replaying our inputs-log, as a result of deterministic behaviour up to the point of System Call, the call-during-replay will happen EXACTLY at that point where we have a record of the return value within inputs-log, so it is trivial to return recorded value – and to ensure determinism at this point. Then, determinism of the whole replay can be proven by induction.
"Call Wrapping" allows to handle ANY system call; however, alternatively, quite a few popular System Calls (such as time-related stuff) can be handled in a faster and/or more flexible manner using other techniques (which were discussed in detail in my talk a year ago).
An example implementation of Call Wrapping MAY look as follows.
In app-level code, we’ll have to replace all the system calls with their deterministic wrappers (or provide seamless replacements). After we do it – implementation of the wrapper becomes trivial: in "Recording" mode we’re calling system call, AND writing its return to the inputs-log, and in "Replay" mode we’re not calling anything from system-level, merely reading the corresponding frame from the inputs-log and returning the result back to the calling app.
Some further practical considerations when we’re implementing determinism:
- in production, storing inputs-log forever-and-ever is not practical (at least most of the time)
- to deal with it, we may use circular logging – but this in turn requires state serialization.
- Serialization can be done in different ways, but one of the most interesting ones (and certainly the fastest one) is the one which assigns allocator to the (Re)Actor, and serialises the whole allocator at CPU page level; while there are some caveats on this way, it seems to work pretty well in practice (unfortunately, this implementation is very new and wasn’t tested in massive deployments yet).
As soon as we make our (Re)Actors deterministic, we obtain the following all-important benefits:
- more meaningful testing. As non-reproducible testing is not exactly meaningful, higher-level test cases often have to ignore those-parts-of-the-output-which-depend-on-the-specific-run (such as time-based outputs). With deterministic (Re)Actors, it is not a problem.
- production post-mortem analysis. If we’re writing all the events (and returns of system calls) to an in-memory log, then, if our app crashes or asserts, we can have not only the-state-which-we-had-at-the-moment-of-the-crash but also last-N-minutes-of-the-program’s-life. As a result, most of the time we can see not only WHAT went wrong, but also WHY it went wrong. This has been seen to allow to fix 90+% of the bugs after the very first time they’re reported IN PRODUCTION – and this is a feature which importance is VERY difficult to overestimate.
- Low-latency Fault Tolerance and/or (Re)Actor Migration. We don’t have time to discuss it in detail, but we can say that at least the former is a very close cousin of Fault Tolerance via so-called "Virtual Lockstep" used in the past by VMWare.
- Replay-Based Regression Testing. This can be a real life-saver for major refactorings – but I have to note that it requires a bit stronger guarantees than same-executable determinism – which guarantees are usually still achievable in practice.
Now let’s see how (Re)Actor-based architectures (whether deterministic or not) satisfy our two Big Business Requirements.
When speaking about multi-coring and scalability – everything looks very very good for (Re)Actors (and more generally – for message-passing architectures). As we don’t share ANY state between different cores (threads, processes, whatever-else) – it means that we’re working under so-called Shared-Nothing model, which is EXTREMELY good for Scalability. Not only Shared-Nothing architectures do allow to use multiple cores – moreover (and unlike Shared-Memory-based approaches) they scale to the pretty much INFINITE number of cores in a not-so-nonlinear-manner. Sure, there are some corner cases where pure (Re)Actors don’t work too well (in particular, in case of one huge monolithic state), but these cases are rather few and far between (and as we’ll see in Part III, there are not-too-ugly workarounds for it too).
When speaking of guarantees against blocking, under (Re)Actors it is achieved via (Mostly)-Non-Blocking I/O we just described. Indeed, if our current thread can process other inputs while that-I/O-call-which-can-take-5-minutes is in progress – we won’t have any problems with our app being responsive-enough for usual end-user.
In practice, I’ve seen (Re)Actors exhibiting latencies as low as 50us, though significantly reducing this number would be complicated, if possible at all.
Now we can try comparing two competing programming approaches: on the one hand it is Shared-Memory one, on another hand – it is (Re)Actors (or more generally, message passing).
First, let’s compare code complexity under these two approaches. As I already mentioned, IF intervening events do NOT need to be handled – both approaches are exactly the same. As for the handling of intervening events – with mutexes and stateful systems it is a surefire recipe for the disaster, but for (Re)Actors it is perfectly manageable.
Now, let’s compare our adversaries from the point of view of "how easy it is to write error-free programs under these paradigms" (which is the opposite of being error-prone). On this front, (Re)Actors win hands down. If there are no mutexes in our programs – there is no chance to forget to lock them when it is necessary, and there is no chance to lock them in the wrong order (opening the door for deadlocks). Sure, there are still ways to mess our program up – but they won’t be specific to (Re)Actors. In the real-world, (Re)Actor-based systems were seen to exhibit 5x less unplanned downtime than industry average (I’ll discuss it in a bit more detail in part III of the talk).
Our next criterium is testability. In general, Shared-Memory programs are inherently untestable, which in turn lets LOTS of bugs into production. OTOH, for (Re)Actors testability is generally very good even without taking any special measures such as determinism; and with determinism implemented they become perfectly testable.
The next line is closely related to the previous one, but it is soo important in practice that I have to mention it separately: it is "how we deal with bugs in production". For Shared-Memory systems, there are LOTS of cases when we have no idea why it crashed – and worse, don’t even know how to find it out, causing LOTS of delays in fixing those all-important production bugs. OTOH, for (Re)Actors, simple text-based logging tends to help A LOT, allowing to fix 80+% of the bugs after their first manifestation; and if we have our (Re)Actors deterministic – the number can go to 90-95%).
The fifth line in our comparison table is Scalability. By definition, Shared-Memory systems DO NOT scale beyond one single box. OTOH, with (Re)Actors we’re speaking about POTENTIALLY-UNLIMITED scalability.
Then, there is performance (which is often a VERY different beast from Scalability). Shared-Memory approaches MAY provide ALL the spectrum of performance, ranging from Poor to Excellent (more on it in a jiff); OTOH, (Re)Actors, tend to exhibit Good to Excellent performance. In the almighty real-world, I have seen examples when (Re)Actor-based systems were performing about 30 TIMES more useful work per server box (again, we’ll briefly discuss it in Part III).
And on the last line, I WILL mention that-only-thing which does NOT allow to eliminate Shared-Memory Approaches entirely; it is Latencies. NON-blocking Shared-Memory allows reaching latencies which are measured in single-digit microseconds. As for (Re)Actors, they’re limited to some tens-of-microseconds. There are applications out there (notably HFT) which DO need this kind of latencies, but for 99.99% of all the business apps out there, double-digit microseconds are good enough.
Speaking in a bit more detail about performance and latencies, I have to note that generic Shared-Memory approach has two flavors which tend to be DRASTICALLY different from the performance point of view.
The first one is BLOCKING Shared-Memory, the one which uses mutexes (or any other blocking mechanism) at app-level to synchronize between threads. This is by far the most popular option for Shared-Memory systems, and it tends to fail BADLY in terms of performance (as it was already noted several times during this conference, one of the authors of POSIX threads has said that they should have named "mutex" a "bottleneck" to reduce chances for it to be misused).
This happens due to context switch costs, contention, and starvation (and BTW, by its very definition mutex is not really a way to make programs parallel, but a way to make them serialized).
Latency-wise, blocking Shared-Memory tends to be very uneven (and usually very unfair too) – and requires to specify acceptable latencies in terms of PERCENTILES.
Performance problems with blocking Shared-Memory are long recognized, and over the time several ways were invented to make Shared-Memory perform better (using stuff such as memory fences, atomics, and RCU). This indeed is known to perform really well (at least within one NUMA node).
Back to (Re)Actors and more generally – to Message Passing. Strictly speaking, Message-Passing also has two flavors: the first one is "classical" message passing (which includes (Re)Actors, Message Passing by Erlang, MPI, and so on). It tends to have very good performance (well, MPI is successfully used for HPC for decades). Latency-wise, it can provide latencies which are as low as double-digit microseconds. While this performance (and especially latency) is usually not AS GOOD as that of the NON-BLOCKING Shared-Memory – it is perfectly-sufficient for 99.9% of real-world apps.
The second flavor of Message-Passing is so-called Message-Driven (a.k.a. Data-Driven) programming, which is used in particular in an HPX library. It tends to improve calculation performance even further – making it a direct rival to much more complicated non-blocking Shared-Memory stuff performance-wise. In terms of latencies, while I don’t have any hard data on message-driven stuff, I don’t expect it to be AS GOOD as that of the non-blocking shared-memory systems.
To make this table complete, we also have to mention scenarios where we can PRETEND that we don’t care about the synchronization at all (it happens when contention is so low that we can assume it is negligible, so ANY synchronization will work with about the same performance). In this case, as the synchronization is virtually non-existent – we can get near-perfect performance.
To briefly summarise this table – I am sure the ONLY technology which we MIGHT have to rule out due to performance issues, is mutex-based Shared-Memory; performance of ALL the other architectures is more-or-less comparable, and each of them happens to have their own virtues (and applicability fields).
To summarise Part II of my talk today, (Re)Actors:
- (a) enable reasonable writing of app-level code, which is either SAME or SIMPLER than equivalent shared-memory code (and especially so with await in mind)
- (b) enable a very wide range of deployment options
- (c) are less error-prone, are testable, and allow to handle production bugs very efficiently
- (d) scale and perform very well
- and (e) can be made deterministic, which enables lots of its own goodies