Implementing Queues for Event-Driven Programs

posted June 13, 2016 by "No Bugs" Hare , translated by Sergey Ignatchenko^IRL

“No Bugs” Hare Author: “No Bugs” Hare Follow: Twitter Facebook

Job Title: Sarcastic Architect

Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,
Calling a Spade a Spade, Keeping Tongue in Cheek

Multiple Writers Single Reader Queue

#DDMoG, Vol. V

[[This is Chapter 16(d) from “beta” Volume V of the upcoming book “Development&Deployment of Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “Book Beta Testing“. All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

We’ve already discussed things related to sockets; now let’s discuss the stuff which is often needed (in particular, it is of Utmost Importance when implementing Reactors), but which is not that common to be universally available as a part of operating system.

I’m speaking about queues in the context of inter-thread communications (where “threads” are usual preemptive threads able to run on different cores, and not cooperative ones a.k.a. fibers). And not only just about "some" implementation of queue, but about queues which have certain properties desirable for our Reactors a.k.a. ad-hoc Finite State Machines a.k.a. Event-Driven Programs.

Simple MWSR Queue

What we usually need from our Queue, is an ability to push asynchronous messages/events there (usually from different threads), and to get them back (usually from one single thread) – in FIFO order, of course. Such Multiple-Writer-Single-Reader queues are known as MWSR queues. In our case, reading from an empty queue MUST block until something appears there; this is necessary to avoid polling. On the other hand, writing MAY block if the queue is full, though in practice this should happen Really Rarely.

Let’s consider the following simple implementation (with no blocking, as our queue cannot become "full"):

template <class Collection>
class MWSRQueue {
 private:
 std::mutex mx;
 std::condition_variable waitrd;
 Collection coll;
 bool killflag = false;
 public:
 using T = Collection::value_type;
 MWSRQueue() {
 }
 void push_back(T&& it) {
 //as a rule of thumb, DO prefer move semantics for queues
 // it reduces the number of potential allocations
 // which happen under the lock(!), as such extra
 // unnecessary allocations due to unnecessary copying
 // can have Bad Impact on performance
 // because of significantly increased mutex contention
 {//creating scope for lock
 std::unique_lock<std::mutex> lock(mx);
 coll.push_back(std::move(it));
 }//unlocking mx
 waitrd.notify_one();
 //Yep, notifying outside of lock is usually BETTER.
 // Otherwise the other thread would be released
 // but will immediately run into
 // our own lock above, causing unnecessary
 // (and Very Expensive) context switch
 }
 pair<bool,T> pop_front() {
 //returns pair<true,popped_value>,
 // or – if the queue is being killed - <false,T()>
 std::unique_lock<std::mutex> lock(mx);
 while(coll.size() == 0 && !killflag) {
 waitrd.wait(lock);
 }
 if(killflag)
 return pair<bool,T>(false,T());
 //creates an unnecessary copy of T(),
 // but usually we won’t care much at this point
 assert(coll.size() > 0);
 T ret = std::move(coll.front());
 coll.pop_front();
 lock.unlock();
 return pair<bool,T>(true, std::move(ret));
 }
void kill() {
 {//creating scope for lock
 std::unique_lock<std::mutex> lock(mx);
 killflag = true;
 }//unlocking mx
 waitrd.notify_all();
 }
};

[[TODO!:test!]]

This is a rather naïve implementation of MWSR queues, but – it will work for quite a while, and it uses only very standard C++11, so it will work pretty much everywhere these days. More importantly, it does implement exactly the API which you need: you can push the items back from other threads, you can read your items from a single thread, and you can request that the wait (if any) is aborted, so your thread can terminate (for example, if you want to terminate your app gracefully). Moreover, our queue provides the whole API which you’ll ever need from your queue; this IS important as it means that you can re-implement your queue later if necessary in a better-performing manner, and nobody will notice the difference.

[画像:Hare thumb up:]“A nice (though side) implementation detail is that our template class MWSRQueue can use any collection which implements usual-for-std-containers functions push_back(), pop_front(), and front().A nice (though side) implementation detail is that our template class MWSRQueue can use any collection which implements usual-for-std-containers functions push_back(), pop_front(), and front(). It means that you can use std::list<> or std::deque<> directly, or make your own class which satisfies this API (for example, you can make your own prioritized queue¹). Oh BTW, and (by pure accident) it seems to be exception-safe too (even in a strong sense²).

OTOH, this naïve implementation has several significant drawbacks, which MAY come into play as soon as we become concerned about performance and reliability. Let’s see these drawbacks one by one.

¹ Note that std::priority_queue<> as such does NOT guarantee the order in case of elements with equal priority, so to make a FIFO-queue-with-priority out of it, you’ll need to make another adapter which assigns number-of-item-since-very-beginning as one of the parameters (and then sort by tuple (priority, number_of_item_since_very_beginning) – and DON’T forget about potential wraparounds too! – that is, unless you’re using uint64_t as your number_of_item_since_very_beginning, when in most practical cases you can demonstrate that wraparound will never happen

² assuming that your type T has a moving constructor with a no-throw guarantee, which it usually does

Fixed-Size Queues

As our class MWSRQueue above is organized, queue size may grow indefinitely. This might look as a Good Thing from theoretical point of view ("hey, we don’t put any limits on our Queue"), but in the real world it often causes severe issues 🙁 . For example, if for some reason one of your servers/Reactors starts to delay processing (or even hangs), such infinite-sized queues can easily eat up all the available RAM, causing swap or denial of allocations, and potentially affecting MUCH more players than it should.

Flow Control

Let’s consider what will happen in the case of one of the Reactors hanging/slowing-down if we limit the size of ALL our queues within the system.

If we limit sizes of ALL our queues, AND all our connections are TCP, then in case of severe overload the following scenario will unfold. First, one queue (the one close to the slow Reactor) will get full; in turn, queue being full will cause TCP thread which fills it, to block.³ Then, the TCP thread on the other side of TCP connection will find that it cannot push data into TCP, so it will block too. Then, the queue which feeds that TCP thread on pushing side, will get full. Then, the sending Reactor’s supposedly-non-blocking function sendMessage(), will be unable to push the message into the queue-which-just-became-full, so our supposedly-non-blocking Reactor will block.

[画像:Surprised hare:]“As we can see, when working with all flow-controlled transports, severe delays tend to propagate from the target to the source. As we can see, when working with all flow-controlled transports (TCP is flow-controlled, and fixed-size queue is flow-controlled too), severe delays tend to propagate from the target to the source. Whether it is good or not – depends on specifics of your system, though from what I’ve seen, in most cases such propagating delays are at least not worse than exhausting RAM which happens in case of infinite queues.

Also, it gives us back the control over what-we-want-to-do in case of such a problem. For example, to avoid one Reactor which processes messages from pretty much independent channels and feeding them to different Reactors, from blocking all the channels in case of one of the target Reactors being slow or hanged, we MAY be able to "teach" our single Reactor to postpone just messages from the affected channel, while working with the other channels as usual. Implementing it would require two things: (a) adding a trySendMessage() function, which tries to send message and returns “send wait handle” if the sending is unsuccessful, and (b) adding a list_of_wait_handles parameter to pop_front() function, with understanding that if some space becomes available in any of “send wait handle”s, pop_front() stops the wait and returns “send wait handle” to the caller (and then infrastructure code will need to send a message/call a callback or continuation from our Reactor).

³ in case when there is no TCP between Reactors so that Reactors are interacting directly, sending Reactor’s supposedly-non-blocking sendMessage() will block immediately, as described below

Dropping Packets

When dealing with messages coming over TCP or over internal communications, we’re usually relying on ALL the messages being delivered (and in order too); that’s why dropping messages is usually not an option on these queues.⁴

[画像:Hare pointing out:]“for UDP packets, there is always an option to drop them if the incoming queue is fullHowever, for UDP packets, there is always an option to drop them if the incoming queue is full;⁵ this is possible because any UDP packets can be dropped anyway, so that our upper-level protocols need to handle dropped packets regardless of us dropping some packets at application level. Moreover, we can implement a selective packet drop if we feel like it (for example, we can drop less important traffic in favor of more important one).

⁴ strictly speaking, if you DO implement reliable inter-Server communications as described in Chapter III, you MAY be able to force-terminate TCP connection, AND to drop all the messages from that connection from the queue too. Not sure whether it is ever useful though.

⁵ Or even almost-full, see, for example, [RED] family of congestion avoidance algorithms

Full Queues are Abnormal. Size. Tracking

[画像:Judging hare:]“full queues SHOULD NOT happen during normal operationRegardless of the choice whether-to-block-or-to-drop outlines above, full queues SHOULD NOT happen during normal operation; they’re more like a way to handle scenarios when something has Already Went Wrong, and to recover from them while minimizing losses. That’s why it is Really Important to keep track of all the queue blocks (due to the queue being full), and to report it to your monitoring system; for this purpose, our queues should provide counters so that infrastructure code can read them and report to a system-wide monitor (see more on monitoring in Vol.3).

Now let’s discuss a question of maximum size of our fixed-size queues. On the one hand, we obviously do NOT want to have any kind of swapping because of the memory allocated to our fixed-size queues. On the other hand, we cannot have our queues limited to maximum size of 2 or 3. If our queue is too small, then we can easily run into scenarios of starvation, when our Reactor is effectively blocked by the flow control mechanisms from doing things (while there is work somewhere in the system, it cannot reach our Reactor). In the extreme cases (and ultra-small sizes like 2 or 3), it is possible even to run into deadlocks (!).⁶

My recommendation when it comes to maximum size of the queues, goes as follows:

DO test your system with all queue sizes set to 1
- see whether you have any deadlocks
  - if yes – DO understand whether you really need those dependencies which are causing deadlocks
    - if yes – DO establish such limits on minimum queue sizes, which guarantee deadlock-free operation
Start with maximum size of between 100 and 1000; most of the time, it should be large enough to stay away from blocks and also to avoid allocating too much memory for them.
DO monitor maximum sizes in production (especially "queue is full" conditions), and act accordingly

⁶ There is a strong argument that deadlocks SHOULD NOT happen even with all queue sizes == 1. I would not say that this qualifies as a firm rule, however, I do agree that if using flow-controlled queues, you SHOULD test your system with all queue sizes set to 1, see below

Implementing Fixed-Size Queue with Flow Control

Now, after we’ve specified what we want, we’re ready to define our own Fixed-Size Queues. Let’s start with a Fixed-Size Queue with Flow Control:

template <class FixedSizeCollection>
class MWSRFixedSizeQueueWithFlowControl {
 private:
 std::mutex mx;
 std::condition_variable waitrd;
 std::condition_variable waitwr;
 FixedSizeCollection coll;
 bool killflag = false;
 //stats:
 int nfulls = 0;
 size_t hwmsize = 0;//high watermark on queue size
 public:
 using T = FixedSizeCollection::value_type;
 MWSRFixedSizeQueueWithFlowControl() {
 }
 void push_back(T&& it) {
 //if the queue is full, BLOCKS until some space is freed
 {//creating scope for lock
 std::unique_lock<std::mutex> lock(mx);
 while(coll.is_full() && !killflag) {
 waitwr.wait(lock);
 ++nfulls;
 //this will also count spurious wakeups,
 // but they’re supposedly rare
 }
 if(killflag)
 return;
 assert(!coll.is_full());
 coll.push_back(std::move(it));
 size_t sz = coll.size();
 hwmsize = max(hwmsize,sz);
 }//unlocking mx
 waitrd.notify_one();
 }
 pair<bool,T> pop_front() {
 std::unique_lock<std::mutex> lock(mx);
 while(coll.size() == 0 && !killflag) {
 waitrd.wait(lock);
 }
 if(killflag)
 return pair<bool,T>(false,T());
 assert(coll.size() > 0);
 T ret = std::move(coll.front());
 coll.pop_front();
 lock.unlock();
 waitwr.notify_one();
 return pair<bool,T>(true, std::move(ret));
 }
 void kill() {
 {//creating scope for lock
 std::unique_lock<std::mutex> lock(mx);
 killflag = true;
 }//unlocking mx
 waitrd.notify_all();
 waitwr.notify_all();
 }
};

Implementing Fixed-Size Queue with a Drop Policy

And here goes a Fixed-Size Queue with a Drop Policy:

template <class FixedSizeCollection, class DropPolicy>
 // DropPolicy should have function
 // pushAndDropOne(T&& t, FixedSizeCollection& coll)
 // it MAY either to skip t,
 // OR to drop something from coll, while pushing t
class MWSRFixedSizeQueueWithDropPolicy {
 private:
 DropPolicy drop;
 std::mutex mx;
 std::condition_variable waitrd;
 FixedSizeCollection coll;
 bool killflag = false;
 //stats:
 int ndrops = 0;
 size_t hwmsize = 0;//high watermark on queue size
 public:
 using T = FixedSizeCollection::value_type;
 MWSRFixedSizeQueueWithDropPolicy(const DropPolicy& drop_)
 : drop(drop_) {
 }
 void push_back(T&& it) {
 //if the queue is full, calls drop.pushAndDropOne()
 {//creating a scope for lock
 std::unique_lock<std::mutex> lock(mx);
 if(coll.is_full()) {//you MAY want to use
 // unlikely() here
 ++ndrops;
 drop.pushAndDropOne(it, coll);
 return;
 }
 assert(!coll.is_full());
 coll.push_back(std::move(it));
 size_t sz = coll.size();
 hwmsize = max(hwmsize,sz);
 }//unlocking mx
 waitrd.notify_one();
 }
 pair<bool,T> pop_front() {
 std::unique_lock<std::mutex> lock(mx);
 while(coll.size() == 0 && !killflag) {
 waitrd.wait(lock);
 }
 if(killflag)
 return pair<bool,T>(false,T());
 assert(coll.size() > 0);
 T ret = std::move(coll.front());
 coll.pop_front();
 lock.unlock();
 return pair<bool,T>(true, std::move(ret));
 }
 void kill() {
 {//creating scope for lock
 std::unique_lock<std::mutex> lock(mx);
 killflag = true;
 }//unlocking mx
 waitrd.notify_all();
 }
};

Performance Issues

As we’re running our system, we MAY run into performance issues; sometimes, it is those queues which cause us trouble.

[画像:Surprised hare:]“With queues-implemented-over-mutexes like the ones we’ve written above, the most annoying thing performance-wise is that there is a chance that the OS’s scheduler can force the preemptive context switch right when the thread-being-preempted-is-owning-our-mutex.With queues-implemented-over-mutexes like the ones we’ve written above, the most annoying thing performance-wise is that there is a chance that the OS’s scheduler can force the preemptive context switch right when the thread-being-preempted-is-owning-our-mutex. This will cause quite a few context switches going back and forth. Such unnecessary context switches have a Big Fat impact on the performance 🙁 (as discussed in [TODO], context switch can cost up to a million CPU clocks⁷).

⁷ Most of the time, such Bad Cases won’t apply to the kind of context switches we’re discussing here, but several context switches each costing 10K CPU clocks, is already Pretty Bad

To deal with it, two approaches are possible. Approach #1 would be simply to

Reduce Time Under Lock

As we reduce the time spent under the mutex lock, chances of that unfortunate-context-switch can be reduced to almost-zero (if we’re doing a Really Good Job, time-under-lock can be as little as a hundred CPU clocks under the lock, so chances of being forced-switched there, become very minimal). And without the lock being occupied, the time to acquire/release the lock usually becomes just two atomic/LOCK/Interlocked operations (and you cannot really do better than that).

Removing Allocations from Under the Lock

A mathematician is asked "how to boil water?" His answer goes as follows:

Let’s consider two cases. In the first case, there is no water in the kettle.

Then, we need to light a fire, put some water into the kettle,

place the kettle over the fire, and wait for some time.

In the second case, there is water in the kettle.

Then we need to pour the water out, and the problem is reduced to the previous case.

— A mathematician who Prefers to stay Anonymous —

Now, let’s see what we can do to reduce time under the lock. If we take a closer look at our class MWSRQueue, we’ll realize that all the operations under the lock are very minimal, except for potential allocations (and/or O(N) operations to move things around).

The problem is that none of the existing std:: containers provides a guarantee that there are neither allocations/deallocations nor O(N) operations within their respective push_back() and pop_front() operations.

std::list<>::push_back()/pop_front() Allocation/deallocation; some implementations MAY use cache or pool allocations, but such optimizations are implementation-specific 🙁

std::vector<>::erase(begin()) (as a replacement for pop_front()) O(N)

std::deque<>::push_back()/pop_front() Allocation/deallocation; some implementations MAY use cache or pool allocations, but such optimizations are implementation-specific 🙁

I know of two ways how to deal with this problem. First, it is possible to use some kind of pool allocation and feed pool allocator to std::list<> or std::deque<> (effectively guaranteeing that all the items are always taken from the pool and nothing else). However, IMO this solution, while workable, looks too much as a way mathematician gets the kettle boiled (see epigraph to this subsection).

Instead, I suggest to do the following:

If you need an infinite-size queue, you can use “intrusive lists” (allocating list elements outside the mutex lock, and reducing contention)
If you need a fixed-size queue, then you can create your own Collection based on circular buffer along the following lines:

template<class T, size_t maxsz_bits>
class CircularBuffer {
 static constexpr size_t bufsz = 1 << maxsz_bits;
 static constexpr size_t maxsz = bufsz - 1;
 //-1 to make sure that head==tail always means ‘empty’
 static constexpr size_t mask = maxsz;
 size_t head = 0;
 size_t tail = 0;
 alignas(T) uint8_t buffer[bufsz*sizeof(T)];
 //Having buffer as T[bufsz] is possible 
 // IF we'll replace placement move constructors with move assignments
 // AND drop explicit destructor calls
 //However, it will require T to have a default constructor,
 // so at the moment I prefer to deal with pure buffers
 // and to have the only requirement that T is move-constructible
 
 public:
 size_t size() {
 return head – tail + 
 (((size_t)(head>=tail)-(size_t)1) & bufsz);
 //trickery to avoid pipeline stalls via arithmetic
 //supposedly equivalent to:
 //if(head >= tail)
 // return head – tail;
 //else
 // return head + bufsz - tail;
 }
 void push_back(T&& t) {
 assert(size() < maxsz);
 new(tbuffer(head)) T(std::move(t));
 head = ( head + 1 ) & mask; 
 } 
 T pop_front() {
 assert(size() > 0);
 T* ttail = tbuffer(tail);
 T ret = std::move(*ttail);
 ttail->~T();
 tail = ( tail + 1 ) & mask;
 return ret;
 }
 private:
 T* tbuffer(size_t idx) {
 return reinterpret_cast<T*>(buffer + (idx*sizeof(T)));
 }
};

Removing locks completely

The second approach is MUCH more radical – it is the one to remove locks completely. And at the first glance, it seems that it is easy to find an appropriate “lockless queue” library. However, there is a caveat:

We do NOT really need “completely lockless queue”. What we need, is a “queue which is lockless until it becomes empty or full”

In other words, our (almost)-lockless queue still needs to lock (otherwise we’d need to poll it, which puts us in a predicament between sacrificing latency and burning CPU cycle in a MUCH worse manner than any losses from the very-infrequent-context-switches on barely-loaded-locks).

[画像:Hare thumb down:]“Unfortunately, I do NOT know of any readily-available library which supports such 'blocking-only-when-necessary' queuesUnfortunately, I do NOT know of any readily-available library which supports such “blocking-only-when-necessary” queues 🙁 . Writing such a thing yourself is certainly possible, but keep in mind that it is going to be a Really Major Effort even if you’re proficient in writing synchro primitives 🙁 (and Even More Major Effort to debug/test it and to prove its correctness⁸). Overall, if considering complexity of writing such a “blocking-only-when-necessary” queue from the point of view of exercises from Knuth’ “The Art of Computer Programming”, I would rate is around 40 🙁 (with “50” being a “non-proven-yet theorem”).

One library which I didn’t try myself, but which MAY help in converting lockless algorithms into lock-when-necessary ones, is [EventCount] from Facebook’s folly library. Let me know whether it worked for you 🙂 .

⁸ yes, for non-trivial primitives such proof is necessary, even if it is done by an exhaustive analysis of all the context switches in all the substantially different points – of course, not forgetting about those nasty ABA problems

Waiting for Other Stuff

More often than not, in addition to waiting for incoming events, we MAY want to wait for “something else”. Examples of these “something else” things range from “something coming in from socket” to “user moving mouse”.

Of course, we could dedicate a thread to wait for several sockets (user input, DNS response, whatever-else) and pushing the result to one of our MWSR Queues, but it means extra context switches, and therefore is not always optimal.

In such cases, we MAY want to use some OS-specific mechanism which allows to wait for several such things simultaneously. Examples of such mechanisms include:

[画像:Hare with an idea:]“To deal with those very-occasional other events (which cannot be handled via select()/poll()/epoll()), a separate anonymous pipe (or equivalent) can be created, which can be listened by the very same select()-like function.(not exactly that OS-specific, but still different enough to be mention here): using select() (poll()/epoll()) as a queue. If MOST of your IO is sockets, and everything-else (like “a message coming in from another thread”) happens very occasionally, then it often makes sense to use select() etc. to deal with sockets – and with anything else too (with absolutely no mutexes etc. in sight). To deal with those very-occasional other events (which cannot be handled via select()/poll()/epoll() because they’re not file handles, or because they’re regular files(!)), a separate anonymous pipe (or equivalent) can be created, which can be listened by the very same select()-like function. Bingo! Most of the things are handled with select()/poll()/epoll()/… without any unnecessary context switches, and the very-occasional stuff is occasional enough to ignore the associated (usually not-too-bad) overhead of sending it over the pipe.
- On Linux, instead of pipe, you can (and IMHO SHOULD) use eventfd() instead of anonymous pipe, to get an improvement in performance. For thread-to-thread communications, it makes select()-etc.-based queues rather efficient.
- Note however, that this approach does NOT work too well performance-wise when most of your events CANNOT be handled by select()-like function directly (and need to be simulated over that pipe). While such a thing WILL work, the time spent on simulating events over pipes, can become substantial :-(.
kqueue(). On BSD, kqueue() allows to wait not only on file handles, and provides more flexibility than epoll(), and occasionally allows to avoid an extra-thread-with-an-anonymous-pipe which would be necessary otherwise.
Win32 WaitForMultipleObjects(). WaitForMultipleObjects() can wait both for sockets and for “events”. This can be used to build a queue which can handle both sockets etc. and other stuff – all without those unnecessary context switches.[[TODO:MsgWaitForMultipleObjects()]]
Win32 thread queues. Another Win32-specific mechanism is related to thread queues (and GetMessage() function). These come handy when you need to handle both Windows messages and something-else (especially when you need to do it in a UI thread).

On libuv

In a sense, [libuv] is The King when we speak about 3rd-party event handling libraries. It can take pretty much anything and make it asynchronous. However, being that universal comes at a price: libuv’s performance, while “pretty good”, is not “the best one possible”. In particular, the trickery described above, can often outperform libuv.

[[TODO: IPC/shared-memory]]

[[To Be Continued…

[画像:Tired hare:]This concludes beta Chapter 16 from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”. Stay tuned for beta Chapter 19, where we’ll start discussing RNG.]]

Don't like this post? Comment↯ below . You do?! Please share: ...on LinkedIn ...on Reddit ...on Twitter ...on Facebook

[+]References

[RED] “Random Early Detection”, Wikipedia

[EventCount] https://github.com/facebook/folly/blob/master/folly/experimental/EventCount.h

[libuv] http://libuv.org

Acknowledgement

Cartoons by Sergey Gordeev^IRL from Gordeev Animation Graphics, Prague.

Asynchronous Processing for Finite State Machines/Actors: from plain event processing to Futures (with OO and Lambda Call Pyramids in between) Eight Ways to Handle Non-Blocking Returns in Message-Passing Programs – with Script Using Parallel <algorithm> Without a Clue: 90x Performance Loss Instead of 8x Gain BDT – Automating Your Application Testing in Behavior-Driven Way Network Programming: Socket Peculiarities, Threads, and Testing

Join our mailing list:

ITHare (38)
Reddit (1)
Reddit (15 upvotes, 6 comments)

Comments

ANON says

June 13, 2016 at 7:02 pm

You should probably explain what MWSR stands for before using it.

Reply
- "No Bugs" Hare says
  
  June 14, 2016 at 4:38 am
  
  Indeed (was so obvious to me that I forgot to mention it…). Added, THANKS!
  
  Reply
Chris Kingsley says

June 13, 2016 at 9:45 pm

There’s something fishy about the queue.pop_front() routines. They copy the front of the queue to a local variable inside the scope of the lock, then end that scope (to unlock), then return that variable:
T ret = coll.front();
coll.pop_front();
}//unlocking mx
return pair(true, ret);

You have to create that variable outside the scope (default constructed, probably), then assign to it inside the scope, then return it outside the scope.

Or make a new variable inside the scope, explicitly unlock, then return from the scope, assuming the lock object knows it has had an explicit unlock and won’t be damaged in doing an excess unlock on the lock destructor.

Reply
- "No Bugs" Hare says
  
  June 14, 2016 at 4:55 am
  
  Big Ooops! Fixed now, THANKS!
  
  Reply
Maninder Batth says

June 13, 2016 at 10:12 pm

Hi, i really like your blog design. Would you mind sharing what software (wordpress etc.) and theme do you use for blogging ?

Reply
- "No Bugs" Hare says
  
  June 14, 2016 at 4:35 am
  
  WordPress + modified “WinterSong Pro” theme by StudioPress…
  
  Reply
Tobias says

June 14, 2016 at 5:53 am

Hi,
Your implementation of MWSRFixedSizeQueueWithFlowControl::push_back is flawed in that killing a full queue will trigger an assert in debug and even worse, might block forever in production.
The assert and pushback part after the loop should be guarded by !killflag.
Nevertheless thanks for sharing these great articles!

Reply
- "No Bugs" Hare says
  
  June 14, 2016 at 9:02 am
  
  Yep, fixed now, THANKS!
  
  Reply
Maarten Hilferink says

June 14, 2016 at 6:23 am

CAn you explain why

pair pop_front() {
std::unique_lock lock(mx);
while(coll.size() == 0 && !killflag) {
waitrd.wait(lock);
}

doesn’t deadlock when coll is empty?

I would think that once coll is empty, locking mx will let all future push_backs wait for the release of mx, which will never happen.

Reply
- Maarten Hilferink says
  
  June 14, 2016 at 6:40 am
  
  could a deadlock be avoided by this implementation?
  
  pair pop_front() {
  while (true) {
  std::unique_lock lock(mx);
  if(killflag)
  return pair(false,T());
  if (coll.size())
  {
  T ret = std::move(coll.front());
  coll.pop_front();
  // lock.unlock(); // do we really need this?
  return pair(true, std::move(ret));
  }
  lock.unlock();
  waitrd.wait(lock);
  // last line is unsafe,
  // push_back could have completed notify_one before wait starts, which will miss an added element, what if no more elements arrive?
  // This could be avoided by calling notify_one from psuh_back within the scope of the mx lock. Context switch takes less time than dead-lock
  // and is only required when we are waiting for an empty queue to become non empty.
  }
  }
  }
  
  Reply
- "No Bugs" Hare says
  
  June 14, 2016 at 8:46 am
  
  It won’t deadlock because condition_variable::wait() temporarily releases the lock (in an atomic manner). https://en.wikipedia.org/wiki/Monitor_(synchronization)#Condition_variables_2 and http://en.cppreference.com/w/cpp/thread/condition_variable/wait are good starting points to read about conditional variables (which are indeed rather strange beasts).
  
  Reply
  - Maarten Hilferink says
    
    June 15, 2016 at 8:00 am
    
    nice, thanks for the info, and this column.
    
    Reply
Ralph Tandetzky says

June 14, 2016 at 8:38 am

Your MWSRQueue class is NOT strongly exception-safe, if the copy-constructor of the element type T can throw. In particular the pop_front() method does not implement that guarantee. Indeed, it cannot implement that guarantee with the given function signature. This is one of “Cargill’s problems” (see http://stackoverflow.com/questions/12600330/pop-back-return-value). The STL solves this problem by providing two separate functions pop() and top(). Another way to “solve” this, is to require, that T is nothrow copy constructible.

Reply
- "No Bugs" Hare says
  
  June 14, 2016 at 8:54 am
  
  Wait, I tried to write it in a way that it does NOT use a copy constructor (only MOVE constructor, and these ARE usually safe, though technically it does need to be specified too). Could you point where exactly pop_front() causes a copy? (NB: it was updated a few hours before you’ve posted your comment exactly to remove copying, so you MIGHT have been reading a version which did indeed cause a copy, but I hope it is gone now).
  
  Reply
  - Ralph Tandetzky says
    
    June 14, 2016 at 9:42 pm
    
    Agreed. My mistake. But still, the template parameter type must be nothrow MOVE constructible, so the strong guarantee is satisfied. This is by far not always the case. For example no STL container in C++14 has such a guarantee. Some implementations of std::list use a sentinel node, which requires heap allocation even for default construction. This can throw, not just in theory. So the nothrow move constructibility is a real requirement that is by far not always satisfied.
    
    Reply
    - "No Bugs" Hare says
      
      June 15, 2016 at 4:38 am
      
      > But still, the template parameter type must be nothrow MOVE constructible
      
      Yes, I’ve added a note to mention it, THANKS!
      
      > Some implementations of std::list use a sentinel node, which requires heap allocation even for default construction.
      
      Hold your horses: while it does mean having default constructor throw, it still doesn’t imply having move constructor throw. TBH, yet to see any sane container implementation with a throwing move (and in 99.99% of cases simple memcpy will work as a move constructor)…
      
      Reply
Nevermind says

June 14, 2016 at 4:51 pm

Hi. I have several questions, will be great if you have some free time and can answer.

1. Why do you prefer scope over unlock?

{//creating scope for lock
std::unique_lock lock(mx);
coll.push_back(std::move(it));
}//unlocking mx

Only reason I can think of is 2 calls (unlock() + destructor) against 1 (destructor). Am I right?

2. Why do you prefer lock in “kill()” over std::atomic_flag killflag?

And thank you for IT Hare.

Reply
- "No Bugs" Hare says
  
  June 14, 2016 at 7:29 pm
  
  1. Pretty much “whatever” (and can be said it is down to personal preference stuff); it is a bit simpler for me to reason in terms of scopes than in terms of explicit unlocks. As for performance difference – well, I bet you won’t be able to see it…
  
  2. Can you describe how exactly you want to use atomic? For any use case I see, I can also see a problem, but they’re too different to describe all of them in one single post :-). In particular: you seem to want to avoid locking mx in kill() – but are you going to use atomic semantics (and memory barriers(!)) for reading killflag from push_back() and pop_front()?
  
  > And thank you for IT Hare.
  
  You’re VERY welcome (it is always nice to know that somebody appreciates what I am doing) 🙂
  
  Reply
  - Nevermind says
    
    June 14, 2016 at 8:57 pm
    
    > In particular: you seem to want to avoid locking mx in kill()
    
    Exactly.
    I understand that “kill()” will be called only once, so it’s just for science.
    I just looked closer at std::atomic_flag: it’s performs almost the same as std::atomic and my idea is dead now. You are totally right about problems with atomic.
    But is there any chance to avoid lock in “kill()” and overhead in push\pop functions? (in a right way)
    
    Reply
    - "No Bugs" Hare says
      
      June 15, 2016 at 4:31 am
      
      > But is there any chance to avoid lock in "kill()" and overhead in push\pop functions? (in a right way)
      
      My gut feeling is “not”, but well – no warranties of any kind… ;-). As soon as you get two atomicity domains (in this case – mx and atomic), analysis becomes sooooo complicated, that it is easily worth a serious research article.
      
      Reply
    - Wanderer says
      
      June 24, 2016 at 5:38 pm
      
      As you already pointed out, it’s going to be called just once. So, based on “know where your bottleneck is”, you probably want to apply your thoughts elsewhere 🙂
      
      At the same time, if you’d like to avoid locks and you have some special conditions fulfilled, take a look at lock-free buffer rings. No locks, even for push/pop, just atomics. Although, beware those specific conditions (and loosing the ability to have blocking pop_front());
      
      Reply
Mark Aylett says

June 17, 2016 at 6:30 pm

I presume you meant to call the move constructor in push_back():

void push_back(T&& t) {
assert(size() < maxsz);
new(&buffer[head]) T(t); <= use T(std::move(t)) for move constructor.
head = ( head + 1 ) & mask;
}

Thanks.

Reply
- "No Bugs" Hare says
  
  June 20, 2016 at 1:00 pm
  
  Thought that I’ve already fixed it, but apparently not everywhere :-(. Fixed now, THANKS!
  
  Reply
Andreas Weis says

June 24, 2016 at 12:30 pm

I personally strongly favor the overload of wait() that takes a predicate as a second parameter besides the lock:

while(coll.size() == 0 && !killflag) {
waitrd.wait(lock);
}

becomes:

waitrd.wait(lock, [this]() { return coll.size != 0 || killflag; });

This has the following advantages:
– It takes care of spurious wakeups. People tend to forget about this and skip the loop.
– It is harder to screw up locking. The predicate is always executed under the lock, as it should be. Missed notfication bugs are significantly harder to write when using predicates.
– (subjectively) I find it easier to reason about than a loop, but ymmv

Even if you disagree on the third point, the first two have led me to rigorously complain about un-predicated waits whenever I run into one in a code review.

Excellent article otherwise, thanks a lot!

Reply
- "No Bugs" Hare says
  
  June 24, 2016 at 1:26 pm
  
  As long as it is within the while(), these two forms are *exactly* the same. And I don’t really think that such critical pieces of code should be left to people who can forget something like while() around the wait, or can screw up locking because of it (there are 50 other things which can be forgotten or screwed up with threads, so these two opportunities don’t really change much in the Grand Schema of Things).
  
  My point here is that this should be one of the Very Few places of thread-synced code, and then it doesn’t really matter much which of equivalent forms to use… And for the purposes of the book, I strongly prefer non-predicate form, as a MUCH more straightforward for people not so proficient with C++11 threads, AND for those who will need to re-implement it using non-C++ threads, which is a perfectly possible requirement, as IIRC C++11, unlike POSIX, does NOT support a concept of condvars-in-shared-memory.
  
  Reply
  - Andreas Weis says
    
    June 24, 2016 at 1:47 pm
    
    My point is really only about the fact that the predicate-based version makes it harder to accidentally use incorrectly. I do hear your points though and I agree that they are valid for the context of the book.
    
    In my day job I am in the somewhat atypical situation that we have to deal with a code base where we have to touch raw synchronization primitives every day. Meaning we work on such a low level that more robust abstractions like task-based multithreading are often not applicable.
    Here the trade-offs are different: If a library interface can help you avoid hard-to-detect bugs, you will learn to appreciate it quickly. For us the costs of having to teach the new interface (which also only works nicely if your devs are already comfortable with C++11 lambdas) quickly pay off, since condition variables are the bread and butter of our daily work.
    
    Reply
    - "No Bugs" Hare says
      
      June 27, 2016 at 8:03 am
      
      > If a library interface can help you avoid hard-to-detect bugs, you will learn to appreciate it quickly.
      
      Sure. Actually, this is One Big Reason why I am trying Really Hard to hide thread sync from app-level developer completely. Think of it as of the ultimate case of API which helps to avoid hard-to-detect bugs – “no thread sync” means “no thread sync bugs”, and they’re BY FAR the worst bugs out there. Just one example. 15+ years ago I’ve found an MT bug in one STL implementation, which bug, in a specially written 20-line program(!) has caused a deadlock in 20ms – or in 20 sec, depending on luck. Afterwards, I’ve met two separate teams who said “hey, we’ve read your article and replaced STL implementation to a thread-safe one, and that crazy bug which happened once a month at the client site, went away”. And it was STL released by a Big Vendor as a companion to their compiler, so they should have known what they’re doing – but they didn’t…
      
      Reply
Wanderer says

June 24, 2016 at 6:23 pm

IT Hare, as always, thanks for the nice article. And all those pieces during the past few months. Didn’t have a chance to read them as they go, so I’m slowly catching up.

Regarding this one, it’s (a bit) surprising that there is no mention of lock-free ring buffers. They are fixed-size and they perfectly fit into your fixed-size buffers chapter.

The only drawback one gets with lock-free buffers is that they are inherently non-blocking, i.e. lock-free!

I suspect one can come up with something like a “atomic read, spinlock, lock” in pop_front() and “atomic write, notify_one” on a push_back() side, although I personally never tried anything like that.

Without it, the only remaining choice is a timer-based polling. Whether it’s a polling within pop_front (i.e. atomic compare + sleep loop) or external polling (non-blocking read that can return “nothing new”) is a details of implementation. It’s going to be a polling.

IT Hare, I know your feelings about polling, i.e. “avoid it whenever possible”. Still, I believe there is a few specific places (two, actually) where polling is a viable solution.

1) Reeeeeally slow polling.

First thing that comes into my mind is logging. Lock-free, congestion-free ability to push something into a log is really nice. On the other side of this buffer, one specific thread that performs polling of “log inbox” once in a few seconds and flushes everything on disk (or wherever the log destination is) looks even more convenient for me, comparing to the event-based notify_one() approach, when every single message wakes up the logging monster.

2) Steady processes

When you have really good expectations on the frequency of the incoming events and [optionally, but important] batch-processing of events is cheaper, polling becomes a good candidate.

This actually applies to logging again. a) I expect at least a few events (log messages) every second; and b) copying those events into a persistence is cheaper when done in batches;

Based on my experience on one of the projects, it actually applies sometimes to network sending too. If I have a queue of outgoing messages (where each running node/worker puts something) and I have a good understanding that there will be a steady flow of those messages (in gaming context imagine “world state update message” every 20ms plus some occasional messages in between), polling approach will have lower number of “triggering”, comparing to notify_one(), i.e. decrease overall CPU load with acceptable results (and even allow more “batching” events into TCP datagrams as a free goodie).

Of course, everything above just IMHO.

Reply
- "No Bugs" Hare says
  
  June 27, 2016 at 8:39 am
  
  Thinking aloud about it (correct me if I’m wrong)… First of all, both your use cases are not really about lock-free stuff, but rather about heavily asymmetrical scenarios (you Really Want your writers to be non-blocking, and don’t care about the reader).
  
  Second, fixed-size lock-free queues are Highly Dangerous (in case if there is a delay on reader side, in case of a fixed-size container you’ll be burning cycles by writers like crazy instead of stopping and giving reader a chance to run).
  
  That being said, unlimited non-blocking queues (such as classical non-blocking linked-lists) MIGHT work for logging. OTOH, an argument about processing being more efficient in batches, flies only partially – while there is no load, we don’t really care about CPU clocks, and in case of increased load, these “batches” will naturally start forming by themselves(!). However, main improvement due to being lock-free, is, of course, related to reduced contention, and this still flies.
  
  I’m thinking about the “perfect queue” for these purposes. It SEEMS that using EventCounts, it should be possible to write a queue which (a) is lock-free until there is a “special condition” (like “queue is free” or “queue is full”); and (b) waits on some kind of a kernel object when there IS such a “special condition”. Moreover, for “lazy” stuff such as logging, it SHOULD be possible to optimise this queue by removing the forced-wakeup for the reader on the “push” side of things when the queue is empty (effectively requiring polling on the other side); however, push() still SHOULD be locked on kernel object when the queue is full – facilitating fixed-size containers and flow control. Also, push MAY cause forced-wakeup when the size becomes higher than certain threshold (to avoid queues from growing too large without real need). What do you think?
  
  Reply
  - Wanderer says
    
    June 28, 2016 at 7:22 pm
    
    Yup, you are right about writers/readers being asymmetrical in those scenarios. Moreover, the more I’m thinking about that, the more I feel that what I want to achieve is a kind of “load balancing” under CPU load. You are right about batching will start forming by themselves. On the other hand, since we don’t really have any control on thread scheduling/prioritizing in C++ standard libraries. And AFAIR the “cool” platform-specific “hacks” I tried 10 years ago didn’t work really well too. So, probably my intention was to make sure that if I have, say, 10-15 threads (i.e nothing like “1 thread per connection”), I don’t really want CPU to be spread evenly among them.
    
    All in all, I think I need to make up a few more tests.
    
    And, yeah, I’m also thinking about a way to have “lock-free until special condition” queue, but the only viable solution I saw so far (don’t remember where) includes busy-spin and/or double-checked locking AFAIR. The overall construct seemed to be working, i.e. fallback to classic mutex in case lock-free insert fails due to full ring buffer (probably Preshing?), although, it looked so ugly that I gave up digging into it.
    
    Reply
dazhuo says

October 5, 2016 at 8:41 am

Get an error when compile CircularBuffer:
non-static data member cannot be constexpr !
may be you should fix it.

Reply
- "No Bugs" Hare says
  
  October 9, 2016 at 7:03 am
  
  Yep, fixed now, THANKS! 🙂 All the code is still subject to testing before the book is published, but I will fix known bugs ASAP :-).
  
  Reply
  - dazhuo says
    
    October 27, 2016 at 5:09 am
    
    All right, I find another error in the circular buffer.
    You can’t just head+1 or tail+1 because its type uin8_t and you can’t call buffer[tail].~T() either. Suggestion: change uint8_t to T, fix all.
    And there is also some tiny error in these codes , it’s fine if you just use for demonstrating.
    Keep testing.
    
    Reply
    - "No Bugs" Hare says
      
      October 31, 2016 at 4:43 am
      
      THANKS, should be fixed now. OTOH, just replacing uint8_t with T won’t fly (there will be an extra constructor and worse – extra destructor on exit which will most likely lead to double-delete or something along the same lines); current fix should avoid it. EDIT: on a second thought, it is possible to have it this way (with some all-important changes(!), see comment in the code) – but it will introduce a requirement for T to be default-constructible, and I don’t really like such not-really-necessary requirements, so I will leave it as is.
      
      Reply
SerzaNT says

September 12, 2017 at 12:26 pm

I’m late to the party, but i will ask anyhow. There is a lot of talk about queues and collections(and that is a good), but what im missing is: what type should be contained in the collection/queue?

As i understand it, the queue will receive(in case of let’s say, GamePlayLogic ReActor) variety of different events. The will be input events(size of few bits essentially) and on the other side there can be gameplay events (with potentially huge data structures). I feel like im missing something since as far as i can see, no one is mention it, but I can’t recall a way of doing this gracefully. I can think of some pointer based structure/class where it will dynamically allocate and grow in size to contain all the data or some mega structure with all the possible variables. Needless to say, both options are terrible.

Reply
- "No Bugs" Hare says
  
  September 13, 2017 at 5:45 am
  
  > and on the other side there can be gameplay events (with potentially huge data structures)
  
  “Gameplay events” is actually rather vague. To be more specific – most of the time, on the Client-Side, we’ll be speaking about (a) user input events such as mouse/joystick/…, and (b) network packets arriving (“fragments” if you’re using TCP) (and even more importantly – there will be NO more event types than these two(!!)). On the Server-Side – as a Big Fat Rule of Thumb(tm), ALL the events will be network packets or inter-(Re)Actor packets (and all the local file operations, even if present, will be blocking under the concept of “mostly-non-blocking processing” discussed in Vol. II of my “Development & Deployment of Multiplayer Online Games”). In extreme cases, there can be 3-4 substantially different event types – but I never seen more than that.
  
  > Needless to say, both options are terrible.
  
  First of all, “terrible” is an inherently subjective judgement. In practice, even having a struct with fields being unused depending on the ‘type’ field is perfectly workable; mostly – because number of the event types is actually extremely limited in practice. Sure, a struct with type, mouse coordinates (unused if the type is ‘packet’) and packet (unused if type is ‘mouse’) is ugly – but it is not _too_ bad in practice (especially if you encapsulate it to guarantee safety and consistency, so an attempt to call getPacket() for a ‘mouse’-type event will lead to an exception).
  
  Second – IF you cannot live with this kind of things (most developers can, but personal preferences do vary) – we can observe that actually, each event is a typical “discriminated union” (a.k.a. tagged union, variant, discriminated union, and disjoint union, see also https://en.wikipedia.org/wiki/Tagged_union). How to implement this “discriminated union” without it being ugly – depends on the programming language (actually, absence of direct support for it in most programming languages is an unfortunate omission, though IIRC Ada did support them). For C++ you can, for example, refer to a 2002 article no less than by Andrei Alexandrescu: http://www.drdobbs.com/cpp/discriminated-unions-i/184403821 (most likely, with C++11 it can be improved further, but even this one should do – at the very least, it DOES guarantee type safety).
  
  EDIT: another alternative would be to confine event completely to the Infrastructure Code, and to have multiple functions instead of react(), such as OnPacket() and OnMouse(). As long as they’re called in the same thread (or at least “as if” from the same thread) – it is perfectly fine too. Overall – it is more or less a matter of taste (and viability of each approach depends too much on specifics of a particular project, and on teams working on it).
  
  Reply
  - SerzaNT says
    
    September 13, 2017 at 5:51 pm
    
    First of all, thanks for a swift response. The union based solution sounds good. I do wonder though,
    
    >there can be 3-4 substantially different event types
    
    how substantially different are we talking? I still expect to see the potential discriminated union to grow quite a bit with different packet types. Let’s say we are talking about the server side of generic RPG. There will be inter ReActor events(e.g. interact, spawn “this type” of object, do damage,…), DB calls/response events(e.g. send me item{ID=123} parameters), Server to Server packets(e.g. matchmaking)… So, is there some generalization in place? Like, all inter ReActor packets should be one type, and so on?
    
    The reason why I’m somewhat fixated on this type, is because it will basically define the internal communication API and I would like to avoid necessity of rewriting it 🙂 .
    
    Reply
    - "No Bugs" Hare says
      
      September 14, 2017 at 4:29 am
      
      > There will be inter ReActor events(e.g. interact, spawn "this type" of object, do damage,...), DB calls/response events(e.g. send me item{ID=123} parameters), Server to Server packets(e.g. matchmaking)
      
      Oh, I see. It is a different level of abstraction – which should be handled in a different place by different means. At some level of abstraction (which will be almost-certainly used for the queue) – we merely have “packets which arrived from the network” (without further detalization, except for “target (Re)Actor where to dispatch them”). At the next level of abstraction – we’ll have to unmarshal the packet – and to deal with it. Most of the time, I prefer to have this unmarshalling (as well as marshalling) to be handled (a) within (Re)Actor (i.e. within single thread), and (b) by the code-generated-by-custom-IDL-compiler (see, for example, http://ithare.com/idl-encodings-mappings-and-backward-compatibility/ ). Marshalling/unmarshalling is a well-defined task which should be separated from the rest of the code (and can-and-should be described in a declarative manner – _at_the_very_least_ similar to that of the protobufs, but IMNSHO we can and should do much better than that).
      
      Hope it helps 🙂
      
      Reply