Job Title: Sarcastic Architect
Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,
Calling a Spade a Spade, Keeping Tongue in Cheek
[rabbit_ddmog vol=”4″ chap=”Chapter 13(j) from "beta" Volume IV”]
It is rather well-known that TCP latency kinda suxx – and we have discussed in details why it is the case, above (in particular, head-of-line blocking is a Big Enemy of the latency).
However, it is really important to understand what exactly is meant under "TCP latency is bad". Actually, on closer examination we will see that at least two different things can be understood under this umbrella statement. The first one is that
We cannot use convenient TCP-like reliable-streaming API without incurring a latency penalty.
Let’s note, however, that this effect can be at least partially mitigated by using Reliable UDP (RUDP). For the time being, I won’t concentrate on RUDP (and on whatever-we-could-get-from-it) – and will just say that if we use reliable-and-ordered RUDP, it will still suffer from Head-of-Line blocking (that’s by definition) – and as a result, any gains from using reliable-and-ordered RUDP will be rather mild compared to TCP1 (and if it is some different flavor of RUDP – we’re not really using TCP-like reliable-streaming APIs anymore).
The second potential – and quite wrong at that – statement says that
<wrong>Whenever we don’t have UDP (for whatever reason) and are forced to use TCP – we’ll inevitably get a latency penalty.</wrong>
While the statement above can look obvious and what-I-am-about-to-say may sound as an obvious heresy – I am about to demonstrate
How to carry UDP-style packets over TCP without incurring (much of) additional latencies.
1 Sure, exponential back-off can be turned off in RUDP, and more aggressive RTO strategy can be used too – but this is still peanuts compared to what can be gained by using, say, state sync algorithms.
Why would we want it?
Why we would want such a thing – it is simple. Let’s consider the following scenario:
- We write our app in UDP style (using anything we like – including state sync algorithm such as the one described in Vol. I’s chapter on Communications and in [Fiedler])
- Then we try to run it on real-world Clients – and we realize that:
- [画像:Wtf hare:]“we realize that for some of the Clients – UDP just doesn’t work because of some weird firewall between Client and ServerFor some of the Clients – UDP just doesn’t work because of some weird firewall between Client and Server (hotel and airport connections are particularly guilty of having such firewalls; some of mobile connections are reported to have this problem too). Overall, according to [QUIC] about 6 to 9% of all the Google users cannot reach Google servers via UDP.
- For some of the Client-platforms-we-need – UDP is not supported at all.
- In particular, as of now, browsers have this problem. NB: I am not sure whether the approach I am describing will work over WebSockets – it at least depends on the question "whether different WebSockets use the same TCP or different ones" (and we need different TCP connections for the algorithm below to work). While reportedly ‘force new connection’:true should do the trick – I have no idea how universal this solution is (and no idea whether there are some peculiarities of WebSockets which prevents the approach below from working). [[BTW – if you find out whether algorithm below can be made workable with WebSockets – please let me know.]]
Whether it is worth the trouble – depends on your needs, but IMO it is still nice to know that whenever you need it – such an option of tunneling UDP-like requests over TCP (and without incurring latency penalty) does exist.
The Algorithm
Actually, the idea is dead simple:
- We have N TCP connections between Client and Server
- For each of these TCP connections – there is a "packet en route" flag. This flag is set to false after TCP connection is established.
- On the sending side, whenever an UDP packet comes in – we’re sending it over one of the TCP connections which does not have the packet-en-route flag. Also, we set the packet-en-route flag for the TCP connection where we sent the packet.
- Of course, we still need to wrap the packet to send it over TCP (to denote the packet size and/or boundary)
- On the receiving side – whenever we get an incoming UDP-over-TCP packet – we send an app-level acknowledgement (as small as 1 byte) back to the sender, over the same TCP connection where we got the incoming message.
- On the sending side, when we get this app-level acknowledgement – we reset the packet-en-route flag
The description above is sufficient to understand how the algorithm works – though of course, to become practical, it will need other stuff (such as detecting connection being “hung” – for example, using app-level keep-alives or just having to wait for the app-level ack too long, creating new connections instead of "hung" ones, using the same N TCP connections for bidirectional connection, auto-detecting number N, and so on). Still, all the other stuff seems to be perfectly viable.
Now, let’s see why this thing will work (almost) about incurring additional latencies for TCP. The answer is simple –
That’s because we always send a packet only to those connections of which we’re sure that there are no outstanding packets.
It means that whenever we’re sending the packet, several properties are guaranteed to stand:
- [画像:Hare thumb up:]“there is nothing to block us (so head-of-line blocking doesn’t apply)there is nothing to block us (so head-of-line blocking doesn’t apply).
- there are no outstanding retransmits (i.e. everything which went to the other side – was already acknowledged and retransmitted). This means that exponential back-off doesn’t apply either. 2
- there are no outstanding bytes. This has an important consequence:
- As “slow start” and most3 congestion avoidance algorithms work in terms of “congestion window” – which in turn is expressed in terms of outstanding bytes – it seems that we won’t run into problems with slow-start/congestion-avoidance either (at least as long as our sends() are small enough to fit into MSS – and most of the time MSS is at least 1200-1400 bytes – once again making it a direct analogy to typical UDP packet limits).
- At least in theory, even Nagle algorithm shouldn’t hurt us under these circumstances (though I’d still probably disable it just in case).
Bingo! We’ve ate our cake got our UDP-over-TCP – and have it got no additional latencies too.
2 Even if there is "residual" RTO (i.e. if RTO has been previously doubled per RFC2988 section 5.5) – the first packet will be sent right away anyway, and we don’t care much about subsequent retransmits
3 if not “all”
Calculating N
However, before we can claim it as a win, there is still a very important consideration here. In particular: this algorithm is realistically usable only if the number N is small enough (in other words, if we’d need 100 TCP connections per user – it will probably start causing problems).
Let’s make some back-of-the-envelope calculations. For example, for a game with 20 "network ticks" per second (which is quite typical), and for a Client with 100ms RTT (it is more than enough for intra-continent stuff) – then under ideal circumstances we’ll need only 0.1/(1/20)=2 TCP connections for the algorithm above to work. In fact – we’ll certainly need to add at least other 2 connections to account for packet loss (which will result in some connections having packet-en-route flag longer than usual). Still, 4 or 5 TCP connections seem as a reasonably good starting point for such 20 network-ticks/second, and 100 ms RTT.
Other scenarios and likely-to-be-reasonable values of N:
- 20 network ticks/second, 200ms RTT (this is enough to have global RTT). Reasonable starting N for this case is probably around 6-7 (4 in ideal case, 2-3 reserve).
- 60 network ticks/second, 100ms RTT. Reasonable starting N: 9-10 (6 for ideal case, 3-4 reserve).
[画像:Hare pointing out:]“As we can see, even in more stressful scenarios N is expected to be relatively smallAs we can see, even in more stressful scenarios N is expected to be relatively small. Now, let’s see where having multiple connections will hurt us:
- more memory used on the Server-Side per player. This one can be partially mitigated by reducing the size of the TCP buffers. With the schema above – we’re not likely to need more than 4K / TCP connection, so to run 1’000 players/2S-2Userver (which is a kind of “industry standard” number for 2017), we’d need around 4K/TCP*10TCP/player*10000players/server = 40MBytes of RAM, which is pretty much nothing by today’s standards. Even if speaking about a “really communications-heavy” server with 10K players – it is still mere 400MBytes.
- More TCP ports used. However – on the Server-Side we won’t see these additional ports (they will be still terminated by the same Server-Side TCP port), so I don’t expect it to be a real problem.
As we can see, analysis and the numbers above seem to indicate that UDP-over-TCP using the schema above, isusable (no warranties of any kind, batteries not included). Moreover, as discussed above – we do NOT intend to use this thing for all the players – but just for those platforms (and/or firewalled users) which don’t have UDP at the moment. As a result – the overall impact of using additional resources for multiple connections will be most likely scaled down even further.
Disclaimer and Conclusion
Beware of bugs in the above code; I have only proved it correct, not tried it.
— Donald Knuth —
Now, a necessary disclaimer:
Beware of problems with the algorithm above. I have only kinda-demonstrated its correctness, not tried it.
However, assuming that the analysis above stands – it opens the door to the following approach for writing games:
- Write all your time-critical protocols using UDP; it includes both different flavours of RUDP and state sync algorithm
- However, for those players who cannot use UDP for whatever reason – your code can use UDP-over-TCP as an (almost-)no-added-latencies backup.
- An additional disclaimer: at least in case of firewalls, some of them will do weird things with your TCP stream (in particular, re-assembling it and forwarding the stream further only after re-assembling) – and the algorithm above will not be able to deal with them. Still, from my experience I feel that such firewalls are not that frequent (though as always, YMMV, and batteries are not included).
Phew, I think it is enough for today. And BTW, if you see any mistakes in the reasoning above (or have tried it in reality and got either positive or negative results) – please LMK.
[[To Be Continued…
[画像:Tired hare:]This concludes beta Chapter 13(j) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”.
Stay tuned for beta Chapter 22, where we’ll start discussing all the boring stuff about debugging, tracing, logging – and not-so-boring post-factum beta/production debugging]]
Acknowledgement
Cartoons by Sergey GordeevIRL from Gordeev Animation Graphics, Prague.