-1

Scenario: An application maintains a pool of connections to another service. The other service drops (not closes) all connections but still accepts new connections.

What are the best way of dealing with this situation? The only way to know that a connection is bad is by trying to interact with it and timeout, which is hard to recover from seamlessly. You can do retries, but if you do that by getting a new connection from the pool, chances are that you will get another connection that is bad, and by that time, or by the time of the third try, you are no longer completing your task within a reasonable time frame. I've seen no meaningful explicit support for dealing with this scenario in the third party pooling libraries i have investigated.

asked Jan 9 at 7:48
1
  • If you need real time interaction, then you need a reliable service and network (presumably internal/private). If you don't need real time interaction then just do timeouts and retries. Commented Jan 9 at 8:12

3 Answers 3

2

(Made comment into an answer since it actually lists some options)

There may be a lot of things you can do, provided you're willing to change stuff that you considered to be given until now.

  • You may change the connection pool such that it runs keep-alive pings on idle connections.
  • You may fix the "other service" to not simply drop connections.
  • You may push back on "reasonable time frame" requirements so that they are compatible with reality.

What's the best option may depend on the circumstances, which we don't know.

answered Jan 9 at 8:08
6
  • Thank you. If the pool uses, say, 10 connections, and you use it to call the other service 100 times per second, the keep-alive would have to be really frequent to be able to catch the problem in time. Also, the timeout would need to be very short, so this is unpractical (i've tried). When it comes to the "other service", lots of cloud infrastructure do cloud things where a service for instance migrates to another node and my experience is that this is done with zero respect for open sockets. Commented Jan 9 at 8:33
  • This is not a good place to discuss possible scenarios. You didn't mention service migration and cloud before - is that the actual situation or just a possibility that you just thought of? How often is the service migrated? What are the normal response times? If a migration happens 1-2 times per day and causes a delay of a second for one client because you try 10 dead connections with a timeout of 100ms, who cares? Commented Jan 9 at 9:10
  • @Buhb if you really need such reliability, then you need to move to infrastructure that doesn't have those problems. If big cloud providers don't suit you, then you need your own infrastructure. But honestly I don't think you've done your research properly. For example there are lots of real time mmo games, or real time video conferencing services running on cloud servers like AWS or GCP. Commented Jan 9 at 9:10
  • @freakish Thanks for the feedback. I'm sure you can cherry-pick solutions and vendors to work around this problem. I asked the question because i have seen a rise in this "hey, your application just needs to deal with this" trend, and I'm trying to, ideally, learn that there are some best practices that I've just missed or, at least, learn that the loss in availability is acceptable for most scenarios, and that's the reason I can't find much written about it. Commented Jan 9 at 9:41
  • <tongue-in-cheek>the first rule for problem solving is "have a problem" - if all you have is a broad idea of a scenario that may present a problem you don't have a problem</tongue-in-cheek> Commented Jan 9 at 9:48
1

I think you'd find it hard to distinguish this scenario from "remote service simply decides to delay all your responses by some time" (e.g. due to load). If you're dependent on a remote service which isn't behaving properly then .. your service fails.

If you know this specific scenario is happening, you could treat a failure on any connection in the connection pool as a trigger to first dump all the connections in the pool to that host. Or, whenever you have a connection timeout error, ask for a retry to the failover host instead.

answered Jan 9 at 10:51
1

The only way to know that a connection is bad is by trying to interact with it and timeout, which is hard to recover from seamlessly. You can do retries, but if you do that by getting a new connection from the pool, chances are that you will get another connection that is bad, and by that time, or by the time of the third try, you are no longer completing your task within a reasonable time frame.

It's not clear what the problem is. If your connections can drop, potentially an indefinite number of times in series, then there is no guarantee that your machine's task will be completed under its own steam in any timeframe at all, let alone a "reasonable" one.

Computers are awash with internal hardware connections, but these are rarely seen explicitly by software developers because they are engineered to have a high level of reliability provided the computer is kept under normal environmental conditions.

"Connections" are revealed to software developers invariably because they are fundamentally shaky and intermittent connections, and the developer needs to proceed to program on the assumption of coping with the indefinite loss of a connection.

Ultimately it is the job of the system supervisers to intervene against abnormal conditions that break connections indefinitely stop the system working, and it is the job of the system designers to ensure (amongst other things) that the system is amenable to supervision and intervention, and that the risk, frequency, and urgency of superviser intervention being required is sufficiently low as to be suitable for the application.

It's often crucial to control complexity and limit extravagant assumptions about reliability, rather than to assume there are crafty solutions which do not at all require complexity to be controlled or extravagant assumptions to be limited.

I see in the comments that the main concern centres around cloud providers. Unfortunately, the use of cloud providers and their hardware means ceding control and losing coordination over a variety of matters that are normally under more control when the hardware belongs to you and/or is geographically co-located at your centres of business.

answered Jan 9 at 23:30

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.