How to deal with TCP-sockets being dropped instead of closed server-side?

Question 1

Scenario: An application maintains a pool of connections to another service. The other service drops (not closes) all connections but still accepts new connections.

What are the best way of dealing with this situation? The only way to know that a connection is bad is by trying to interact with it and timeout, which is hard to recover from seamlessly. You can do retries, but if you do that by getting a new connection from the pool, chances are that you will get another connection that is bad, and by that time, or by the time of the third try, you are no longer completing your task within a reasonable time frame. I've seen no meaningful explicit support for dealing with this scenario in the third party pooling libraries i have investigated.

Question 2

If you need real time interaction, then you need a reliable service and network (presumably internal/private). If you don't need real time interaction then just do timeouts and retries.

Question 3

(Made comment into an answer since it actually lists some options)

There may be a lot of things you can do, provided you're willing to change stuff that you considered to be given until now.

You may change the connection pool such that it runs keep-alive pings on idle connections.
You may fix the "other service" to not simply drop connections.
You may push back on "reasonable time frame" requirements so that they are compatible with reality.

What's the best option may depend on the circumstances, which we don't know.

Question 4

Thank you. If the pool uses, say, 10 connections, and you use it to call the other service 100 times per second, the keep-alive would have to be really frequent to be able to catch the problem in time. Also, the timeout would need to be very short, so this is unpractical (i've tried). When it comes to the "other service", lots of cloud infrastructure do cloud things where a service for instance migrates to another node and my experience is that this is done with zero respect for open sockets.

Question 5

This is not a good place to discuss possible scenarios. You didn't mention service migration and cloud before - is that the actual situation or just a possibility that you just thought of? How often is the service migrated? What are the normal response times? If a migration happens 1-2 times per day and causes a delay of a second for one client because you try 10 dead connections with a timeout of 100ms, who cares?

Question 6

@Buhb if you really need such reliability, then you need to move to infrastructure that doesn't have those problems. If big cloud providers don't suit you, then you need your own infrastructure. But honestly I don't think you've done your research properly. For example there are lots of real time mmo games, or real time video conferencing services running on cloud servers like AWS or GCP.

Question 7

@freakish Thanks for the feedback. I'm sure you can cherry-pick solutions and vendors to work around this problem. I asked the question because i have seen a rise in this "hey, your application just needs to deal with this" trend, and I'm trying to, ideally, learn that there are some best practices that I've just missed or, at least, learn that the loss in availability is acceptable for most scenarios, and that's the reason I can't find much written about it.

Question 8

<tongue-in-cheek>the first rule for problem solving is "have a problem" - if all you have is a broad idea of a scenario that may present a problem you don't have a problem</tongue-in-cheek>

Question 9

I think you'd find it hard to distinguish this scenario from "remote service simply decides to delay all your responses by some time" (e.g. due to load). If you're dependent on a remote service which isn't behaving properly then .. your service fails.

If you know this specific scenario is happening, you could treat a failure on any connection in the connection pool as a trigger to first dump all the connections in the pool to that host. Or, whenever you have a connection timeout error, ask for a retry to the failover host instead.

Question 10

The only way to know that a connection is bad is by trying to interact with it and timeout, which is hard to recover from seamlessly. You can do retries, but if you do that by getting a new connection from the pool, chances are that you will get another connection that is bad, and by that time, or by the time of the third try, you are no longer completing your task within a reasonable time frame.

It's not clear what the problem is. If your connections can drop, potentially an indefinite number of times in series, then there is no guarantee that your machine's task will be completed under its own steam in any timeframe at all, let alone a "reasonable" one.

Computers are awash with internal hardware connections, but these are rarely seen explicitly by software developers because they are engineered to have a high level of reliability provided the computer is kept under normal environmental conditions.

"Connections" are revealed to software developers invariably because they are fundamentally shaky and intermittent connections, and the developer needs to proceed to program on the assumption of coping with the indefinite loss of a connection.

Ultimately it is the job of the system supervisers to intervene against abnormal conditions that break connections indefinitely stop the system working, and it is the job of the system designers to ensure (amongst other things) that the system is amenable to supervision and intervention, and that the risk, frequency, and urgency of superviser intervention being required is sufficiently low as to be suitable for the application.

It's often crucial to control complexity and limit extravagant assumptions about reliability, rather than to assume there are crafty solutions which do not at all require complexity to be controlled or extravagant assumptions to be limited.

I see in the comments that the main concern centres around cloud providers. Unfortunately, the use of cloud providers and their hardware means ceding control and losing coordination over a variety of matters that are normally under more control when the hardware belongs to you and/or is geographically co-located at your centres of business.

score 2 · Answer 1 · 2025-01-09 08:08:37Z

2

(Made comment into an answer since it actually lists some options)

There may be a lot of things you can do, provided you're willing to change stuff that you considered to be given until now.

You may change the connection pool such that it runs keep-alive pings on idle connections.
You may fix the "other service" to not simply drop connections.
You may push back on "reasonable time frame" requirements so that they are compatible with reality.

What's the best option may depend on the circumstances, which we don't know.

Share

Improve this answer

answered Jan 9 at 8:08

Hans-Martin Mosner's user avatar

Hans-Martin Mosner Hans-Martin Mosner

18.4k1 gold badge36 silver badges47 bronze badges

6

Thank you. If the pool uses, say, 10 connections, and you use it to call the other service 100 times per second, the keep-alive would have to be really frequent to be able to catch the problem in time. Also, the timeout would need to be very short, so this is unpractical (i've tried). When it comes to the "other service", lots of cloud infrastructure do cloud things where a service for instance migrates to another node and my experience is that this is done with zero respect for open sockets.

Buhb
– Buhb

01/09/2025 08:33:18
Commented Jan 9 at 8:33
This is not a good place to discuss possible scenarios. You didn't mention service migration and cloud before - is that the actual situation or just a possibility that you just thought of? How often is the service migrated? What are the normal response times? If a migration happens 1-2 times per day and causes a delay of a second for one client because you try 10 dead connections with a timeout of 100ms, who cares?

Hans-Martin Mosner
– Hans-Martin Mosner

01/09/2025 09:10:03
Commented Jan 9 at 9:10
@Buhb if you really need such reliability, then you need to move to infrastructure that doesn't have those problems. If big cloud providers don't suit you, then you need your own infrastructure. But honestly I don't think you've done your research properly. For example there are lots of real time mmo games, or real time video conferencing services running on cloud servers like AWS or GCP.

freakish
– freakish

01/09/2025 09:10:53
Commented Jan 9 at 9:10
@freakish Thanks for the feedback. I'm sure you can cherry-pick solutions and vendors to work around this problem. I asked the question because i have seen a rise in this "hey, your application just needs to deal with this" trend, and I'm trying to, ideally, learn that there are some best practices that I've just missed or, at least, learn that the loss in availability is acceptable for most scenarios, and that's the reason I can't find much written about it.

Buhb
– Buhb

01/09/2025 09:41:18
Commented Jan 9 at 9:41
<tongue-in-cheek>the first rule for problem solving is "have a problem" - if all you have is a broad idea of a scenario that may present a problem you don't have a problem</tongue-in-cheek>

Hans-Martin Mosner
– Hans-Martin Mosner

01/09/2025 09:48:09
Commented Jan 9 at 9:48

| Show 1 more comment

pjc50 pjc50 15.3k1 gold badge37 silver badges40 bronze badges · Answer 2 · 2025-01-09 10:51:47Z

I think you'd find it hard to distinguish this scenario from "remote service simply decides to delay all your responses by some time" (e.g. due to load). If you're dependent on a remote service which isn't behaving properly then .. your service fails.

If you know this specific scenario is happening, you could treat a failure on any connection in the connection pool as a trigger to first dump all the connections in the pool to that host. Or, whenever you have a connection timeout error, ask for a retry to the failover host instead.

Steve Steve 12.4k2 gold badges20 silver badges36 bronze badges · Answer 3 · 2025-01-09 23:30:02Z

The only way to know that a connection is bad is by trying to interact with it and timeout, which is hard to recover from seamlessly. You can do retries, but if you do that by getting a new connection from the pool, chances are that you will get another connection that is bad, and by that time, or by the time of the third try, you are no longer completing your task within a reasonable time frame.

It's not clear what the problem is. If your connections can drop, potentially an indefinite number of times in series, then there is no guarantee that your machine's task will be completed under its own steam in any timeframe at all, let alone a "reasonable" one.

Computers are awash with internal hardware connections, but these are rarely seen explicitly by software developers because they are engineered to have a high level of reliability provided the computer is kept under normal environmental conditions.

"Connections" are revealed to software developers invariably because they are fundamentally shaky and intermittent connections, and the developer needs to proceed to program on the assumption of coping with the indefinite loss of a connection.

Ultimately it is the job of the system supervisers to intervene against abnormal conditions that break connections indefinitely stop the system working, and it is the job of the system designers to ensure (amongst other things) that the system is amenable to supervision and intervention, and that the risk, frequency, and urgency of superviser intervention being required is sufficiently low as to be suitable for the application.

It's often crucial to control complexity and limit extravagant assumptions about reliability, rather than to assume there are crafty solutions which do not at all require complexity to be controlled or extravagant assumptions to be limited.

I see in the comments that the main concern centres around cloud providers. Unfortunately, the use of cloud providers and their hardware means ceding control and losing coordination over a variety of matters that are normally under more control when the hardware belongs to you and/or is geographically co-located at your centres of business.

Stack Exchange Network

How to deal with TCP-sockets being dropped instead of closed server-side?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to deal with TCP-sockets being dropped instead of closed server-side?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions