-
Notifications
You must be signed in to change notification settings - Fork 440
Fix - Add automatic reconnect support for STOMP producers #1099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix - Add automatic reconnect support for STOMP producers #1099
Conversation
Do we not want this?
Yes, indeed. You should not commit lock file in libraries in contrast to application.
Yes, indeed. You should not commit lock file in libraries in contrast to application.
Hmm, I always lock my dependencies regardless of project type, but no matter! 😄
I'm going through some additional testing right now and I'm noticing one other issue with stomp-php
where it claims to have successfully written, but actually hasn't. I end up having to pump quite a few messages through before it reports a failed write.
I'll spend some more time today testing and will report back.
@makasim - So after looking a little deeper into it, it looks like there's another issue with stomp-php
when sending messages where it occasionally fails to detect a dead connection. I am able to reproduce this consistently.
The sequence looks roughly like:
- Run a broker that connects to
queue-1
and listens. - Worker receives a message on
queue-1
and goes to handle it. - Worker requires a message to be written to
queue-2
- Worker connects to
queue-2
and write; the message is successful. This leaves the connection open with no heartbeat.note: Up until this point, I think things are fine. If
queue-1
works steadily for a while, messages sent from the worker toqueue-2
are safe. This steady work simulates a heartbeat and prevents any dead connection detection on the broker from triggering. - If after a while (connection timeout window), a new message arrives on
queue-1
and the worker goes to process it similar to before, it ends up unable to deliver any message toqueue-2
. The connection from the broker forqueue-2
is effectively corrupted, but internally tostomp-php
, it thinks it's still good. Even to the point of attempting to write bytes without triggering its exception.
Originally I thought stomp-php
was simply not heartbeating a connection and not detecting dead connections, but now I'm able to see it also try to use its connection resource without throwing the "Was not possible to write frame!"
exception. 😱
I've added another improvement to the PR which we can discuss further, it unfortunately has the impact of requiring one connection per message. Again though, I'm not sure how to workaround this and in fairness, I can see how this hasn't been observed until now because I could see the majority of projects using this library only dispatching from short-lived web requests.
To find out that connection is closed ( by the server) you have to read from the socket. You will get EOF or timeout.
But you cannot read before you write in client to server commutation.
There is no way to detect a closed connection by writing to it. Timeout is the only option.
Is there a way to set write timeout ? That could be used as a sign of closed connection and trigger reconnect.
Oh thanks! I'll tinker around with your suggestion and reply here with what I figure out. Definitely need to be able to come up with a way to gracefully handle this.
I'll see how long I have to go before something has to happen upstream 😆
After spending a fair bit of time on it, I've been unable to come up with a way to jog or test the connection. Can you think of any concrete solutions that might get around this issue?
From my vantage point, I see only a few options:
- Have enqueue send heartbeats for non-queue producer connections
- Somehow figure out a way to jog the connection prior to using it
- Allow enqueue to treat connections that don't have a heartbeat to be considered "transient" and destroy or reconnect them after each message sent.
IMO, I feel like the 3rd option is the most "predictable" philosophy and also aligns message publishing with how it's functioning right now during web requests.
Just a note in case you're at things early today, will be pushing up a new approach that I think I have working really nicely here...
pkg/stomp/StompContext.php
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Connections start out as "transient" which means that they will use some new reconnect behaviour below. I have it here in the constructor as configurable if for whatever reason a connection needs to start out long-lived.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As soon as the first consumer is created against a context, we convert to long-lived and disable all transient functionality so that queues with heartbeats don't lose their connections while heartbeating.
pkg/stomp/StompContext.php
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only happens when a connection is not being used for consumption.
Triggering a disconnect prior to a producer being used is enough, as the underlying stomp-php
library can safely rely on its false
state around connections. Connections recover without any additional involvement from enqueue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope you don't mind, I moved this out. 😄
@makasim - Updates made and I'm really happy with this, it's working great and has greatly improved the stability and flexibility of enqueue STOMP! 😃
To clarify:
You have to connections: one for consumers and another for publisher. The problem happens when you want to publish a message from a consumer ?
If so, It might be better to reconnect publisher connection only once before processing a message (assuming that message processing does not take long).
Hmm, let's see if I can do this justice 😆
I think part of the issue is around the current division of responsibilities. We need to do the correction at the Context
because StompConsumer
and StompProducer
don't know if they should start out transient as well as when or whether they should stop being transient at some point.
StompContext
only has the authority to manage connections, but never uses themStompProducer
andStompConsumer
only have the authority to use connections , but not manage them
For the current design:
StompContext
has responsibility for bothcreateProducer
andcreateConsumer
StompContext
is responsible for triggering the creation and holding references to connections viaStompContext::createStomp
, this makesStompContext
the first point of integration withstomp-php
- So long as a
StompContext
is only ever used for producers, then it's better to preserve the one message per connection semantics by bouncing the connection before any attempt to use it.- This is because if a
StompContext
only hascreateProducer
called, it almost definitely doesn't have a heartbeat configured createProducer
is called multiple times to send messages.StompProducers
are not reused and do not have sole authority over a connection, so it made sense to me to reset the connection at the level where it's managed.
- This is because if a
In theory I could add a $transient
parameter to StompProducer
instances, but it seems somewhat out of place as their current design is focused around using the connection and not managing it (as noted before).
Apologies, I'm hopefully explaining this correctly and in such a way that does the current design justice 😆
I am not that much into Laravel specifics. In general the idea is next
- you've got a message
- forced disconnect-connect of publisher connection, like
$conn->reconnect();
- consume the received message.
Once you get a new message do the same.
If I understand what you're saying correctly, that flow sounds like it's for receiving messages? Which isn't a problem in my scenario. The issue I'm observing is when I try to send to a queue (queue-2
) that is not being listened on, from a long-lived process.
I have no reason to open and hold a connection open to the destination queue (queue-2
) because I won't be publishing to it on every inbound message from say...queue-1
, which is long lived and is working fine with a heartbeat configured.
FWIW, this could be an issue without Laravel involved, there's nothing specific to it here. Moreover, I think what I'm adding more explicitly reflects how people use Enqueue during regular web request lifecycle. It just happens to ensure that semantic is now consistently offered for long running processes as well.
Sorry, one more edit, but I'm also more than happy to do a screen share to show the difference. Sometimes a picture/video is way better for understanding what's going on.
@makasim - Is there anything specific I can do to help get this merged? At this point my testing locally has proven the fix in this MR to be quite robust for long lived processes. 😄
Hey @makasim, just checking in again. I'd love to get this merged so that we'll be able to use STOMP from PHP in some upcoming work. Everything is still working very smoothly with my local version of this fix.
Honestly, I am not completely sold on this solution. Let's discuss the issue over a call. What about tomorrow?
Absolutely, I can definitely do that, I'm in the CDT timezone, happy to accommodate, what's your preference? 😄
It's worth noting that the transient flag is only in effect when the connection is going to be short-lived anyway. So there's very little risk of this breaking consumers. It's just causing the connection to recycle prior to use for non-consumer usage.
@makasim - If you're still available and unsure about this, I followed you on twitter if you wanna DM any details for a quick screen share or chat.
Again, I've been running this patch locally from two different queue workers that interact over the broker since figuring it out. Have made sure to come back and test dispatches. Consumers and heartbeats haven't been negatively impacted. My scenario also tests consuming and dispatching from the same queue, so the transient
flag is being managed correctly.
I do feel like it's a stable change and am happy to commit long term to assisting if for some reason this causes a regression. In many ways though, I think this could improve stability. A long lived web request publishing to a queue with a short TTL could in theory encounter an issue without this fix.
@atrauzzi Sent you a message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally I indent lines like 79
, but the code style suggested this, apologies!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where we default to "off".
Good morning @makasim! 😆
As per our chat earlier today, I've wired this up to the configuration obtained in StompConnectionFactory
. Pretty straightforward I'd say and preserves all old behaviour by default.
@atrauzzi some tests are failing
@makasim - I think the two that are failing now are okay?
Wonderful, thank you @makasim. Will this be automatically released?
Semi-automatically...
I'll do it.
Thank you too!
Hey @makasim - did the release end up going through? I noticed that I'm still seeing version 0.10.4
as the latest. 😄
Uh oh!
There was an error while loading. Please reload this page.
Closes: #1097
To detail my scenario, I'm running a long-lived PHP process (laravel queue worker) that listens on one connection, but can occasionally publish to another connection. The connection used for listening doesn't suffer from any issues because enqueue successfully sets up heartbeats, anticipating the fact that consumers tend to be long lived. For producers however, there might be the assumption that they are transient (web requests) and as such, don't require connections to live any longer than it takes for the request thread to finish all its work.
@makasim - My apologies if I've missed something or if this seems like a shoehorn fix / neglects any other concerns. After doing some digging around however I am starting to wonder if this might actually be a desirable approach.
What I've noticed in the
stomp-php
library is that their implementation ofClient::isConnected
has the potential to report false positives for the state of the connection and that without getting into a more convoluted fix, can only be "tested" by attempting to use it.With the code fix in this PR, failures to write are jogged with one last-ditch attempt to connect. My long lived mixed consumer+producer processes are able to send messages and reestablish connections on-demand with no ceremony or boilerplate.
I'd like to add the caveat that this is not my preferred strategy and would favour something more explicit, but just based on the fact that
StompProducer
isn't set up for configuration and the aforementioned false-positives fromstomp-php
, this may do well for all scenarios.