-
Notifications
You must be signed in to change notification settings - Fork 8
-
Thoughts on using a hearbeat to extend the visibility_timeout?
I'm not 100% sure, but I believe there might be 2 issues for longer running jobs:
- Currently, if the job takes longer than default
VisibilityTimeout: 301
it will get retried, potentially indefinitely. - If the the job hits an error, visibility_timeout is used as the backoff function, but
visibility_timeout
isn't restored on next run, so a retried job must complete faster thanLambdakiq::Backoff
.
A heartbeat would solve these two issues up until the 900 second limit.
I think Shoryuken implements this... it did?, but it swapped it out in favor of a more complex system that I don't fully understand.
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 1 comment 1 reply
-
Oh yea, I think I remember reading about the heartbeat when doing the initial release. My thinking is that if folks knew they had jobs that ran close to, or over, 5 minutes (300s) they should create a specific queue/function with distinct timeouts. As you pointed out, the max would be Lambda's 15 minutes (900s). What do you think if we did these changes?
- Change examples from 300 to 900 and document job duration limits on Lambda.
- Add a job timer based on the queue's visibility timeout. Be it 60, 300, or 900, raise a timeout, put job back in queue.
- Add documentation in the Metrics section to monitor durations.
Seems to me this would get us where we need without having to make heartbeat calls.
Beta Was this translation helpful? Give feedback.
All reactions
-
Interesting, what about simply always setting it to 900? In both the CloudFormation template and before running the job?
I'm trying to see the scenarios where this is bad but the only one I can come up with is during a total failure of the machine/job before it can set the visibility timeout back down from 900 to the Backoff
timing.
If I'm understanding correctly the 2 main scenarios:
- Job start (900)
- Job finish (delete_message)
- Job start (timeout to 900)
- Job failure (timeout to 30)
- Job start (timeout to 900)
- Job failure (timeout to 46)
...
Obviously, if there is a total machine failure, the job doesn't get picked up again for ~900's, which I think is a great tradeoff for how simple this is.
put job back in queue
To be clear on the technical language, my understanding is it's not really "back in queue" it's "in flight" waiting for visibility_timeout before being "available" again, or am I not thinking about it correctly?
Beta Was this translation helpful? Give feedback.