Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Using a heartbeat to extend visibility_timeout #9

nitsujri started this conversation in Ideas
Discussion options

Thoughts on using a hearbeat to extend the visibility_timeout?

I'm not 100% sure, but I believe there might be 2 issues for longer running jobs:

  • Currently, if the job takes longer than default VisibilityTimeout: 301 it will get retried, potentially indefinitely.
  • If the the job hits an error, visibility_timeout is used as the backoff function, but visibility_timeout isn't restored on next run, so a retried job must complete faster than Lambdakiq::Backoff.

A heartbeat would solve these two issues up until the 900 second limit.

I think Shoryuken implements this... it did?, but it swapped it out in favor of a more complex system that I don't fully understand.

You must be logged in to vote

Replies: 1 comment 1 reply

Comment options

Oh yea, I think I remember reading about the heartbeat when doing the initial release. My thinking is that if folks knew they had jobs that ran close to, or over, 5 minutes (300s) they should create a specific queue/function with distinct timeouts. As you pointed out, the max would be Lambda's 15 minutes (900s). What do you think if we did these changes?

  1. Change examples from 300 to 900 and document job duration limits on Lambda.
  2. Add a job timer based on the queue's visibility timeout. Be it 60, 300, or 900, raise a timeout, put job back in queue.
  3. Add documentation in the Metrics section to monitor durations.

Seems to me this would get us where we need without having to make heartbeat calls.

You must be logged in to vote
1 reply
Comment options

Interesting, what about simply always setting it to 900? In both the CloudFormation template and before running the job?

I'm trying to see the scenarios where this is bad but the only one I can come up with is during a total failure of the machine/job before it can set the visibility timeout back down from 900 to the Backoff timing.

If I'm understanding correctly the 2 main scenarios:

- Job start (900)
 - Job finish (delete_message)
- Job start (timeout to 900)
 - Job failure (timeout to 30)
 - Job start (timeout to 900)
 - Job failure (timeout to 46)
 ... 

Obviously, if there is a total machine failure, the job doesn't get picked up again for ~900's, which I think is a great tradeoff for how simple this is.

put job back in queue

To be clear on the technical language, my understanding is it's not really "back in queue" it's "in flight" waiting for visibility_timeout before being "available" again, or am I not thinking about it correctly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Ideas
Labels
None yet

AltStyle によって変換されたページ (->オリジナル) /