Using a heartbeat to extend visibility_timeout · rails-lambda/lambdakiq · Discussion #9

nitsujri
May 27, 2021

Thoughts on using a hearbeat to extend the visibility_timeout?

I'm not 100% sure, but I believe there might be 2 issues for longer running jobs:

Currently, if the job takes longer than default VisibilityTimeout: 301 it will get retried, potentially indefinitely.
If the the job hits an error, visibility_timeout is used as the backoff function, but visibility_timeout isn't restored on next run, so a retried job must complete faster than Lambdakiq::Backoff.

A heartbeat would solve these two issues up until the 900 second limit.

I think Shoryuken implements this... it did?, but it swapped it out in favor of a more complex system that I don't fully understand.

Replies: 1 comment 1 reply

metaskills
May 29, 2021
Maintainer

Oh yea, I think I remember reading about the heartbeat when doing the initial release. My thinking is that if folks knew they had jobs that ran close to, or over, 5 minutes (300s) they should create a specific queue/function with distinct timeouts. As you pointed out, the max would be Lambda's 15 minutes (900s). What do you think if we did these changes?

Change examples from 300 to 900 and document job duration limits on Lambda.
Add a job timer based on the queue's visibility timeout. Be it 60, 300, or 900, raise a timeout, put job back in queue.
Add documentation in the Metrics section to monitor durations.

Seems to me this would get us where we need without having to make heartbeat calls.

1 reply

@nitsujri

nitsujri May 31, 2021
Author

Interesting, what about simply always setting it to 900? In both the CloudFormation template and before running the job?

I'm trying to see the scenarios where this is bad but the only one I can come up with is during a total failure of the machine/job before it can set the visibility timeout back down from 900 to the Backoff timing.

If I'm understanding correctly the 2 main scenarios:

- Job start (900)
 - Job finish (delete_message)
- Job start (timeout to 900)
 - Job failure (timeout to 30)
 - Job start (timeout to 900)
 - Job failure (timeout to 46)
 ...

Obviously, if there is a total machine failure, the job doesn't get picked up again for ~900's, which I think is a great tradeoff for how simple this is.

put job back in queue

To be clear on the technical language, my understanding is it's not really "back in queue" it's "in flight" waiting for visibility_timeout before being "available" again, or am I not thinking about it correctly?

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using a heartbeat to extend visibility_timeout #9

Uh oh!

{{title}}

Uh oh!

nitsujri
May 27, 2021

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

metaskills
May 29, 2021
Maintainer

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

nitsujri May 31, 2021
Author

Select a reply

Uh oh!

Using a heartbeat to extend visibility_timeout #9

Uh oh!

nitsujri May 27, 2021

Replies: 1 comment · 1 reply

Uh oh!

metaskills May 29, 2021 Maintainer

Uh oh!

Uh oh!

nitsujri May 31, 2021 Author

nitsujri
May 27, 2021

Replies: 1 comment 1 reply

metaskills
May 29, 2021
Maintainer

nitsujri May 31, 2021
Author