Codeberg/Community
54
325
Fork
You've already forked Community
12

Increase ENDLESS_TASK_TIMEOUT to 72h? #2202

Closed
opened 2025年11月07日 20:16:52 +01:00 by zyphlar · 6 comments

Comment

Hi there! We at CoMaps have our own runners for beefy long-running Actions, and benefit greatly from having our CI/CD ecosystem all in one place here at Codeberg, but our map generation job can run for days (currently 5 days but ideally 3 or less once we optimize a problematic straggler.) I've edited the runner settings so the jobs actually keep going behind the scenes, the only issue is that Codeberg gives up on the job and marks it as failed... but doesn't actually force-stop the job with the runner. Which is a good thing, I like my jobs finishing behind the scenes! But it's also unfortunate, since it looks failed and stopped and causes a little panic.

Hopefully it's not too many resources to hold open a job at the coordinator level, right? And the primary resource issue is the runners themselves, who should actually be doing the bulk of any cancelling?

### Comment Hi there! We at CoMaps have our own runners for beefy long-running Actions, and benefit greatly from having our CI/CD ecosystem all in one place here at Codeberg, but our map generation job can run for days (currently 5 days but ideally 3 or less once we optimize a problematic straggler.) I've edited the runner settings so the jobs actually keep going behind the scenes, the only issue is that Codeberg gives up on the job and marks it as failed... but doesn't actually force-stop the job with the runner. Which is a good thing, I like my jobs finishing behind the scenes! But it's also unfortunate, since it looks failed and stopped and causes a little panic. Hopefully it's not too many resources to hold open a job at the coordinator level, right? And the primary resource issue is the runners themselves, who should actually be doing the bulk of any cancelling?

The current value is 3 hours and I'm not exactly sure who is benefiting from this (as the job is still continuing by the runner). I don't feel strongly about changing this.

@zyphlar wrote in #2202 (comment):

Hopefully it's not too many resources to hold open a job at the coordinator level, right? And the primary resource issue is the runners themselves, who should actually be doing the bulk of any cancelling?

You might be better suited for this, does the job still continuing outputting after it being marked as failed? All what stopping the task does is updating the status in the database.

CC @fnetX, do you have anything for changing this value?

The current value is 3 hours and I'm not exactly sure who is benefiting from this (as the job is still continuing by the runner). I don't feel strongly about changing this. @zyphlar wrote in https://codeberg.org/Codeberg/Community/issues/2202#issue-2645660: > Hopefully it's not too many resources to hold open a job at the coordinator level, right? And the primary resource issue is the runners themselves, who should actually be doing the bulk of any cancelling? You might be better suited for this, does the job still continuing outputting after it being marked as failed? All what stopping the task does is updating the status in the database. CC @fnetX, do you have anything for changing this value?
Owner
Copy link

No, I don't have strong objections. I have increased them in Codeberg-Infrastructure/build-deploy-forgejo@747d8b7c22 already. While I see the value in cancelling jobs that take too long, I think this is better done on the runner level to give project admins the control. I don't see a good reason for enforcing this on Codeberg's level.

No, I don't have strong objections. I have increased them in https://codeberg.org/Codeberg-Infrastructure/build-deploy-forgejo/commit/747d8b7c22db487a001e42642ba31c02a74747fb already. While I see the value in cancelling jobs that take too long, I think this is better done on the runner level to give project admins the control. I don't see a good reason for enforcing this on Codeberg's level.
Owner
Copy link

Oh, and re-reading my commit message: @zyphlar It is actually relevant if your jobs produce no output, because then ZOMBIE_TASK_TIMEOUT also applies. If they produce debug output, we should be fine.

@Gusted I'll have to look into the exact behaviour of Actions (or you tell me if you know). I have questions about the behaviour of running jobs with no log output. Ideally, if the runner continously reports them as still running, they should not be canceled. However, I would want that jobs where the runner simply disappears (e.g. because of server shutdowns, network issues, crashes ... and thus cannot mark the job as failed) will run into some sort of timeout, because this is then only our responsibility.

So: Runner actively reports job as running: We shouldn't cancel, not our responsibility.
Runner disappearing and no longer marking the job as running: We need to assume the runner has failed and it becomes our responsibility to consider the job failed after some time.

Oh, and re-reading my commit message: @zyphlar It is actually relevant if your jobs produce no output, because then ZOMBIE_TASK_TIMEOUT also applies. If they produce debug output, we should be fine. @Gusted I'll have to look into the exact behaviour of Actions (or you tell me if you know). I have questions about the behaviour of running jobs with no log output. Ideally, if the runner continously reports them as still running, they should not be canceled. However, I would want that jobs where the runner simply disappears (e.g. because of server shutdowns, network issues, crashes ... and thus cannot mark the job as failed) will run into some sort of timeout, because this is then only our responsibility. So: Runner actively reports job as running: We shouldn't cancel, not our responsibility. Runner disappearing and no longer marking the job as running: We need to assume the runner has failed and it becomes our responsibility to consider the job failed after some time.
Author
Copy link

Sounds good, thanks! The symptom is codeberg.org/*/actions/* looks like the job has failed mysteriously after 24h and some minutes, even with recent log output, but investigating the Forgejo runner (docker on our server) shows the job actually continuing just fine. Then after that job has finished, it moves on to the next job just fine (with successful codeberg.org output for that subsequent job.)

While we're at it, if CoMaps is using some resource excessively feel free to open an issue and @ me or @pastk -- thanks for everything you do! Making the map generation process public and reproduceable and automated has been a big goal of mine.

Sounds good, thanks! The symptom is `codeberg.org/*/actions/*` looks like the job has failed mysteriously after 24h and some minutes, even with recent log output, but investigating the Forgejo runner (docker on our server) shows the job actually continuing just fine. Then after that job has finished, it moves on to the next job just fine (with successful codeberg.org output for that subsequent job.) While we're at it, if CoMaps is using some resource excessively feel free to open an issue and @ me or @pastk -- thanks for everything you do! Making the map generation process public and reproduceable and automated has been a big goal of mine.
Owner
Copy link

Hi! I have finally made the config changes in Codeberg-Infrastructure/build-deploy-forgejo@f16fc1b3d7 and they'll be included in the next deployment, likely in a few days. I'm closing this issue, I suppose you'll notice once the behaviour changed :)

Please let us know once you have optimized the builds for a shorter runtime, so we can ideally lower the threshold again (currently at 5d).

I'm very happy to support CoMaps, and I hope that the last anti-feature flag can be removed from F-Droid some day. If there is anything else that we can do for you (e.g. actions/meta#54), please let us know.

Hi! I have finally made the config changes in https://codeberg.org/Codeberg-Infrastructure/build-deploy-forgejo/commit/f16fc1b3d7e3e8e9b76ff546192bcecc67217499 and they'll be included in the next deployment, likely in a few days. I'm closing this issue, I suppose you'll notice once the behaviour changed :) Please let us know once you have optimized the builds for a shorter runtime, so we can ideally lower the threshold again (currently at 5d). I'm very happy to support CoMaps, and I hope that the last anti-feature flag can be removed from F-Droid some day. If there is anything else that we can do for you (e.g. https://codeberg.org/actions/meta/issues/54), please let us know.
Author
Copy link

Thanks so much! I think we're at 36-ish hours currently.

Thanks so much! I think we're at 36-ish hours currently.
Sign in to join this conversation.
No Branch/Tag specified
main
No results found.
Labels
Clear labels
accessibility

Reduces accessibility and is thus a "bug" for certain user groups on Codeberg.
bug

Something is not working the way it should. Does not concern outages.
bug
infrastructure

Errors evidently caused by infrastructure malfunctions or outages
Codeberg

This issue involves Codeberg's downstream modifications and settings and/or Codeberg's structures.
contributions welcome

Please join the discussion and consider contributing a PR!
docs

No bug, but an improvement to the docs or UI description will help
duplicate

This issue or pull request already exists
enhancement

New feature
infrastructure

Involves changes to the server setups, use `bug/infrastructure` for infrastructure-related user errors.
legal

An issue directly involving legal compliance
licence / ToS

involving questions about the ToS, especially licencing compliance
please chill
we are volunteers

Please consider editing your posts and remember that there is a human on the other side. We get that you are frustrated, but it's harder for us to help you this way.
public relations

Things related to Codeberg's external communication
question

More information is needed
question
user support

This issue contains a clearly stated problem. However, it is not clear whether we have to fix anything on Codeberg's end, but we're helping them fix it and/or find the cause.
s/Forgejo

Related to Forgejo. Please also check Forgejo's issue tracker.
s/Forgejo/migration

Migration related issues in Forgejo
s/Pages

Issues related to the Codeberg Pages feature
s/Weblate

Issue is related to the Weblate instance at https://translate.codeberg.org
s/Woodpecker

Woodpecker CI related issue
security

involves improvements to the sites security
service

Add a new service to the Codeberg ecosystem (instead of implementing into Gitea)
upstream

An open issue or pull request to an upstream repository to fix this issue (partially or completely) exists (i.e. Gitea, Forgejo, etc.)
wontfix

Codeberg's current set of contributors are not planning to spend time on delegating this issue.
Milestone
Clear milestone
No items
No milestone
Projects
Clear projects
No items
No project
Assignees
Clear assignees
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Codeberg/Community#2202
Reference in a new issue
Codeberg/Community
No description provided.
Delete branch "%!s()"

Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?