Codeberg/Community

Fork 12

Code Issues 426 Activity

Increase ENDLESS_TASK_TIMEOUT to 72h? #2202

New issue

Closed

opened 2025年11月07日 20:16:52 +01:00 by zyphlar · 6 comments

zyphlar commented

2025年11月07日 20:16:52 +01:00

Copy link

Comment

Hi there! We at CoMaps have our own runners for beefy long-running Actions, and benefit greatly from having our CI/CD ecosystem all in one place here at Codeberg, but our map generation job can run for days (currently 5 days but ideally 3 or less once we optimize a problematic straggler.) I've edited the runner settings so the jobs actually keep going behind the scenes, the only issue is that Codeberg gives up on the job and marks it as failed... but doesn't actually force-stop the job with the runner. Which is a good thing, I like my jobs finishing behind the scenes! But it's also unfortunate, since it looks failed and stopped and causes a little panic.

Hopefully it's not too many resources to hold open a job at the coordinator level, right? And the primary resource issue is the runners themselves, who should actually be doing the bulk of any cancelling?

### Comment Hi there! We at CoMaps have our own runners for beefy long-running Actions, and benefit greatly from having our CI/CD ecosystem all in one place here at Codeberg, but our map generation job can run for days (currently 5 days but ideally 3 or less once we optimize a problematic straggler.) I've edited the runner settings so the jobs actually keep going behind the scenes, the only issue is that Codeberg gives up on the job and marks it as failed... but doesn't actually force-stop the job with the runner. Which is a good thing, I like my jobs finishing behind the scenes! But it's also unfortunate, since it looks failed and stopped and causes a little panic. Hopefully it's not too many resources to hold open a job at the coordinator level, right? And the primary resource issue is the runners themselves, who should actually be doing the bulk of any cancelling?

Gusted commented

2025年11月08日 22:11:25 +01:00

Owner

Copy link

The current value is 3 hours and I'm not exactly sure who is benefiting from this (as the job is still continuing by the runner). I don't feel strongly about changing this.

@zyphlar wrote in #2202 (comment):

Hopefully it's not too many resources to hold open a job at the coordinator level, right? And the primary resource issue is the runners themselves, who should actually be doing the bulk of any cancelling?

You might be better suited for this, does the job still continuing outputting after it being marked as failed? All what stopping the task does is updating the status in the database.

CC @fnetX, do you have anything for changing this value?

The current value is 3 hours and I'm not exactly sure who is benefiting from this (as the job is still continuing by the runner). I don't feel strongly about changing this. @zyphlar wrote in https://codeberg.org/Codeberg/Community/issues/2202#issue-2645660: > Hopefully it's not too many resources to hold open a job at the coordinator level, right? And the primary resource issue is the runners themselves, who should actually be doing the bulk of any cancelling? You might be better suited for this, does the job still continuing outputting after it being marked as failed? All what stopping the task does is updating the status in the database. CC @fnetX, do you have anything for changing this value?

fnetX commented

2025年11月09日 00:13:01 +01:00

Owner

Copy link

No, I don't have strong objections. I have increased them in Codeberg-Infrastructure/build-deploy-forgejo@747d8b7c22 already. While I see the value in cancelling jobs that take too long, I think this is better done on the runner level to give project admins the control. I don't see a good reason for enforcing this on Codeberg's level.

No, I don't have strong objections. I have increased them in https://codeberg.org/Codeberg-Infrastructure/build-deploy-forgejo/commit/747d8b7c22db487a001e42642ba31c02a74747fb already. While I see the value in cancelling jobs that take too long, I think this is better done on the runner level to give project admins the control. I don't see a good reason for enforcing this on Codeberg's level.

❤️ 1

fnetX commented

2025年11月09日 00:16:07 +01:00

Owner

Copy link

Oh, and re-reading my commit message: @zyphlar It is actually relevant if your jobs produce no output, because then ZOMBIE_TASK_TIMEOUT also applies. If they produce debug output, we should be fine.

@Gusted I'll have to look into the exact behaviour of Actions (or you tell me if you know). I have questions about the behaviour of running jobs with no log output. Ideally, if the runner continously reports them as still running, they should not be canceled. However, I would want that jobs where the runner simply disappears (e.g. because of server shutdowns, network issues, crashes ... and thus cannot mark the job as failed) will run into some sort of timeout, because this is then only our responsibility.

So: Runner actively reports job as running: We shouldn't cancel, not our responsibility.
Runner disappearing and no longer marking the job as running: We need to assume the runner has failed and it becomes our responsibility to consider the job failed after some time.

Oh, and re-reading my commit message: @zyphlar It is actually relevant if your jobs produce no output, because then ZOMBIE_TASK_TIMEOUT also applies. If they produce debug output, we should be fine. @Gusted I'll have to look into the exact behaviour of Actions (or you tell me if you know). I have questions about the behaviour of running jobs with no log output. Ideally, if the runner continously reports them as still running, they should not be canceled. However, I would want that jobs where the runner simply disappears (e.g. because of server shutdowns, network issues, crashes ... and thus cannot mark the job as failed) will run into some sort of timeout, because this is then only our responsibility. So: Runner actively reports job as running: We shouldn't cancel, not our responsibility. Runner disappearing and no longer marking the job as running: We need to assume the runner has failed and it becomes our responsibility to consider the job failed after some time.

zyphlar commented

2025年11月09日 00:26:17 +01:00

Author

Copy link

Sounds good, thanks! The symptom is codeberg.org/*/actions/* looks like the job has failed mysteriously after 24h and some minutes, even with recent log output, but investigating the Forgejo runner (docker on our server) shows the job actually continuing just fine. Then after that job has finished, it moves on to the next job just fine (with successful codeberg.org output for that subsequent job.)

While we're at it, if CoMaps is using some resource excessively feel free to open an issue and @ me or @pastk -- thanks for everything you do! Making the map generation process public and reproduceable and automated has been a big goal of mine.

Sounds good, thanks! The symptom is `codeberg.org/*/actions/*` looks like the job has failed mysteriously after 24h and some minutes, even with recent log output, but investigating the Forgejo runner (docker on our server) shows the job actually continuing just fine. Then after that job has finished, it moves on to the next job just fine (with successful codeberg.org output for that subsequent job.) While we're at it, if CoMaps is using some resource excessively feel free to open an issue and @ me or @pastk -- thanks for everything you do! Making the map generation process public and reproduceable and automated has been a big goal of mine.

fnetX commented

2025年11月23日 00:09:55 +01:00

Owner

Copy link

Hi! I have finally made the config changes in Codeberg-Infrastructure/build-deploy-forgejo@f16fc1b3d7 and they'll be included in the next deployment, likely in a few days. I'm closing this issue, I suppose you'll notice once the behaviour changed :)

Please let us know once you have optimized the builds for a shorter runtime, so we can ideally lower the threshold again (currently at 5d).

I'm very happy to support CoMaps, and I hope that the last anti-feature flag can be removed from F-Droid some day. If there is anything else that we can do for you (e.g. actions/meta#54), please let us know.

Hi! I have finally made the config changes in https://codeberg.org/Codeberg-Infrastructure/build-deploy-forgejo/commit/f16fc1b3d7e3e8e9b76ff546192bcecc67217499 and they'll be included in the next deployment, likely in a few days. I'm closing this issue, I suppose you'll notice once the behaviour changed :) Please let us know once you have optimized the builds for a shorter runtime, so we can ideally lower the threshold again (currently at 5d). I'm very happy to support CoMaps, and I hope that the last anti-feature flag can be removed from F-Droid some day. If there is anything else that we can do for you (e.g. https://codeberg.org/actions/meta/issues/54), please let us know.

❤️ 1

fnetX closed this issue

2025年11月23日 00:09:56 +01:00

zyphlar commented

2025年11月23日 01:44:45 +01:00

Author

Copy link

Thanks so much! I think we're at 36-ish hours currently.

No Branch/Tag specified

Branches Tags

main

Labels

Clear labels

accessibility

Reduces accessibility and is thus a "bug" for certain user groups on Codeberg.

bug

Something is not working the way it should. Does not concern outages.

bug

infrastructure

Errors evidently caused by infrastructure malfunctions or outages

Codeberg

This issue involves Codeberg's downstream modifications and settings and/or Codeberg's structures.

contributions welcome

Please join the discussion and consider contributing a PR!

docs

No bug, but an improvement to the docs or UI description will help

duplicate

This issue or pull request already exists

enhancement

New feature

infrastructure

Involves changes to the server setups, use `bug/infrastructure` for infrastructure-related user errors.

legal

An issue directly involving legal compliance

licence / ToS

involving questions about the ToS, especially licencing compliance

please chill

we are volunteers

Please consider editing your posts and remember that there is a human on the other side. We get that you are frustrated, but it's harder for us to help you this way.

public relations

question

More information is needed

question

user support

This issue contains a clearly stated problem. However, it is not clear whether we have to fix anything on Codeberg's end, but we're helping them fix it and/or find the cause.

s/Forgejo

s/Forgejo/migration

Migration related issues in Forgejo

s/Pages

s/Weblate

s/Woodpecker

Woodpecker CI related issue

security

involves improvements to the sites security

service

Add a new service to the Codeberg ecosystem (instead of implementing into Gitea)

upstream

An open issue or pull request to an upstream repository to fix this issue (partially or completely) exists (i.e. Gitea, Forgejo, etc.)

wontfix

Codeberg's current set of contributors are not planning to spend time on delegating this issue.

No labels

contributions welcome

Milestone

Clear milestone

No items

No milestone

Projects

Clear projects

No items

No project

Assignees

Clear assignees

No assignees

3 participants

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

Codeberg/Community#2202

No description provided.

Delete branch "%!s()"

Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?