Codeberg/Community
62
386
Fork
You've already forked Community
12

500/504 server error creating issues/PR, commenting etc. #2596

Open
opened 2026年05月06日 11:21:02 +02:00 by pastk · 75 comments

Comment

It had been happening several hours a day for previous 3 days.
E.g. when posting a comment a "loading" (sending?) spinner is visible and it spins for several seconds, then a red ! server error: 500 error bar appears and upon page refresh its clear the comment hasn't been added.

It also happens when e.g. rebasing a PR and force pushing changes - the code gets updated, but the force pushed message don't get recorded. Same with merging a PR - it gets merged, but its status is not updated to "merged" etc.

First time it had been resolved I thought its a transient problem that had been fixed.

But now its happening again for a third time..
I.e. I was not able to create this particular issue at ~9:20am UTC, so I've reported to the matrix chat and waited until the issue is gone again.
(CC @gusted)

### Comment It had been happening several hours a day for previous 3 days. E.g. when posting a comment a "loading" (sending?) spinner is visible and it spins for several seconds, then a red `! server error: 500` error bar appears and upon page refresh its clear the comment hasn't been added. It also happens when e.g. rebasing a PR and force pushing changes - the code gets updated, but the force pushed message don't get recorded. Same with merging a PR - it gets merged, but its status is not updated to "merged" etc. First time it had been resolved I thought its a transient problem that had been fixed. But now its happening again for a third time.. I.e. I was not able to create this particular issue at ~9:20am UTC, so I've reported to the matrix chat and waited until the issue is gone again. (CC @gusted)

A general comment on the deadlock issues, yesterday during Codeberg's weekly community meeting we did look into it and discovered one of our Galera (MariaDB database cluster) node is slowing down the whole cluster. The underlying cause is not confirmed, we saw troubling numbers on one of the root drive SSDs and the machine of that node has the weakest CPU. This machine also hosts most services (notable exception being Forgejo and Woodpecker CI agent) after it was moved to this server when our third machine suddenly died (https://social.anoxinon.de/@codebergstatus/115974907704158246), so finally moving it back is also likely planned to distribute the CPU load.

For now we've increased the amount of threads (Codeberg-Infrastructure/scripted-configuration@d80ce8c6c2) and is showing promising results, although I do still see codeberg.org returning 5xx errors when the operation (mostly) succeeded so the issue is not completely gone.

There's also a possibility it being due to a Forgejo v15 upgrade as we also did that quite recently, but so far nothing has been able to point towards a possible suspect in that area. We do see "weird" DELETE queries from time to time that the database struggles to process that might suggest Forgejo is sending too complex queries to the database.

A general comment on the deadlock issues, yesterday during Codeberg's weekly community meeting we did look into it and discovered one of our Galera (MariaDB database cluster) node is slowing down the whole cluster. The underlying cause is not confirmed, we saw troubling numbers on one of the root drive SSDs and the machine of that node has the weakest CPU. This machine also hosts most services (notable exception being Forgejo and Woodpecker CI agent) after it was moved to this server when our third machine suddenly died (https://social.anoxinon.de/@codebergstatus/115974907704158246), so finally moving it back is also likely planned to distribute the CPU load. For now we've increased the amount of threads (https://codeberg.org/Codeberg-Infrastructure/scripted-configuration/commit/d80ce8c6c236bd48175b71460d6e090d279daf03) and is showing promising results, although I do still see codeberg.org returning 5xx errors when the operation (mostly) succeeded so the issue is not completely gone. There's also a possibility it being due to a Forgejo v15 upgrade as we also did that quite recently, but so far nothing has been able to point towards a possible suspect in that area. We do see "weird" `DELETE` queries from time to time that the database struggles to process that might suggest Forgejo is sending too complex queries to the database.

I'm also seeing the same issue - creating a PR, it just spins then gives me a big red 500. I've tried several times so far, but no success.

I'm also seeing the same issue - creating a PR, it just spins then gives me a big red 500. I've tried several times so far, but no success.

@jacklund I can see a few deadlocks around that time - unfortunately the last deadlock recorded by mariadb (for which I can see detailed information) was unrelated to commenting/issue/pr :(

Forgejo v15 gained the ability to retry transaction, I will add some retries around the areas where I can see deadlocks from around that time. -> Deployed as part of Codeberg-Infrastructure/forgejo@7f55180a7d

@jacklund I can see a few deadlocks around that time - unfortunately the last deadlock recorded by mariadb (for which I can see detailed information) was unrelated to commenting/issue/pr :( Forgejo v15 gained the ability to retry transaction, I will add some retries around the areas where I can see deadlocks from around that time. -> Deployed as part of https://codeberg.org/Codeberg-Infrastructure/forgejo/commit/7f55180a7dcdfe00cdbd114ed2c52e897ec9729b

I was going to open an issue on behalf of the Gentoo PR assignment and CI with more or less the same title: Frequent 500 errors. From what I understand this appears to happen particularly often around deleting comments.

The scripts in use that interact with Codeberg are found here:

In both repos, the API managed in a CodebergAPI class defined in codebergapi.py. Both set of scripts are executed in a cron job, and post comments on PRs. They also clean up old comments, so there's somewhat frequent comment deletion.

For example: https://gitweb.gentoo.org/proj/assign-pull-requests.git/tree/assign-pull-requests-codeberg.py#n175

An example PR: gentoo/gentoo#853 - includes both "Pull Request assignment" and "Pull request CI report".

I was going to open an issue on behalf of the Gentoo PR assignment and CI with more or less the same title: Frequent 500 errors. From what I understand this appears to happen particularly often around deleting comments. The scripts in use that interact with Codeberg are found here: - https://gitweb.gentoo.org/proj/repo-mirror-ci.git/tree/pull-request - https://gitweb.gentoo.org/proj/assign-pull-requests.git/tree/assign-pull-requests-codeberg.py In both repos, the API managed in a `CodebergAPI` class defined in codebergapi.py. Both set of scripts are executed in a cron job, and post comments on PRs. They also clean up old comments, so there's somewhat frequent comment deletion. For example: https://gitweb.gentoo.org/proj/assign-pull-requests.git/tree/assign-pull-requests-codeberg.py#n175 An example PR: https://codeberg.org/gentoo/gentoo/pulls/853 - includes both "Pull Request assignment" and "Pull request CI report".

Deleting things are quite a thing, as they don't just delete the entry from the database but also delete references to it and then you're suddenly touching quite a few tables. And yeah that makes it more prone to deadlocks. That said, yet another function that can benefit from RetryTx to make it (slightly) more robust; will add it for the next deployment.

Deleting things are quite a thing, as they don't just delete the entry from the database but also delete references to it and then you're suddenly touching quite a few tables. And yeah that makes it more prone to deadlocks. That said, yet another function that can benefit from `RetryTx` to make it (slightly) more robust; will add it for the next deployment.

I cannot add a user to the team of a repo. (again 500 errors)

I cannot add a user to the team of a repo. (again 500 errors)

@jakorten your 500 is not related to this one, please check the username that you're entering (specifically remove some whitespaces at the beginning of your input)

@jakorten your 500 is not related to this one, please check the username that you're entering (specifically remove some whitespaces at the beginning of your input)

Interesting, but now it does work (I copy-paste usernames so I doubt it has to do with spaces)

Interesting, but now it does work (I copy-paste usernames so I doubt it has to do with spaces)

@jakorten Your input into the username field is user-name the trailing whitespace need to be removed.

@jakorten Your input into the username field is ` user-name` the trailing whitespace need to be removed.

This might now manifest as 504 errors: https://social.anoxinon.de/@codebergstatus/116540639518944405

This might now manifest as 504 errors: https://social.anoxinon.de/@codebergstatus/116540639518944405

We've identified a regression in Forgejo. We were able to catch a problematic query that was locking the comment table for several minutes, causing all commenting related operations to deadlock. Query itself was a simple DELETE FROM comment WHERE dependent_issue_id IN (list of issue ids), dependent_issue_id columns has a index. This should be a very fast operation, but upon manually executing the query with some test data it was really slow, the ANALYZE query provided a very clear reason: index was not being used. Trial-and-error shows the limit was 200, a value which happens to be the default value of eq_range_index_dive_limit.

Here's where the regression comes in, the length of the issue ids list was previously 50 but is now 500 with forgejo/forgejo#11999, therefore going over that limit and hitting this optimizer. Why this optimizer is removing the use of the index is still unknown, and possible a bug, we are able to resolve the performance issue by bumping the threshold to 501 to restore to the old behavior of not using this optimizer. Codeberg-Infrastructure/scripted-configuration@4800a045b1

I will keep this thread open for a few more days, but I believe this eliminates most of the deadlock errors we've been seeing in the past few weeks.

We've identified a regression in Forgejo. We were able to catch a problematic query that was locking the `comment` table for several minutes, causing all commenting related operations to deadlock. Query itself was a simple `DELETE FROM comment WHERE dependent_issue_id IN (list of issue ids)`, `dependent_issue_id` columns has a index. This should be a very fast operation, but upon manually executing the query with some test data it was really slow, the `ANALYZE` query provided a very clear reason: index was not being used. Trial-and-error shows the limit was 200, a value which happens to be the default value of [eq_range_index_dive_limit](https://mariadb.com/docs/server/server-management/variables-and-modes/server-system-variables#eq_range_index_dive_limit). Here's where the regression comes in, the length of the issue ids list was previously 50 but is now 500 with https://codeberg.org/forgejo/forgejo/issues/11999, therefore going over that limit and hitting this optimizer. Why this optimizer is removing the use of the index is still unknown, and possible a bug, we are able to resolve the performance issue by bumping the threshold to 501 to restore to the old behavior of not using this optimizer. https://codeberg.org/Codeberg-Infrastructure/scripted-configuration/commit/4800a045b1e97129f8897a4088827a2b00d791da I will keep this thread open for a few more days, but I believe this eliminates most of the deadlock errors we've been seeing in the past few weeks.

I still get 504 on repo creation

I still get 504 on repo creation

@breinich wrote in #2596 (comment):

I still get 504 on repo creation

I have the same problem. Both when migrating and when creating a new repo.

@breinich wrote in https://codeberg.org/Codeberg/Community/issues/2596#issuecomment-14669340: > I still get 504 on repo creation I have the same problem. Both when migrating and when creating a new repo.

getting 504 while trying delete a repo

getting 504 while trying delete a repo

I experienced this error these days (also today [1]) which is specially noticeable at 12:00-13:00, on other day moments, like now, works better (specially at night).

I am thinking, what about a test/healthcheck that creates an issue each hour, evaluates time (or timeout), I suspect that it's because of heavy-balance, it might need require extra resources on "prime time", but maybe, there is more optimization to address that heavyness.

example/inspiration: https://matrix.org/blog/2020/11/03/how-we-fixed-synapse-s-scalability/#performance

[1] this was creating an issue, trying different moments (I got rate limited pretty fast in a previous attempt, so this time, I was more relaxed on retrying for several minutes)

image

I experienced this error these days (also today [1]) which is specially noticeable at 12:00-13:00, on other day moments, like now, works better (specially at night). I am thinking, what about a test/healthcheck that creates an issue each hour, evaluates time (or timeout), I suspect that it's because of heavy-balance, it might need require extra resources on "prime time", but maybe, there is more optimization to address that heavyness. example/inspiration: https://matrix.org/blog/2020/11/03/how-we-fixed-synapse-s-scalability/#performance [1] this was creating an issue, trying different moments (I got *rate limited* pretty fast in a previous attempt, so this time, I was more relaxed on retrying for several minutes) ![image](/attachments/3e6a63e2-d28c-4282-9819-ea7e78f92cfc)

Currently cannot create a repository, I keep getting hammered with a 504.

Currently cannot create a repository, I keep getting hammered with a 504.

I still get the same thing as above.

I still get the same thing as above.

Yes, the 504 error is back. I can't star a repo and I can't create a new repo. Git operations seem to be working fine.

Yes, the 504 error is back. I can't star a repo and I can't create a new repo. Git operations seem to be working fine.

I also can't create repo's and subscribe to threads

I also can't create repo's and subscribe to threads

I also cannot create a repository and am getting the 504 error.

I also cannot create a repository and am getting the 504 error.

This is fixed now. I just created a repo and starred another repo.

This is fixed now. I just created a repo and starred another repo.

Seems to work for me too!

Seems to work for me too!

This has unfortunately manifested a few more time since we found the regression and applied a workaround to it. In the incidents were I was around and able to observe the 504 they still stemmed from the same query that we identified to be problematic in #2596 (comment) and killing it (or restarting Forgejo) was the only way to unblock other queries. In the hopes it does something Codeberg-Infrastructure/forgejo@6d221141a3 was deployed to restore the old behavior of smaller batch sizes.

We're still going to look more into it, but unfortunately the time where we can diagnose, test and troubleshooting the issue is when certain operations are not working (and you get the 504 gateway timeout). This does not give us a lot of time to properly deal with it and have to end those sessions early to simply restore functionality.

This has unfortunately manifested a few more time since we found the regression and applied a workaround to it. In the incidents were I was around and able to observe the 504 they still stemmed from the same query that we identified to be problematic in https://codeberg.org/Codeberg/Community/issues/2596#issuecomment-14517906 and killing it (or restarting Forgejo) was the only way to unblock other queries. In the hopes it does something https://codeberg.org/Codeberg-Infrastructure/forgejo/commit/6d221141a34ce3b9dddb1a6a51ae560755e2964b was deployed to restore the old behavior of smaller batch sizes. We're still going to look more into it, but unfortunately the time where we can diagnose, test and troubleshooting the issue is when certain operations are not working (and you get the 504 gateway timeout). This does not give us a lot of time to properly deal with it and have to end those sessions early to simply restore functionality.

Still experiencing this when trying to create a PR (red error bar ! server error: 500).

Still experiencing this when trying to create a PR (red error bar `! server error: 500`).
Gusted changed title from (削除) 500 server error creating issues/PR, commenting etc. (削除ここまで) to 500/504 server error creating issues/PR, commenting etc. 2026年05月14日 21:34:59 +02:00

Unable to fork ziglang/zig. Consistently seeing 504 errors.

Unable to fork [ziglang/zig](https://codeberg.org/ziglang/zig/). Consistently seeing 504 errors.

Getting 504 when trying to create a repository too. Seems like this issue lasts since last week. Damn

Getting 504 when trying to create a repository too. Seems like this issue lasts since last week. Damn

Looks like I can't make new issues.

image

Looks like I can't make new issues. ![image](/attachments/0515660c-bc4f-4fe6-a87f-1803828dd9d9)
200 KiB

I'm consistently getting 504s when creating repositories, hope this gets resolved soon! <3

Edit; Staring and watching repositories also results in the same error.

I'm consistently getting 504s when creating repositories, hope this gets resolved soon! <3 *Edit; Staring and watching repositories also results in the same error.*

Can confirm that I am unable to create new repositories

Can confirm that I am unable to create new repositories

I'm also experiencing 504s. Clicking "watch" on this issue returns 504, and attempting to create a new issue is also returning 504

I'm also experiencing 504s. Clicking "watch" on this issue returns 504, and attempting to create a new issue is also returning 504

Failed to star or fork any repos, too, the error code is 504. Any updates on this issue?

Failed to star or fork any repos, too, the error code is 504. Any updates on this issue?

Error 504 when calling /Codeberg/Community/issues/2596/watch: <html>

504 Gateway Time-out

The server didn't respond in time. </html> :(
Error 504 when calling /Codeberg/Community/issues/2596/watch: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> :(

Same here, can't create new repositories

Same here, can't create new repositories

Same issue when trying to create or delete a repository.
image

Same issue when trying to create or delete a repository. ![image](/attachments/0cab54b2-fdcd-400f-8907-f4f9b45373f1)
7.9 KiB

I've been trying to make a fork for a PR for the last 5 hours
image
I also can't add stars, watch projects, or follow issues (I discovered this while trying to follow this one).
image

I've been trying to make a fork for a PR for the last 5 hours ![image](/attachments/a175e99b-6647-4e1f-a383-7bb65216a8b9) I also can't add stars, watch projects, or follow issues (I discovered this while trying to follow this one). ![image](/attachments/e0031aae-3888-400e-b2ee-1cb900994f2b)

I get the same issues

I get the same issues

Same here, been trying to create a repo for I'd say the last couple of hours and cannot due to this.

Same here, been trying to create a repo for I'd say the last couple of hours and cannot due to this.

Same issue, cannot delete or create a repo. Cannot star or unwatch a repo either. RSS button works fine though.

Same issue, cannot delete or create a repo. Cannot star or unwatch a repo either. RSS button works fine though.

As many others I get 504 errors when trying to fork a repository or subscribe to some issue. How can we help from a user perspective? Can we collect and provide any data to help in troubleshooting?

As many others I get 504 errors when trying to fork a repository or subscribe to some issue. How can we help from a user perspective? Can we collect and provide any data to help in troubleshooting?

I think the status.codeberg.org website should reflect these issues; I honestly thought it was a problem with my provider or me because everything looked fine on the status website.
image

I think the [status.codeberg.org](https://status.codeberg.org/status/codeberg) website should reflect these issues; I honestly thought it was a problem with my provider or me because everything looked fine on the status website. ![image](/attachments/69e7cddf-edcf-4ea3-9a43-8db1aae968d0)

It's been resolved for now.

@calsan This can't be integrated into the status page easily, it only affects certain endpoints and as they effect write operations the status page can't replicate that as a health check.

It's been resolved for now. @calsan This can't be integrated into the status page easily, it only affects certain endpoints and as they effect write operations the status page can't replicate that as a health check.

@Gusted I understand, I was just saying that because I recently migrated from GitHub and I figured it's very frustrating for a new user to see that something they need isn't working when there's no information about it. I mean, obviously this issue exists, but if you check the status page first, I mean, it took me four hours before I found it. I was trying with VPNs or my third-world providers. Anyway, I know it's not your or the team fault; in fact, thanks, it seems to be working now.

@Gusted I understand, I was just saying that because I recently migrated from GitHub and I figured it's very frustrating for a new user to see that something they need isn't working when there's no information about it. I mean, obviously this issue exists, but if you check the status page first, I mean, it took me four hours before I found it. I was trying with VPNs or my third-world providers. Anyway, I know it's not your or the team fault; in fact, thanks, it seems to be working now.

To also give a status update on the underlying problem. Whenever a big (many issues, comments, pull requests, action runs) repository is being deleted this is done in a transaction, and at some point it effectively has locked most tables and is blocking other INSERT/UPDATE queries to the point those are timing out. I'll be prioritizing to reduce the "harm" that this transaction can do, but it's no easy task and requires some non-trivial engineering. If you've ideas please reach out in https://matrix.to/#/#forgejo-development:matrix.org

To also give a status update on the underlying problem. Whenever a big (many issues, comments, pull requests, action runs) repository is being deleted this is done in a transaction, and at some point it effectively has locked most tables and is blocking other INSERT/UPDATE queries to the point those are timing out. I'll be prioritizing to reduce the "harm" that this transaction can do, but it's no easy task and requires some non-trivial engineering. If you've ideas please reach out in https://matrix.to/#/#forgejo-development:matrix.org

I'm adding that I'm seeing 504s when trying to fork a repo.

[EDIT: Tried again just now, ~4 hours later. Worked fine.]

I'm adding that I'm seeing 504s when trying to fork a repo. [EDIT: Tried again just now, ~4 hours later. Worked fine.]

same here trying to fork guix/guix for two days, without success.

I'm responded with a 504 after 30s delay.

same here trying to fork guix/guix for two days, without success. I'm responded with a 504 after 30s delay.

Is this like an intermittent issue surfaced during high load, or should we expect for the time being that forking a large repository is going to trigger a timeout? Not trying to pile on, just looking for a little more clarity on the current assessment. ty.

Is this like an intermittent issue surfaced during high load, or should we expect for the time being that forking a large repository is going to trigger a timeout? Not trying to pile on, just looking for a little more clarity on the current assessment. ty.

@Gusted gave an explanation on the underlying problem yesterday.

TL;DR: When you try to fork some repo while the database tables are locked you running into a timeout.

@Gusted gave an explanation on the underlying problem [yesterday](https://codeberg.org/Codeberg/Community/issues/2596#issuecomment-15079674). TL;DR: When you try to fork some repo while the database tables are locked you running into a timeout.

I have same 504 error on simple attempt to add comment here
freesewing/freesewing#848

I have same 504 error on simple attempt to add comment here https://codeberg.org/freesewing/freesewing/issues/848
Whello. We too are still experiencing it. some logs: https://github.com/cinepro-org/core/actions/runs/26169397922/job/76983309056 ![image](/attachments/5c055a16-cbcf-4acc-ba80-3247bbb1368d) https://github.com/cinepro-org/docs/actions/runs/26169524670/job/76982973737 ![image](/attachments/bb9f299e-4f65-425a-9607-880a25e8e753) I hope you can fix it soon.

A little bit funny that I got a 504 trying to open this issue.
When the issue tracker is affected, you just know it is bad ...

A little bit funny that I got a 504 trying to open this issue. When the issue tracker is affected, you just know it is bad ...

Clicking on one of the larger (< 1Mb) files in my repo results in 504 error every time.

Clicking on one of the larger (< 1Mb) files in my repo results in 504 error every time.

works again....

works again....

still not work...

still not work...

I'm also running into this when trying to change my username.

(edit: worked almost immediately after posting this comment...)

I'm also running into this when trying to change my username. (edit: worked almost immediately after posting this comment...)

This whole situation is such a mess. Sometimes codeberg does work, other times it doesn't. I wish someone could give a clear explanation of why this happened and how and when it will be resolved.

This whole situation is such a mess. Sometimes codeberg does work, other times it doesn't. I wish someone could give a clear explanation of why this happened and how and when it will be resolved.

@thomasboom wrote in #2596 (comment):

This whole situation is such a mess. Sometimes codeberg does work, other times it doesn't. I wish someone could give a clear explanation of why this happened and how and when it will be resolved.

See the following two comments for an explanation and what the team behind Codeberg.org is currently doing about it:

@thomasboom wrote in https://codeberg.org/Codeberg/Community/issues/2596#issuecomment-15659306: > This whole situation is such a mess. Sometimes codeberg does work, other times it doesn't. I wish someone could give a clear explanation of why this happened and how and when it will be resolved. See the following two comments for an explanation and what the team behind Codeberg.org is currently doing about it: - [comment 1](https://codeberg.org/Codeberg/Community/issues/2596#issuecomment-14517906) - [comment 2](https://codeberg.org/Codeberg/Community/issues/2596#issuecomment-14768952)

We're still going to look more into it, but unfortunately the time where we can diagnose, test and troubleshooting the issue is when certain operations are not working (and you get the 504 gateway timeout). This does not give us a lot of time to properly deal with it and have to end those sessions early to simply restore functionality.

There are a lot of transient 504s, but in plenty of cases the problem can be reproduced 100% of the time.

I've been trying to fork forgejo/forgejo for almost a week now, and every day I get a 504. I have yet to succeed.

I don't know if that is the same root cause as all the other 504s reported in this thread, but surely there is no problem with reproducing this particular issue?

> We're still going to look more into it, but unfortunately the time where we can diagnose, test and troubleshooting the issue is when certain operations are not working (and you get the 504 gateway timeout). This does not give us a lot of time to properly deal with it and have to end those sessions early to simply restore functionality. There are a lot of transient 504s, but in plenty of cases the problem can be reproduced 100% of the time. I've been trying to fork forgejo/forgejo for almost a week now, and every day I get a 504. I have yet to succeed. I don't know if that is the same root cause as all the other 504s reported in this thread, but surely there is no problem with reproducing this particular issue?

@Tronde wrote in #2596 (comment):

See the following two comments for an explanation and what the team behind Codeberg.org is currently doing about it:

While I completely understand the volunteer nature of Codeberg, comment 2 does not give us a lot to work with:

We're still going to look more into it, but unfortunately the time where we can diagnose, test and troubleshooting the issue is when certain operations are not working (and you get the 504 gateway timeout). This does not give us a lot of time to properly deal with it and have to end those sessions early to simply restore functionality.

Ok.. I get that complex ephemeral bugs are hard to diagnose. On the other hand.. it's literally making Codeberg unusable. Like, my team can't create ANY PRs. That's not a sometimes thing, it's not exotic behavior.. it's literally the main day-to-day task.

It's not intermittent, either -- it fails every single time with a 500 error.

(We've had to switch away from Codeberg until this is fixed.)

@Tronde wrote in https://codeberg.org/Codeberg/Community/issues/2596#issuecomment-15660377: > See the following two comments for an explanation and what the team behind Codeberg.org is currently doing about it: > > * [comment 1](https://codeberg.org/Codeberg/Community/issues/2596#issuecomment-14517906) > * [comment 2](https://codeberg.org/Codeberg/Community/issues/2596#issuecomment-14768952) While I completely understand the volunteer nature of Codeberg, comment 2 does not give us a lot to work with: > We're still going to look more into it, but unfortunately the time where we can diagnose, test and troubleshooting the issue is when certain operations are not working (and you get the 504 gateway timeout). This does not give us a lot of time to properly deal with it and have to end those sessions early to simply restore functionality. Ok.. I get that complex ephemeral bugs are hard to diagnose. On the other hand.. it's literally making Codeberg unusable. Like, my team can't create ANY PRs. That's not a sometimes thing, it's not exotic behavior.. it's literally the main day-to-day task. It's not intermittent, either -- it fails every single time with a `500` error. (We've had to switch away from Codeberg until this is fixed.)

It also sounds like, for whatever reason, this issue is mixing together at least two distinct class of errors -- the 504 timeout, and the plain 500 error on creating PRs.
It doesn't sound like the latter is related?

It also sounds like, for whatever reason, this issue is mixing together at least two distinct class of errors -- the `504` timeout, and the plain `500` error on creating PRs. It doesn't sound like the latter is related?
Member
Copy link

It would be great to read a statement (or follow-up) at some point on whether this issue is mainly due to the use of old/low-powered hardware (in relation to the load the instance has meanwhile), given that a general bug in Forgejo has been outruled by now?

It would be great to read a statement (or follow-up) at some point on whether this issue is mainly due to the use of old/low-powered hardware (in relation to the load the instance has meanwhile), given that a general bug in Forgejo has been outruled by now?

The issue described in #2596 (comment) was fixed yesterday in the morning. The results so far are good, and we believe the main source of 504 issues are hereby resolved that were related with: creating issues, comments, pull requests, starring and watching (repo-related actions that touches the database with a INSERT or UPDATE statement).


Forking is a different issue, for large repositories there's a slow process of migrating tags which takes ages. We already raised this timeout to 10minutes but I just noticed there was a small ordering mistake where this didn't get applied correctly. @untitaker


@pat-s, it's solely Forgejo being put under stress by the amount of traffic we're seeing and if that's combined with Forgejo making some assumptions that fail under scale, that's not fixable by better hardware. Also sorry of the mess of this thread but I hope I didn't give the indication Forgejo was outruled, it was the prime suspect 😄


@codenamedmitri Sorry it seems my comment was poorly phrased to convey its intent, it's not a ephemeral bug. Rather once we're aware of it we also have in the back in our heads that we should resolve it ASAP because it's affecting people's workflow by the second as you've noticed. If we then want to casually read documentation, what tables are good to be used to diagnose etc. then that becomes quite stressful and is better left for the next attempt and resolve it temporarily by stopping the offending query.


@bagel-very your case deserves a new issue. Although we today deploy a general bump to 30s as a trade-off, so maybe that resolves it.

The issue described in https://codeberg.org/Codeberg/Community/issues/2596#issuecomment-15079674 was fixed yesterday in the morning. The results so far are good, and we believe the main source of 504 issues are hereby resolved that were related with: creating issues, comments, pull requests, starring and watching (repo-related actions that touches the database with a INSERT or UPDATE statement). --- Forking is a different issue, for large repositories there's a slow process of migrating tags which takes ages. We already raised this timeout to 10minutes but I just noticed there was a small ordering mistake where this didn't get applied correctly. @untitaker --- @pat-s, it's solely Forgejo being put under stress by the amount of traffic we're seeing and if that's combined with Forgejo making some assumptions that fail under scale, that's not fixable by better hardware. Also sorry of the mess of this thread but I hope I didn't give the indication Forgejo was outruled, it was the prime suspect 😄 --- @codenamedmitri Sorry it seems my comment was poorly phrased to convey its intent, it's not a ephemeral bug. Rather once we're aware of it we also have in the back in our heads that we should resolve it ASAP because it's affecting people's workflow by the second as you've noticed. If we then want to casually read documentation, what tables are good to be used to diagnose etc. then that becomes quite stressful and is better left for the next attempt and resolve it temporarily by stopping the offending query. --- @bagel-very your case deserves a new issue. Although we today deploy a general bump to 30s as a trade-off, so maybe that resolves it.

I have CI (GitHub workflows) that have been getting 504 intermittently in the last few hours.
Example workflow log:
https://github.com/owncloud/core/actions/runs/26386534375/job/77666167782?pr=41552

And again:
https://github.com/owncloud/core/actions/runs/26388508278/job/77672390895?pr=41552

I have CI (GitHub workflows) that have been getting 504 intermittently in the last few hours. Example workflow log: https://github.com/owncloud/core/actions/runs/26386534375/job/77666167782?pr=41552 And again: https://github.com/owncloud/core/actions/runs/26388508278/job/77672390895?pr=41552

Hi, if you saw any 504 or 500 in the last ~10 hours then it was related to someone trying to bring the instance down.

Hi, if you saw any 504 or 500 in the last ~10 hours then it was related to someone trying to bring the instance down.

Having 504 issues in GitLab CI's of Fdroid repo: https://gitlab.com/albertodiazsaez/fdroiddata/-/jobs/14550532584

Having 504 issues in GitLab CI's of Fdroid repo: https://gitlab.com/albertodiazsaez/fdroiddata/-/jobs/14550532584

Don’t forget to check https://status.codeberg.eu/status/codeberg when there are such issues. Seems again #2596 (comment).

Don’t forget to check https://status.codeberg.eu/status/codeberg when there are such issues. Seems again https://codeberg.org/Codeberg/Community/issues/2596#issuecomment-15836651.

@mahlzahn wrote in #2596 (comment):

Don’t forget to check https://status.codeberg.eu/status/codeberg when there are such issues. Seems again #2596 (comment).

Status showed All Systems Operational, and still the pipelines are failing with a 504, I guess it's the same issue as you mentioned, but the Status page won't show it sometimes. Website seems to work, I only detect problems when working with git commands.

@mahlzahn wrote in https://codeberg.org/Codeberg/Community/issues/2596#issuecomment-15924479: > Don’t forget to check https://status.codeberg.eu/status/codeberg when there are such issues. Seems again #2596 (comment). Status showed All Systems Operational, and still the pipelines are failing with a 504, I guess it's the same issue as you mentioned, but the Status page won't show it sometimes. Website seems to work, I only detect problems when working with git commands.

I've been getting 504s while fetching my repo over https and connection closed with ssh since this morning. Both of my machines are affected so I don't think its a config issue on my end.

I've been getting 504s while fetching my repo over https and connection closed with ssh since this morning. Both of my machines are affected so I don't think its a config issue on my end.

The latest 504s are being caused by a load issue, we're not really sure where it's coming from but it seems related to incoming SSH connections.

The latest 504s are being caused by a load issue, we're not really sure where it's coming from but it seems related to incoming SSH connections.

same orphaned *.lock under /mnt/ceph-cluster on push, 504 at finalize — details in #2707, also on niko64/vecgfx #2710.

same orphaned *.lock under /mnt/ceph-cluster on push, 504 at finalize — details in #2707, also on niko64/vecgfx #2710.

Another occurrence, this time on git push (not issue/PR creation): repo AnarBib/anarbib. Objects transfer fully, then the push dies at the post-receive / ref-update step:

Writing objects: 100% (56/56), done.
error: RPC failed; HTTP 504 curl 22 The requested URL returned error: 504
send-pack: unexpected disconnect while reading sideband packet
fatal: the remote end hung up unexpectedly

The server-side ref never updates (web UI still shows the ~24h-old HEAD). A 504 at this same finalize step earlier left a stale refs/heads/main.lock (since cleared — thanks). I'm holding off on retrying to avoid recreating a lock. Details in #2707. Happy to provide timestamps.

Another occurrence, this time on `git push` (not issue/PR creation): repo `AnarBib/anarbib`. Objects transfer fully, then the push dies at the post-receive / ref-update step: Writing objects: 100% (56/56), done. error: RPC failed; HTTP 504 curl 22 The requested URL returned error: 504 send-pack: unexpected disconnect while reading sideband packet fatal: the remote end hung up unexpectedly The server-side ref never updates (web UI still shows the ~24h-old HEAD). A 504 at this same finalize step earlier left a stale `refs/heads/main.lock` (since cleared — thanks). I'm holding off on retrying to avoid recreating a lock. Details in #2707. Happy to provide timestamps.
- I appear to have horribly broken this: https://codeberg.org/forgejo/docs/pulls/1994 by renaming https://codeberg.org/jsoref/docs to https://codeberg.org/jsoref/forgejo-docs -- which failed - I can't fork https://codeberg.org/ziglang/zig/ https://codeberg.org/Codeberg/Community/issues/2596#issuecomment-15032505 -- is there a way for someone to fork it for me? I have a large changeset I'd like to propose...

From Gentoo's CI perspective: It's working well the majority of the time, we sometimes see a 504 but they're not that frequent. Mostly it's this error we see in the logs (fetching the list of open PRs):

requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://codeberg.org/api/v1/repos/gentoo/gentoo/pulls?limit=100&state=open
From Gentoo's CI perspective: It's working well the majority of the time, we sometimes see a 504 but they're not that frequent. Mostly it's this error we see in the logs (fetching the list of open PRs): ``` requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://codeberg.org/api/v1/repos/gentoo/gentoo/pulls?limit=100&state=open ```

@Gusted wrote in #2596 (comment):

Forking is a different issue, for large repositories there's a slow process of migrating tags which takes ages. We already raised this timeout to 10minutes but I just noticed there was a small ordering mistake where this didn't get applied correctly. @untitaker

Sorry if off topic. But would we be able to convert this statement into a feature request? I have two options for this feature request in mind.

  • Allow shallow forks, when trying to contribute to a repository I don't usually need all of the repositorie's history or tags.
  • Make it easier to create Pull Requests from non-forked repositories (this allows the "fork" to be done locally by setting up the remotes).
@Gusted wrote in https://codeberg.org/Codeberg/Community/issues/2596#issuecomment-15800714: > Forking is a different issue, for large repositories there's a slow process of migrating tags which takes ages. We already raised this timeout to 10minutes but I just noticed there was a small ordering mistake where this didn't get applied correctly. @untitaker Sorry if off topic. But would we be able to convert this statement into a feature request? I have two options for this feature request in mind. - Allow shallow forks, when trying to contribute to a repository I don't usually need all of the repositorie's history or tags. - Make it easier to create Pull Requests from non-forked repositories (this allows the "fork" to be done locally by setting up the remotes).

@laumann there's a big performance problem with that endpoint that makes it unnecessary slow, we already had to give a exception for it to Forgejo project (

http-request set-timeout server 60s if { path /api/v1/repos/forgejo/forgejo/pulls } { url_param(limit) eq 100 }

) I will add one for gentoo as well.

@kaeru feature requests are better discussed and proposed in Forgejo: https://codeberg.org/forgejo/forgejo/issues/new/choose

@laumann there's a big performance problem with that endpoint that makes it unnecessary slow, we already had to give a exception for it to Forgejo project (https://codeberg.org/Codeberg-Infrastructure/scripted-configuration/src/commit/7c3f0285deb31b3481b750ceac278abab9d36c70/hosts/_reverseproxy/etc/haproxy/haproxy.cfg#L389) I will add one for gentoo as well. @kaeru feature requests are better discussed and proposed in Forgejo: https://codeberg.org/forgejo/forgejo/issues/new/choose

@Gusted wrote in #2596 (comment):

I will add one for gentoo as well.

Thanks, that's very nice of you :) let me know if there's any changes we could be making to alleviate performance issues.

@Gusted wrote in https://codeberg.org/Codeberg/Community/issues/2596#issuecomment-17164109: > I will add one for gentoo as well. Thanks, that's very nice of you :) let me know if there's any changes we could be making to alleviate performance issues.
Sign in to join this conversation.
No Branch/Tag specified
main
No results found.
Labels
Clear labels
accessibility
Reduces accessibility and is thus a "bug" for certain user groups on Codeberg.
bug
Something is not working the way it should. Does not concern outages.
bug
infrastructure
Errors evidently caused by infrastructure malfunctions or outages
Codeberg
This issue involves Codeberg's downstream modifications and settings and/or Codeberg's structures.
contributions welcome
Please join the discussion and consider contributing a PR!
docs
No bug, but an improvement to the docs or UI description will help
duplicate
This issue or pull request already exists
enhancement
New feature
infrastructure
Involves changes to the server setups, use `bug/infrastructure` for infrastructure-related user errors.
legal
An issue directly involving legal compliance
licence / ToS
involving questions about the ToS, especially licencing compliance
please chill
we are volunteers
Please consider editing your posts and remember that there is a human on the other side. We get that you are frustrated, but it's harder for us to help you this way.
public relations
Things related to Codeberg's external communication
question
More information is needed
question
user support
This issue contains a clearly stated problem. However, it is not clear whether we have to fix anything on Codeberg's end, but we're helping them fix it and/or find the cause.
s/Forgejo
Related to Forgejo. Please also check Forgejo's issue tracker.
s/Forgejo/migration
Migration related issues in Forgejo
s/Pages
Issues related to the Codeberg Pages feature
s/Weblate
Issue is related to the Weblate instance at https://translate.codeberg.org
s/Woodpecker
Woodpecker CI related issue
security
involves improvements to the sites security
service
Add a new service to the Codeberg ecosystem (instead of implementing into Forgejo)
upstream
An open issue or pull request to an upstream repository to fix this issue (partially or completely) exists (i.e. Forgejo, Weblate, etc.)
wontfix
Codeberg's current set of contributors are not planning to spend time on delegating this issue.
Milestone
Clear milestone
No items
No milestone
Projects
Clear projects
No items
No project
Assignees
Clear assignees
No assignees
48 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Codeberg/Community#2596
Reference in a new issue
Codeberg/Community
No description provided.
Delete branch "%!s()"

Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?