Random 503 / 504 Errors in Google Cloud Run (Java)

Question 1

We are currently running a Java 17 app on Cloud Run and have encountered an unusual issue. While the service usually operates smoothly, a small percentage of requests (both GET and POST) fail unexpectedly.

These failed requests return either a 503 or 504 status, often appearing in pairs (which I observed today). Additionally, the failed requests share the same instanceID, and oddly, some successful requests are also associated with this instance. Meanwhile, the liveness probe is functioning correctly without any issues, despite customer-facing requests failing. The liveness probe checks our database, Redis connections, and other integrations, such as file storage connections.

The 503s include the following text payload:

The request failed because either the HTTP response was malformed or connection to the instance had an error. Additional troubleshooting documentation can be found at: https://cloud.google.com/run/docs/troubleshooting#malformed-response-or-connection-error

Another Spring Boot app, trying to access the API via a FeignClient, is receiving a feign.FeignException$ServiceUnavailable. I'm wondering if this could be related to a load balancer issue. Perhaps the health checks are passing correctly because they bypass the load balancer, but the actual requests are being affected by it?

Our CPU and memory usage are within reasonable limits, so I don't believe the issue is due to our instances being under-provisioned. Many of the failing requests are "simple" requests that typically respond in under 100ms.

Question 2

Can you share more about your Cloud Run configuration, your code and if you are doing something "special" (websocket, streaming,...)?

Question 3

My code involves querying a database, where some operations are trivial (usually returning in 50-100ms), while others are more complex, such as accessing Google Cloud Storage and performing calculations, which can take 5-10 seconds to complete. Here's a high-level overview of my Cloud Run configuration: 8 CPU units, 8GB of RAM, and 8 minimum instances.

Question 4

It's more platform related I think. The Google support would help you on this

Question 5

Unfortunately, the last time I reached out to them, they weren't very helpful. They simply referred me to their public documentation on 503 errors and, as far as I could tell, didn’t conduct any specific investigation.

Question 6

In case you haven’t tried yet, please check the troubleshooting guide for recommended steps to rule out application side failure:

Check Cloud Logging
App-level timeouts
Downstream network bottleneck
Inbound request limit to a single container

Another thing to consider is investigating if there’s a mismatch in the location of your resources. This solution works here and could be useful to you (hopefully).

If the above options still won’t resolve it, this could be a Cloud Run specific issue and better addressed by the Google Cloud Support team. You may reach out to them via below channels:

Premium support - paid support option
Cloud Run Public Issue Tracker - full list of open tickets

Question 7

Thanks, I'll take a look. However, I don't see any OOM exceptions or application-level timeouts. My customer-facing traffic is well below 800 requests per second, but it’s possible that async tasks are generating additional traffic, which might be causing issues with the container. I will get back to you if I figure it out.

J_Dubu 3291 silver badge12 bronze badges · Accepted Answer · 2025-01-16 22:45:07Z

In case you haven’t tried yet, please check the troubleshooting guide for recommended steps to rule out application side failure:

Check Cloud Logging
App-level timeouts
Downstream network bottleneck
Inbound request limit to a single container

Another thing to consider is investigating if there’s a mismatch in the location of your resources. This solution works here and could be useful to you (hopefully).

If the above options still won’t resolve it, this could be a Cloud Run specific issue and better addressed by the Google Cloud Support team. You may reach out to them via below channels:

Premium support - paid support option
Cloud Run Public Issue Tracker - full list of open tickets

Thanks, I'll take a look. However, I don't see any OOM exceptions or application-level timeouts. My customer-facing traffic is well below 800 requests per second, but it’s possible that async tasks are generating additional traffic, which might be causing issues with the container. I will get back to you if I figure it out.

CollectivesTM on Stack Overflow

Random 503 / 504 Errors in Google Cloud Run (Java)

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related