0

We are currently running a Java 17 app on Cloud Run and have encountered an unusual issue. While the service usually operates smoothly, a small percentage of requests (both GET and POST) fail unexpectedly.

These failed requests return either a 503 or 504 status, often appearing in pairs (which I observed today). Additionally, the failed requests share the same instanceID, and oddly, some successful requests are also associated with this instance. Meanwhile, the liveness probe is functioning correctly without any issues, despite customer-facing requests failing. The liveness probe checks our database, Redis connections, and other integrations, such as file storage connections.

The 503s include the following text payload:

The request failed because either the HTTP response was malformed or connection to the instance had an error. Additional troubleshooting documentation can be found at: https://cloud.google.com/run/docs/troubleshooting#malformed-response-or-connection-error

Another Spring Boot app, trying to access the API via a FeignClient, is receiving a feign.FeignException$ServiceUnavailable. I'm wondering if this could be related to a load balancer issue. Perhaps the health checks are passing correctly because they bypass the load balancer, but the actual requests are being affected by it?

Our CPU and memory usage are within reasonable limits, so I don't believe the issue is due to our instances being under-provisioned. Many of the failing requests are "simple" requests that typically respond in under 100ms.

asked Jan 15, 2025 at 15:16
4
  • Can you share more about your Cloud Run configuration, your code and if you are doing something "special" (websocket, streaming,...)? Commented Jan 15, 2025 at 16:31
  • My code involves querying a database, where some operations are trivial (usually returning in 50-100ms), while others are more complex, such as accessing Google Cloud Storage and performing calculations, which can take 5-10 seconds to complete. Here's a high-level overview of my Cloud Run configuration: 8 CPU units, 8GB of RAM, and 8 minimum instances. Commented Jan 16, 2025 at 6:58
  • It's more platform related I think. The Google support would help you on this Commented Jan 16, 2025 at 19:36
  • Unfortunately, the last time I reached out to them, they weren't very helpful. They simply referred me to their public documentation on 503 errors and, as far as I could tell, didn’t conduct any specific investigation. Commented Jan 17, 2025 at 8:30

1 Answer 1

0

In case you haven’t tried yet, please check the troubleshooting guide for recommended steps to rule out application side failure:

  • Check Cloud Logging

  • App-level timeouts

  • Downstream network bottleneck

  • Inbound request limit to a single container

Another thing to consider is investigating if there’s a mismatch in the location of your resources. This solution works here and could be useful to you (hopefully).

If the above options still won’t resolve it, this could be a Cloud Run specific issue and better addressed by the Google Cloud Support team. You may reach out to them via below channels:

answered Jan 16, 2025 at 22:45
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, I'll take a look. However, I don't see any OOM exceptions or application-level timeouts. My customer-facing traffic is well below 800 requests per second, but it’s possible that async tasks are generating additional traffic, which might be causing issues with the container. I will get back to you if I figure it out.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.