Tutorial: Local troubleshooting of a Cloud Run service
Stay organized with collections
Save and categorize content based on your preferences.
This tutorial shows how a service developer can troubleshoot a broken
Cloud Run service using Google Cloud Observability tools for discovery and a local
development workflow for investigation.
This step-by-step "case study" companion to the
troubleshooting guide uses a sample project that
results in runtime errors when deployed, which you troubleshoot to find and fix
the problem.
Objectives
Write, build, and deploy a service to Cloud Run
Use Error Reporting and Cloud Logging to identify an error
Retrieve the container image from Container Registry for a root cause analysis
Fix the "production" service, then improve the service to mitigate future problems
Costs
In this document, you use the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage,
use the pricing calculator.
New Google Cloud users might be eligible for a free trial.
Before you begin
Sign in to your Google Cloud account. If you're new to
Google Cloud,
create an account to evaluate how our products perform in
real-world scenarios. New customers also get 300ドル in free credits to
run, test, and deploy workloads.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Roles required to select or create a project
Select a project: Selecting a project doesn't require a specific
IAM role—you can select any project that you've been
granted a role on.
Create a project: To create a project, you need the Project Creator role
(roles/resourcemanager.projectCreator), which contains the
resourcemanager.projects.create permission. Learn how to grant
roles.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Roles required to select or create a project
Select a project: Selecting a project doesn't require a specific
IAM role—you can select any project that you've been
granted a role on.
Create a project: To create a project, you need the Project Creator role
(roles/resourcemanager.projectCreator), which contains the
resourcemanager.projects.create permission. Learn how to grant
roles.
To configure gcloud with defaults for your Cloud Run service:
Set your default project:
gcloudconfigsetprojectPROJECT_ID
Replace PROJECT_ID with the name of the project you created for
this tutorial.
Configure gcloud for your chosen region:
gcloudconfigsetrun/regionREGION
Replace REGION with the supported Cloud Run
region
of your choice.
Cloud Run locations
Cloud Run is regional, which means the infrastructure that
runs your Cloud Run services is located in a specific region and is
managed by Google to be redundantly available across
all the zones within that region.
Meeting your latency, availability, or durability requirements are primary
factors for selecting the region where your Cloud Run services are run.
You can generally select the region nearest to your users but you should consider
the location of the other Google Cloud
products that are used by your Cloud Run service.
Using Google Cloud products together across multiple locations can affect
your service's latency as well as cost.
europe-west6 (Zurich, Switzerland)
leaf icon Low CO2
me-central1 (Doha)
me-central2 (Dammam)
northamerica-northeast1 (Montreal)
leaf icon Low CO2
northamerica-northeast2 (Toronto)
leaf icon Low CO2
southamerica-east1 (Sao Paulo, Brazil)
leaf icon Low CO2
southamerica-west1 (Santiago, Chile)
leaf icon Low CO2
us-west2 (Los Angeles)
us-west3 (Salt Lake City)
us-west4 (Las Vegas)
If you already created a Cloud Run service, you can view the
region in the Cloud Run dashboard in the
Google Cloud console.
Assembling the code
Build a new Cloud Run greeter service step-by-step.
As a reminder, this service creates a runtime error on purpose for the
troubleshooting exercise.
Create a new project:
Node.js
Create a Node.js project by defining the service package, initial dependencies,
and some common operations.
Create a new hello-service directory:
mkdir hello-service
cd hello-service
Create a new Node.js project by generating a package.json file:
npminit--yesnpminstallexpress@4
Open the new package.json file in your editor and configure a start
script to run node index.js. When you're done, the file will look like this:
{"name":"hello-broken","description":"Broken Cloud Run service for troubleshooting practice","version":"1.0.0","private":true,"main":"index.js","scripts":{"start":"node index.js","test":"echo \"Error: no test specified\" && exit 0","system-test":"NAME=Cloud c8 mocha -p -j 2 test/system.test.js --timeout=360000 --exit"},"engines":{"node":">=16.0.0"},"author":"Google LLC","license":"Apache-2.0","dependencies":{"express":"^4.17.1"},"devDependencies":{"c8":"^10.0.0","google-auth-library":"^9.0.0","got":"^11.5.0","mocha":"^10.0.0"}}
If you continue to evolve this service beyond the immediate tutorial, consider
filling in the description, author, and evaluate the license. For more details,
read the package.json documentation.
Python
Create a new hello-service directory:
mkdir hello-service
cd hello-service
Create a requirements.txt file and copy your dependencies into it:
Flask==3.0.3pytest==8.2.0;python_version > "3.0"# pin pytest to 4.6.11 for Python2.pytest==4.6.11;python_version < "3.0"gunicorn==23.0.0Werkzeug==3.0.3
Go
Create a new hello-service directory:
mkdir hello-service
cd hello-service
Create a Go project by initializing a new go module:
gomodinitexample.com/hello-service
You can update the specific name as you wish: you should update the name if
the code is published to a web-reachable code repository.
Create an HTTP service to handle incoming requests:
Node.js
constexpress=require('express');constapp=express();app.get('/',(req,res)=>{console.log('hello: received request.');const{NAME}=process.env;if(!NAME){// Plain error logs do not appear in Stackdriver Error Reporting.console.error('Environment validation failed.');console.error(newError('Missing required server parameter'));returnres.status(500).send('Internal Server Error');}res.send(`Hello ${NAME}!`);});constport=parseInt(process.env.PORT)||8080;app.listen(port,()=>{console.log(`hello: listening on port ${port}`);});
Python
importjsonimportosfromflaskimportFlaskapp=Flask(__name__)@app.route("/",methods=["GET"])defindex():"""Example route for testing local troubleshooting. This route may raise an HTTP 5XX error due to missing environment variable. """print("hello: received request.")NAME=os.getenv("NAME")ifnotNAME:print("Environment validation failed.")raiseException("Missing required service parameter.")returnf"Hello {NAME}"if__name__=="__main__":PORT=int(os.getenv("PORT"))ifos.getenv("PORT")else8080# This is used when running locally. Gunicorn is used to run the# application on Cloud Run. See entrypoint in Dockerfile.app.run(host="127.0.0.1",port=PORT,debug=True)
Go
// Sample hello demonstrates a difficult to troubleshoot service.packagemainimport("fmt""log""net/http""os")funcmain(){log.Print("hello: service started")http.HandleFunc("/",helloHandler)port:=os.Getenv("PORT")ifport==""{port="8080"log.Printf("Defaulting to port %s",port)}log.Printf("Listening on port %s",port)log.Fatal(http.ListenAndServe(fmt.Sprintf(":%s",port),nil))}funchelloHandler(whttp.ResponseWriter,r*http.Request){log.Print("hello: received request")name:=os.Getenv("NAME")ifname==""{log.Printf("Missing required server parameter")// The panic stack trace appears in Cloud Error Reporting.panic("Missing required server parameter")}fmt.Fprintf(w,"Hello %s!\n",name)}
Java
import staticspark.Spark.get;import staticspark.Spark.port;importorg.slf4j.Logger;importorg.slf4j.LoggerFactory;publicclassApp{privatestaticfinalLoggerlogger=LoggerFactory.getLogger(App.class);publicstaticvoidmain(String[]args){intport=Integer.parseInt(System.getenv().getOrDefault("PORT","8080"));port(port);get("/",(req,res)->{logger.info("Hello: received request.");Stringname=System.getenv("NAME");if(name==null){// Standard error logs do not appear in Stackdriver Error Reporting.System.err.println("Environment validation failed.");Stringmsg="Missing required server parameter";logger.error(msg,newException(msg));res.status(500);return"Internal Server Error";}res.status(200);returnString.format("Hello %s!",name);});}}
Create a Dockerfile to define the container image used to deploy the service:
# Use the official Python image.# https://hub.docker.com/_/pythonFROMpython:3.11# Allow statements and log messages to immediately appear in the Cloud Run logsENVPYTHONUNBUFFEREDTrue# Copy application dependency manifests to the container image.# Copying this separately prevents re-running pip install on every code change.COPYrequirements.txt./# Install production dependencies.RUNpipinstall-rrequirements.txt# Copy local code to the container image.ENVAPP_HOME/appWORKDIR$APP_HOMECOPY../# Run the web service on container startup.# Use gunicorn webserver with one worker process and 8 threads.# For environments with multiple CPU cores, increase the number of workers# to be equal to the cores available.# Timeout is set to 0 to disable the timeouts of the workers to allow Cloud Run to handle instance scaling.CMDexecgunicorn--bind:$PORT--workers1--threads8--timeout0main:app
This sample uses Jib to build
Docker images using common Java tools. Jib optimizes container builds without
the need for a Dockerfile or having Docker
installed. Learn more about building Java containers with Jib.
Shipping code consists of three steps: building a container image with
Cloud Build, uploading the container image to Container Registry, and
deploying the container image to Cloud Run.
To ship your code:
Build your container and publish on Container Registry:
Where PROJECT_ID is your Google Cloud project ID. You can check your
current project ID with gcloud config get-value project.
Upon success, you should see a SUCCESS message containing the ID, creation
time, and image name. The image is stored in Container Registry and can be
re-used if desired.
Where PROJECT_ID is your Google Cloud project ID. You can check your
current project ID with gcloud config get-value project.
Upon success, you should see a SUCCESS message containing the ID, creation
time, and image name. The image is stored in Container Registry and can be
re-used if desired.
Where PROJECT_ID is your Google Cloud project ID. You can check your
current project ID with gcloud config get-value project.
Upon success, you should see a SUCCESS message containing the ID, creation
time, and image name. The image is stored in Container Registry and can be
re-used if desired.
Replace PROJECT_ID with your Google Cloud project ID. hello-service is
both the container image name and name of the Cloud Run service.
Notice that the container image is deployed to the service and
region that you configured previously under
Setting up gcloud
Respond y, "Yes", to the allow unauthenticated prompt. See
Managing Access for more details on
IAM-based authentication.
Wait until the deployment is complete: this can take about half a minute.
On success, the command line displays the service URL.
Trying it out
Try out the service to confirm you have successfully deployed it. Requests
should fail with a HTTP 500 or 503 error (members of the class
5xx Server errors).
The tutorial walks through troubleshooting this error response.
The service is auto-assigned a navigable URL.
Navigate to this URL with your web browser:
Open a web browser
Find the service URL output by the earlier deploy command.
If the deploy command did not provide a URL then something went wrong.
Review the error message and act accordingly: if no actionable guidance
is present, review the troubleshooting guide
and possibly retry the deployment command.
Navigate to this URL by copying it into your browser's address bar and
pressing ENTER.
View the HTTP 500 or HTTP 503 error.
If you receive a HTTP 403 error, you may have rejected
allow unauthenticated invocations at the deployment prompt.
Grant public access to the service to fix this:
gcloud run services add-iam-policy-binding hello-service \
--member="allUsers" \
--role="roles/run.invoker"
Visualize that the HTTP 5xx error encountered above in Trying it out
was encountered as a production runtime error. This tutorial walks through a
formal process for handling it. Although production error resolution processes
vary widely, this tutorial presents a particular sequence of steps to show the
application of useful tools and techniques.
To investigate this problem you will work through these phases:
Collect more details on the reported error to support further investigation and set a mitigation strategy.
Relieve user impact by deciding to push forward in a fix or rollback to a known-healthy version.
Reproduce the error to confirm the correct details have been gathered and that
the error is not a one-time glitch
Perform a root cause analysis on the bug to find the code, configuration, or
process which created this error
At the start of the investigation you have a URL, timestamp, and the message
"Internal Server Error".
Gathering further details
Gather more information about the problem to understand what happened and
determine next steps.
Use available Google Cloud Observability tools to collect more details:
Use the Error Reporting console, which provides a dashboard with
details and recurrence tracking for errors with a recognized
stack trace.
Screenshot of the error list including columnns 'Resolution Status', Occurrences, Error, and 'Seen in'.
List of recorded errors. Errors are grouped by message across revisions, services, and platforms.
Click on the error to see the stack trace details, noting the function calls
made just prior to the error.
Screenshot of a single parsed stack trace, demonstrating a common profile of this error.
The "Stack trace sample" in the error details page shows a single instance
of the error. You can review each individual instances.
Use Cloud Logging to review the sequence of operations leading to the
problem, including error messages that are not included in the
Error Reporting console because of a lack of a recognized
error stack trace:
If this is an established service, known to work, there will be a previous
revision of the service on Cloud Run. This tutorial uses a new service
with no previous versions, so you cannot do a rollback.
However, if you have a service with previous versions you can roll back to,
follow Viewing revision details
to extract the container name and configuration details necessary to create a
new working deployment of your service.
Reproducing the error
Using the details you obtained previously, confirm the
problem consistently occurs under test conditions.
Send the same HTTP request by trying it out again, and see if
the same error and details are reported. It may take some time for error details
to show up.
Because the sample service in this tutorial is read-only and doesn't trigger any
complicating side effects, reproducing errors in production is safe. However,
for many real services, this won't be the case: you may need to reproduce errors
in a test environment or limit this step to local investigation.
Reproducing the error establishes the context for further work. For example,
if developers cannot reproduce the error further investigation may require
additional instrumentation of the service.
Performing a root cause analysis
Root cause analysis is an important step in
effective troubleshooting
to ensure you fix the problem instead of a symptom.
Previously in this tutorial, you reproduced the problem on Cloud Run
which confirms the problem is active when the service is hosted on
Cloud Run. Now reproduce the problem locally to determine if the
problem is isolated to the code or if it only emerges in production hosting.
If you have not used Docker CLI locally with Container Registry, authenticate
it with gcloud:
If the most recently used container image name is not available, the service
description has the information of the most recently deployed container image:
gcloudrunservicesdescribehello-service
Find the container image name inside the spec object. A more targeted
command can directly retrieve it:
This command reveals a container image name such as gcr.io/PROJECT_ID/hello-service.
Pull the container image from the Container Registry to your environment, this
step might take several minutes as it downloads the container image:
dockerpullgcr.io/PROJECT_ID/hello-service
Later updates to the container image that reuse this name can be retrieved
with the same command. If you skip this step, the docker run command below
pulls a container image if one is not present on the local machine.
Run locally to confirm the problem is not unique to Cloud Run:
The PORT environment variable is used by the service to determine the
port to listen on inside the container.
The run command starts the container, defaulting to the entrypoint
command defined in the Dockerfile or a parent container image.
The --rm flag deletes the container instance on exit.
The -e flag assigns a value to an environment variable. -e PORT=$PORT
is propagating the PORT variable from the local system into the container
with the same variable name.
The -p flag publishes the container as a service available on
localhost at port 9000. Requests to localhost:9000 will be routed to the
container on port 8080. This means output from the service about the port
number in use will not match how the service is accessed.
The final argument gcr.io/PROJECT_ID/hello-service
is a container image tag, a human-readable label for a container image's
sha256 hash identifier. If not available locally, docker attempts to
retrieve the image from a remote registry.
In your browser, open http://localhost:9000. Check the terminal output for
error messages that match those on {ops_name}}.
If the problem is not reproducible locally, it may be unique to the
Cloud Run environment. Review the
Cloud Run troubleshooting guide
for specific areas to investigate.
In this case the error is reproduced locally.
Now that the error is doubly-confirmed as persistent and caused by the service
code instead of the hosting platform, it's time to investigate the code more closely.
For purposes of this tutorial it is safe to assume the code inside the container
and the code in the local system is identical.
Revisit the error report's stack trace and cross-reference with the code to find
the specific lines at fault.
Node.js
Find the source of the error message in the file index.js around the line
number called out in the stack trace shown in the logs:
const{NAME}=process.env;if(!NAME){// Plain error logs do not appear in Stackdriver Error Reporting.console.error('Environment validation failed.');console.error(newError('Missing required server parameter'));returnres.status(500).send('Internal Server Error');}
Python
Find the source of the error message in the file main.py around the line
number called out in the stack trace shown in the logs:
NAME=os.getenv("NAME")ifnotNAME:print("Environment validation failed.")raiseException("Missing required service parameter.")
Go
Find the source of the error message in the file main.go around the line
number called out in the stack trace shown in the logs:
name:=os.Getenv("NAME")ifname==""{log.Printf("Missing required server parameter")// The panic stack trace appears in Cloud Error Reporting.panic("Missing required server parameter")}
Java
Find the source of the error message in the file App.java around the line number called out in the stack trace shown in the logs:
Stringname=System.getenv("NAME");if(name==null){// Standard error logs do not appear in Stackdriver Error Reporting.System.err.println("Environment validation failed.");Stringmsg="Missing required server parameter";logger.error(msg,newException(msg));res.status(500);return"Internal Server Error";}
Examining this code, the following actions are taken when the NAME environment
variable is not set:
An error is logged to Google Cloud Observability
An HTTP error response is sent
The problem is caused by a missing variable, but the root cause is more specific:
the code change adding the hard dependency on an environment variable did not
include related changes to deployment scripts and runtime requirements documentation.
Fixing the root cause
Now that we have collected the code and identified the potential root cause,
we can take steps to fix it.
Check whether the service works locally with the NAME environment available
in place:
Run the container locally with the environment variable added:
Wait a few seconds while Cloud Run creates a new revision based on the
previous revision with the new environment variable added.
Confirm the service is now fixed:
Navigate your browser to the Cloud Run service URL.
See "Hello Override!" appear on the page.
Verify that no unexpected messages or errors appear in Cloud Logging or
Error Reporting.
Improving future troubleshooting speed
In this sample production problem, the error was related to operational
configuration. There are code changes that will minimize the impact of this
problem in the future.
Improve the error log to include more specific details.
Instead of returning an error, have the service fall back to a safe default.
If using a default represents a change to normal functionality, use a warning
message for monitoring purposes.
Let's step through removing the NAME environment variable as a hard dependency.
Remove the existing NAME-handling code:
Node.js
const{NAME}=process.env;if(!NAME){// Plain error logs do not appear in Stackdriver Error Reporting.console.error('Environment validation failed.');console.error(newError('Missing required server parameter'));returnres.status(500).send('Internal Server Error');}
Python
NAME=os.getenv("NAME")ifnotNAME:print("Environment validation failed.")raiseException("Missing required service parameter.")
Go
name:=os.Getenv("NAME")ifname==""{log.Printf("Missing required server parameter")// The panic stack trace appears in Cloud Error Reporting.panic("Missing required server parameter")}
Java
Stringname=System.getenv("NAME");if(name==null){// Standard error logs do not appear in Stackdriver Error Reporting.System.err.println("Environment validation failed.");Stringmsg="Missing required server parameter";logger.error(msg,newException(msg));res.status(500);return"Internal Server Error";}
Add new code that sets a fallback value:
Node.js
constNAME=process.env.NAME||'World';if(!process.env.NAME){console.log(JSON.stringify({severity:'WARNING',message:`NAME not set, default to '${NAME}'`,}));}
Python
NAME=os.getenv("NAME")ifnotNAME:NAME="World"error_message={"severity":"WARNING","message":f"NAME not set, default to {NAME}",}print(json.dumps(error_message))
Go
name:=os.Getenv("NAME")ifname==""{name="World"log.Printf("warning: NAME not set, default to %s",name)}
Java
Stringname=System.getenv().getOrDefault("NAME","World");if(System.getenv("NAME")==null){logger.warn(String.format("NAME not set, default to %s",name));}
Test locally by re-building and running the container through the affected
configuration cases:
Node.js
dockerbuild--taggcr.io/PROJECT_ID/hello-service.
Python
dockerbuild--taggcr.io/PROJECT_ID/hello-service.
Go
dockerbuild--taggcr.io/PROJECT_ID/hello-service.
Java
mvncompilejib:build
Confirm the NAME environment variable still works:
If the service does not return a result, confirm the removal of code in the
first step did not remove extra lines, such as those used to write the response.
Each deployment to a service creates a new revision and automatically starts
serving traffic when ready.
To clear the environment variables set earlier:
gcloud run services update hello-service --clear-env-vars
Add the new functionality for the default value to automated test coverage for
the service.
Finding other issues in the logs
You may see other issues in the Log Viewer for this service. For example, an
unsupported system call will appear in the logs as a "Container Sandbox Limitation".
For example, the Node.js services sometimes result in this log message:
Container Sandbox Limitation: Unsupported syscall statx(0xffffff9c,0x3e1ba8e86d88,0x0,0xfff,0x3e1ba8e86970,0x3e1ba8e86a90). Please, refer to https://gvisor.dev/c/linux/amd64/statx for more information.
In this case, the lack of support does not impact the hello-service sample service.
To avoid additional charges to your Google Cloud account, delete all the resources
you deployed with this tutorial.
Delete the project
If you created a new project for this tutorial, delete the project.
If you used an existing project and need to keep it without the changes you added
in this tutorial, delete resources that you created for the tutorial.
The easiest way to eliminate billing is to delete the project that you
created for the tutorial.
To delete the project:
In the Google Cloud console, go to the Manage resources page.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025年11月24日 UTC."],[],[]]