Troubleshooting PyTorch - TPU

This guide provides troubleshooting information to help you identify and resolve problems you might encounter while training PyTorch models on Cloud TPU. For a more general guide to getting started with Cloud TPU, see the PyTorch quickstart.

Troubleshooting slow training performance

If your model trains slowly, generate and review a metrics report.

To automatically analyze the metrics report and provide a summary, run your workload with PT_XLA_DEBUG=1.

For more information about issues that might cause your model to train slowly, see Known performance caveats.

Performance profiling

To profile your workload in-depth to discover bottlenecks, review these resources:

More debugging tools

You can specify environment variables to control the behavior of the PyTorch/XLA software stack.

If you encounter an unexpected bug and need help, file a GitHub issue.

Managing XLA tensors

XLA tensor Quirks describes what you should and shouldn't do when working with XLA tensors and shared weights.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025年10月13日 UTC.