My code was running fine with CUDA, but now that I run it with device="cpu", with the flag torch.autograd.set_detect_anomaly(True), the runtime error is raised:
RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.
Looking closely at the call stack:
File "<ipython-input-468-be9e157834e4>", line 83, in forward
self.grad_mag = torch.sqrt(self.grad_x**2 + self.grad_y**2)
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 41, in wrapped
return f(*args, **kwargs)
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:111.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Indicating that the error is when backwarding through:
self.grad_mag = torch.sqrt(self.grad_x**2 + self.grad_y**2)
Firstly, I don't see why this issue is on CPU but not on CUDA. Secondly, I don't understand why it would get NaNs from the backward pass
print("grad_x: ",self.grad_x.isinf().any(), self.grad_x.isnan().any())
print("grad_y: ",self.grad_y.isinf().any(), self.grad_y.isnan().any())
self.grad_mag = torch.sqrt(self.grad_x**2 + self.grad_y**2)
print("grad mag ", self.grad_mag.isinf().any(), self.grad_mag.isnan().any())
which outputs:
grad_x: tensor(False) tensor(False)
grad_y: tensor(False) tensor(False)
grad mag tensor(False) tensor(False)
If it makes any difference, I'm optimizing with LBFGS
2 Answers 2
Is it going negative? You can fix that with:
eps = 1e-8
self.grad_mag = torch.sqrt(self.grad_x**2 + self.grad_y**2 + eps)
You could also switch optimizers to see if this fixes the issue:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
or completely:
import torch
# Example in the forward pass
eps = 1e-8 # Small epsilon to avoid negative values in sqrt due to precision
grad_magnitude = self.grad_x**2 + self.grad_y**2
# Check for any potential negative values (this shouldn't happen, but for debugging)
if (grad_magnitude < 0).any():
print("Warning: Negative values detected in grad_magnitude")
# Clamp the value to avoid NaN errors
self.grad_mag = torch.sqrt(torch.clamp(grad_magnitude, min=eps))
The issue might be floating-point precision differences between CUDA and CPU, which are causing the NaNs in the backward pass.
3 Comments
sqrt function in your formula you will have a non-differentiability issue at 0 because the square root itself is unfortunately not differentiable at zeroI think oppressionslayer's solution will work but I would recommend minimizing the squared norm instead of the norm since they both converge toward the same argmin and the squared norm is fully differentiable:
self.grad_sqmag = self.grad_x**2 + self.grad_y**2
This is equivalent to minimizing f(x) = x**2 instead of f(x) = abs(x)
In general (and in ML in particular) there are few situations where you want to backprop through the norm rather than the squared norm.
As for why it "works" on GPU, I cannot say for sure but I guess this is just implementation shenanigans and it probably just fails on GPU too in a less obvious way, giving you wrong gradients anyway.
self.grad_mag = torch.sqrt(self.grad_x**2 + self.grad_y**2 + 1e-8 )as @oppressionslayer recommended. Do you know why this error would only pop up on CPU? I will try to provide a minimal example.breaking being different for *torch and torch* libs