5

My code was running fine with CUDA, but now that I run it with device="cpu", with the flag torch.autograd.set_detect_anomaly(True), the runtime error is raised:

RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.

Looking closely at the call stack:

 File "<ipython-input-468-be9e157834e4>", line 83, in forward
 self.grad_mag = torch.sqrt(self.grad_x**2 + self.grad_y**2)
 File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 41, in wrapped
 return f(*args, **kwargs)
 (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:111.)
 return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

Indicating that the error is when backwarding through:

self.grad_mag = torch.sqrt(self.grad_x**2 + self.grad_y**2)

Firstly, I don't see why this issue is on CPU but not on CUDA. Secondly, I don't understand why it would get NaNs from the backward pass

print("grad_x: ",self.grad_x.isinf().any(), self.grad_x.isnan().any())
print("grad_y: ",self.grad_y.isinf().any(), self.grad_y.isnan().any())
self.grad_mag = torch.sqrt(self.grad_x**2 + self.grad_y**2)
print("grad mag ", self.grad_mag.isinf().any(), self.grad_mag.isnan().any())

which outputs:

grad_x: tensor(False) tensor(False)
grad_y: tensor(False) tensor(False)
grad mag tensor(False) tensor(False)

If it makes any difference, I'm optimizing with LBFGS

asked Oct 17, 2024 at 19:34
3
  • Can you provide a bit more context for the code ? Ideally a code snippet that is as small as possible which could reproduce the error ? Also check that grad_x and grad_y are nonzero because the norm function that you are computing for grad_mag is non-differentiable in 0 so that may be the cause Commented Oct 19, 2024 at 22:09
  • @trialNerror this is indeed true. I added an epsilon term and that "fixed" it self.grad_mag = torch.sqrt(self.grad_x**2 + self.grad_y**2 + 1e-8 ) as @oppressionslayer recommended. Do you know why this error would only pop up on CPU? I will try to provide a minimal example. Commented Oct 21, 2024 at 14:17
  • This could be another manifestation of this report that has several linked issues. Basically comes down to CPU and GPU handling of NaNs and breaking being different for *torch and torch* libs Commented Oct 22, 2024 at 9:51

2 Answers 2

1

Is it going negative? You can fix that with:

eps = 1e-8 
self.grad_mag = torch.sqrt(self.grad_x**2 + self.grad_y**2 + eps)

You could also switch optimizers to see if this fixes the issue:

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
or completely:
import torch
# Example in the forward pass
eps = 1e-8 # Small epsilon to avoid negative values in sqrt due to precision
grad_magnitude = self.grad_x**2 + self.grad_y**2
# Check for any potential negative values (this shouldn't happen, but for debugging)
if (grad_magnitude < 0).any():
 print("Warning: Negative values detected in grad_magnitude")
# Clamp the value to avoid NaN errors
self.grad_mag = torch.sqrt(torch.clamp(grad_magnitude, min=eps))

The issue might be floating-point precision differences between CUDA and CPU, which are causing the NaNs in the backward pass.

answered Oct 20, 2024 at 6:35
Sign up to request clarification or add additional context in comments.

3 Comments

I agree that this will fix the issue, but I believe it to be the wrong solution. Except in some specific case, you want to minimize the squared norm instead of the norm because they have the same argmin and the squared norm is fully differentiable. Upvote nonetheless, this is a solution
How about these changes: self.grad_mag = torch.sqrt(self.grad_x2 + self.grad_y2) remove eps i have to get my nvidia dev machine out i think i can fix this!
As long as there is an sqrt function in your formula you will have a non-differentiability issue at 0 because the square root itself is unfortunately not differentiable at zero
0

I think oppressionslayer's solution will work but I would recommend minimizing the squared norm instead of the norm since they both converge toward the same argmin and the squared norm is fully differentiable:

self.grad_sqmag = self.grad_x**2 + self.grad_y**2

This is equivalent to minimizing f(x) = x**2 instead of f(x) = abs(x)

In general (and in ML in particular) there are few situations where you want to backprop through the norm rather than the squared norm.

As for why it "works" on GPU, I cannot say for sure but I guess this is just implementation shenanigans and it probably just fails on GPU too in a less obvious way, giving you wrong gradients anyway.

answered Oct 23, 2024 at 0:32

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.