When I run without CUDA: Function ‘PowBackward0’ returned nan values in its 0th output

Question 1

My code was running fine with CUDA, but now that I run it with device="cpu", with the flag torch.autograd.set_detect_anomaly(True), the runtime error is raised:

RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.

Looking closely at the call stack:

 File "<ipython-input-468-be9e157834e4>", line 83, in forward
 self.grad_mag = torch.sqrt(self.grad_x**2 + self.grad_y**2)
 File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 41, in wrapped
 return f(*args, **kwargs)
 (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:111.)
 return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

Indicating that the error is when backwarding through:

self.grad_mag = torch.sqrt(self.grad_x**2 + self.grad_y**2)

Firstly, I don't see why this issue is on CPU but not on CUDA. Secondly, I don't understand why it would get NaNs from the backward pass

print("grad_x: ",self.grad_x.isinf().any(), self.grad_x.isnan().any())
print("grad_y: ",self.grad_y.isinf().any(), self.grad_y.isnan().any())
self.grad_mag = torch.sqrt(self.grad_x**2 + self.grad_y**2)
print("grad mag ", self.grad_mag.isinf().any(), self.grad_mag.isnan().any())

which outputs:

grad_x: tensor(False) tensor(False)
grad_y: tensor(False) tensor(False)
grad mag tensor(False) tensor(False)

If it makes any difference, I'm optimizing with LBFGS

Question 2

Can you provide a bit more context for the code ? Ideally a code snippet that is as small as possible which could reproduce the error ? Also check that grad_x and grad_y are nonzero because the norm function that you are computing for grad_mag is non-differentiable in 0 so that may be the cause

Question 3

@trialNerror this is indeed true. I added an epsilon term and that "fixed" it self.grad_mag = torch.sqrt(self.grad_x**2 + self.grad_y**2 + 1e-8 ) as @oppressionslayer recommended. Do you know why this error would only pop up on CPU? I will try to provide a minimal example.

Question 4

This could be another manifestation of this report that has several linked issues. Basically comes down to CPU and GPU handling of NaNs and breaking being different for *torch and torch* libs

Question 5

Is it going negative? You can fix that with:

eps = 1e-8 
self.grad_mag = torch.sqrt(self.grad_x**2 + self.grad_y**2 + eps)

You could also switch optimizers to see if this fixes the issue:

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
or completely:
import torch
# Example in the forward pass
eps = 1e-8 # Small epsilon to avoid negative values in sqrt due to precision
grad_magnitude = self.grad_x**2 + self.grad_y**2
# Check for any potential negative values (this shouldn't happen, but for debugging)
if (grad_magnitude < 0).any():
 print("Warning: Negative values detected in grad_magnitude")
# Clamp the value to avoid NaN errors
self.grad_mag = torch.sqrt(torch.clamp(grad_magnitude, min=eps))

The issue might be floating-point precision differences between CUDA and CPU, which are causing the NaNs in the backward pass.

Question 6

I agree that this will fix the issue, but I believe it to be the wrong solution. Except in some specific case, you want to minimize the squared norm instead of the norm because they have the same argmin and the squared norm is fully differentiable. Upvote nonetheless, this is a solution

Question 7

How about these changes: self.grad_mag = torch.sqrt(self.grad_x2 + self.grad_y2) remove eps i have to get my nvidia dev machine out i think i can fix this!

Question 8

As long as there is an sqrt function in your formula you will have a non-differentiability issue at 0 because the square root itself is unfortunately not differentiable at zero

Question 9

I think oppressionslayer's solution will work but I would recommend minimizing the squared norm instead of the norm since they both converge toward the same argmin and the squared norm is fully differentiable:

self.grad_sqmag = self.grad_x**2 + self.grad_y**2

This is equivalent to minimizing f(x) = x**2 instead of f(x) = abs(x)

In general (and in ML in particular) there are few situations where you want to backprop through the norm rather than the squared norm.

As for why it "works" on GPU, I cannot say for sure but I guess this is just implementation shenanigans and it probably just fails on GPU too in a less obvious way, giving you wrong gradients anyway.

oppressionslayer 7,2142 gold badges11 silver badges26 bronze badges · Accepted Answer · 2024-10-20 06:35:28Z

Is it going negative? You can fix that with:

eps = 1e-8 
self.grad_mag = torch.sqrt(self.grad_x**2 + self.grad_y**2 + eps)

You could also switch optimizers to see if this fixes the issue:

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
or completely:
import torch
# Example in the forward pass
eps = 1e-8 # Small epsilon to avoid negative values in sqrt due to precision
grad_magnitude = self.grad_x**2 + self.grad_y**2
# Check for any potential negative values (this shouldn't happen, but for debugging)
if (grad_magnitude < 0).any():
 print("Warning: Negative values detected in grad_magnitude")
# Clamp the value to avoid NaN errors
self.grad_mag = torch.sqrt(torch.clamp(grad_magnitude, min=eps))

The issue might be floating-point precision differences between CUDA and CPU, which are causing the NaNs in the backward pass.

I agree that this will fix the issue, but I believe it to be the wrong solution. Except in some specific case, you want to minimize the squared norm instead of the norm because they have the same argmin and the squared norm is fully differentiable. Upvote nonetheless, this is a solution
How about these changes: self.grad_mag = torch.sqrt(self.grad_x2 + self.grad_y2) remove eps i have to get my nvidia dev machine out i think i can fix this!
As long as there is an sqrt function in your formula you will have a non-differentiability issue at 0 because the square root itself is unfortunately not differentiable at zero

CollectivesTM on Stack Overflow

When I run without CUDA: Function ‘PowBackward0’ returned nan values in its 0th output

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related