Questions tagged [gradient-descent]

Ask Question

Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. For stochastic gradient descent there is also the [sgd] tag.

997 questions

Newest Active Bountied Unanswered

2 votes

0 answers

21 views

What causes the degradation problem - the higher training error in much deeper networks?

In the paper "Deep Residual Learning for Image Recognition", it's been mentioned that "When deeper networks are able to start converging, a degradation problem has been exposed: with ...

Vignesh N's user avatar

Vignesh N

asked Oct 11 at 12:28

0 votes

0 answers

41 views

Why does LightGBM use the factor (1-a)/b in GOSS?

LightGBM is a specific implementation of gradient boosted decision trees. One notable difference is how samples used for calculating variance gain in split points are picked. In the algorithm, ...

yanis-falaki's user avatar

yanis-falaki

asked Jul 21 at 7:15

2 votes

1 answer

59 views

Running SGD multiple times and picking the best result: keywords / name for this practice?

When fitting neural networks, I often run stochastic gradient descent multiple times and take the run with the lowest training loss. I'm trying to look up research literature on this practice, but I'm ...

Jacob Maibach's user avatar

Jacob Maibach

asked Jun 10 at 20:26

4 votes

1 answer

89 views

Stochastic Gradient Descent for Multilayer Networks

I was going through the algorithm for Stochastic Gradient decent in mulilayer network from the book Machine Learning by Tom Mitchell, and it shows the formulae for weight update rule. However, I dont ...

Machine123's user avatar

Machine123

asked May 10 at 14:13

10 votes

3 answers

2k views

Is Backpropagation faulty?

Consider a neural network with 2 or more layers. After we update the weights in layer 1, the input to layer 2 ($a^{(1)}$) has changed, so ∂z/∂w is no longer correct, as z has changed to z* and z* $\...

Yaron's user avatar

Yaron

asked Apr 27 at 0:44

1 vote

0 answers

85 views

cost function behaves erratically [closed]

Trying to learn basic machine learning, I wrote my own code for logistic regression where I minimize the usual log likelihood using gradient descent. This is the plot of the error function through a ...

user470820's user avatar

user470820

asked Mar 27 at 14:14

2 votes

0 answers

55 views

Estimator bias implies Gradient Bias

Say I have a biased estimator, for example estimating $\log \mathbb{E}[f_\theta(x)]$ using Monte Carlo Does this implies that $\nabla_\theta \log \mathbb{E}[f_\theta(x)]$ is also biased if estimated ...

Alberto's user avatar

Alberto

1,561

asked Mar 9 at 21:35

5 votes

1 answer

116 views

For a linear problem $Ax=b,ドル is gradient descent a lot faster than least squares (any approach)?

Context There are many methods to solve least squares, but most of them involve $k n^3$ flops. Using gradient descent, one runs $A x_i$ and uses the error to update $x_{i+1} = x_i - c \times \mathrm{g}...

uranus's user avatar

uranus

asked Mar 4 at 16:49

0 votes

0 answers

53 views

Nesterov Accelerated Gradient Descent Stalling with High Regularization in Extreme Learning Machine

I'm implementing Nesterov Accelerated Gradient Descent (NAG) on an Extreme Learning Machine (ELM) with one hidden layer. My loss function is the Mean Squared Error (MSE) with $L^2$ regularization. The ...

Paolo Pedinotti's user avatar

Paolo Pedinotti

asked Mar 2 at 13:48

3 votes

1 answer

80 views

Do deep learning frameworks "look ahead" when calculating gradient in Nesterov optimization?

The whole point behind Nesterov optimization is to calculate the gradient not at the current parameter values $\theta_t,ドル but at $\theta_t + \beta m,ドル where $\beta$ is the momentum coefficient and $m$ ...

Antonios Sarikas's user avatar

Antonios Sarikas

asked Feb 22 at 22:20

0 votes

0 answers

52 views

If the main benefit of BatchNorm is loss landscape smoothing, why do we use z-score normalisation instead of min-max?

According to recent papers, the main reason why BatchNorm works is because it smooths the loss landscape. So if the main benefit is loss landscape smoothing, why do we need mean subtraction at all? ...

FadiBenz's user avatar

FadiBenz

asked Jan 31 at 10:00

1 vote

0 answers

62 views

Do weights update less towards the start of a neural network?

That is, because the error is coming from the end of the neural network (ie at the output layer) and trickles back via backpropagation to the start of the neural network, does that mean that the ...

Null Six's user avatar

Null Six

asked Jan 22 at 18:10

14 votes

6 answers

3k views

Why are so many problems linear and how would one solve nonlinear problems?

I am taking a deep learning in Python class this semester and we are doing linear algebra. Last lecture we "invented" linear regression with gradient descent (did least squares the lecture ...

Lukas's user avatar

Lukas

asked Jan 4 at 19:41

1 vote

0 answers

45 views

Batch Normalization and the effect of scaled weights on the gradients

I have been reading the following paper: https://arxiv.org/pdf/1706.05350, and I am having a hard time with some claims and derivations made in the paper. First of all, the main thing I am interested ...

kklaw's user avatar

kklaw

asked Dec 26, 2024 at 11:21

1 vote

0 answers

58 views

Why do machine learning courses on regression mostly focus on gradient descient although we have the closed form estimator $(X'X)^{-1}X'Y$? [duplicate]

In many online machine learning courses and videos(such as Andrew Ng's coursera course), when it comes to regression (for example regressing $Y$ on features $X$), althouth we have the closed form ...

ExcitedSnail's user avatar

ExcitedSnail

3,090

asked Dec 13, 2024 at 14:08

15 30 50 per page

2 3 4 5

...

67 Next

Stack Exchange Network

Questions tagged [gradient-descent]

What causes the degradation problem - the higher training error in much deeper networks?

Why does LightGBM use the factor (1-a)/b in GOSS?

Running SGD multiple times and picking the best result: keywords / name for this practice?

Stochastic Gradient Descent for Multilayer Networks

Is Backpropagation faulty?

cost function behaves erratically [closed]

Estimator bias implies Gradient Bias

For a linear problem $Ax=b,ドル is gradient descent a lot faster than least squares (any approach)?

Nesterov Accelerated Gradient Descent Stalling with High Regularization in Extreme Learning Machine

Do deep learning frameworks "look ahead" when calculating gradient in Nesterov optimization?

If the main benefit of BatchNorm is loss landscape smoothing, why do we use z-score normalisation instead of min-max?

Do weights update less towards the start of a neural network?

Why are so many problems linear and how would one solve nonlinear problems?

Batch Normalization and the effect of scaled weights on the gradients

Why do machine learning courses on regression mostly focus on gradient descient although we have the closed form estimator $(X'X)^{-1}X'Y$? [duplicate]

Hot Network Questions