Questions tagged [gradient-descent]
Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. For stochastic gradient descent there is also the [sgd] tag.
997 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
2
votes
0
answers
21
views
What causes the degradation problem - the higher training error in much deeper networks?
In the paper "Deep Residual Learning for Image Recognition", it's been mentioned that
"When deeper networks are able to start converging, a degradation problem has been exposed: with ...
0
votes
0
answers
41
views
Why does LightGBM use the factor (1-a)/b in GOSS?
LightGBM is a specific implementation of gradient boosted decision trees. One notable difference is how samples used for calculating variance gain in split points are picked.
In the algorithm, ...
2
votes
1
answer
59
views
Running SGD multiple times and picking the best result: keywords / name for this practice?
When fitting neural networks, I often run stochastic gradient descent multiple times and take the run with the lowest training loss. I'm trying to look up research literature on this practice, but I'm ...
4
votes
1
answer
89
views
Stochastic Gradient Descent for Multilayer Networks
I was going through the algorithm for Stochastic Gradient decent in mulilayer network from the book Machine Learning by Tom Mitchell, and it shows the formulae for weight update rule. However, I dont ...
10
votes
3
answers
2k
views
Is Backpropagation faulty?
Consider a neural network with 2 or more layers. After we update the weights in layer 1, the input to layer 2 ($a^{(1)}$) has changed, so ∂z/∂w is no longer correct, as z has changed to z* and z* $\...
1
vote
0
answers
85
views
cost function behaves erratically [closed]
Trying to learn basic machine learning, I wrote my own code for logistic regression where I minimize the usual log likelihood using gradient descent. This is the plot of the error function through a ...
2
votes
0
answers
55
views
Estimator bias implies Gradient Bias
Say I have a biased estimator, for example estimating $\log \mathbb{E}[f_\theta(x)]$ using Monte Carlo
Does this implies that $\nabla_\theta \log \mathbb{E}[f_\theta(x)]$ is also biased if estimated ...
5
votes
1
answer
116
views
For a linear problem $Ax=b,ドル is gradient descent a lot faster than least squares (any approach)?
Context
There are many methods to solve least squares, but most of them involve $k n^3$ flops.
Using gradient descent, one runs $A x_i$ and uses the error to update $x_{i+1} = x_i - c \times \mathrm{g}...
0
votes
0
answers
53
views
Nesterov Accelerated Gradient Descent Stalling with High Regularization in Extreme Learning Machine
I'm implementing Nesterov Accelerated Gradient Descent (NAG) on an Extreme Learning Machine (ELM) with one hidden layer. My loss function is the Mean Squared Error (MSE) with $L^2$ regularization.
The ...
3
votes
1
answer
80
views
Do deep learning frameworks "look ahead" when calculating gradient in Nesterov optimization?
The whole point behind Nesterov optimization is to calculate the gradient not at the current parameter values $\theta_t,ドル but at $\theta_t + \beta m,ドル where $\beta$ is the momentum coefficient and $m$ ...
0
votes
0
answers
52
views
If the main benefit of BatchNorm is loss landscape smoothing, why do we use z-score normalisation instead of min-max?
According to recent papers, the main reason why BatchNorm works is because it smooths the loss landscape. So if the main benefit is loss landscape smoothing, why do we need mean subtraction at all? ...
1
vote
0
answers
62
views
Do weights update less towards the start of a neural network?
That is, because the error is coming from the end of the neural network (ie at the output layer) and trickles back via backpropagation to the start of the neural network, does that mean that the ...
14
votes
6
answers
3k
views
Why are so many problems linear and how would one solve nonlinear problems?
I am taking a deep learning in Python class this semester and we are doing linear algebra.
Last lecture we "invented" linear regression with gradient descent (did least squares the lecture ...
1
vote
0
answers
45
views
Batch Normalization and the effect of scaled weights on the gradients
I have been reading the following paper: https://arxiv.org/pdf/1706.05350, and I am having a hard time with some claims and derivations made in the paper.
First of all, the main thing I am interested ...
1
vote
0
answers
58
views
Why do machine learning courses on regression mostly focus on gradient descient although we have the closed form estimator $(X'X)^{-1}X'Y$? [duplicate]
In many online machine learning courses and videos(such as Andrew Ng's coursera course), when it comes to regression (for example regressing $Y$ on features $X$), althouth we have the closed form ...