Skip to main content
Cross Validated

Questions tagged [gradient-descent]

Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. For stochastic gradient descent there is also the [sgd] tag.

Filter by
Sorted by
Tagged with
2 votes
0 answers
21 views

In the paper "Deep Residual Learning for Image Recognition", it's been mentioned that "When deeper networks are able to start converging, a degradation problem has been exposed: with ...
0 votes
0 answers
41 views

LightGBM is a specific implementation of gradient boosted decision trees. One notable difference is how samples used for calculating variance gain in split points are picked. In the algorithm, ...
2 votes
1 answer
59 views

When fitting neural networks, I often run stochastic gradient descent multiple times and take the run with the lowest training loss. I'm trying to look up research literature on this practice, but I'm ...
4 votes
1 answer
89 views

I was going through the algorithm for Stochastic Gradient decent in mulilayer network from the book Machine Learning by Tom Mitchell, and it shows the formulae for weight update rule. However, I dont ...
10 votes
3 answers
2k views

Consider a neural network with 2 or more layers. After we update the weights in layer 1, the input to layer 2 ($a^{(1)}$) has changed, so ∂z/∂w is no longer correct, as z has changed to z* and z* $\...
1 vote
0 answers
85 views

Trying to learn basic machine learning, I wrote my own code for logistic regression where I minimize the usual log likelihood using gradient descent. This is the plot of the error function through a ...
2 votes
0 answers
55 views

Say I have a biased estimator, for example estimating $\log \mathbb{E}[f_\theta(x)]$ using Monte Carlo Does this implies that $\nabla_\theta \log \mathbb{E}[f_\theta(x)]$ is also biased if estimated ...
5 votes
1 answer
116 views

Context There are many methods to solve least squares, but most of them involve $k n^3$ flops. Using gradient descent, one runs $A x_i$ and uses the error to update $x_{i+1} = x_i - c \times \mathrm{g}...
0 votes
0 answers
53 views

I'm implementing Nesterov Accelerated Gradient Descent (NAG) on an Extreme Learning Machine (ELM) with one hidden layer. My loss function is the Mean Squared Error (MSE) with $L^2$ regularization. The ...
3 votes
1 answer
80 views

The whole point behind Nesterov optimization is to calculate the gradient not at the current parameter values $\theta_t,ドル but at $\theta_t + \beta m,ドル where $\beta$ is the momentum coefficient and $m$ ...
0 votes
0 answers
52 views

According to recent papers, the main reason why BatchNorm works is because it smooths the loss landscape. So if the main benefit is loss landscape smoothing, why do we need mean subtraction at all? ...
1 vote
0 answers
62 views

That is, because the error is coming from the end of the neural network (ie at the output layer) and trickles back via backpropagation to the start of the neural network, does that mean that the ...
14 votes
6 answers
3k views

I am taking a deep learning in Python class this semester and we are doing linear algebra. Last lecture we "invented" linear regression with gradient descent (did least squares the lecture ...
1 vote
0 answers
45 views

I have been reading the following paper: https://arxiv.org/pdf/1706.05350, and I am having a hard time with some claims and derivations made in the paper. First of all, the main thing I am interested ...
1 vote
0 answers
58 views

In many online machine learning courses and videos(such as Andrew Ng's coursera course), when it comes to regression (for example regressing $Y$ on features $X$), althouth we have the closed form ...

15 30 50 per page
1
2 3 4 5
...
67

AltStyle によって変換されたページ (->オリジナル) /