Questions tagged [gradient-descent]
The gradient-descent tag has no summary.
47 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
1
vote
0
answers
33
views
Conditions on LR in Gradient Descent
In Introductory Lectures in Convex Optimization by Yurii Nesterov, Section 1.2.3 shows that gradient descent is guaranteed to converge if the step size is chosen either with a fixed step size or ...
0
votes
2
answers
382
views
Find minimum of a function only knowing the ordering of a set of input points
Suppose I have a function $f: \mathbb{R}^n\rightarrow\mathbb{R}$. All I know about the function is, I have a set of pairs of vectors ($\vec{v}_a,ドル $\vec{v}_b$) for which I know which one is greater (i....
1
vote
1
answer
174
views
What does RSGD stand for?
I'm reading a paper that involves an algorithm for RSGD. It's clearly a form of stochastic gradient descent, but I can't find what the R stands for. The authors provide their own implementation of it, ...
1
vote
0
answers
55
views
Understanding gradient flow of a linearized wide neural network
I've been trying to fully understand the paper "Wide Neural Networks of Any Depth Evolve as
Linear Models Under Gradient Descent" (available here), but I'm stuck on the linearization part, ...
0
votes
0
answers
179
views
Create a simple Neural Network of n layers in python from scratch with numpy to solve XOR example problem using Batch Gradient Descent
I'm a young programmer that was interested by machine learning. I watched videos and read articles about the theory behind simple neural networks. However, I can't manage to set it up correctly. I've ...
0
votes
0
answers
122
views
How to calculate the upper bound of the gredient of a multi layer ReLu neural network
Layers: We shall denote in the following the layer number by the upper script $\ell$. We have $\ell=0$ for the input layer, $\ell=1$ for the first hidden layer, and $\ell=L$ for the output layer. The ...
2
votes
0
answers
47
views
Convergence rate of quasi-newton method for non-convex objective function
Consider a real-valued $L$-smooth and non-convex objective function $f: \mathbb{R}^n \mapsto \mathbb{R}$. There exists a bound on number of iterations in order to find a (local) minima using ordinary ...
1
vote
1
answer
213
views
Why when a function is quadratic, the approximation by Newton's method is exact, and the algorithm converges to the global minimum in a single step?
Suppose we want to find the value of $x$ that minimizes
$$
f(x)=\frac{1}{2}\|A x-b\|_{2}^{2} .
$$
Specialized linear algebra algorithms can solve this problem efficiently; however, we can also explore ...
1
vote
1
answer
110
views
The preliminary of the Bandit Gradient Algorithm
In the papers introducing The Bandit Gradient Algorithm as Stochastic Gradient Ascent, the following relationship:
is always considered as a preliminary and lacks proof for it. Does anyone know how ...
1
vote
0
answers
124
views
RMSProp Momentum and Decay
I'm making an application of MobileNetV2 and according to their article:
We train our models using
TensorFlow. We use the standard RMSPropOptimizer with both decay and momentum set to 0.9.
We use ...
1
vote
0
answers
35
views
Reinforcement learning with 0 rewards and costs
Suppose we have a hallway environment, i.e, $N$ nodes from left to right, and we can either move left or right. Moving left at the leftmost node does nothing and reaching the right most node gives you ...
1
vote
0
answers
51
views
Searching for the underyling affine transformation in a ridge function
Quoting from Wikipedia:
A ridge function is any function $f:\mathbb{R}^d\rightarrow\mathbb{R}$ that can be written as the composition of a univariate function with an affine transformation, that is: $...
0
votes
0
answers
347
views
Coordinate descent for Lasso, Question about algorithm
I'm not sure why the algorithm computes $c_k$ with $\sum_{j \neq k} w_j x_{i, j}$. Why does one need to ignore the $k^{th}$ feature here? I'm not sure how this is derived. Is this the result of taking ...
1
vote
1
answer
330
views
How does Gradient Descent treat multiple features?
As far as I know, when you reach the step, in a gradient descent algorithm, to calculate step_size, you calculate ...
2
votes
1
answer
63
views
SGD statistical guarantee
I have a question regard online learning with SGD. Is there a way to give a statistical guarantee that the value obtained after $n$ samples deviates at most $\epsilon$ from the real value?