Jump to content
Wikipedia The Free Encyclopedia

Stochastic variance reduction

From Wikipedia, the free encyclopedia
Family of optimization algorithms

(Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting.

Variance reduction approaches are widely used for training machine learning models such as logistic regression and support vector machines [1] as these problems have finite-sum structure and uniform conditioning that make them ideal candidates for variance reduction.

Finite sum objectives

[edit ]

A function f {\displaystyle f} {\displaystyle f} is considered to have finite sum structure if it can be decomposed into a summation or average:

f ( x ) = 1 n i = 1 n f i ( x ) , {\displaystyle f(x)={\frac {1}{n}}\sum _{i=1}^{n}f_{i}(x),} {\displaystyle f(x)={\frac {1}{n}}\sum _{i=1}^{n}f_{i}(x),}

where the function value and derivative of each f i {\displaystyle f_{i}} {\displaystyle f_{i}} can be queried independently. Although variance reduction methods can be applied for any positive n {\displaystyle n} {\displaystyle n} and any f i {\displaystyle f_{i}} {\displaystyle f_{i}} structure, their favorable theoretical and practical properties arise when n {\displaystyle n} {\displaystyle n} is large compared to the condition number of each f i {\displaystyle f_{i}} {\displaystyle f_{i}}, and when the f i {\displaystyle f_{i}} {\displaystyle f_{i}} have similar (but not necessarily identical) Lipschitz smoothness and strong convexity constants.

The finite sum structure should be contrasted with the stochastic approximation setting which deals with functions of the form f ( θ ) = E ξ [ F ( θ , ξ ) ] {\textstyle f(\theta )=\operatorname {E} _{\xi }[F(\theta ,\xi )]} {\textstyle f(\theta )=\operatorname {E} _{\xi }[F(\theta ,\xi )]} which is the expected value of a function depending on a random variable ξ {\textstyle \xi } {\textstyle \xi }. Any finite sum problem can be optimized using a stochastic approximation algorithm by using F ( , ξ ) = f ξ {\displaystyle F(\cdot ,\xi )=f_{\xi }} {\displaystyle F(\cdot ,\xi )=f_{\xi }}.

Rapid Convergence

[edit ]

Stochastic variance reduced methods without acceleration are able to find a minima of f {\displaystyle f} {\displaystyle f} within accuracy ϵ > {\displaystyle \epsilon >} {\displaystyle \epsilon >}, i.e. f ( x ) f ( x ) ϵ {\displaystyle f(x)-f(x_{*})\leq \epsilon } {\displaystyle f(x)-f(x_{*})\leq \epsilon } in a number of steps of the order:

O ( ( L μ + n ) log ( 1 ϵ ) ) . {\displaystyle O\left(\left({\frac {L}{\mu }}+n\right)\log \left({\frac {1}{\epsilon }}\right)\right).} {\displaystyle O\left(\left({\frac {L}{\mu }}+n\right)\log \left({\frac {1}{\epsilon }}\right)\right).}

The number of steps depends only logarithmically on the level of accuracy required, in contrast to the stochastic approximation framework, where the number of steps O ( L / ( μ ϵ ) ) {\displaystyle O{\bigl (}L/(\mu \epsilon ){\bigr )}} {\displaystyle O{\bigl (}L/(\mu \epsilon ){\bigr )}} required grows proportionally to the accuracy required. Stochastic variance reduction methods converge almost as fast as the gradient descent method's O ( ( L / μ ) log ( 1 / ϵ ) ) {\displaystyle O{\bigl (}(L/\mu )\log(1/\epsilon ){\bigr )}} {\displaystyle O{\bigl (}(L/\mu )\log(1/\epsilon ){\bigr )}} rate, despite using only a stochastic gradient, at a 1 / n {\displaystyle 1/n} {\displaystyle 1/n} lower cost than gradient descent.

Accelerated methods in the stochastic variance reduction framework achieve even faster convergence rates, requiring only

O ( ( n L μ + n ) log ( 1 ϵ ) ) {\displaystyle O\left(\left({\sqrt {\frac {nL}{\mu }}}+n\right)\log \left({\frac {1}{\epsilon }}\right)\right)} {\displaystyle O\left(\left({\sqrt {\frac {nL}{\mu }}}+n\right)\log \left({\frac {1}{\epsilon }}\right)\right)}

steps to reach ϵ {\displaystyle \epsilon } {\displaystyle \epsilon } accuracy, potentially n {\displaystyle {\sqrt {n}}} {\displaystyle {\sqrt {n}}} faster than non-accelerated methods. Lower complexity bounds.[2] for the finite sum class establish that this rate is the fastest possible for smooth strongly convex problems.

Approaches

[edit ]

Variance reduction approaches fall within four main categories: table averaging methods, full-gradient snapshot methods, recursive estimator methods (e.g., SARAH), and dual methods. Each category contains methods designed for dealing with convex, non-smooth, and non-convex problems, each differing in hyper-parameter settings and other algorithmic details.

SAGA

[edit ]

In the SAGA method,[3] the prototypical table averaging approach, a table of size n {\displaystyle n} {\displaystyle n} is maintained that contains the last gradient witnessed for each f i {\displaystyle f_{i}} {\displaystyle f_{i}} term, which we denote g i {\displaystyle g_{i}} {\displaystyle g_{i}}. At each step, an index i {\displaystyle i} {\displaystyle i} is sampled, and a new gradient f i ( x k ) {\displaystyle \nabla f_{i}(x_{k})} {\displaystyle \nabla f_{i}(x_{k})} is computed. The iterate x k {\displaystyle x_{k}} {\displaystyle x_{k}} is updated with:

x k + 1 = x k γ [ f i ( x k ) g i + 1 n i = 1 n g i ] , {\displaystyle x_{k+1}=x_{k}-\gamma \left[\nabla f_{i}(x_{k})-g_{i}+{\frac {1}{n}}\sum _{i=1}^{n}g_{i}\right],} {\displaystyle x_{k+1}=x_{k}-\gamma \left[\nabla f_{i}(x_{k})-g_{i}+{\frac {1}{n}}\sum _{i=1}^{n}g_{i}\right],}

and afterwards table entry i {\displaystyle i} {\displaystyle i} is updated with g i = f i ( x k ) {\displaystyle g_{i}=\nabla f_{i}(x_{k})} {\displaystyle g_{i}=\nabla f_{i}(x_{k})}.

SAGA is among the most popular of the variance reduction methods due to its simplicity, easily adaptable theory, and excellent performance. It is the successor of the SAG method,[4] improving on its flexibility and performance.

SVRG

[edit ]

The stochastic variance reduced gradient method (SVRG),[5] the prototypical snapshot method, uses a similar update except instead of using the average of a table it instead uses a full-gradient that is reevaluated at a snapshot point x ~ {\displaystyle {\tilde {x}}} {\displaystyle {\tilde {x}}} at regular intervals of m n {\displaystyle m\geq n} {\displaystyle m\geq n} iterations. The update becomes:

x k + 1 = x k γ [ f i ( x k ) f i ( x ~ ) + f ( x ~ ) ] , {\displaystyle x_{k+1}=x_{k}-\gamma [\nabla f_{i}(x_{k})-\nabla f_{i}({\tilde {x}})+\nabla f({\tilde {x}})],} {\displaystyle x_{k+1}=x_{k}-\gamma [\nabla f_{i}(x_{k})-\nabla f_{i}({\tilde {x}})+\nabla f({\tilde {x}})],}

This approach requires two stochastic gradient evaluations per step, one to compute f i ( x k ) {\displaystyle \nabla f_{i}(x_{k})} {\displaystyle \nabla f_{i}(x_{k})} and one to compute f i ( x ~ ) , {\displaystyle \nabla f_{i}({\tilde {x}}),} {\displaystyle \nabla f_{i}({\tilde {x}}),} where-as table averaging approaches need only one.

Despite the high computational cost, SVRG is popular as its simple convergence theory is highly adaptable to new optimization settings. It also has lower storage requirements than tabular averaging approaches, which make it applicable in many settings where tabular methods can not be used.

SARAH

[edit ]

The SARAH (stochastic recursive gradient) method[6] maintains a recursive estimator of the gradient rather than storing a table of past gradients (as in SAGA) or computing periodic full-gradient snapshots (as in SVRG). At the start of an inner loop, a full gradient is computed at a reference point x ~ {\displaystyle {\tilde {x}}} {\displaystyle {\tilde {x}}}: v 0 = f ( x ~ ) {\displaystyle v_{0}=\nabla f({\tilde {x}})} {\displaystyle v_{0}=\nabla f({\tilde {x}})}. For inner iterations, with a sampled index i k {\displaystyle i_{k}} {\displaystyle i_{k}}, the gradient estimator and iterate are updated by:

v k = f i k ( x k ) f i k ( x k 1 ) + v k 1 , x k + 1 = x k γ v k . {\displaystyle v_{k}=\nabla f_{i_{k}}(x_{k})-\nabla f_{i_{k}}(x_{k-1})+v_{k-1},\qquad x_{k+1}=x_{k}-\gamma v_{k}.} {\displaystyle v_{k}=\nabla f_{i_{k}}(x_{k})-\nabla f_{i_{k}}(x_{k-1})+v_{k-1},\qquad x_{k+1}=x_{k}-\gamma v_{k}.}

This recursion requires two component-gradient evaluations per step f i k ( x k ) {\displaystyle \nabla f_{i_{k}}(x_{k})} {\displaystyle \nabla f_{i_{k}}(x_{k})} and f i k ( x k 1 ) {\displaystyle \nabla f_{i_{k}}(x_{k-1})} {\displaystyle \nabla f_{i_{k}}(x_{k-1})} but does not need to store per-sample gradients, resulting in lower memory cost than table-averaging methods. SARAH admits linear convergence for strongly convex functions and has been extended to more general nonconvex and composite problems.[7] [8]

SDCA

[edit ]

Exploiting the dual representation of the objective leads to another variance reduction approach that is particularly suited to finite-sums where each term has a structure that makes computing the convex conjugate f i , {\displaystyle f_{i}^{*},} {\displaystyle f_{i}^{*},} or its proximal operator tractable. The standard SDCA method[9] considers finite sums that have additional structure compared to generic finite sum setting:

f ( x ) = 1 n i = 1 n f i ( x T v i ) + λ 2 x 2 , {\displaystyle f(x)={\frac {1}{n}}\sum _{i=1}^{n}f_{i}(x^{T}v_{i})+{\frac {\lambda }{2}}\|x\|^{2},} {\displaystyle f(x)={\frac {1}{n}}\sum _{i=1}^{n}f_{i}(x^{T}v_{i})+{\frac {\lambda }{2}}\|x\|^{2},}

where each f i {\displaystyle f_{i}} {\displaystyle f_{i}} is 1 dimensional and each v i {\displaystyle v_{i}} {\displaystyle v_{i}} is a data point associated with f i {\displaystyle f_{i}} {\displaystyle f_{i}}. SDCA solves the dual problem:

max α R n 1 n i = 1 n f i ( α i ) λ 2 1 λ n i = 1 n α i v i 2 , {\displaystyle \max _{\alpha \in \mathbb {R} ^{n}}-{\frac {1}{n}}\sum _{i=1}^{n}f_{i}^{*}(-\alpha _{i})-{\frac {\lambda }{2}}\left\|{\frac {1}{\lambda n}}\sum _{i=1}^{n}\alpha _{i}v_{i}\right\|^{2},} {\displaystyle \max _{\alpha \in \mathbb {R} ^{n}}-{\frac {1}{n}}\sum _{i=1}^{n}f_{i}^{*}(-\alpha _{i})-{\frac {\lambda }{2}}\left\|{\frac {1}{\lambda n}}\sum _{i=1}^{n}\alpha _{i}v_{i}\right\|^{2},}

by a stochastic coordinate ascent procedure, where at each step the objective is optimized with respect to a randomly chosen coordinate α i {\displaystyle \alpha _{i}} {\displaystyle \alpha _{i}}, leaving all other coordinates the same. An approximate primal solution x {\displaystyle x} {\displaystyle x} can be recovered from the α {\displaystyle \alpha } {\displaystyle \alpha } values:

x = 1 λ n i = 1 n α i v i {\displaystyle x={\frac {1}{\lambda n}}\sum _{i=1}^{n}\alpha _{i}v_{i}} {\displaystyle x={\frac {1}{\lambda n}}\sum _{i=1}^{n}\alpha _{i}v_{i}}.

This method obtains similar theoretical rates of convergence to other stochastic variance reduced methods, while avoiding the need to specify a step-size parameter. It is fast in practice when λ {\displaystyle \lambda } {\displaystyle \lambda } is large, but significantly slower than the other approaches when λ {\displaystyle \lambda } {\displaystyle \lambda } is small.

Accelerated approaches

[edit ]

Accelerated variance reduction methods are built upon the standard methods above. The earliest approaches make use of proximal operators to accelerate convergence, either approximately or exactly. Direct acceleration approaches have also been developed.[10]

Catalyst acceleration

[edit ]

The catalyst framework[11] uses any of the standard methods above as an inner optimizer to approximately solve a proximal operator:

x k argmin x { f ( x ) + κ 2 x y k 1 2 } {\displaystyle x_{k}\approx {\text{argmin}}_{x}\left\{f(x)+{\frac {\kappa }{2}}\|x-y_{k-1}\|^{2}\right\}} {\displaystyle x_{k}\approx {\text{argmin}}_{x}\left\{f(x)+{\frac {\kappa }{2}}\|x-y_{k-1}\|^{2}\right\}}

after which it uses an extrapolation step to determine the next y {\displaystyle y} {\displaystyle y}:

y k = x k + β k ( x k x k 1 ) {\displaystyle y_{k}=x_{k}+\beta _{k}(x_{k}-x_{k-1})} {\displaystyle y_{k}=x_{k}+\beta _{k}(x_{k}-x_{k-1})}

The catalyst method's flexibility and simplicity make it a popular baseline approach. However, it doesn't achieve the optimal rate of convergence among accelerated methods; it is potentially slower by up to a log factor in the hyper-parameters.

Point-SAGA

[edit ]

Proximal operations may also be applied directly to the f i {\displaystyle f_{i}} {\displaystyle f_{i}} terms to yield an accelerated method. The Point-SAGA method[12] replaces the gradient operations in SAGA with proximal operator evaluations, result in a simple, direct acceleration method:

x k + 1 = prox j γ ( z k x k + γ [ g j 1 n i = 1 n g i ] ) , {\displaystyle x_{k+1}={\text{prox}}_{j}^{\gamma }\left(z_{k}\triangleq x_{k}+\gamma \left[g_{j}-{\frac {1}{n}}\sum _{i=1}^{n}g_{i}\right]\right),} {\displaystyle x_{k+1}={\text{prox}}_{j}^{\gamma }\left(z_{k}\triangleq x_{k}+\gamma \left[g_{j}-{\frac {1}{n}}\sum _{i=1}^{n}g_{i}\right]\right),}

with the table update g j = 1 γ ( z k x k + 1 ) {\displaystyle g_{j}={\frac {1}{\gamma }}(z_{k}-x_{k+1})} {\displaystyle g_{j}={\frac {1}{\gamma }}(z_{k}-x_{k+1})} performed after each step. Here prox j γ {\displaystyle {\text{prox}}_{j}^{\gamma }} {\displaystyle {\text{prox}}_{j}^{\gamma }} is defined as the proximal operator for the j {\displaystyle j} {\displaystyle j}th term:

prox j γ ( y ) = argmin x { f j ( x ) + 1 2 γ x y 2 } . {\displaystyle {\text{prox}}_{j}^{\gamma }(y)={\text{argmin}}_{x}\left\{f_{j}(x)+{\frac {1}{2\gamma }}\|x-y\|^{2}\right\}.} {\displaystyle {\text{prox}}_{j}^{\gamma }(y)={\text{argmin}}_{x}\left\{f_{j}(x)+{\frac {1}{2\gamma }}\|x-y\|^{2}\right\}.}

Unlike other known accelerated methods, Point-SAGA requires only a single iterate sequence x {\displaystyle x} {\displaystyle x} to be maintained between steps, and it has the advantage of only having a single tunable parameter γ {\displaystyle \gamma } {\displaystyle \gamma }. It obtains the optimal accelerated rate of convergence for strongly convex finite-sum minimization without additional log factors.

See also

[edit ]

References

[edit ]
  1. ^ "sklearn.linear_model.LogisticRegression". Scikit Learn. Retrieved Feb 26, 2022.
  2. ^ Lan, Guanghui; Zhou, Yi (2018). "An optimal randomized incremental gradient method". Mathematical Programming: Series A and B. 171 (1–2): 167–215. arXiv:1507.02000 . doi:10.1007/s10107-017-1173-0. S2CID 9143586.
  3. ^ Defazio, Aaron; Bach, Francis; Lacoste-Julien, Simon (2014). "SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives". Neural Information Processing Systems. arXiv:1407.0202 .
  4. ^ Schmidt, Mark; Le Roux, Nicolas; Bach, Francis (2017). "Minimizing finite sums with the stochastic average gradient". Mathematical Programming. 162. arXiv:1309.2388 .
  5. ^ Johnson, Rie; Zhang, Tong (2013). "Accelerating Stochastic Gradient Descent using Predictive Variance Reduction" (PDF). Neural Information Processing Systems.
  6. ^ Nguyen, Lam M.; Liu, Jie; Scheinberg, Katya; Takáč, Martin. "SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient" (PDF). International Conference on Machine Learning.
  7. ^ Fang, Cong; Li, Chris Junchi; Lin, Zhouchen; Zhang, Tong. "SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator" (PDF). NeurIPS 2018.{{cite web}}: CS1 maint: multiple names: authors list (link)
  8. ^ Pham, Nhan; Nguyen, Lam M.; Phan, Dzung T.; Tran-Dinh, Quoc. "ProxSARAH: An Efficient Algorithmic Framework for Stochastic Composite Nonconvex Optimization" (PDF). JMLR 2020.{{cite web}}: CS1 maint: multiple names: authors list (link)
  9. ^ Shalev-Shwartz, Shai; Zhang, Tong (2013). "Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization" (PDF). Journal of Machine Learning Research. 14.
  10. ^ Lan, Guanghui; Zhou, Yi (2018). "An optimal randomized incremental gradient method". Mathematical Programming: Series A and B. 171 (1–2): 167–215. arXiv:1507.02000 . doi:10.1007/s10107-017-1173-0. S2CID 9143586.
  11. ^ Lin, Hongzhou; Mairal, Julien; Harchaoui, Zaid (2016). "Catalyst Acceleration for First-order Convex Optimization: from Theory to Practice". Journal of Machine Learning Research. 18. arXiv:1712.05654 .
  12. ^ Defazio, Aaron (2016). "A Simple Practical Accelerated Method for Finite Sums". Neural Information Processing Systems. arXiv:1602.02442 .

AltStyle によって変換されたページ (->オリジナル) /