3
$\begingroup$

I am trying to find $$ \min_W \|Y-XW \|_F^2$$ $$s.t. \forall ij, W_{ij}\geq0 $$ where X is input data and Y is the output data we try to fit to. This is a convex optimization problem that can be solved with quadratic programming.

As an exercise, I tried two different methods that use gradient descent.

  1. Perform gradient descent on $$ \|Y-XW \|_F^2 + \lambda \sum_{ij\in S}\max(-W_{ij},0) $$ where $S$ is a set of indices of $W$ where I want to impose a nonnegativity constraint. The Lagrange multiplier term will be positive when there are negative values in $ij$.

  2. Perform gradient descent on $$ \|Y-XW \|_F^2$$ but at each iteration, project $W_{ij}$ to the nonnegative orthant. In other words, do $W_{ij}\leftarrow \max(W_{ij},0)$ at the end of each iteration.

Interestingly, the $W$s found using these gradient descent methods did not converge to that of the quadratic programming. Also, the gradient descent method was sensitive to the initial condition and converged to different $W$ that had different cost values.

Why can't these gradient descent methods find the global optimum of the convex problem?

asked Sep 9, 2021 at 15:44
$\endgroup$
5
  • 2
    $\begingroup$ What was the norm of your gradient at your final iterate? It should be very small as an indication of true convergence. Additionally, at the final iterate, it is helpful to check if the Hessian is positive-definite to check if you are truly at a minimum or just a saddle point. Note that the $\max(x,0)$ function is flat for $x < 0,ドル so it is possible that your loss function contains a large flat valley where the gradient is too small to induce a big enough step, and so you would need to increase your step size. $\endgroup$ Commented Sep 9, 2021 at 16:09
  • $\begingroup$ @mhdadk Thanks. I checked the Hessian of the gradients, and they are positive-definite. The L2 norms of the gradients at the final iterations were 1e-3, and I used a learning rate of 2e-4. So I think I should be able to say I reached minima. Regarding your point on the flatness: Although $\max(x,0)$ has a flat area, my loss function contains a quadratic error term, which would put non-zero gradients in that area. So shouldn't I need not worry about the flatness of $\max(x,0)$? $\endgroup$ Commented Sep 9, 2021 at 19:12
  • $\begingroup$ I'm not sure what you mean by "Hessian of the gradient". Could you clarify? I was referring to the hessian of the cost function evaluated at your final iterate. As for the flatness, it would be a concern if your Lagrange multiplier is large, thereby weighing the regularization term higher. Otherwise, it isn't really a concern. One thing you could try is to generate surface/contour plots for a much simpler version of your cost function. That is, decrease the dimensionality of $Y,X,$ and $W$ to 1 or 2 and create these plots to check if your cost function is indeed convex. $\endgroup$ Commented Sep 9, 2021 at 19:41
  • $\begingroup$ @mhdadk Sorry, "Hessian of the gradient" was a typo. I meant the Hessian of the cost function at the final iterate. $\endgroup$ Commented Sep 9, 2021 at 20:42
  • $\begingroup$ @mhdadk I see. Thanks for the insight! $\endgroup$ Commented Sep 9, 2021 at 20:46

1 Answer 1

3
$\begingroup$

Projected Gradient Descent works for this problem very well.

I defined it as:

$$ \arg \min_{\boldsymbol{X}} \frac{1}{2} {\left\| \boldsymbol{A} \boldsymbol{X} - \boldsymbol{B} \right\|}_{F}^{2} \; \text{ subject to } \boldsymbol{X}_{i, i} \geq 0 $$

The step of the Projected Gradient Descent:

$$ \boldsymbol{X}^{\left( k + 1 \right)} = \max \left\{ 0, \boldsymbol{X}^{\left( k \right)} - \eta \left( \boldsymbol{A}^{T} \left( \boldsymbol{A} \boldsymbol{X}^{\left( k \right)} - \boldsymbol{B} \right) \right) \right\} $$

Where $\eta$ is the step size.

enter image description here

You may use acceleration methods such as FISTA to have even faster convergence.


The code is available on my StackExchange Code GitHub Repository (Look at the CrossValidated\Q544135 folder).

answered Aug 18 at 14:01
$\endgroup$
5
  • $\begingroup$ nice answer! So you're using a step size of 1? $\endgroup$ Commented Aug 18 at 14:55
  • 1
    $\begingroup$ @NathanWycoff, I actually just forgot the step size. Thanks for the reminder. $\endgroup$ Commented Aug 18 at 15:54
  • 1
    $\begingroup$ How is it that the very same plot is showing up in your answers to very different questions?? $\endgroup$ Commented Aug 18 at 20:19
  • 1
    $\begingroup$ @whuber, It is not the same graph. I just implement the answer I suggest and show the convergence of the method to reference convex solver. The convergence process of a convex objective with 1st order methods, without a scale, looks similar. $\endgroup$ Commented Aug 19 at 4:53
  • $\begingroup$ @whuber I suppose we are seeing the convergence theory of first order methods in action indeed :) This is because these synthetic problems all have good conditioning as they are generated from iid data. On real data, we would see more divergence in the behavior of the methods across datasets as the conditioning changes. $\endgroup$ Commented Aug 20 at 0:37

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.