Projected Gradient Descent for Quadratic Programming Problem

Question 1

I am trying to find $$ \min_W \|Y-XW \|_F^2$$ $$s.t. \forall ij, W_{ij}\geq0 $$ where X is input data and Y is the output data we try to fit to. This is a convex optimization problem that can be solved with quadratic programming.

As an exercise, I tried two different methods that use gradient descent.

Perform gradient descent on $$ \|Y-XW \|_F^2 + \lambda \sum_{ij\in S}\max(-W_{ij},0) $$ where $S$ is a set of indices of $W$ where I want to impose a nonnegativity constraint. The Lagrange multiplier term will be positive when there are negative values in $ij$.
Perform gradient descent on $$ \|Y-XW \|_F^2$$ but at each iteration, project $W_{ij}$ to the nonnegative orthant. In other words, do $W_{ij}\leftarrow \max(W_{ij},0)$ at the end of each iteration.

Interestingly, the $W$s found using these gradient descent methods did not converge to that of the quadratic programming. Also, the gradient descent method was sensitive to the initial condition and converged to different $W$ that had different cost values.

Why can't these gradient descent methods find the global optimum of the convex problem?

Question 2

What was the norm of your gradient at your final iterate? It should be very small as an indication of true convergence. Additionally, at the final iterate, it is helpful to check if the Hessian is positive-definite to check if you are truly at a minimum or just a saddle point. Note that the $\max(x,0)$ function is flat for $x < 0,ドル so it is possible that your loss function contains a large flat valley where the gradient is too small to induce a big enough step, and so you would need to increase your step size.

Question 3

@mhdadk Thanks. I checked the Hessian of the gradients, and they are positive-definite. The L2 norms of the gradients at the final iterations were 1e-3, and I used a learning rate of 2e-4. So I think I should be able to say I reached minima. Regarding your point on the flatness: Although $\max(x,0)$ has a flat area, my loss function contains a quadratic error term, which would put non-zero gradients in that area. So shouldn't I need not worry about the flatness of $\max(x,0)$?

Question 4

I'm not sure what you mean by "Hessian of the gradient". Could you clarify? I was referring to the hessian of the cost function evaluated at your final iterate. As for the flatness, it would be a concern if your Lagrange multiplier is large, thereby weighing the regularization term higher. Otherwise, it isn't really a concern. One thing you could try is to generate surface/contour plots for a much simpler version of your cost function. That is, decrease the dimensionality of $Y,X,$ and $W$ to 1 or 2 and create these plots to check if your cost function is indeed convex.

Question 5

@mhdadk Sorry, "Hessian of the gradient" was a typo. I meant the Hessian of the cost function at the final iterate.

Question 6

@mhdadk I see. Thanks for the insight!

Question 7

Projected Gradient Descent works for this problem very well.

I defined it as:

$$ \arg \min_{\boldsymbol{X}} \frac{1}{2} {\left\| \boldsymbol{A} \boldsymbol{X} - \boldsymbol{B} \right\|}_{F}^{2} \; \text{ subject to } \boldsymbol{X}_{i, i} \geq 0 $$

The step of the Projected Gradient Descent:

$$ \boldsymbol{X}^{\left( k + 1 \right)} = \max \left\{ 0, \boldsymbol{X}^{\left( k \right)} - \eta \left( \boldsymbol{A}^{T} \left( \boldsymbol{A} \boldsymbol{X}^{\left( k \right)} - \boldsymbol{B} \right) \right) \right\} $$

Where $\eta$ is the step size.

enter image description here

You may use acceleration methods such as FISTA to have even faster convergence.

The code is available on my StackExchange Code GitHub Repository (Look at the CrossValidated\Q544135 folder).

Question 8

nice answer! So you're using a step size of 1?

Question 9

@NathanWycoff, I actually just forgot the step size. Thanks for the reminder.

Question 10

How is it that the very same plot is showing up in your answers to very different questions??

Question 11

@whuber, It is not the same graph. I just implement the answer I suggest and show the convergence of the method to reference convex solver. The convergence process of a convex objective with 1st order methods, without a scale, looks similar.

Question 12

@whuber I suppose we are seeing the convergence theory of first order methods in action indeed :) This is because these synthetic problems all have good conditioning as they are generated from iid data. On real data, we would see more divergence in the behavior of the methods across datasets as the conditioning changes.

Royi 1,44211 silver badges33 bronze badges · Answer 1 · 2025-08-18 14:01:21Z

3

$\begingroup$

Projected Gradient Descent works for this problem very well.

I defined it as:

$$ \arg \min_{\boldsymbol{X}} \frac{1}{2} {\left\| \boldsymbol{A} \boldsymbol{X} - \boldsymbol{B} \right\|}_{F}^{2} \; \text{ subject to } \boldsymbol{X}_{i, i} \geq 0 $$

The step of the Projected Gradient Descent:

$$ \boldsymbol{X}^{\left( k + 1 \right)} = \max \left\{ 0, \boldsymbol{X}^{\left( k \right)} - \eta \left( \boldsymbol{A}^{T} \left( \boldsymbol{A} \boldsymbol{X}^{\left( k \right)} - \boldsymbol{B} \right) \right) \right\} $$

Where $\eta$ is the step size.

enter image description here

You may use acceleration methods such as FISTA to have even faster convergence.

The code is available on my StackExchange Code GitHub Repository (Look at the CrossValidated\Q544135 folder).

Share

Improve this answer

edited Aug 18 at 15:56

answered Aug 18 at 14:01

Royi's user avatar

Royi

1,44211 silver badges33 bronze badges

$\endgroup$

5

$\begingroup$ nice answer! So you're using a step size of 1? $\endgroup$

Nathan Wycoff
– Nathan Wycoff

2025年08月18日 14:55:47 +00:00
Commented Aug 18 at 14:55
1

$\begingroup$ @NathanWycoff, I actually just forgot the step size. Thanks for the reminder. $\endgroup$

Royi
– Royi

2025年08月18日 15:54:58 +00:00
Commented Aug 18 at 15:54
1

$\begingroup$ How is it that the very same plot is showing up in your answers to very different questions?? $\endgroup$

whuber
– whuber ♦

2025年08月18日 20:19:52 +00:00
Commented Aug 18 at 20:19
1

$\begingroup$ @whuber, It is not the same graph. I just implement the answer I suggest and show the convergence of the method to reference convex solver. The convergence process of a convex objective with 1st order methods, without a scale, looks similar. $\endgroup$

Royi
– Royi

2025年08月19日 04:53:19 +00:00
Commented Aug 19 at 4:53
$\begingroup$ @whuber I suppose we are seeing the convergence theory of first order methods in action indeed :) This is because these synthetic problems all have good conditioning as they are generated from iid data. On real data, we would see more divergence in the behavior of the methods across datasets as the conditioning changes. $\endgroup$

Nathan Wycoff
– Nathan Wycoff

2025年08月20日 00:37:33 +00:00
Commented Aug 20 at 0:37

Add a comment |

Stack Exchange Network

Projected Gradient Descent for Quadratic Programming Problem

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Projected Gradient Descent for Quadratic Programming Problem

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions