ridge regression and function smoothness with L2 regularization

Question 1

Ridge regression's objective function: $$ L(w) = \underbrace{\|y - Xw\|^2}_\text{data term} + \underbrace{\lambda\|w\|^2}_\text{smoothness term} $$

I am trying to understand the regularization term, $\lambda\|w\|^2$. My questions are:

What does smoothness mean here?

I checked the definition of smooth in Wolfram, but it seems not right in here.

A smooth function is a function that has continuous derivatives up to some desired order over some domain.
I read a document explaining the smoothness term.

page 12 in the pdf

A very common assumption is that the underlying function is likely to be smooth, for example, having small derivatives. Smoothness distinguishes the examples in Figure 2. There is also a practical reason to prefer smoothness, in that assuming smoothness reduces model complexity:

I have difficulty understanding above:
- the underlying function is smooth will have small derivatives
- smoothness reduces model complexity.

My counterexample is: $$ f(x) = w_0 + w_1x + w_2x^2 + w_3x^3 $$

with $w = [0.5, 0.7, 0.3, 0.4]$ , or $w = [5, 7, 3, 4],ドル they are both function of $C^\infty$

I know I must be making mistakes somewhere. Please help me to correctly understand it. Thank you.

Question 2

I don't see why the authors say that smoothness requires derivatives to be small. It sort of depends on how you define small, where you require it to be small and what order of derivative you refer to.

Question 3

In the context of polynomial fitting, I have this loose, artistic sense that if $\|\mathbf{w}\| < \|\mathbf{u}\|$ then $f(\mathbf{x}; \mathbf{w}) = \sum_j w_i x^j$ tends to be a less squiggly looking polynomial, function than $\sum_j u_i x^j$. I'd have to think about it if there's a way to put that in more rigorous terms.

Question 4

Eg. checkout this polynomial curve fitting example.

Question 5

As @Michael Chernick said, smoothness is a bad term. I can see it making sense if you are fitting a scatterplot smoother and want to limit the second derivatives, but here its really a shrinkage parameter ($\lambda,ドル that is).

It penalizes large coefficients. However, it does this smoothly in the sense that it does not "zero out" any of your variables. This is different than the "LASSO" regularizer, $\lambda \|w\|_1,ドル which can zero out variables.

user145807 · Accepted Answer · 2017-05-24 04:48:05Z

As @Michael Chernick said, smoothness is a bad term. I can see it making sense if you are fitting a scatterplot smoother and want to limit the second derivatives, but here its really a shrinkage parameter ($\lambda,ドル that is).

It penalizes large coefficients. However, it does this smoothly in the sense that it does not "zero out" any of your variables. This is different than the "LASSO" regularizer, $\lambda \|w\|_1,ドル which can zero out variables.

Stack Exchange Network

ridge regression and function smoothness with L2 regularization

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

ridge regression and function smoothness with L2 regularization

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions