Foundations of Supervised Learning

•

Supervisedlearningrequireshandlingbothapproximation,estimation

and optimization errors.

•

Learning in high-dimensions requires strong notions of regularity.

•

Standardfunctionclassesbasedonlocalandglobalsmoothnessdonot

scale efficiently to complex real data.

Machine learning, in essence, is concerned with predicting future outcomes

frompastexperience.Inthischapter,wewillformalizethisgoal,focusing

onsupervisedlearning,arguablythesimplestandmostwidespreadmachine

learning modality. Although it maybe viewed as ‘merely’doing curve-fitting,

it turnsout that suchfitting abilitysignals a profoundinterplay between high-

dimensional statistics, optimization and mathematical analysis.

Indeed,as theinput ofamachine learningsystembecomes moreandmore

complex,as inimage,video orchemicaldata, guaranteeingthatthe modelwill

preserve itspredictive poweron unseendatabecomes increasinglyhard.This

corresponds to the colloquial curseof dimensionality, a term originally coined

byBellmaninthe50sthatcapturesanexponentialdependencybetweenthe

inputdimensionandtheunderlyingcostofrunningacertainalgorithmicor

statistic procedure to a desired performance level.

Inorder tobeatthe curseofdimensionality, itistherefore necessarytomake

modelingassumptionsthatencodecertainregularitiesaboutthedata.Inthis

chapterwewill reviewfunctionclasses basedonclassicnotionsofregularity

‘borrowed’frommathematicalanalysis,whichprovideacleanmathematical

picture of high-dimensional learning—and set the backdrop for thegeometric

function classes of Chapter 4.

2.1Main Ingredients of Supervised Learning

ThestartingpointofSupervisedmachinelearning(andalsoofourGDL

journey)isasetofNobservations{(x

)}

i=1

,wherex

∈Xaretheinput

40Chapter 2

Figure 2.1

RepresentativeSupervisedLearningtasks:(i)regressiontoalow-dimensionalspace,

(ii) Classification, (iii) structured prediction between high-dimensional spaces.

observationsandy

∈Yarethelabelstobepredicted.Thedefiningfeature

inthissetupisthatXisahigh-dimensionalspace:onecanonicalexample

assumesX=R

tobeaEuclideanspace oflargedimensiond≫1.Thelabel

spaceYmightbeeitherlow-orhigh-dimensional;intheformercase,Y=R

encompasses regression tasks,such aspredicting thefree energy ofa chemical

compound,aswellasclassificationtasksY={1,K},suchasimageclassifi-

cation. Inthe latter, Y≈Xcaptures so-calledstructured prediction problems,

wheretheoutputsharessomecharacteristicswiththeinput;forexamplein

medical imaging applications, the outputmightindicate suspiciousregions on

anIRM,orindynamicalsystems,whereY=Xandthegoalistopredicta

high-dimensional state X

t+1

∈Xfrom the current state X

∈X.

DataDistributionInordertoformalizethepredictivepowerofamodel,

itisnecessarytomodelhowdata(bothpastandfuture)isgenerated.The

standard assumption in supervisedlearning is that (x

) are drawn i.i.d. from

anunderlyingdataprobabilitydistributionPdefinedoverX×ばつY,which

willalsobeusedtogeneratefuturedata.Itisimportanttoemphasizethat

this is a simplifying assumptionthatis madetoestablishinsightfultheoretical

guaranteesof futureperformance(or generalisation,aswewillseenext),but

many ML algorithmscan operate and produce useful results in absenceof this

i.i.d.property

.Anotherimportantremarkisthatthisdatadistributionisnot

knowntotheLearner,soitcannotbeleveragedduringthetrainingprocess;

instead, it has intrinsic theoretical value to analyse the prediction performance

of different learning algorithms.

Foundations of Supervised Learning41

Figure 2.2

CartoonillustrationofthejointdistributiondefiningaSupervisedLearningproblem.

The target function in this example is the conditional expectation f(x)=E

[y|x].

Loss FunctionBesides the data distribution, anotheressential component to

assessa MLmodelis thelossfunctionl:Y×ばつY→R,a non-negativefunction

definedover thelabel spacesatisfyingl(y,y)=0foranyy∈Y.For classifica-

tiontasks,wemayusel(y,y

′

)=1

y=y

′

,whileforregressionthesquaredloss

l(y,y

′

)=(y–y

′

)

istheubiquitouschoice,owingtoitsfundamentalrolein

statisticsandsignalprocessing.Now,givenanyfunctionf:X→Y,thisloss

canthenbeusedtoassessthepoint-wiseerrorl(f(x),y)ataparticulardatapoint

(x,y), as well as the population risk (or error)

R(f):=E[l(f(x),y)]=

X×ばつY

l(f(x),y)dP(x,y) .(2.1)

Thisexpectationmaybere-writtenasR(f)=

(f(x))dP

(x),whereP

isthemarginaldistributionwithrespecttox,i.e.P

(A):=

A×ばつY

dP(x,y)for

anymeasurablesetA⊆X,andR

(z):=

l(z,y)dP

y|x

(y)istheexpectedloss

undertheconditionaldistributionof ygivenx

.Therefore,thefunctionf

∗

(x):=

argmin

y∈Y

(y)is bydefinitiontheoneachievingsmallesterror,andreferred

astheBayesoptimumpredictor.NotethatforgeneraldatadistributionsP,

theassociatedBayeserrorwillbestrictlypositive,duetotheunderlying

uncertainty in the conditional distribution of y given x.

42Chapter 2

ModelClassWhiletheBayespredictordescribestheoptimalchoicegiven

anysupervisedlearningtask,it isanunfeasiblesolutiondueto severalreasons.

First,itisdefinedviathepopulationloss,whichisunknowntotheLearner.

Next,itdefinesafunctionf

∗

thatmaybearbitrarilycomplextodescribe,

oreventoapproximate.Towardsbridgingthesegaps,wefirstintroduceour

hypothesisclassF:afamilyoffunctionsf:X→Ytypicallyparametrised

asF={f

;θ∈Θ}.MostifnotalltheexamplesinthisbookwillconsiderF

consistingofneuralnetworkparametrisations,whereθencodesthenetwork

weights. Learning thushappens in two stages: Firstwe decide on thefunction

classF,forinstancebyselectingacertainneuralnetworkarchitecture,and

next wedetermine the best hypothesis within F. Theexcess risk of a hypothe-

sis f∈Fis defined asthe deviation from theoptimal Bayes risk R(f)–R(f

∗

Theinfimumexcessriskinf

f∈F

R(f)–R(f

∗

)capturestheunderlyingability

ofthefunctionclasstorepresentthesolutionofinterest.Thatsaid,optimiz-

ingtheexcessriskoveraclassFstillrequiresoracleaccesstothetruedata

distribution. How can one overcome this limitation?

Empirical RiskAveraging theloss over thetraining set definesthe empiri-

cal risk (or error)

R(f):=

i=1

l(f(x

),y

) .(2.2)

Since the training set{(x

)}

is a random sample,the empirical risk is a ran-

domfunction,whoseexpectationispreciselyR(f).Foranyfixedf,

R(f)is

anaverageofni.i.d.randomvariablesl(f(x

),y

).Assuch,fromthelawof

largenumbersweknowthat

R(f)convergesalmostsurelytoitsexpectation

R(f)asN →∞;moreover,undermildmomentassumptions,wealsohave

thecentral-limit-theoremconvergence

(

R(f)–R(f))→N(0,1),where

=Var(l(f(x),y)), capturingthefamiliarscaleO(1/

√

N) ofthestatisticalfluc-

tuations.Thisclassicasymptoticbehaviorcanbefurtherquantifiedwithnon-

asymptoticguaranteesusingtoolsfrommeasureconcentration.Forinstance,

assuming|l(f(x),y)|≤C

isaboundedrandomvariable,Hoeffding’sinequality

asserts that





R(f)–R(f)



≥t



≤2exp



–

–2



indicating that fluctuations t≫

√

are extremely unlikely.

AseeminglynaïvestrategytoobtaingoodpredictorsinFisthustomin-

imisetheempiricalrisk,ratherthanthepopulationrisk.However,before

claiming victoryby meansof theprevious simple argument,itis importantto

realize thatwe needa more powerfulcontrol ofstatistical fluctuations. Indeed,

Foundations of Supervised Learning43

wehavesofarfocusedonmeasuringthefluctuationsatafixedhypothesisf,

but the training process is precisely about finding the right one!

Therelevantnotion isnotto comparehowindividualrandomvariables

R(f)

fluctuate around their expectations R(f), but how the random function

Rfluc-

tuatesaround itsexpectationR.Wethus needtocontrolfluctuations uniformly

inthesupportF,thatis,sup

f∈F

R(f)–R(f)|.Asitturnsout,suchuniform

control can beachieved under fairly general conditions,byappropriately con-

strainingthehyptothesisclass,asweshallseenext.TheresultingEmpirical

RiskMinimization(ERM)isinfactaverypowerfulalgorithm,andthemain

statistical paradigm to perform Supervised Learning.

EmpiricalRiskMinimisationGivenahypothesisclassF={f

;θ∈Θ}

parametrised by θ, the ERM defines an estimator of the form f

, where

θ∈argmin

θ∈Θ

R(f

i=1

l(f

),y

) .(2.3)

In words, we aresearching for a parameterinthe domain Θ that best explains

theobserveddata,inthesenseofthechosenlossl.Whilethiscertainlyseemsa

necessary condition to geta good predictor,to what extentis it also sufficient?

ByexpressingthepopulationriskR(f)asR(f)=

R(f)+(R(f)–

R(f)),we

canthinkofagoodhypothesisasafunctionfsuchthat(i)theempiricalrisk

R(f)issmall,and(ii)thefluctuationsR(f)–

R(f)aresmall.Whilethefirst

propertyiswhattheERMexplicitlyisdesignedtodo,andcanbereadily

assessedbymeasuringtheempiricalriskontheavailabledata,thesecond

relies on structural properties ofthedatadistribution, the hypothesis class and

the optimization algorithm, and can notbeempirically verified from the train-

ingwithouttheseassumptions.Thekeyobservationthatwedevelopnextis

that thesetwo sourcesoferror aretypically intension, andone needsto trade

them off — the familiar bias-variance tradeoff.

Havingsmall(orevenzero)trainingerroramountstosolving(approxi-

mately orexactly)in θasystem ofNequations: f

)=y

for alli=1...N. The

abilitytosolvesystemsof equations,even whenthey arenon-linear, depends

fundamentally onthe number of degrees offreedom at ourdisposal, i.e. onthe

intrinsic dimensionality of themodel class, here roughly capturedinthe num-

berofparametersencodedinθ.Inotherwords,whenthenumberofparameters

islargerelativetoN,theERMoperatesintheso-calledoverparametrised

regime,andone shouldexpectmanyparameterchoices thatfitthe dataequally

well. However, recall that we are using the empirical risk

Ras a proxy for the

populationriskR,thetruefunctionalweareinterestedinminimising.Thus,

44Chapter 2

howtomakeaninformedchoiceamongstallparametersinΘthatexplain

equally well (or interpolate) the data?

RegularizationThefolkloreanswer,Occam’srazor,suggeststopickthe

‘simplest’solutionsamongstthosethatinterpolatethedata.Moreformally,this

is achievedin statistical learning via regularisation,or capacity control. Given

a non-negative penaltyγ:Θ→Rencoding the‘complexity’ ofthe associated

hypothesis f

, we consider the regularised form of equation 2.3:

∈argmin

θ∈Θ;γ(θ)≤δ

R(f

) .(2.4)

In words, weintroduce a hyper-parameter δcontrolling the maximum allowed

complexityofourestimator.Asthereadermightanticipate,thisparameter

controls the fundamentaltradeoffbetween the ability toexplain the data(bias)

and thedangerofoverfitting toit(variance). Itis importanttoemphasizethat

regularizationandoverparametrisationcan(and oftendo) coexist.In statistical

terms,thecapacity ofthe modelclassisnowencodedinthesinglescalarδ.We

willseeinSection2.3someexamplesofregularisationinNeuralNetworks,

withtheirassociatednormsγ(θ).Beforethat,letusformalizethetradeoff

achieved by regularisation.

Example:polynomialregressioninunivariatedataFigure2.3displays

thefamiliarunivariateexampleoffittinganon-polynomialtargetfunction

observedonnpointswithpolynomialsofincreasingdegreer.Thebias-

variancetradeoffisillustratedbythetransitionfromtheunderfittingregime

wherer<nto theoverfittingregimewherer>n. Thetransition r=ncaptures a

singluar situationwhereunregularisedERMbecomes ill-conditioned,andthe

subsequent population risk blows up; See Figure 2.5

2.2Decomposition of Risk

Recallthattheultimategoalofsupervisedlearningistoperformwellon

unseendata,or, moreprecisely,toconstruct anestimator

ffromtheavailable

datathat hassmall excessrisk R(

f)–R(f

∗

).Let usconsider anestimator

f=f

that approximatelysolves theregularisedERM from fromequation 2.4.Let us

now quantify the excess risk of

Westartbyaddingandsubtractingthebestmodel(intermsofpopulation

risk) achievable under the constraint γ(θ)≤δ:

f)–R(f

∗

)=R(

f)–inf

γ(θ)≤δ

R(f

)+inf

γ(θ)≤δ

R(f

)–R(f

∗

)

|{z}

approximation error

.(2.5)

Foundations of Supervised Learning45

Figure 2.3

Polynomial Regressionon univariate functions.Regressions with polynomialsof small

degreeunderfit,anderrorisdominated bybias, while polynomialsofhighdegreeover-

fit, and error is dominated by variance.

Wehavewrittentheexcesserror(whichisnon-negativebydefinition)asthe

sum of two non-negative terms. Thesecondone,termed approximation error,

measuresourabilitytodesignaccurateenoughhypothesisclasseswiththe

right‘inductivebiases’,andpenalizesusforreducingthecapacityofour

modelclassviatherestrictionγ(θ)≤δ.Further,byexploitingthefactthat

fapproximately solves equation 2.4, we decompose the first term as

f)–inf

γ(θ)≤δ

R(f

) =R(

f)–

f)+inf

γ(θ)≤δ

R(f

)–inf

γ(θ)≤δ

R(f

f)–inf

γ(θ)≤δ

R(f

)

≤2sup

γ(θ)≤δ

|R(f

)–

R(f

|{z}

statistical error

f)–inf

γ(θ)≤δ

R(f

)

|{z}

optimization error

sinceinf

γ(θ)≤δ

R(f

)–inf

γ(θ)≤δ

R(f

)≤

R(f

∗

)–R(f

∗

),whereθ

∗

∈argmin

γ(θ)≤δ

R(f

Besidesthepreviousapproximationerror,wenowidentifytwootherfunda-

mentalsourcesoferror:thestatisticalerrorpenalizesusforoptimizingthe

46Chapter 2

Figure 2.4

DecompositionofRisk intoApproximation,Estimation andOptimization.The approx-

imationerrorariseswhenconstraining/regularisingtheriskover aballF

.Estimation

error corresponds tothefactthat theempirical risk

Rdeviatesfrom thepopulation risk

R.Finally,optimizationerrorcapturestheinabilityofoptimizationmethodstosolve

generic non-convex problems.

‘wrong’riskfunctional,thatis,theempirical(

R)ratherthanthepopulation

(R)objective,andiscapturedherethroughtheuniformdeviationbetween

thesetwofunctionswithintheballγ(θ)≤δ.Theremainingtermistheopti-

mizationerror,sinceingeneralequation2.4doesnotadmitclosed-form

solutions.Moreover,ingeneraltheempiricalrisk

Risanon-convexfunc-

tionofθ,whichcreatespotentialroadblocksintheassociatedminimisation

problem. Figure 2.4 illustrates the resulting decomposition of risk.

The importanttakeaway from thiserror decomposition isthat inanysuper-

visedlearningtask,thecomplexityparameterδplaysafundamentaltradeoff

betweenthedifferentsources oferror,even inoverparametrisedregimeswhere

thenumberofparameters(orneurons)is≫N.Generallyspeaking,asmall

complexityparameterδmakesthestatisticalerrorsmaller(sinceweneedto

controlfluctuationsbetween two functionsover asmallerdomain),whilemak-

ingtheapproximationerrorlarger.Wemaythusinterpretthisisabias-variance

tradeoffarisinginclassicmodelselectioninstatistics(BottouandBousquet

2007).

Foundations of Supervised Learning47

Figure 2.5

IllustrationoftheDouble-DescentPhenomena,wherebythepopulationriskofunreg-

ularisedERMundergoesaphasetransitionasthenumberofparameterscrossesthe

number of datapoints, with a second ‘descent’ phase.

2.3Over-Parametrised Regime

Inthecontextofneuralnetworks,or moregenerally modelshaving manymore

parameters than data-points, one may wonder how this tradeoff plays out.

Theso-calleddouble-descentphenomena(Belkinetal.2019)seemingly

goesagainstthisbias-variancetradeoff,wherebyhighlyover-parametrised

neuralnetworksexhibit excellentgeneralisationperformance;seeFigure2.5.

Infact,thedouble-descentphenomenaisconsistentwiththevariance-bias

tradeoffdescribedabove,butitservesasacautionarytalethatthe‘effec-

tive’ complexity ofa hypothesis classis generally not wellcaptured by simply

counting the number of parameters.

Asuccessfullearning schemethusneedstoencodethe appropriatenotionof

regularityorinductive biasfor f,imposedthroughtheconstructionofthe func-

tion class Fand theuse of regularisation. We briefly introduce thisconcept in

the following section.

Modern machine learning operates with large, high-quality datasets, which,

togetherwithappropriatecomputationalresources,motivatethedesignof

richfunctionclassesFwiththecapacitytointerpolatesuchlargedata.This

mindsetplayswellwithneuralnetworks,sinceeventhesimplestchoicesof

architecture yields a dense class of functions.

A setA⊂Bis said tobe densein Bifits closure satisfiesA∪{ lim

i→∞

∈

A}=B.ThisimpliesthatanypointinBisarbitrarilyclosetoapointinA

.AtypicalUniversalApproximationresultshows thattheclassoffunctions

48Chapter 2

representede.g.bya two-layerperceptron, f(x)=c

⊤

sign(Ax+b)isdensein

the space of continuous functions on R

Thecapacitytoapproximatealmostarbitraryfunctionsisthesubjectof

various UniversalApproximationTheorems; severalsuchresultswereproved

andpopularisedinthe1990sbyappliedmathematiciansandcomputerscien-

tists(seee.g.Cybenko1989;Hornik1991;Barron1993;Leshnoetal.1993;

Maiorov 1999; Pinkus 1999).

Figure 2.6

MultilayerPerceptrons(Rosenblatt1958),thesimplestfeed-forwardneuralnetworks,

areuniversalapproximators:withjustonehiddenlayer,theycanrepresentcombina-

tions of step functions, allowing to approximate anycontinuous function witharbitrary

precision.

Universal Approximation, however, doesnot implyan absence ofinductive

bias. Given a hypothesis space Fwith universal approximation, we can define

acomplexitymeasureγ:F→R

(thesamenotionthatweintroducedatthe

end of Section 2.1), and redefine our interpolation problem as

f∈argmin

g∈F

γ(g)s.t.g(x

)=f(x

)fori=1,...,N,

i.e., we are looking for themostregular functions within our hypothesis class.

Forstandardfunctionspaces,thiscomplexitymeasurecanbedefinedasa

norm

.Inthissense,andinlightoftheRiskdecompositionthatwas outlined

inSection2.2,wecanrevisittheUniversalApproximationTheoremswitha

moreskepticaleye:they‘just’conveytheideathattheapproximationerror

Foundations of Supervised Learning49

inf

γ(f)≤δ

R(f)–R(f

∗

)convergestozeroasδ→∞.Statedassuch,theylack

quantitative power to understandthe resultingtradeoffbetween approximation

and statistical errors.

Inordertoovercomethislimitation,itisthusnecessarytoquantify

approximationrates.Inlow dimensions,splinesareaworkhorseforfunction

approximation.Theycanbeformulatedasabove,withanormcapturingthe

classicalnotionofsmoothness, suchasthesquared-normofsecond-derivatives

+∞

–∞

′′

(x)|

dxforcubicsplines.Thisrepresentationallowsustoquantify

approximation errors by controlling the norm of the spline approximation.

Inthecaseofneuralnetworks,the complexitymeasureγcanbeexpressed

intermsofthenetworkweights,i.e.γ(f

)=γ(θ).TheL

-normofthenet-

workweights, known asweightdecay, orthe so-calledpath-norm(Neyshabur,

Tomioka,andSrebro2015)arepopularchoicesindeeplearningliterature.

FromaBayesianperspective,suchcomplexitymeasurescanalsobeinter-

pretedasthenegativelogofthepriorforthefunctionofinterest.The

regularisationtermγ(θ)maytakedifferentformsacross differentML systems.

Forinstance,forlinearmodelsoftheformf

(x)=θ

⊤

Φ(x)withθ∈R

,wemay

choose an L

norm γ(θ)=∥θ∥

j=1

|θ

, with p=1,2 capturing the familiar

settings of sparse and ridge regression, respectively.

It is important to emphasizethat the regularization framework encompasses

notonlymethodsthatexplicitlyconsiderregularisationineitherconstrained

(asabove)orpenalizedform(byaddingtheassociatedLagrangemultipliers,

leading to min

θ∈Θ

gR(f

)+λγ(θ) ); but also methodswherethe regularization

comes implicitlythrough thechoice of optimizationscheme. For example, it is

well-knownthat gradient-descenton an under-determined least-squares objec-

tivewillchooseinterpolatingsolutionswithminimalL

norm.Theextension

of such implicitregularisation resultsto modern neuralnetworks is thesubject

ofcurrentstudies(seee.g.Gunasekaretal.2017;Blancetal.2020;Shamir

and Vardi 2020; Razin and Cohen 2020).

Theimportanttakeawayisthatregularization andoverparametrisationare

compatible in modern DL via the implicit bias of gradient-based learning, and

thisbias,or‘prior’,isoftencapturedasacertainnorminthespaceoffunctions.

Allinall,anaturalquestionarises:howtodefineeffective priorsthat capture

theexpectedregularities andcomplexitiesofreal-worldpredictiontasks,and

that enable a quantification of the approximation error?

2.4The Curse of Dimensionality In High-Dimensional Learning

Whileinterpolationinlow-dimensions(withd=1,2or3)isaclassicsignal

processingtaskwithveryprecisemathematicalcontrolofestimationerrors

using increasingly sophisticatedregularity classes(such as spline interpolants,

50Chapter 2

wavelets,curvelets, orridgelets), thesituationfor high-dimensionalproblems

is entirely different. In thisgeneric high-dimensional regime,one is often con-

frontedwithanimpossibletradeoffbetween‘cursed’approximationratesor

‘cursed’ estimation rates

In orderto convey theessenceof theidea,let usconsidera classicalnotion

ofregularitythatcanbeeasilyextendedtohighdimensions:1-Lipschitz- func-

tionsf:X→R,i.e.functionssatisfying|f(x)–f(x

′

)|≤∥x–x

′

∥forallx,x

′

∈X.

This hypothesissays that ifwe perturb theinput x slightly (asmeasured by the

norm ∥x–x

′

∥), theoutput f(x) isnot allowed to changemuch. Thisis aweaker

formofregularitythanSobolevsmoothness,whichadditionallyasksforfto

have s derivatives that are integrable.If our only knowledge of the target func-

tion fisthat it is1-Lipschitz, how many observations dowe expect to require

to ensure that ourestimate

fwill be close to f? Figure 2.7reveals that the gen-

eralanswerisnecessarilyexponentialinthedimensiond,signalingthatthe

Lipschitz classgrows‘too quickly’as theinputdimension increases:in many

applicationswitheven modestdimensiond,thenumberofsampleswouldbe

biggerthanthenumberofatomsintheuniverse.Thesituationisnotbetter

ifonereplacestheLipschitzclassbyaglobalsmoothnesshypothesis,such

astheSobolevClassH

(Ω

)

.Indeed,classicresults(Tsybakov 2008)estab-

lish a minimax rate of approximation and learning for the Sobolev class of the

order ε

–d/s

, showing thatthe extrasmoothness assumptionson fonly improve

thestatisticalpicturewhens∝d,anunrealisticassumptioninpractice. Thus,

if‘classic’notionsofregularityarenotadaptedtoperformhigh-dimensional

learning, how should one replace them with effective ones?

2.5Linear Approximation With Positive-Definite Kernels

Apowerfulframeworktodefineflexibleapproximationspacesinhigh-

dimensionisviapositivesemidefinitekernels:apsdkernelk(x,x

′

)isasym-

metricmappingonX×ばつXsatisfying

i,j=1

k(x

)≥0foranyα∈R

and any collection ofpoints (x

,...,x

). Akernel capturesan a-priorinotion of

similarity intheinputspace,e.ga kernelof theform k(x,x

′

)=x

⊤

′

in X=R

measureslinearcorrelation,whileaGaussiankernelk(x,x

′

)=exp(–

2τ

∥x–

′

∥

)measureswhetherx,x

′

arecloseatacertainscale,inthesensethat

k(x,x

′

)≈1 whenever ∥x–x

′

∥≲τand k(x,x

′

)≈0 otherwise.

To each positive semidefinite kernel on Xone can associate a Hilbert space

offunctions definedover Xwith additionalstructure,theso-calledreproduc-

ingproperty.Inessence,thereexists a(possiblyinfinite-dimensional)Hilbert

space Hand a ‘featuremap’φ:X→Hsuch that thekernelcanbe viewed as

aninnerproduct:k(x,x

′

)=⟨φ(x),φ(x

′

)⟩

.Thereproducingpropertyrefersto

Foundations of Supervised Learning51

Figure 2.7

We consideraLipschitzfunction f(x)=

j=1

φ(x–x

) wherez

=±1, x

∈R

is placed

ineachquadrant,andφalocallysupportedLipschitz‘bump’.Unlessweobservethe

functioninmostofthe2

quadrants,wewillincurinaconstanterrorinpredicting

it. Thissimplegeometricargument canbeformalised throughthenotionofMaximum

Discrepancy(LuxburgandBousquet2004),definedfortheLipschitzclassasκ(d)=

x,x

′

sup

f∈Lip(1)



f(x

)–

f(x

′

)



≃N

–1/d

,whichmeasuresthelargestexpected

discrepancybetweentwoindependentN-sampleexpectations.Ensuringthatκ(d)≃ε

requires N=Θ(ε

–d

); thecorresponding sample{x

}

defines anε-net ofthe domain. For

a d-dimensional Euclidean domain of diameter 1, its size grows exponentially as ε

–d

thefactthatinthisHilbertspace,functionevaluationscanbeseenasinner-

products:foreachf∈Handx∈X,onehasf(x)=⟨f,φ(x)⟩

.Giventhatin

supervisedlearningoneacquiresinformationaboutthetargetfunctionf

∗

via

(noisy)evaluationsinthe trainingset,ie y

∗

)+ξ

,thereproducingprop-

ertyallowsustoviewthesemeasurementsasbona-fideprojectionsofthetarget

onto certain directions, spanned precisely by the data features φ(x

)∈H.

Theresulting ReproducingKernel HilbertSpace (RKHS)Hthus consistsof

functionsf

:X→Rthatarelinearinθwhenexpressedusingtherepresen-

tation f

(x)=⟨θ,φ(x)⟩

, and comeswithaHilbert regularization norm γ(f

∥θ∥

.The Hilbertianstructure enablesaprecise controlofall sourcesoferror,

includingapproximation,generalisationandoptimization,eveninthehigh-

dimensionalregime.Theinitialchoiceofkernelthuscapturesthe‘inductive

bias’associatedwiththecorrespondingRKHS.Forinstance,intheprevious

Gaussiankernelexample, theassociated RKHSnorm∥f

∥

correspondstoa

weightedFouriernormoftheform∥f

∥

∝

f(ω)|

exp(

∥ω∥

)dω,which

indicatesthatHcontainsonlyinfinitelydifferentiablesmoothfunctionswith

exponential Fourier decay.

InthecontextofNeuralNetworks,somepopularkernelsareobtainedby

buildingfeaturemapsφ(x)usinginfinitelywideneuralnetworks,andby

52Chapter 2

approximatingthemusingRandomFeatures.Forinstance,byconsidering

k(x,x

′

)=E

w∼μ

[σ(w

⊤

x)σ(w

⊤

′

)],whereμisafixedprobabilitydistribution

inR

andσanarbitraryactivationfunction,theassociatedRandomFeature

approximation is obtained by a Monte-Carlo estimate of k:

k(x,x

′

m=1

σ(w

⊤

x)σ(w

⊤

′

) ,with w

∼

iid

μ.

Theassociated(random)RKHSisthusgeneratedbytherandomfeatures

{σ(w

⊤

·)}

m≤M

,whichcanbeinterpretedinthiscaseasawidth-Mshal-

lowneuralnetworkwithrandomweightsw

.Theresultingapproximation

spaceisthenobtainedbylinearcombinationsoftheserandomfeatures;in

otherwords,ashallowneuralnetworkwhereonlythelastoutputlayeris

trained.Anotherpopularkernelmethodassociatedwithneuralnetworksis

theso-calledNeuralTangentKernel(NTK),whichalsoconsidersthegra-

dientfeatures{∇

σ(w

⊤

·)}

andcapturesthetrainingdynamicsincertain

overparametrised scalings (the so-called lazy regime).

WhiletheRKHSframeworkofferswideflexibilitythroughthechoiceof

thekernel,itisinherentlyalinearapproximationframeworkwherethereis

norepresentation(or‘feature’)learning,sincethechoiceofthekernel(and

therefore the associatedfeature maps) ismadebefore seeingthetraining data.

Inotherwords,kernelmethodsareunabletoextractmeaningfulinformation

inhigh-dimensionsunlessonealreadyhasapreconceivedideaof‘whereto

look’. To illustrate the limitations of kernels in high-dimensions, it is useful to

considerisotropickernels,iek(x,x

′

)=k(Ux,Ux

′

)foranyrotationU ∈O

.In

words, thesekernels have nopreconceivednotionofpriviledged directionsin

the inputspace.Under theseconditions,the kernelridgeregressionestimator

associatedwithadataset{(x

=f(x

)}

i≤n

isessentiallyequivalenttoperform-

ingpolynomialregressiononf:Ifn≃d

,then

fisthebestdegree-kpolynomial

approximationoff(MisiakiewiczandMontanari2023).Themodelisthus

unable toadapt toany formoflow-dimensionalstructure presentin thetarget

function f, and is only effective to learn targets with global smoothness.

In conclusion, suchlimitation is incompatible with many aspects of modern

DL, suchasthepre-training paradigm,inwhicha largetrainingset isusedto

learncertainfeatures,whicharethentransferredtoadownstreamtaskviaa

fine-tuning stage.

2.6From Linear To Non-Linear Approximation

Thealternative FeatureLearningframework correspondsinsteadtonon-linear

approximation. Whilethepreviouskernelapproximation framework considers

afixedfamilyoffeaturemaps{φ

(x)}

j≤dim(H)

andapproximatesatargetf

∗

Foundations of Supervised Learning53

as f

∗

≈

by adjustingthe weightsθ

, innon-linear approximationone

also adjuststhe parametersw

defining thefeatures. Theadditionaldegrees of

freedombroughtbyadjustingw

besidesθ

bringadditionaladaptivity,that

can becomedramatic inthe high-dimensional regime. In otherwords, thenon-

linearapproximationaspectoffeaturelearningimprovestheapproximation

propertiesofthemodel,attheexpenseofamorechallengingoptimization.

Indeed,thejointoptimizationofw

andθ

definesanon-convexproblem

which isgenerally challengingto analyse,yet oftensolvablein practiceusing

gradient-descent techniques.

In summary, some of the limitationsof linear approximation andassociated

kernelmethodsmaybeovercomebyconsideringfeaturelearning,whichhas

theabilitytoadapttospecificstructuresoftheinputfunction.Yet,theques-

tionremains:howtodefinethesetrainablefeaturesinsuchawaythatprior

information on the target function is efficiently encoded?

2.7Approximation With Shallow Networks And Beyond

To illustratethe difficulty ofthisquestion, itisuseful toconsider thesimplest

instanceoffeaturelearninginNNs,givenbyshallowneuralnetworks.This

architecture defines a class of functions of the form







(x)=

m≤M

σ(w

⊤

x+b

) ,θ={α

}







,(2.6)

where σisanactivation and θcollectsthe weights ofthenetwork. The model

isthusbuildinganapproximationasalinearcombinationofridgefunctions,

whicharenon-linearfunctionsofone-dimensionalprojectionsoftheinput.

Whilethisclassenjoys universalapproximationintheoverparametrisedlimit

M→∞,thisapproximationbecomesquantitativeinthehigh-dimensional

regime (ie, with an approximationerror ∥f

∗

–f

(M)

∥≃M

–η

with η>0 indepen-

dent ofthe inputdimension) onlyunder strongassumptions onthe target. For

instance,iff

∗

issuchthatitdependsonacollectionoflow-dimensionalprojec-

tionsoftheinput(seeFigure2.8),thentheshallowclass,whenoperatinginthe

non-linear regime,isable tobreak thiscurse ofdimensionality (Bach2017a).

However,inmostreal-worldapplications(suchascomputervision,speech

analysis,physics,orchemistry),functionsofinteresttendtoexhibitcomplex

long-range correlationsthatcannot beexpressedwith low-dimensional projec-

tions(Figure2.8),makingthishypothesisunrealistic.Itisthusnecessaryto

define analternative sourceofregularity, byexploiting thespatial structure of

thephysicaldomainandthegeometricpriorsoff,aswedescribeinthenext

Chapters.

54Chapter 2

Figure 2.8

Iftheunknownfunctionfispresumedtobewellapproximatedasf(x)≈g(Ax)for

someunknownA∈R

k×ばつd

withk≪d,thenshallowneuralnetworkscancapturethis

inductivebias,seee.g.Bach2017a.Intypicalapplications,suchdependencyonlow-

dimensionalprojectionsisunrealistic,asillustratedinthisexample:alow-passfilter

projects theinputimagestoa low-dimensional subspace;whileitconveys mostofthe

semantics, substantial information is lost.