2
Foundations of Supervised Learning
Supervisedlearningrequireshandlingbothapproximation,estimation
and optimization errors.
Learning in high-dimensions requires strong notions of regularity.
Standardfunctionclassesbasedonlocalandglobalsmoothnessdonot
scale efficiently to complex real data.
Machine learning, in essence, is concerned with predicting future outcomes
frompastexperience.Inthischapter,wewillformalizethisgoal,focusing
onsupervisedlearning,arguablythesimplestandmostwidespreadmachine
learning modality. Although it maybe viewed as ‘merely’doing curve-fitting,
it turnsout that suchfitting abilitysignals a profoundinterplay between high-
dimensional statistics, optimization and mathematical analysis.
Indeed,as theinput ofamachine learningsystembecomes moreandmore
complex,as inimage,video orchemicaldata, guaranteeingthatthe modelwill
preserve itspredictive poweron unseendatabecomes increasinglyhard.This
corresponds to the colloquial curseof dimensionality, a term originally coined
byBellmaninthe50sthatcapturesanexponentialdependencybetweenthe
inputdimensionandtheunderlyingcostofrunningacertainalgorithmicor
statistic procedure to a desired performance level.
Inorder tobeatthe curseofdimensionality, itistherefore necessarytomake
modelingassumptionsthatencodecertainregularitiesaboutthedata.Inthis
chapterwewill reviewfunctionclasses basedonclassicnotionsofregularity
‘borrowed’frommathematicalanalysis,whichprovideacleanmathematical
picture of high-dimensional learning—and set the backdrop for thegeometric
function classes of Chapter 4.
2.1Main Ingredients of Supervised Learning
ThestartingpointofSupervisedmachinelearning(andalsoofourGDL
journey)isasetofNobservations{(x
i
,y
i
)}
N
i=1
,wherex
i
∈Xaretheinput
40Chapter 2
Figure 2.1
RepresentativeSupervisedLearningtasks:(i)regressiontoalow-dimensionalspace,
(ii) Classification, (iii) structured prediction between high-dimensional spaces.
observationsandy
i
∈Yarethelabelstobepredicted.Thedefiningfeature
inthissetupisthatXisahigh-dimensionalspace:onecanonicalexample
assumesX=R
d
tobeaEuclideanspace oflargedimensiond1.Thelabel
spaceYmightbeeitherlow-orhigh-dimensional;intheformercase,Y=R
encompasses regression tasks,such aspredicting thefree energy ofa chemical
compound,aswellasclassificationtasksY={1,K},suchasimageclassifi-
cation. Inthe latter, Y≈Xcaptures so-calledstructured prediction problems,
wheretheoutputsharessomecharacteristicswiththeinput;forexamplein
medical imaging applications, the outputmightindicate suspiciousregions on
anIRM,orindynamicalsystems,whereY=Xandthegoalistopredicta
high-dimensional state X
t+1
∈Xfrom the current state X
t
∈X.
DataDistributionInordertoformalizethepredictivepowerofamodel,
itisnecessarytomodelhowdata(bothpastandfuture)isgenerated.The
standard assumption in supervisedlearning is that (x
i
,y
i
) are drawn i.i.d. from
anunderlyingdataprobabilitydistributionPdefinedoverX×ばつY,which
willalsobeusedtogeneratefuturedata.Itisimportanttoemphasizethat
this is a simplifying assumptionthatis madetoestablishinsightfultheoretical
guaranteesof futureperformance(or generalisation,aswewillseenext),but
many ML algorithmscan operate and produce useful results in absenceof this
i.i.d.property
1
.Anotherimportantremarkisthatthisdatadistributionisnot
knowntotheLearner,soitcannotbeleveragedduringthetrainingprocess;
instead, it has intrinsic theoretical value to analyse the prediction performance
of different learning algorithms.
Foundations of Supervised Learning41
Figure 2.2
CartoonillustrationofthejointdistributiondefiningaSupervisedLearningproblem.
The target function in this example is the conditional expectation f(x)=E
P
[y|x].
Loss FunctionBesides the data distribution, anotheressential component to
assessa MLmodelis thelossfunctionl:Y×ばつY→R,a non-negativefunction
definedover thelabel spacesatisfyingl(y,y)=0foranyy∈Y.For classifica-
tiontasks,wemayusel(y,y
)=1
y=y
,whileforregressionthesquaredloss
l(y,y
)=(yy
)
2
istheubiquitouschoice,owingtoitsfundamentalrolein
statisticsandsignalprocessing.Now,givenanyfunctionf:X→Y,thisloss
canthenbeusedtoassessthepoint-wiseerrorl(f(x),y)ataparticulardatapoint
(x,y), as well as the population risk (or error)
R(f):=E[l(f(x),y)]=
Z
X×ばつY
l(f(x),y)dP(x,y) .(2.1)
Thisexpectationmaybere-writtenasR(f)=
R
X
R
x
(f(x))dP
x
(x),whereP
x
isthemarginaldistributionwithrespecttox,i.e.P
x
(A):=
R
A×ばつY
dP(x,y)for
anymeasurablesetA⊆X,andR
x
(z):=
R
Y
l(z,y)dP
y|x
(y)istheexpectedloss
undertheconditionaldistributionof ygivenx
2
.Therefore,thefunctionf
(x):=
argmin
y∈Y
R
x
(y)is bydefinitiontheoneachievingsmallesterror,andreferred
astheBayesoptimumpredictor.NotethatforgeneraldatadistributionsP,
theassociatedBayeserrorwillbestrictlypositive,duetotheunderlying
uncertainty in the conditional distribution of y given x.
42Chapter 2
ModelClassWhiletheBayespredictordescribestheoptimalchoicegiven
anysupervisedlearningtask,it isanunfeasiblesolutiondueto severalreasons.
First,itisdefinedviathepopulationloss,whichisunknowntotheLearner.
Next,itdefinesafunctionf
thatmaybearbitrarilycomplextodescribe,
oreventoapproximate.Towardsbridgingthesegaps,wefirstintroduceour
hypothesisclassF:afamilyoffunctionsf:X→Ytypicallyparametrised
asF={f
θ
;θΘ}.MostifnotalltheexamplesinthisbookwillconsiderF
consistingofneuralnetworkparametrisations,whereθencodesthenetwork
weights. Learning thushappens in two stages: Firstwe decide on thefunction
classF,forinstancebyselectingacertainneuralnetworkarchitecture,and
next wedetermine the best hypothesis within F. Theexcess risk of a hypothe-
sis f∈Fis defined asthe deviation from theoptimal Bayes risk R(f)–R(f
).
Theinfimumexcessriskinf
f∈F
R(f)–R(f
)capturestheunderlyingability
ofthefunctionclasstorepresentthesolutionofinterest.Thatsaid,optimiz-
ingtheexcessriskoveraclassFstillrequiresoracleaccesstothetruedata
distribution. How can one overcome this limitation?
Empirical RiskAveraging theloss over thetraining set definesthe empiri-
cal risk (or error)
ˆ
R(f):=
1
N
N
X
i=1
l(f(x
i
),y
i
) .(2.2)
Since the training set{(x
i
,y
i
)}
i
is a random sample,the empirical risk is a ran-
domfunction,whoseexpectationispreciselyR(f).Foranyfixedf,
ˆ
R(f)is
anaverageofni.i.d.randomvariablesl(f(x
i
),y
i
).Assuch,fromthelawof
largenumbersweknowthat
ˆ
R(f)convergesalmostsurelytoitsexpectation
R(f)asN →∞;moreover,undermildmomentassumptions,wealsohave
thecentral-limit-theoremconvergence
q
N
σ
2
f
(
ˆ
R(f)–R(f))→N(0,1),where
σ
2
f
=Var(l(f(x),y)), capturingthefamiliarscaleO(1/
N) ofthestatisticalfluc-
tuations.Thisclassicasymptoticbehaviorcanbefurtherquantifiedwithnon-
asymptoticguaranteesusingtoolsfrommeasureconcentration.Forinstance,
assuming|l(f(x),y)|C
f
isaboundedrandomvariable,Hoeffding’sinequality
asserts that
P
ˆ
R(f)–R(f)
t
2exp
1
2
C
–2
f
t
2
N
,
indicating that fluctuations t
1
N
are extremely unlikely.
AseeminglynaïvestrategytoobtaingoodpredictorsinFisthustomin-
imisetheempiricalrisk,ratherthanthepopulationrisk.However,before
claiming victoryby meansof theprevious simple argument,itis importantto
realize thatwe needa more powerfulcontrol ofstatistical fluctuations. Indeed,
Foundations of Supervised Learning43
wehavesofarfocusedonmeasuringthefluctuationsatafixedhypothesisf,
but the training process is precisely about finding the right one!
Therelevantnotion isnotto comparehowindividualrandomvariables
ˆ
R(f)
fluctuate around their expectations R(f), but how the random function
ˆ
Rfluc-
tuatesaround itsexpectationR.Wethus needtocontrolfluctuations uniformly
inthesupportF,thatis,sup
f∈F
|
ˆ
R(f)–R(f)|.Asitturnsout,suchuniform
control can beachieved under fairly general conditions,byappropriately con-
strainingthehyptothesisclass,asweshallseenext.TheresultingEmpirical
RiskMinimization(ERM)isinfactaverypowerfulalgorithm,andthemain
statistical paradigm to perform Supervised Learning.
EmpiricalRiskMinimisationGivenahypothesisclassF={f
θ
;θΘ}
parametrised by θ, the ERM defines an estimator of the form f
ˆ
θ
, where
ˆ
θargmin
θΘ
ˆ
R(f
θ
)=
1
N
N
X
i=1
l(f
θ
(x
i
),y
i
) .(2.3)
In words, we aresearching for a parameterinthe domain Θ that best explains
theobserveddata,inthesenseofthechosenlossl.Whilethiscertainlyseemsa
necessary condition to geta good predictor,to what extentis it also sufficient?
ByexpressingthepopulationriskR(f)asR(f)=
ˆ
R(f)+(R(f)–
ˆ
R(f)),we
canthinkofagoodhypothesisasafunctionfsuchthat(i)theempiricalrisk
ˆ
R(f)issmall,and(ii)thefluctuationsR(f)–
ˆ
R(f)aresmall.Whilethefirst
propertyiswhattheERMexplicitlyisdesignedtodo,andcanbereadily
assessedbymeasuringtheempiricalriskontheavailabledata,thesecond
relies on structural properties ofthedatadistribution, the hypothesis class and
the optimization algorithm, and can notbeempirically verified from the train-
ingwithouttheseassumptions.Thekeyobservationthatwedevelopnextis
that thesetwo sourcesoferror aretypically intension, andone needsto trade
them off — the familiar bias-variance tradeoff.
Havingsmall(orevenzero)trainingerroramountstosolving(approxi-
mately orexactly)in θasystem ofNequations: f
θ
(x
i
)=y
i
for alli=1...N. The
abilitytosolvesystemsof equations,even whenthey arenon-linear, depends
fundamentally onthe number of degrees offreedom at ourdisposal, i.e. onthe
intrinsic dimensionality of themodel class, here roughly capturedinthe num-
berofparametersencodedinθ.Inotherwords,whenthenumberofparameters
islargerelativetoN,theERMoperatesintheso-calledoverparametrised
regime,andone shouldexpectmanyparameterchoices thatfitthe dataequally
well. However, recall that we are using the empirical risk
ˆ
Ras a proxy for the
populationriskR,thetruefunctionalweareinterestedinminimising.Thus,
44Chapter 2
howtomakeaninformedchoiceamongstallparametersinΘthatexplain
equally well (or interpolate) the data?
RegularizationThefolkloreanswer,Occam’srazor,suggeststopickthe
‘simplest’solutionsamongstthosethatinterpolatethedata.Moreformally,this
is achievedin statistical learning via regularisation,or capacity control. Given
a non-negative penaltyγ:ΘRencoding the‘complexity’ ofthe associated
hypothesis f
θ
, we consider the regularised form of equation 2.3:
ˆ
θ
δ
argmin
θΘ;γ(θ)δ
ˆ
R(f
θ
) .(2.4)
In words, weintroduce a hyper-parameter δcontrolling the maximum allowed
complexityofourestimator.Asthereadermightanticipate,thisparameter
controls the fundamentaltradeoffbetween the ability toexplain the data(bias)
and thedangerofoverfitting toit(variance). Itis importanttoemphasizethat
regularizationandoverparametrisationcan(and oftendo) coexist.In statistical
terms,thecapacity ofthe modelclassisnowencodedinthesinglescalarδ.We
willseeinSection2.3someexamplesofregularisationinNeuralNetworks,
withtheirassociatednormsγ(θ).Beforethat,letusformalizethetradeoff
achieved by regularisation.
Example:polynomialregressioninunivariatedataFigure2.3displays
thefamiliarunivariateexampleoffittinganon-polynomialtargetfunction
observedonnpointswithpolynomialsofincreasingdegreer.Thebias-
variancetradeoffisillustratedbythetransitionfromtheunderfittingregime
wherer<nto theoverfittingregimewherer>n. Thetransition r=ncaptures a
singluar situationwhereunregularisedERMbecomes ill-conditioned,andthe
subsequent population risk blows up; See Figure 2.5
2.2Decomposition of Risk
Recallthattheultimategoalofsupervisedlearningistoperformwellon
unseendata,or, moreprecisely,toconstruct anestimator
ˆ
ffromtheavailable
datathat hassmall excessrisk R(
ˆ
f)–R(f
).Let usconsider anestimator
ˆ
f=f
ˆ
θ
that approximatelysolves theregularisedERM from fromequation 2.4.Let us
now quantify the excess risk of
ˆ
f.
Westartbyaddingandsubtractingthebestmodel(intermsofpopulation
risk) achievable under the constraint γ(θ)δ:
R(
ˆ
f)–R(f
)=R(
ˆ
f)–inf
γ(θ)δ
R(f
θ
)+inf
γ(θ)δ
R(f
θ
)–R(f
)
|{z}
approximation error
.(2.5)
Foundations of Supervised Learning45
Figure 2.3
Polynomial Regressionon univariate functions.Regressions with polynomialsof small
degreeunderfit,anderrorisdominated bybias, while polynomialsofhighdegreeover-
fit, and error is dominated by variance.
Wehavewrittentheexcesserror(whichisnon-negativebydefinition)asthe
sum of two non-negative terms. Thesecondone,termed approximation error,
measuresourabilitytodesignaccurateenoughhypothesisclasseswiththe
right‘inductivebiases’,andpenalizesusforreducingthecapacityofour
modelclassviatherestrictionγ(θ)δ.Further,byexploitingthefactthat
ˆ
fapproximately solves equation 2.4, we decompose the first term as
R(
ˆ
f)–inf
γ(θ)δ
R(f
θ
) =R(
ˆ
f)–
ˆ
R(
ˆ
f)+inf
γ(θ)δ
ˆ
R(f
θ
)–inf
γ(θ)δ
R(f
θ
)+
ˆ
R(
ˆ
f)–inf
γ(θ)δ
ˆ
R(f
θ
)
2sup
γ(θ)δ
|R(f
θ
)–
ˆ
R(f
θ
)|
|{z}
statistical error
+
ˆ
R(
ˆ
f)–inf
γ(θ)δ
ˆ
R(f
θ
)
|{z}
optimization error
,
sinceinf
γ(θ)δ
ˆ
R(f
θ
)–inf
γ(θ)δ
R(f
θ
)
ˆ
R(f
θ
)–R(f
θ
),whereθ
argmin
γ(θ)δ
R(f
θ
).
Besidesthepreviousapproximationerror,wenowidentifytwootherfunda-
mentalsourcesoferror:thestatisticalerrorpenalizesusforoptimizingthe
46Chapter 2
Figure 2.4
DecompositionofRisk intoApproximation,Estimation andOptimization.The approx-
imationerrorariseswhenconstraining/regularisingtheriskover aballF
δ
.Estimation
error corresponds tothefactthat theempirical risk
ˆ
Rdeviatesfrom thepopulation risk
R.Finally,optimizationerrorcapturestheinabilityofoptimizationmethodstosolve
generic non-convex problems.
‘wrong’riskfunctional,thatis,theempirical(
ˆ
R)ratherthanthepopulation
(R)objective,andiscapturedherethroughtheuniformdeviationbetween
thesetwofunctionswithintheballγ(θ)δ.Theremainingtermistheopti-
mizationerror,sinceingeneralequation2.4doesnotadmitclosed-form
solutions.Moreover,ingeneraltheempiricalrisk
ˆ
Risanon-convexfunc-
tionofθ,whichcreatespotentialroadblocksintheassociatedminimisation
problem. Figure 2.4 illustrates the resulting decomposition of risk.
The importanttakeaway from thiserror decomposition isthat inanysuper-
visedlearningtask,thecomplexityparameterδplaysafundamentaltradeoff
betweenthedifferentsources oferror,even inoverparametrisedregimeswhere
thenumberofparameters(orneurons)isN.Generallyspeaking,asmall
complexityparameterδmakesthestatisticalerrorsmaller(sinceweneedto
controlfluctuationsbetween two functionsover asmallerdomain),whilemak-
ingtheapproximationerrorlarger.Wemaythusinterpretthisisabias-variance
tradeoffarisinginclassicmodelselectioninstatistics(BottouandBousquet
2007).
Foundations of Supervised Learning47
Figure 2.5
IllustrationoftheDouble-DescentPhenomena,wherebythepopulationriskofunreg-
ularisedERMundergoesaphasetransitionasthenumberofparameterscrossesthe
number of datapoints, with a second ‘descent’ phase.
2.3Over-Parametrised Regime
Inthecontextofneuralnetworks,or moregenerally modelshaving manymore
parameters than data-points, one may wonder how this tradeoff plays out.
Theso-calleddouble-descentphenomena(Belkinetal.2019)seemingly
goesagainstthisbias-variancetradeoff,wherebyhighlyover-parametrised
neuralnetworksexhibit excellentgeneralisationperformance;seeFigure2.5.
Infact,thedouble-descentphenomenaisconsistentwiththevariance-bias
tradeoffdescribedabove,butitservesasacautionarytalethatthe‘effec-
tive’ complexity ofa hypothesis classis generally not wellcaptured by simply
counting the number of parameters.
Asuccessfullearning schemethusneedstoencodethe appropriatenotionof
regularityorinductive biasfor f,imposedthroughtheconstructionofthe func-
tion class Fand theuse of regularisation. We briefly introduce thisconcept in
the following section.
Modern machine learning operates with large, high-quality datasets, which,
togetherwithappropriatecomputationalresources,motivatethedesignof
richfunctionclassesFwiththecapacitytointerpolatesuchlargedata.This
mindsetplayswellwithneuralnetworks,sinceeventhesimplestchoicesof
architecture yields a dense class of functions.
A setA⊂Bis said tobe densein Bifits closure satisfiesA∪{ lim
i→∞
a
i
:a
i
A}=B.ThisimpliesthatanypointinBisarbitrarilyclosetoapointinA
3
.AtypicalUniversalApproximationresultshows thattheclassoffunctions
48Chapter 2
representede.g.bya two-layerperceptron, f(x)=c
sign(Ax+b)isdensein
the space of continuous functions on R
d
.
Thecapacitytoapproximatealmostarbitraryfunctionsisthesubjectof
various UniversalApproximationTheorems; severalsuchresultswereproved
andpopularisedinthe1990sbyappliedmathematiciansandcomputerscien-
tists(seee.g.Cybenko1989;Hornik1991;Barron1993;Leshnoetal.1993;
Maiorov 1999; Pinkus 1999).
Figure 2.6
MultilayerPerceptrons(Rosenblatt1958),thesimplestfeed-forwardneuralnetworks,
areuniversalapproximators:withjustonehiddenlayer,theycanrepresentcombina-
tions of step functions, allowing to approximate anycontinuous function witharbitrary
precision.
Universal Approximation, however, doesnot implyan absence ofinductive
bias. Given a hypothesis space Fwith universal approximation, we can define
acomplexitymeasureγ:F→R
+
(thesamenotionthatweintroducedatthe
end of Section 2.1), and redefine our interpolation problem as
̃
fargmin
g∈F
γ(g)s.t.g(x
i
)=f(x
i
)fori=1,...,N,
i.e., we are looking for themostregular functions within our hypothesis class.
Forstandardfunctionspaces,thiscomplexitymeasurecanbedefinedasa
norm
4
.Inthissense,andinlightoftheRiskdecompositionthatwas outlined
inSection2.2,wecanrevisittheUniversalApproximationTheoremswitha
moreskepticaleye:they‘just’conveytheideathattheapproximationerror
Foundations of Supervised Learning49
inf
γ(f)δ
R(f)–R(f
)convergestozeroasδ→∞.Statedassuch,theylack
quantitative power to understandthe resultingtradeoffbetween approximation
and statistical errors.
Inordertoovercomethislimitation,itisthusnecessarytoquantify
approximationrates.Inlow dimensions,splinesareaworkhorseforfunction
approximation.Theycanbeformulatedasabove,withanormcapturingthe
classicalnotionofsmoothness, suchasthesquared-normofsecond-derivatives
R
+
|f
′′
(x)|
2
dxforcubicsplines.Thisrepresentationallowsustoquantify
approximation errors by controlling the norm of the spline approximation.
Inthecaseofneuralnetworks,the complexitymeasureγcanbeexpressed
intermsofthenetworkweights,i.e.γ(f
θ
)=γ(θ).TheL
2
-normofthenet-
workweights, known asweightdecay, orthe so-calledpath-norm(Neyshabur,
Tomioka,andSrebro2015)arepopularchoicesindeeplearningliterature.
FromaBayesianperspective,suchcomplexitymeasurescanalsobeinter-
pretedasthenegativelogofthepriorforthefunctionofinterest.The
regularisationtermγ(θ)maytakedifferentformsacross differentML systems.
Forinstance,forlinearmodelsoftheformf
θ
(x)=θ
Φ(x)withθR
m
,wemay
choose an L
p
norm γ(θ)=θ
p
p
=
P
m
j=1
|θ
j
|
p
, with p=1,2 capturing the familiar
settings of sparse and ridge regression, respectively.
It is important to emphasizethat the regularization framework encompasses
notonlymethodsthatexplicitlyconsiderregularisationineitherconstrained
(asabove)orpenalizedform(byaddingtheassociatedLagrangemultipliers,
leading to min
θΘ
ˆ
gR(f
θ
)+λγ(θ) ); but also methodswherethe regularization
comes implicitlythrough thechoice of optimizationscheme. For example, it is
well-knownthat gradient-descenton an under-determined least-squares objec-
tivewillchooseinterpolatingsolutionswithminimalL
2
norm.Theextension
of such implicitregularisation resultsto modern neuralnetworks is thesubject
ofcurrentstudies(seee.g.Gunasekaretal.2017;Blancetal.2020;Shamir
and Vardi 2020; Razin and Cohen 2020).
Theimportanttakeawayisthatregularization andoverparametrisationare
compatible in modern DL via the implicit bias of gradient-based learning, and
thisbias,or‘prior’,isoftencapturedasacertainnorminthespaceoffunctions.
Allinall,anaturalquestionarises:howtodefineeffective priorsthat capture
theexpectedregularities andcomplexitiesofreal-worldpredictiontasks,and
that enable a quantification of the approximation error?
2.4The Curse of Dimensionality In High-Dimensional Learning
Whileinterpolationinlow-dimensions(withd=1,2or3)isaclassicsignal
processingtaskwithveryprecisemathematicalcontrolofestimationerrors
using increasingly sophisticatedregularity classes(such as spline interpolants,
50Chapter 2
wavelets,curvelets, orridgelets), thesituationfor high-dimensionalproblems
is entirely different. In thisgeneric high-dimensional regime,one is often con-
frontedwithanimpossibletradeoffbetween‘cursed’approximationratesor
‘cursed’ estimation rates
5
.
In orderto convey theessenceof theidea,let usconsidera classicalnotion
ofregularitythatcanbeeasilyextendedtohighdimensions:1-Lipschitz- func-
tionsf:X→R,i.e.functionssatisfying|f(x)–f(x
)|≤∥xx
forallx,x
∈X.
This hypothesissays that ifwe perturb theinput x slightly (asmeasured by the
norm xx
), theoutput f(x) isnot allowed to changemuch. Thisis aweaker
formofregularitythanSobolevsmoothness,whichadditionallyasksforfto
have s derivatives that are integrable.If our only knowledge of the target func-
tion fisthat it is1-Lipschitz, how many observations dowe expect to require
to ensure that ourestimate
̃
fwill be close to f? Figure 2.7reveals that the gen-
eralanswerisnecessarilyexponentialinthedimensiond,signalingthatthe
Lipschitz classgrows‘too quickly’as theinputdimension increases:in many
applicationswitheven modestdimensiond,thenumberofsampleswouldbe
biggerthanthenumberofatomsintheuniverse.Thesituationisnotbetter
ifonereplacestheLipschitzclassbyaglobalsmoothnesshypothesis,such
astheSobolevClassH
s
(Ω
d
)
6
.Indeed,classicresults(Tsybakov 2008)estab-
lish a minimax rate of approximation and learning for the Sobolev class of the
order ε
d/s
, showing thatthe extrasmoothness assumptionson fonly improve
thestatisticalpicturewhensd,anunrealisticassumptioninpractice. Thus,
if‘classic’notionsofregularityarenotadaptedtoperformhigh-dimensional
learning, how should one replace them with effective ones?
2.5Linear Approximation With Positive-Definite Kernels
Apowerfulframeworktodefineflexibleapproximationspacesinhigh-
dimensionisviapositivesemidefinitekernels:apsdkernelk(x,x
)isasym-
metricmappingonX×ばつXsatisfying
P
n
i,j=1
α
i
α
j
k(x
i
,x
j
)0foranyαR
n
and any collection ofpoints (x
1
,...,x
n
). Akernel capturesan a-priorinotion of
similarity intheinputspace,e.ga kernelof theform k(x,x
)=x
x
in X=R
d
measureslinearcorrelation,whileaGaussiankernelk(x,x
)=exp(–
1
2τ
2
x
x
2
)measureswhetherx,x
arecloseatacertainscale,inthesensethat
k(x,x
)1 whenever xx
τand k(x,x
)0 otherwise.
To each positive semidefinite kernel on Xone can associate a Hilbert space
offunctions definedover Xwith additionalstructure,theso-calledreproduc-
ingproperty.Inessence,thereexists a(possiblyinfinite-dimensional)Hilbert
space Hand a ‘featuremap’φ:X→Hsuch that thekernelcanbe viewed as
aninnerproduct:k(x,x
)=φ(x),φ(x
)
H
.Thereproducingpropertyrefersto
Foundations of Supervised Learning51
Figure 2.7
We consideraLipschitzfunction f(x)=
P
2
d
j=1
z
j
φ(xx
j
) wherez
j
=±1, x
j
R
d
is placed
ineachquadrant,andφalocallysupportedLipschitz‘bump’.Unlessweobservethe
functioninmostofthe2
d
quadrants,wewillincurinaconstanterrorinpredicting
it. Thissimplegeometricargument canbeformalised throughthenotionofMaximum
Discrepancy(LuxburgandBousquet2004),definedfortheLipschitzclassasκ(d)=
E
x,x
sup
fLip(1)
1
N
P
l
f(x
l
)–
1
N
P
l
f(x
l
)
N
–1/d
,whichmeasuresthelargestexpected
discrepancybetweentwoindependentN-sampleexpectations.Ensuringthatκ(d)ε
requires N=Θ(ε
d
); thecorresponding sample{x
l
}
l
defines anε-net ofthe domain. For
a d-dimensional Euclidean domain of diameter 1, its size grows exponentially as ε
d
.
thefactthatinthisHilbertspace,functionevaluationscanbeseenasinner-
products:foreachf∈Handx∈X,onehasf(x)=f,φ(x)
H
.Giventhatin
supervisedlearningoneacquiresinformationaboutthetargetfunctionf
via
(noisy)evaluationsinthe trainingset,ie y
i
=f
(x
i
)+ξ
i
,thereproducingprop-
ertyallowsustoviewthesemeasurementsasbona-fideprojectionsofthetarget
onto certain directions, spanned precisely by the data features φ(x
i
)∈H.
Theresulting ReproducingKernel HilbertSpace (RKHS)Hthus consistsof
functionsf
θ
:X→Rthatarelinearinθwhenexpressedusingtherepresen-
tation f
θ
(x)=θ,φ(x)
H
, and comeswithaHilbert regularization norm γ(f
θ
)=
θ
H
.The Hilbertianstructure enablesaprecise controlofall sourcesoferror,
includingapproximation,generalisationandoptimization,eveninthehigh-
dimensionalregime.Theinitialchoiceofkernelthuscapturesthe‘inductive
bias’associatedwiththecorrespondingRKHS.Forinstance,intheprevious
Gaussiankernelexample, theassociated RKHSnormf
θ
H
correspondstoa
weightedFouriernormoftheformf
θ
2
H
R
|
ˆ
f(ω)|
2
exp(
τ
2
2
ω
2
)dω,which
indicatesthatHcontainsonlyinfinitelydifferentiablesmoothfunctionswith
exponential Fourier decay.
InthecontextofNeuralNetworks,somepopularkernelsareobtainedby
buildingfeaturemapsφ(x)usinginfinitelywideneuralnetworks,andby
52Chapter 2
approximatingthemusingRandomFeatures.Forinstance,byconsidering
k(x,x
)=E
wμ
[σ(w
x)σ(w
x
)],whereμisafixedprobabilitydistribution
inR
d
andσanarbitraryactivationfunction,theassociatedRandomFeature
approximation is obtained by a Monte-Carlo estimate of k:
ˆ
k(x,x
)=
1
M
M
X
m=1
σ(w
m
x)σ(w
m
x
) ,with w
m
iid
μ.
Theassociated(random)RKHSisthusgeneratedbytherandomfeatures
{σ(w
m
·)}
mM
,whichcanbeinterpretedinthiscaseasawidth-Mshal-
lowneuralnetworkwithrandomweightsw
m
.Theresultingapproximation
spaceisthenobtainedbylinearcombinationsoftheserandomfeatures;in
otherwords,ashallowneuralnetworkwhereonlythelastoutputlayeris
trained.Anotherpopularkernelmethodassociatedwithneuralnetworksis
theso-calledNeuralTangentKernel(NTK),whichalsoconsidersthegra-
dientfeatures{
w
σ(w
m
·)}
m
andcapturesthetrainingdynamicsincertain
overparametrised scalings (the so-called lazy regime).
WhiletheRKHSframeworkofferswideflexibilitythroughthechoiceof
thekernel,itisinherentlyalinearapproximationframeworkwherethereis
norepresentation(or‘feature’)learning,sincethechoiceofthekernel(and
therefore the associatedfeature maps) ismadebefore seeingthetraining data.
Inotherwords,kernelmethodsareunabletoextractmeaningfulinformation
inhigh-dimensionsunlessonealreadyhasapreconceivedideaof‘whereto
look’. To illustrate the limitations of kernels in high-dimensions, it is useful to
considerisotropickernels,iek(x,x
)=k(Ux,Ux
)foranyrotationU ∈O
d
.In
words, thesekernels have nopreconceivednotionofpriviledged directionsin
the inputspace.Under theseconditions,the kernelridgeregressionestimator
ˆ
f
associatedwithadataset{(x
i
,y
i
=f(x
i
)}
in
isessentiallyequivalenttoperform-
ingpolynomialregressiononf:Ifnd
k
,then
ˆ
fisthebestdegree-kpolynomial
approximationoff(MisiakiewiczandMontanari2023).Themodelisthus
unable toadapt toany formoflow-dimensionalstructure presentin thetarget
function f, and is only effective to learn targets with global smoothness.
In conclusion, suchlimitation is incompatible with many aspects of modern
DL, suchasthepre-training paradigm,inwhicha largetrainingset isusedto
learncertainfeatures,whicharethentransferredtoadownstreamtaskviaa
fine-tuning stage.
2.6From Linear To Non-Linear Approximation
Thealternative FeatureLearningframework correspondsinsteadtonon-linear
approximation. Whilethepreviouskernelapproximation framework considers
afixedfamilyoffeaturemaps{φ
w
j
(x)}
jdim(H)
andapproximatesatargetf
Foundations of Supervised Learning53
as f
P
j
θ
j
φ
w
j
by adjustingthe weightsθ
j
, innon-linear approximationone
also adjuststhe parametersw
j
defining thefeatures. Theadditionaldegrees of
freedombroughtbyadjustingw
j
besidesθ
j
bringadditionaladaptivity,that
can becomedramatic inthe high-dimensional regime. In otherwords, thenon-
linearapproximationaspectoffeaturelearningimprovestheapproximation
propertiesofthemodel,attheexpenseofamorechallengingoptimization.
Indeed,thejointoptimizationofw
j
andθ
j
definesanon-convexproblem
which isgenerally challengingto analyse,yet oftensolvablein practiceusing
gradient-descent techniques.
In summary, some of the limitationsof linear approximation andassociated
kernelmethodsmaybeovercomebyconsideringfeaturelearning,whichhas
theabilitytoadapttospecificstructuresoftheinputfunction.Yet,theques-
tionremains:howtodefinethesetrainablefeaturesinsuchawaythatprior
information on the target function is efficiently encoded?
2.7Approximation With Shallow Networks And Beyond
To illustratethe difficulty ofthisquestion, itisuseful toconsider thesimplest
instanceoffeaturelearninginNNs,givenbyshallowneuralnetworks.This
architecture defines a class of functions of the form
F=
f
θ
(x)=
X
mM
α
m
σ(w
m
x+b
m
) ,θ={α
m
,w
m
,b
m
}
m
,(2.6)
where σisanactivation and θcollectsthe weights ofthenetwork. The model
isthusbuildinganapproximationasalinearcombinationofridgefunctions,
whicharenon-linearfunctionsofone-dimensionalprojectionsoftheinput.
Whilethisclassenjoys universalapproximationintheoverparametrisedlimit
M→∞,thisapproximationbecomesquantitativeinthehigh-dimensional
regime (ie, with an approximationerror f
f
(M)
θ
∥≃M
η
with η>0 indepen-
dent ofthe inputdimension) onlyunder strongassumptions onthe target. For
instance,iff
issuchthatitdependsonacollectionoflow-dimensionalprojec-
tionsoftheinput(seeFigure2.8),thentheshallowclass,whenoperatinginthe
non-linear regime,isable tobreak thiscurse ofdimensionality (Bach2017a).
However,inmostreal-worldapplications(suchascomputervision,speech
analysis,physics,orchemistry),functionsofinteresttendtoexhibitcomplex
long-range correlationsthatcannot beexpressedwith low-dimensional projec-
tions(Figure2.8),makingthishypothesisunrealistic.Itisthusnecessaryto
define analternative sourceofregularity, byexploiting thespatial structure of
thephysicaldomainandthegeometricpriorsoff,aswedescribeinthenext
Chapters.
54Chapter 2
Figure 2.8
Iftheunknownfunctionfispresumedtobewellapproximatedasf(x)g(Ax)for
someunknownAR
k×ばつd
withkd,thenshallowneuralnetworkscancapturethis
inductivebias,seee.g.Bach2017a.Intypicalapplications,suchdependencyonlow-
dimensionalprojectionsisunrealistic,asillustratedinthisexample:alow-passfilter
projects theinputimagestoa low-dimensional subspace;whileitconveys mostofthe
semantics, substantial information is lost.
Suggested Further Reading
For acomprehensive introduction to learning theory, we refer the readerto the
recent ‘LearningTheory fromFirstPrinciples’ byFrancis Bach.Inparticular,
thetopicsoutlinedinSection2.1,Section2.2andSection2.5arepresented
fromtheground-upandwithgreatlevelofdetail.FurtheraspectsofRKHS
theoryandtheirapplicationstolearningtheoryaredetailedin‘Learningwith
Kernels,byScholkopfandSmola.Forfurthertheoreticalaspectsofstatisti-
callearningtheory,andacomprehensivereviewofmeasureconcentration,
wereferthereaderto‘High-DimensionalStatistics’,byMartinWainwright.
For topicsspecifically relatedto NeuralNetwork approximation,described in
Section2.4andSection2.7,werecommendthereader toexplore‘DeepLearn-
ingTheory’byMatusTelgarsky.Finally,thenotionsoflinearandnon-linear
Foundations of Supervised Learning55
approximation outlinedin Section2.6 areexplored indepth in‘A Wavelettour
in Signal Processing’, by Stephane Mallat (Chapter 9).
ApproximationTheoryforNeuralNetworks:Theapproximationproperties
ofgenericfullyconnectedneuralnetworkswereheavilystudiedinthe80s
and90s,culminatinginafairlycompletecharacterisation.(BaumandHaus-
sler1988)consideredtheminimalsizeofneuralnetworkswiththeabilityto
fitacertain(finite)dataset.Shortlyafter,severalworksstudiedthedenisty
ofshallowneuralnetworksinthespaceofcontinuousfunctions,e.g(Hecht-
Nielsen1987;CarrollandDickinson1989;Cybenko1989;Funahashi1989;
Hornik,Stinchcombe, andWhite 1989),undercertain assumptionsonthe acti-
vationfunction,culminatinginthe‘modern’formulationofUAT(Leshno
etal.1993)whichonlyaskstheactivationtonotbeapolynomial.Beyond
qualitativeapproximation,the quantitative study ofapproximationrates using
neuralnetworkswaspioneeredbyBarron(Barron1993)andMaiorovand
Meir(Maiorov1999;MeirandMaiorov2000).Barronalsoidentifiedthe
fundamentalseparationbetweenlinearandnon-linearapproximationusing
shallowneuralnetworks,subsequentlystudiedin(Bach2017a;Ma,Wu,et
al. 2022).
KernelsandNeuralNetworks:TheinterplaybetweenNeuralNetworks
andKernelmethodsdatesatleasttoNeal(Neal1996);seealso(Williams
1996),whereaconnectionwasmadebetweeninfinitely-wideneuralnet-
worksandcertainGaussianProcesses.(Weston,Ratle,andCollobert2008)
furtherexploredtheinterplaybetweenkernelsandNNs,withthemotivation
toleveragethetractableformalismofkernelmethodstoeasetheoptimiza-
tionchallengesofdeeparchitectures.Shortlyafter,ChoandSaul(Choand
Saul2009)computedthedot-productkernelsassociatedwithdeeparchitec-
tureswithrandomlyinitializedisotropicweights.Inparallel,RahimiandRecht
(RahimiandRecht2007)introducedrandomfeatureapproximationsforcer-
tainkernelclasses,whichinretrospectprovidedtheframeworkfortoday’s
quantitativenon-asymptotictheory(MisiakiewiczandMontanari2023;Bach
2017a, 2017b).In essence,this framework providesan explicit statistical char-
acterizationofneuralnetworkswhereonlythelastlayerisallowedtotrain.The
generalsettingofnon-linearNNapproximationisoutofthescopeofkernel
methods, withthenotable exception ofthe Neural TangentKernel from(Jacot,
Gabriel,andHongler2018).TheNTKconsidersalinearisationof theNeural
Network at its initialization, and shows that, for certain choice of initialisation
scaling, the training dynamics of the actual network and its linearisation agree
inthe overparametrisedlimit. Thisregimeis however unabletoaccountfor the
feature learning abilities of NNs (Chizat, Oyallon, and Bach 2019).
56Chapter 2
Exercises
1.Decomposition of Risk: Quadratic Function and Least Squares
ConsideralinearhypothesisclassoftheformF={f
θ
(x)=θ
φ(x)},where
φ:R
d
R
m
isafixed,finite-dimensionalfeaturemap,andθR
m
parametrisesthe
hypothesis. Consider the least-square loss, hence l(f
θ
(x),y)=(θ
φ(x)–y)
2
.
(i)By developing the square, show that for any θwe have
ˆ
R(f
θ
)–R(f
θ
)=θ
(
ˆ
ΣΣ)θ–2θ
(
ˆ
mm)+(ˆσ
2
y
σ
2
y
) ,
where
ˆ
Σ=
1
n
P
i
φ(x
i
)φ(x
i
)
, Σ=E[
ˆ
Σ],
ˆ
m=
1
n
P
i
y
i
φ(x
i
), m=E[
ˆ
m],ˆσ
2
y
=
1
n
P
i
y
2
i
, σ
2
y
=
E[ˆσ
2
y
].
(ii)AssumenowthatcovariantesareisotropicΣ=Idandtherealizablesettingy|x=
θ
φ(x). Show that the excess risk is then
ˆ
R(f
θ
)–R(f
θ
)=(θθ
)
(
ˆ
ΣΣ)(θθ
) .
(iii)Conclude that the uniform bound in the ball F
δ
={f
θ
∈F;θ∥≤δ} is
sup
f∈F
δ
|
ˆ
R(f)–R(f)|=δ
2
ˆ
ΣΣ
op
.
Inthisexample,theuniformcontroloftheexcessriskcanthusbedirectlycaptured
withtheoperatornormoftheempiricalcovariancefluctuations(noticethattheoper-
atornormA
op
=sup
θ∥≤1
Aθstillcontainsasupremumoverthedomain).The
additionalstructureoftheoperatornormofalinearoperatorleadstoatightcontrol
of theexcess riskin linearmodels usingtoolsfrom RandomMatrix Theory;seeMisi-
akiewiczandMontanari2023foran in-depthanalysisoftheexcessriskinlinearmodels
(including infinite-dimensional linear models given by RKHS).
2.Generalization Bounds with Rademacher Complexity
Reference:(Bach2024,Section4.5)Denotez=(x,y)∈Z:=X×ばつYandH=
{(x,y)7→l(y,f(x));f∈F} theclassof functions fromZto Rassociatedto the hypoth-
esisclassF.Inthisexercise,wewillobtaingeneralisationboundssup
F
{R(f)–
ˆ
R(f)}.
(i)Show that sup
F
{R(f)–
ˆ
R(f)}=sup
H
{E[h(z)]–
1
n
P
n
i=1
h(z
i
)}.
LetD={z
1
,...,z
n
}bethedataset.TheRademacherComplexityoftheclassHwith
respect to the data distribution is defined as
R
n
(H):=E
ε,D
sup
h∈H
1
n
ε,h(D)
,
whereh(D)=(h(z
1
),...,h(z
n
))andε=(ε
1
,...,ε
n
)isavectorofniidRademacher
variables, which take values ε
i
=±1 with probability 1/2.
Foundations of Supervised Learning57
(ii)ShowthatR
n
isalinearfunctional,ieR
n
(αH+α
H
)=|α|R
n
(H)+|α
|R
n
(H
)for
α,α
R,thatitistranslation-invariant,ieR
n
(H+{h
0
})=R
n
(H),andthatR
n
(H)=
R
n
(CH(H)), where CH is the convex hull.
(iii)Massart’s Lemmaforfinite Hypothesisclass: AssumeH={h
1
,...,h
m
} andthatHis
a.s. bounded, ie
1
n
P
n
i=1
h(z
i
)
2
R
2
for all h∈H. Show that R
n
(H)
q
2 logm
n
R.
(iv)Symmetrisation Show that
E
"
sup
H
{E[h(z)]–
1
n
n
X
i=1
h(z
i
)}
#
2R
n
(H) ,
andthesameboundappliestoE
sup
H
{
1
n
P
n
i=1
h(z
i
)–E[h(z)]}
.(Introduceaninde-
pendentcopyD
={z
1
,...,z
n
}ofthedatasetandusetheidentityE[h(z)]–h(z
i
)=
E[h(z
i
)–h(z
i
)|D].
Thepreviousboundcontrolstheexpectedgeneralisationerror.Toobtainhigh-
probabilitybounds,wenowconsiderMcDiarmid’sinequality,whichexploitsthea.s.
boundedness of the random variables.
Lemma 1 (McDiarmid’s inequality)Let Z
1
,...,Z
n
be independent random variables
defined over Z,and f:Z
n
Rsuch that,for all i{1,n} andallz
1
,...,z
n
,z
i
, satisfies
|f(z
1
,...,z
i–1
,z
i
,z
i+1
,...z
n
)–f(z
1
,...,z
i–1
,z
i
,z
i+1
,...z
n
)|c .
Then
P
f(Z
1
,...,Z
n
)–E[f(Z
1
,...,Z
n
)]
t
2exp(–2t
2
/(nc
2
))
(v)ApplyMcDiarmid’sinequalitybothtoRandtoR
n
toobtainthefollowinghigh-
probabilty bound: if h(z)lfor all z and all h, then withprobability greater than 1–δ,
for any h∈H,
E[h(z)]
1
n
n
X
i=1
h(z
i
)+2
ˆ
R
n
(H)+3l
r
log(2δ
–1
)
2n
,
wherewedefinedtheempiricalRademachercomplexityas
ˆ
R
n
(H):=E
ε
sup
h∈H
1
n
ε,h(D)
.
TheRademachercomplexityprovidesapowerfultooltocontrolthegeneralisation
error;see(Section4.5)forfurtherpropertiesandspecificexampleswherehypothesis
classes are given by balls in generic normed spaces.
3.Learning Lipschitz Functions in d-dimensions
Let {(x
i
,y
i
)}
i=1,...,n
with x
i
Unif([–1,1]
d
) and y
i
=f(x
i
) where fis 1-Lipschitz:
|f(x)–f(y)|≤∥xy.Showthatinordertolearnfuptouniformaccuracyε;ie
E[sup
x[–1,1]
d
|f(x)–
ˆ
f(x)|]ε,where theexpectation iswithrespect totherandomness
of the training data, it is necessary and sufficient that n=Θ(ε
d
).
4.Linear vs Non-Linear Approximation in High-Dimension
58Chapter 2
In thisexercise,we willcompare linear approximation schemes(associated with
Kernelmethods)withnon-linearapproximation(associatedwithNeuralNetworksin
the feature-learning regime).
LetF=L
2
(R
d
,ν)bethespaceofsquared-integrablefunctionswithrespectto
adatadistributionν,andconsiderB={φ
m
}
mΛ
anorthonormalbasisofF.Foreach
subsetSΛ,wedefineU
S
:=span{φ
m
;mS}.Thelinearappromationoff∈Fis
given by
ε
S
(f):= min
gU
S
fg,g
S
(f):=argmin
gU
S
fg.
Finally,givena target family F
⊂F,the minimax N-term linear approximation error
is defined as
ε
N
(F
):=inf
|S|N
sup
f∈F
ε
S
(f) .
(i)Show that g
S
(f) can be explicitly computed as g
S
(f)=
P
mS
f,φ
m
φ
m
.
Consider now an order in Λ, and S=S
N
the nested sets S
N
={1,N}.
(ii)Show that for f∈F, we have lim
N→∞
ε
N
(f)=0.
(iii)Give an example of a family F
such that ε
N
(F
)=O(N
–1
).
(iv)Showthatiff∈F
issuchthat
P
m
m
2s
|f,φ
m
|
2
<fors>1/2,thenε
N
(F
)=
O(N
s
).Thisquantifiestheprevious approximationguaranteeintheso-calledSobolev
class.
We now define the non-linear approximation error as
ε
N
(F
):=sup
f∈F
inf
|S|=N
ε
S
(f) .
(v)Show that for any Nand any F
, ε
N
(F
)ε
N
(F
).
Non-linearapproximationschemesarethusmorepowerfulthantheirlinearcoun-
terparts.Weconcludethisexercisebyillustratinghowthegapbetweenlinearand
non-linear approximation can be dramatic in the high-dimensional regime.
Forthatpurpose,considerarotationally-invariantdatadistributionν,ie,such
that forany U∈O
d
and any measureable setAR
d
we have ν(U(A))=ν(A). Consider
anorthonormalbasisofL
2
(R
d
,ν)oftheformB={φ
j,k
}
jN, kN
d,j
,wheretheorthog-
onalsubspacesV
j
=span(φ
j,k
;k=1...N
d,j
)areclosedunderrotations:iffV
j
,then
f
U
V
j
forallU.N
d,j
isthusthedimensionofeachinvariantsubspace,anditcanbe
assumed that N
d,j
d
j
for sufficiently larged. Theseinvariant spaces are often referred
as harmonics. Finally, consider a function class F
of the form
F
={f;
X
j
j
2s
P
V
j
f
2
<;f(x)=h(x·θ) for some θS
d–1
} ,
whereP
V
denotestheorthogonalprojection.F
thusconsistsofhigh-dimensional
functions whichonly depend onone-dimensionalprojectionsof the input,theso-called
single-index models, andwhose linkfunction hhaspolynomial decayof orders inthe
harmonic decomposition.
Foundations of Supervised Learning59
(vi)Showthat,foranybasis Basabove,theminimaxlinearapproximationerrorε
N
(F
)is
minimisedatS={(j,k);jj
(N)},where j
(N)is thelargestjsuchthat
P
j
j
N(d,j
)<
N.
(vii)Deduce that the minimaxlinear approximation error ε
N
(F
) has a staircase behavior:
it remains constant as long as j
(N) does not increase.
(viii)Using theasymtpotics N
d,j
d
j
, deducethat toreachapproximation errorε,oneneeds
N d
ε
–1/s
terms.
(ix)Incontrast,showthatthenon-linearapproximationerror,whenoptimizedover
harmonic bases as above
ε
N
(F
):=sup
f∈F
inf
B,|S|=N
ε
S
(f) ,
requires only N ε
–1/s
terms.
5.Universal Approximation with Fourier Analysis
[FromM.TelgarskyDLTSection2.3,andBarron1993]Inthisexercise,we
will derive the Universal Approximation Theorem forShallow Neural Networks using
Fourier Representations.
(i)VerifythatforfL
1
(R
d
)andC
1
,itsFouriertransform
ˆ
f(ξ)=
R
f(x)e
–2πix·ξ
dxisalso
in L
1
(R
d
), and moreover
Z
ξ|
ˆ
f(ξ)|dξ<.
(ii)ShowtheinverseFourierrepresentationf(x)=
R
ˆ
f(ξ)e
2πiξ·x
dξ,wheretheequality
holds point-wise.
(iii)Let
ˆ
f(ξ)=|
ˆ
f(ξ)|e
iφ
f
(ξ)
Cbe the polar representation. Show that
f(x)=
Z
cos(2π(ξ·x+φ
f
(ξ)))|
ˆ
f(ξ)|dξ.
(iv)Using the Funtamental Theorem of Calculus, show that
cos(2π(ξ·x+φ
f
(ξ)))=cos(2πφ
f
(ξ))+
Z
ξ
0
sin(2π(–b+φ
f
(–ξ)))–sin(2π(b+φ
f
(ξ)))
1(ξ·xb)db.
(v)Concludethatthereexistameasureμ(b,ξ)satisfying
R
|μ(b,ξ)|dbdξ4π
R
ξ|
ˆ
f(ξ)|dξ<
such that
f(x)=
Z
μ(b,ξ)1(x·ξb)dbdξ;
inotherwords,wecanexpressfasaninfinitely-wideshallowneuralnetworkwith
threshold activation function x7→1(x·ξb).
60Chapter 2
Notes
1
ThestudyofMLbeyondthei.i.d.assumptionisthetopicofTransfer,
Continual, or Adversarial Learning; see notes at the end of the chapter.
2
IfPadmitsadensity(whichwedenotebyPbyabuseoflanguage)with
respecttoabasemeasureμinX×ばつY,soPμ,thenP
y|x
admitsadensitywith
respecttothemarginalμ
y
(y)=
R
X
dP(x,y),withdP
y|x
(y)=Z
–1
P(x,y)dμ
y
(y),
withZ=
R
Y
P(x,y)dμ
y
(y).Wewillglossoverthemeasure-theoreticconsid-
erationswhendealingwithconditionaldistributions,andassumethroughout
thatalljointdistributionsareabsolutelycontinuouswithrespecttothebase
measure of the domain.
3
Notethatthedensenessofasetisapropertythatdependsonthechoice
ofatopology.ConsiderBthesetofsquared-integrable, boundedscalarfunc-
tions,and A⊂Btheset ofbounded,compactly-supportedfunctions. ThenA
isdenseinBunderthetopologyinducedbyL
2
(R),whereasitisnotdense
under the topology induced by L
(R).
4
Informally, anorm xcan be regardedasa"length"of vector x. ABanach
space is a complete vector space equipped with a norm.
5
‘Cursed’herereferstoratesofapproximation/estimationthathaveexpo-
nentialdependenceindimension,e.gεn
c/d
wherenreferstothestatisti-
cal/computational resources, egnumber of neurons to approximate a target, or
number of datapoints to estimate it.
6
A functionfis in theSobolevclass H
s
(Ω
d
) iffL
2
(Ω
d
) andthe generalised
s-thorderderivativeissquare-integrable:
R
|ω|
2s+1
|
ˆ
f(ω)|
2
dω<,where
ˆ
fis the
Fourier transform of f; see Chapter 6.

AltStyle によって変換されたページ (->オリジナル) /