Foundations of Supervised Learning55
approximation outlinedin Section2.6 areexplored indepth in‘A Wavelettour
in Signal Processing’, by Stephane Mallat (Chapter 9).
ApproximationTheoryforNeuralNetworks:Theapproximationproperties
ofgenericfullyconnectedneuralnetworkswereheavilystudiedinthe80s
and90s,culminatinginafairlycompletecharacterisation.(BaumandHaus-
sler1988)consideredtheminimalsizeofneuralnetworkswiththeabilityto
fitacertain(finite)dataset.Shortlyafter,severalworksstudiedthedenisty
ofshallowneuralnetworksinthespaceofcontinuousfunctions,e.g(Hecht-
Nielsen1987;CarrollandDickinson1989;Cybenko1989;Funahashi1989;
Hornik,Stinchcombe, andWhite 1989),undercertain assumptionsonthe acti-
vationfunction,culminatinginthe‘modern’formulationofUAT(Leshno
etal.1993)whichonlyaskstheactivationtonotbeapolynomial.Beyond
qualitativeapproximation,the quantitative study ofapproximationrates using
neuralnetworkswaspioneeredbyBarron(Barron1993)andMaiorovand
Meir(Maiorov1999;MeirandMaiorov2000).Barronalsoidentifiedthe
fundamentalseparationbetweenlinearandnon-linearapproximationusing
shallowneuralnetworks,subsequentlystudiedin(Bach2017a;Ma,Wu,et
al. 2022).
KernelsandNeuralNetworks:TheinterplaybetweenNeuralNetworks
andKernelmethodsdatesatleasttoNeal(Neal1996);seealso(Williams
1996),whereaconnectionwasmadebetweeninfinitely-wideneuralnet-
worksandcertainGaussianProcesses.(Weston,Ratle,andCollobert2008)
furtherexploredtheinterplaybetweenkernelsandNNs,withthemotivation
toleveragethetractableformalismofkernelmethodstoeasetheoptimiza-
tionchallengesofdeeparchitectures.Shortlyafter,ChoandSaul(Choand
Saul2009)computedthedot-productkernelsassociatedwithdeeparchitec-
tureswithrandomlyinitializedisotropicweights.Inparallel,RahimiandRecht
(RahimiandRecht2007)introducedrandomfeatureapproximationsforcer-
tainkernelclasses,whichinretrospectprovidedtheframeworkfortoday’s
quantitativenon-asymptotictheory(MisiakiewiczandMontanari2023;Bach
2017a, 2017b).In essence,this framework providesan explicit statistical char-
acterizationofneuralnetworkswhereonlythelastlayerisallowedtotrain.The
generalsettingofnon-linearNNapproximationisoutofthescopeofkernel
methods, withthenotable exception ofthe Neural TangentKernel from(Jacot,
Gabriel,andHongler2018).TheNTKconsidersalinearisationof theNeural
Network at its initialization, and shows that, for certain choice of initialisation
scaling, the training dynamics of the actual network and its linearisation agree
inthe overparametrisedlimit. Thisregimeis however unabletoaccountfor the
feature learning abilities of NNs (Chizat, Oyallon, and Bach 2019).