1
Introduction
Thelastdecadehaswitnessedanexperimental revolution indatascienceand
machinelearning,epitomisedbydeeplearningmethods.Indeed,manyhigh-
dimensionallearningtaskspreviouslythoughttobebeyondreach–suchas
computervision,playingGo,orproteinfolding–areinfactfeasiblewith
appropriatedataandcomputationalscale.Remarkably,theessenceofdeep
learningisbuiltfromtwosimplealgorithmicprinciples:first,thenotion
ofrepresentationorfeaturelearning,wherebyadapted,oftenhierarchical,
featurescapturetheappropriatenotionofregularityforeachtask,andsec-
ond, learning by gradient descent-type optimisation,typically implemented as
backpropagation.
Whilelearninggeneric functionsinhighdimensionsisa cursedestimation
problem, mosttasks ofinterest arenot generic,and comewith essentialprede-
fined regularities arisingfromthe underlyinglow-dimensionality andstructure
ofthephysicalworld.Thisbookisconcernedwithexposingtheseregulari-
tiesthroughunifiedgeometric principlesthatcanbeappliedthroughoutawide
spectrum of applications.
Exploitingtheknown symmetriesof alargesystemisa powerfulandclas-
sicalremedyagainstthecurseofdimensionality,andformsthebasisofmost
physical theories. Deep learningsystems are no exception, and since theearly
days researchers have adapted neural networks to exploit the low-dimensional
geometry arisingfrom physicalmeasurements,e.g. gridsin images,sequences
intime-series,orpositionandmomentuminmolecules,andtheirassociated
symmetries,suchastranslationor rotation.Throughoutourexposition,we will
describe thesemodels,as wellas many others, asnaturalinstances ofthe same
underlying principle of geometric regularity.
GeometricDeepLearningisa‘geometricunification’endeavourinthespirit
ofKlein’sErlangenProgrammethatservesadualpurpose.Ononehand,it
provides a commonmathematical framework tostudy theclassical successful
4Chapter 1
neuralnetworkarchitectures,suchasCNNs,RNNs,GNNs,andTransform-
ers.Wewillshowthatalloftheabovecanbeobtainedbythechoiceofa
geometricdomain,itssymmetrygroup,andappropriatelyconstructedinvari-
ant andequivariant neuralnetworklayers—what werefer to asthe ‘Geometric
Deep Learning blueprint.’Wewill exemplify instances of this blueprinton the
‘5G ofGeometric DeepLearning’: Graphs,Grids, Groups, GeometricGraphs,
andGauges.Ontheotherhand,GeometricDeepLearninggivesaconstruc-
tiveproceduretoincorporatepriorphysicalknowledgeintoneuralnetworks
andprovideaprincipledwaytobuildnewarchitectures.Withthispremise,
we shallnow explore how the ideascentral toGeometric DeepLearning have
crystallised through time.
1.1On the Shoulders of Giants
"Symmetry,aswideorasnarrowasyoumaydefineitsmeaning,isoneidea
bywhichmanthroughtheageshastriedtocomprehendandcreateorder,
beauty, and perfection." This somewhat poetic definition of symmetry is given
intheeponymousbookofthegreatmathematicianHermannWeyl(1952),
his Schwanengesang ontheeve ofretirement fromtheInstitute forAdvanced
StudyinPrinceton.Weyltracesthespecialplacesymmetryhasoccupiedin
scienceandarttotheancienttimes,fromSumeriansymmetricdesignstothe
Pythagoreanswhobelievedthecircletobeperfectduetoitsrotationalsym-
metry.Platoconsideredthefiveregularpolyhedrabearinghisnametodayso
fundamentalthattheymustbethebasicbuildingblocksshapingthematerial
world.
Yet,thoughPlatoiscreditedwithcoiningthetermσνμμετρ ́ια,whichlit-
erallytranslatesas‘samemeasure’,heuseditonlyvaguelytoconveythe
beautyofproportioninartandharmonyinmusic.Itwastheastronomerand
mathematician JohannesKepler (1611)to attemptthe firstrigorous analysisof
thesymmetricshapeofwatercrystals.Inhistreatise‘OntheSix-Cornered
Snowflake’
1
heattributedthesix-folddihedralstructureofsnowflakesto
hexagonalpackingofparticles–anideathatthoughprecededtheclear
understandingofhowmatterisformed,stillholdstodayasthebasisof
crystallography (Ball 2011).
Inmodernmathematics,symmetryisalmostunivocallyexpressedinthe
languageofgrouptheory.The originsofthistheoryareattributedtoÉvariste
Galois, who coined the term andused it to study the solvability of polynomial
equationsinthe1830s.
2
Twoothernamesassociatedwithgrouptheoryare
thoseofSophusLieandFelixKlein,whometandworkedfruitfullytogether
foraperiodoftime(Tobies2019).Theformerwoulddevelopthetheoryof
Introduction5
Plato
(ca. 370 BC)
Johannes Kepler
(1571–1630)
Figure 1.1
Platobelievedthatsymmetricpolyhedra("Platonicsolids")werethefundamental
buildingblocksofnature.JohannesKeplerattributedforthefirsttimethesix-fold
symmetryofwatercrystalstothehexagonalpackingofparticles,antedatingmodern
crystallography.
continuoussymmetries thattodaybearshisname(Liegroups); thelatterpro-
claimedgrouptheorytobetheorganisingprincipleofgeometryinhisErlangen
Programme,whichwealreadymentionedinthePreface.GiventhatKlein’s
Programmeis theinspiration forourbook,it isworthwhile tospend moretime
on its historical context and revolutionary impact.
1.1.1A Strange New Universe out of Nothing
Thefoundationsofmodern geometrywereformalisedinancient Greecenearly
2300yearsagobyEuclidinatreatisenamedtheElements(Στoιχε`ια).
Euclideangeometry(whichisstilltaughtatschoolas‘thegeometry’)was
asetofresultsbuiltuponfiveintuitiveaxiomsorpostulates.TheFifthPos-
tulate–statingthatitis possibleto passonly oneline parallelto agivenline
throughapointoutsideit–appearedlessobviousandanillustriousrowof
mathematicians broke their teeth trying to prove it since antiquity, to no avail.
Anearlyapproachtotheproblemoftheparallelsappearsintheeleventh
centuryPersian treatiseA commentaryon thedifficultiesconcerning thepostu-
lates of Euclid’s Elements by Omar Khayyam.
3
The eighteenth-century Italian
Jesuit priestGiovanniSaccheri(1733) waslikely aware of thisprevious work
judgingby thetitleof hisownworkEuclidesab omninævovindicatus (‘Euclid
clearedofeverystain’,seeFigure1.2).LikeKhayyam,heconsideredthe
summitangles ofa quadrilateralwithsidesperpendicularto thebase. Thecon-
clusion thatacute angles leadto infinitelymanynon-intersecting lines thatcan
bepassedthroughapointnotonastraightlineseemedsocounter-intuitive
6Chapter 1
thatherejected itas‘repugnatis naturælinærectæ’(‘repugnanttothenature
of straight lines’).
4
Euclid
(ca. 300 BC)
Omar Khayyam
(1048–1131)
Figure 1.2
The"FatherofGeometry",Euclid,laidthefoundationsofmoderngeometryinthe
Elements. Omar Khayyam made early attempts toprove Euclid’s fifth postulate, which
were also referenced in Saccheri’s work (Euclides ab omni nævo vindicatus).
The nineteenth centuryhas brought the realisationthat the FifthPostulate is
notessentialandonecanconstructalternativegeometriesbasedondifferent
notions ofparallelism. One suchearlyexampleis projective geometry, arising,
asthenamesuggests,inperspectivedrawingandarchitecture.Inthisgeom-
etry,pointsandlinesareinterchangeable,andtherearenoparallellinesin
the usualsense:any linesmeet ina ‘pointatinfinity.’While resultsin projec-
tivegeometry areknown since antiquity,it wasfirst systematicallystudied by
Jean-Victor Poncelet (1822)
5
; see Figure 1.3.
Gérard Desargues
(1591–1661)
Jean-Victor
Poncelet
(1788–1867)
Figure 1.3
Based on theprior work ofGérard Desargues,Jean-Victor Poncelet revived theinterest
inprojectivegeometry,oneoftheearliestexamplesofgeometriesnotrequiringthe
parallel postulate.
Introduction7
Thecreditforthefirstconstructionofatruenon-Euclideangeometryis
disputed.TheprincepsmathematicorumCarlFriedrichGaussworkedonit
around 1813 but never published anyresults.
6
The first publicationon thesub-
jectofnon-Euclideangeometrywas‘OntheOriginsofGeometry’,bythe
RussianmathematicianNikolaiLobachevsky(1829)—asdepictedinFigure
1.4. Inthiswork, heconsideredthe FifthPostulatean arbitrarylimitationand
proposedanalternativeone, thatmorethanone linecanpassthrough apoint
that isparallel to agiven one.Such a constructionrequires aspace with nega-
tive curvature —whatwenowcalla hyperbolicspace —anotionthatwasstill
not fully mastered at that time.
7
Lobachevsky’s idea appeared heretical and he
was openly deridedby colleagues.
8
A similarconstruction was independently
discovered by theHungarian JánosBolyai, who publishedit in 1832under the
name‘absolutegeometry.’ Inan earlierletter tohis fatherdated 1823,he wrote
enthusiastically about this new development: ‘Ihave discovered such wonder-
fulthingsthatIwasamazed...outofnothingIhavecreatedastrangenew
universe.’
Carl Friedrich
Gauss
(1777–1855)
János
Bolyai
(1802–1860)
Nikolai
Lobachevsky
(1792–1856)
Bernhard
Riemann
(1826–1866)
Figure 1.4
Severalmathematiciansplayedpivotalrolesinproposingearlyvariantsofnon-
Euclideangeometriesintheeighteenthcentury.Gaussreportedlyworked onthetopic
withoutpublishinganymaterialonit.Thefirstpublication—focussedonhyperbolic
geometry—came fromLobachevsky (On theOrigins ofGeometry, depictedhere), with
Bolyaihavingworkedonitconcurrently. Riemannintroducedanysuchgeometries later
on, one of which was elliptic geometry.
In themeantime,new geometriescontinued tocomeoutlike fromacornu-
copia.AugustMöbius(1827),oftheeponymoussurfacefame,studiedaffine
geometry.Gauss’studentBernhardtRiemannintroducedaverybroadclass
ofgeometries—calledtodayRiemannianishishonour—inhishabili-
tationlecture,subsequentlypublishedunderthetitleÜberdieHypothesen,
8Chapter 1
welcheder GeometriezuGrunde liegen (‘Onthe Hypotheses onwhich Geom-
etryisBased,’1854).AspecialcaseofRiemanniangeometryisthe‘elliptic’
geometryofthesphere,anotherconstructionviolatingEuclid’sFifthPostu-
late, asthere isno pointon thesphere throughwhich aline canbedrawn that
neverintersectsagivenline.Towardsthesecondhalfofthenineteenthcen-
tury,Euclid’smonopolyovergeometrywascompletelyshuttered.Newtypes
ofgeometry(Euclidean,affine,projective,hyperbolic,spherical)emerged
andbecameindependentfieldsofstudy.However,therelationshipsofthese
geometries and their hierarchy were not understood.
It wasin thisexciting but messysituationthatFelixKleincameforth,with
a geniusinsight to usegrouptheory asanalgebraic abstractionofsymmetry to
organisethe ‘geometric zoo.’
9
It appearedthatEuclidean geometryis aspecial
caseofaffinegeometry,whichisinturnaspecialcaseofprojectivegeom-
etry(or,intermsofgrouptheory,theEuclideangroupisasubgroupofthe
projective group). Klein, and independently the Italiangeometer Eugenio Bel-
trami,furthershowedthatconstant-curvaturenon-Euclideangeometries(i.e.,
thehyperbolicgeometryofLobachevsky andBolyaiandthesphericalgeom-
etry orRiemann)could beobtained asspecial casesofprojectivegeometry
10
;
alsoseeFigure1.5.MoregeneralRiemanniangeometrywithnon-constant
curvaturewasnotincludedinKlein’sunifiedgeometricpicture,andittook
anotherfiftyyearsbeforeitwasintegrated,largely thankstothework ofÉlie
Cartan in the 1920s.
Felix Klein
(1849–1925)
Eugenio Beltrami
(1835–1900)
Figure 1.5
FelixKlein’sErlangenProgrammeofferedawaytoorganiseandcategoriseexisting
geometriesaccordingtotheirsymmetries.Additionally,KleinandBeltramiindepen-
dently proved thatnon-Euclidean geometrieswithconstant curvature arespecial cases
of projective geometry.
Klein’sErlangenProgrammehashadaprofoundmethodologicalandcul-
turalimpactongeometryandmathematicsingeneral.Itwasinasensethe
Introduction9
‘second algebraisation’ ofgeometry(thefirstonebeingtheanalytic geometry
ofRenéDescartesandthemethodofcoordinatesbearinghislatinisedname
Cartesius)thatallowedtoproduceresultsimpossiblebypreviousmethods.
Category Theory,abstractingtherelationsbetweenobjectsandnow pervasive
in pure mathematics, canbe "regardedas a continuation of theKlein Erlangen
Programme,in thesense thatageometricalspacewith itsgroup oftransforma-
tionsisgeneralizedtoacategorywithitsalgebraofmappings,"inthewords
ofits creatorsSamuel Eilenberg andSaundersMac Lane—see(Marquis 2009)
and Figure 1.6.
Élie
Cartan
(1869–1951)
Samuel
Eilenberg
(1913–1998)
Saunders
Mac Lane
(1909–2005)
Figure 1.6
Whilethe ErlangenProgrammeoutlinedapath towards unifyingallgeometriesunder
acommonlens,itwas notuntiltheworkofÉlieCartanseveral decadeslaterthatthis
unification was entirelyformalised. Additionally, the ErlangenProgramme inspiredthe
creationofentirenovelfieldsofmathematics,asevidencedby‘GeneralTheoryof
NaturalEquivalences’,theoriginaltextonCategoryTheorybyEilenbergandMac
Lane.
1.1.2The Theory of Everything
Consideringprojectivegeometrythemostgeneralofall,Kleincomplained
‘howpersistentlythemathematicalphysicistdisregardstheadvantages
affordedhiminmanycasesbyonlyamoderatecultivationoftheprojective
view’(Klein1872).Hisadvocatingfortheexploitationofgeometryandthe
principles ofsymmetryin physicshasforetoldthefollowing centurythatwas
truly revolutionary for the field.
InGöttingen,
11
Klein’scolleagueEmmy Noether(1918)
12
provedthatevery
differentiablesymmetry ofthe actionof aphysical systemhas acorresponding
conservationlaw.It was byallmeans astunning result:beforehand,meticulous
experimentalobservationwasrequiredtodiscoverfundamentallawssuchas
10Chapter 1
the conservation ofenergy,andeventhen, itwas anempiricalresultnot com-
ingfromanywhere.Noether’sTheorem—"aguidingstarto20thand21st
century physics," inthe words of theNobel laureateFrank Wilczek — allowed
for example to show that the conservation of energy emerges from the transla-
tionalsymmetry oftime,aratherintuitiveideathatthe resultsofanexperiment
shouldnotdependon whetherit isconducted todayor tomorrow.Thefirstpage
of this landmark result is depicted in Figure 1.7.
Anothersymmetryassociatedwithchargeconservation,theglobalgauge
invarianceoftheelectromagneticfield,firstappearedinMaxwell’sformu-
lationofelectrodynamics(Maxwell1865);however,itsimportanceinitially
remainedunnoticed.Thesame HermannWeyl whowrotesodithyrambically
aboutsymmetryistheonewhofirstintroducedtheconceptofgaugeinvari-
ance inphysics intheearly 20thcentury,
13
emphasizing itsrole asa principle
fromwhichelectromagnetismcanbederived.Ittookseveraldecadesuntil
thisfundamentalprinciple–initsgeneralisedformdevelopedbyYangand
Mills (1954) – proved successful in providing a unified framework to describe
thequantum-mechanicalbehaviourofelectromagnetismandtheweakand
strongforces,finallyculminatingintheStandardModelthatcapturesallthe
fundamental forcesof naturebut gravity. Assuccinctlyput byanother Nobel-
winningphysicist,PhilipAnderson(1972),"itisonlyslightlyoverstatingthe
case to say that physics is the study of symmetry."
Emmy
Noether
(1882–1935)
Hermann
Weyl
(1885–1955)
Chen-Ning
Yang
(b. 1922)
Robert
Mills
(1927–1999)
Figure 1.7
TheErlangenProgrammehadsignificantspillovereffectsinphysics.Thelandmark
resultwasNoether’stheorem,whichdirectlyspecifieshowconservationlawsarise
directlyfromsymmetryconstraints.Lateron,gaugesymmetries–firstintroducedby
Weyl, then developed byYang and Mills –provedan effective abstraction for discover-
ing the presently best-known model of the physical world: the Standard Model.
Introduction11
1.2Towards Geometric Deep Learning
Animpatientreadermightwonderatthispoint,whatdoesallthisexcursion
intothehistoryofgeometryandphysics,howeverexcitingitmightbe,have
todowithdeeplearning?Aswewillsee,thegeometricnotionsofsymmetry
andinvariancehavebeenrecognisedascrucialeveninearlyattemptstodo
‘patternrecognition,’anditisfairtosaythatgeometryhasaccompaniedthe
nascent fieldofartificialintelligence fromitsverybeginning. Whileitishard
toagreeonaspecificpointintimewhen‘artificialintelligence’wasbornas
a scientificfield(atthe end,humanshave beenobsessedwithcomprehending
intelligenceandlearningfromthedawnofcivilisation),andeventhehistory
ofdeeplearningisdisputed,wewilltryalessriskytaskoflookingatthe
precursorsofgeometricdeeplearning—themaintopicofourbook.This
history can be packed into less than a century.
1.2.1Early Neural Networks and the AI Winter
Bythe1930s,ithasbecomeclearthatthemindresidesinthebrain,and
researcheffortsturnedtoexplainingbrainfunctionssuchasmemory,per-
ception,andreasoning,intermsofbrainnetworkstructures.McCullochand
Pitts(1943)arecreditedwiththefirstmathematicalabstractionoftheneu-
ron,showingitscapabilitytocomputelogicalfunctions.Justayearafterthe
legendaryDartmouthCollegeworkshopthatcoinedtheveryterm‘artificial
intelligence,’
14
anAmericanpsychologistFrankRosenblatt(1957)fromCor-
nellAeronauticallaboratoryproposedaclassofneuralnetworkshecalled
‘perceptrons’
15
,cf.Figure1.8.Perceptrons,firstimplementedonadigital
machineandthenindedicatedhardware,managedtosolvesimplepattern
recognition problems such as classification of geometric shapes. However, the
quick rise of ‘connectionism’ (how AI researchers working on artificial neural
networkslabeledthemselves)receivedabucketofcoldwaterintheformof
thenow infamousbookPerceptronsbyMinskyandPapert(1969);theneural
network model they used is depicted in Figure 1.9.
Inthedeeplearningcommunity,itiscommontoretrospectivelyblame
MinskyandPapertfortheonsetofthefirst‘AIWinter,’whichmadeneural
networksfalloutoffashionforoveradecade.Atypicalnarrativementions
the ‘XOR Affair,’ a proofthat perceptrons wereunable to learnevenvery sim-
ple logicalfunctions asevidence of theirpoor expressive power. Somesources
evenaddapinchofdramarecallingthatRosenblattandMinskywenttothe
sameschoolandeven allegingthatRosenblatt’sprematuredeathinaboating
accident in 1971 was a suicide intheaftermath ofthe criticismof hiswork by
colleagues.
12Chapter 1
SENSORY
UNITS
(S-UNITS)
RETINAL
UNITS
CIRCUITS
R
1
3
R
2
R
4
R
5
R
6
R
7
R
8
R
RETNA
ASSOCIATION
UNITS
(A-UNITS)
RESPONSE
UNITS
(R-UNITS)
Frank Rosenblatt
(1928–1971)
Figure 1.8
ThePerceptronproposedbyFrankRosenblatt(1957)wasoneofthesimplestneural
network architectures.
The realityisprobablymoremundane andmorenuancedatthe sametime.
First,afarmoreplausiblereasonforthe‘AIWinter’intheUSAisthe1969
Mansfield Amendment, which requiredthemilitarytofund "mission-oriented
directresearch,ratherthanbasicundirectedresearch."Sincemanyeffortsin
artificial intelligence at the time, includingRosenblatt’s research, were funded
bymilitaryagenciesanddidnotshowimmediateutility,thecutinfunding
hashaddramaticeffects.Second,neuralnetworksandartificialintelligence
ingeneralwereover-hyped:itisenoughtorecalla1958NewYorkerarticle
callingperceptronsa"firstseriousrival tothehumanbraineverdevised"and
"remarkablemachines"thatwere"capableofwhatamountstothought,"
16
or
theoverconfidentMITSummerVisionProjectexpectinga"constructionofa
significantpartofavisualsystem"andachievingtheabilitytoperform"pattern
recognition"
17
during one summer term of1966.A realisationby theresearch
community that initialhopes to ‘solve intelligence’ hadbeen overly optimistic
was just a matter of time.
Ifone,however,looksintothesubstanceofthedispute,itisapparentthat
whatRosenblattcalleda‘perceptron’isratherdifferentfromwhatMinsky
and Papert understoodunder thisterm. MinskyandPapertfocusedtheir analy-
sis andcriticismon anarrow classof single-layerneuralnetworks they called
‘simpleperceptrons’ (andwhatistypicallyassociated withthistermin modern
times,see Figure1.8)that computeaweighted linearcombinationof theinputs
followedbyanonlinearfunction.
18
Ontheotherhand, Rosenblattconsidered a
broader classof architectures thatantedated many ideas ofwhat wouldnowbe
considered‘modern’deeplearning,includingmulti-layerednetworkswithran-
domandlocalconnectivity.
19
Rosenblattcouldhaveprobablyrebuttedsome
Introduction13
Marvin Minsky
(1927–2016)
Seymour Papert
(1928–2016)
Figure 1.9
TheinfluentialbookPerceptronsbyMinskyandPapert(1969)consideredsimple
single-layerneuralnetworksdepictedhere.Itwasperhapstheearliestgeometric
approach to learning, including the introduction of group invariance.
of the criticism concerning the expressive power of perceptrons had he known
aboutthe proofoftheThirteenthHilbertProblem
20
byVladimir Arnold(1956)
andAndreyKolmogorov(1957)establishingthatacontinuousmultivariate
functioncanbewrittenasasuperpositionofcontinuousfunctionsofasingle
variable.TheArnold–Kolmogorovtheoremwasaprecursorofasubsequent
classofresultsknownas‘universalapproximationtheorems’formulti-layer
(or ‘deep’) neural networks that put these concerns to rest.
While mostremember thebookofMinsky and Papertforthe roleitplayed
in cutting the wingsof theearly dayconnectionistsand lamentthelost oppor-
tunities, animportant overlooked aspect isthatitforthefirst timepresented a
geometricanalysis oflearning problems.This factisreflectedintheveryname
ofthe book,subtitled Anintroduction toComputational Geometry.At thetime
it was aradically new idea, and Block(1970) in his critical review ofthe book
(which essentially stood to the defense of Rosenblatt),wondered whether "the
newsubjectof‘ComputationalGeometry"’would"growintoanactivefield
of mathematics, or will itpeter outina miscellany of dead ends?"(theformer
happened: computational geometry is now a well-established field
21
).
Furthermore,MinskyandPapertprobablydeservethecreditforthefirst
introductionofgrouptheoryintotherealmofmachinelearning:theirGroup
Invariancetheoremstated thatifaneural networkisinvarianttosomegroup,
thenitsoutputcouldbe expressedasfunctionsofthe orbitsofthegroup(we
will define thesetermsinsubsequent Chapters). Whilethey usedthisresultto
provelimitationsofwhataperceptroncouldlearn,similarapproacheswere
subsequentlyusedbyAmari(1978)fortheconstructionofinvariantfeatures
inpatternrecognitionproblems.Anevolutionoftheseideasintheworksof
14Chapter 1
Sejnowski,Kienker,andHinton(1986)andShawe-Taylor(1989,1993),unfor-
tunately rarely cited today, provided the foundations of the geometric learning
blueprint described in this book.
1.2.2Universal Approximation and the Curse of Dimensionality
Theaforementionednotionofuniversalapproximationdeservesfurtherdiscus-
sion. The term refers tothe ability to approximate any continuous multivariate
functiontoanydesiredaccuracy;inthemachinelearning literature,thistype
ofresultsisusuallycreditedtoCybenko(1989)andHornik(1991).Figure
1.10outlinesthepioneeringresearchersworkingonuniversalapproximation
results.
David
Hilbert
(1862–1943)
Andrey
Kolmogorov
(1903–1987)
Vladimir
Arnold
(1937–2010)
George
Cybenko
Kurt
Hornik
Figure 1.10
DavidHilbert’sThirteenthProblem,provenbyAndreyKolmogorovandVladimir
Arnold,wasoneofthefirstresultsshowingthatamultivariatecontinuousfunction
couldbeexpressedasacompositionandsumofsimpleone-dimensionalfunctions.
George Cybenkoand KurtHornikprovedresultsspecifictoneural networks,showing
thataperceptron withone hiddenlayercan approximateanycontinuousfunction toany
desired accuracy.
Unlike‘simple’(single-layer)perceptronscriticisedbyMinskyandPapert
(1969),multilayerneuralnetworksareuniversalapproximatorsandthusare
anappealingchoiceofarchitectureforlearningproblems.Wecanthinkof
supervisedmachinelearningasafunctionapproximationproblem:giventhe
outputsofsomeunknown function(e.g.,animageclassifier)onatraining set
(e.g.,imagesofcatsanddogs),wetrytofindafunctionfromsomehypoth-
esisclassthatfitswellthetrainingdataandallowstopredicttheoutputson
previously unseen inputs (‘generalisation’).
Universalapproximationguaranteesthatwecanexpressfunctionsfroma
verybroadregularityclass(continuousfunctions)bymeansofamulti-layer
neuralnetwork.Inotherwords,thereexistsaneuralnetworkwithacertain
Introduction15
numberofneuronsandcertainweightsthatapproximatesagivenfunction
mappingfromtheinputtotheoutputspace(e.g.,fromthespaceofimages
to the space of labels). However, universal approximation theorems do not tell
ushowtofindsuchweights.Infact,learning(i.e.,findingweights)inneu-
ralnetworkshasbeenabigchallengeintheearlydays.Rosenblattshowed
alearningalgorithmonlyforasingle-layerperceptron;totrainmulti-layer
neural networks, IvakhnenkoandLapa (1966)used alayer-wiselearning algo-
rithmcalled‘groupmethodofdatahandling.’ ThisallowedIvakhnenko(1971)
to go as deep as eight layers — a remarkable feat for the early 1970s!
Abreakthroughcamewiththeinventionofbackpropagation,analgorithm
usingthechainruletocomputethegradientoftheweightswithrespectto
the loss function and allowed to use gradient descent-based optimisation tech-
niquestotrainneuralnetworks.Asoftoday,thisisthestandardapproachin
deeplearning.Whiletheoriginsofbackpropagationdatebacktoatleastthe
1960s,
22
the firstconvincingdemonstration ofthis methodinneuralnetworks
wasinthewidelycitedNaturepaperofRumelhart,Hinton,andWilliams
(1986). The introduction of this simpleand efficientlearning method has been
a keycontributingfactorto thereturn ofneuralnetworks tothe AIscene in the
1980sand1990s.ThekeycontributorsinthisspacearediscussedinFigure
1.11.
Alexey
Ivakhnenko
(1913–2007)
Seppo
Linnainmaa
(b. 1945)
Paul
Werbos
(b. 1947)
David
Rumelhart
(1942–2011)
Figure 1.11
Four ofthemostimportantfigures inthedevelopmentof modernmethodsoflearning
viagradientdescent.Ivakhnenkoderivedalayer-wiselearningalgorithmwhichallowed
for training an eight-layerneural network in 1971. Thebackpropagation algorithm was
initially describedby Linnainmaa,andfirst deployed indeepneural networksby Wer-
bos.Rumelhart’s workofferedthefirst convincing demonstrationof themethod indeep
neural networks, which propelled the use of backpropagation in future works.
Lookingatneuralnetworksthroughthelensofapproximationtheoryhas
led some cynics to call deep learninga "glorified curve fitting." We will letthe
16Chapter 1
reader judgehow truethismaximisbytryingtoanswerthefollowing impor-
tant question: how many samples (trainingexamples) are neededto accurately
approximateafunction?Approximationtheoristswillimmediatelyretortthat
theclassofcontinuousfunctionsthatmultilayerperceptronscanrepresent
isobviouslywaytoolarge:onecanpassinfinitelymanydifferentcontinu-
ousfunctionsthroughafinitecollectionofpoints.
23
Itisnecessarytoimpose
additional regularity assumptions suchas Lipschitz continuity,
24
in whichcase
onecanprovideaboundontherequirednumberofsamples.Unfortunately,
theseboundsscaleexponentiallywithdimension—aphenomenoncolloqui-
allyknownasthe‘curseofdimensionality’
25
—whichisunacceptablein
machinelearningproblems:evensmall-scalepatternrecognitionproblems,
suchasimageclassification,dealwithinputspacesofthousandsofdimen-
sions.Ifonehadtorelyonlyonclassicalresultsfromapproximationtheory,
machinelearningwouldbeimpossible.Inourillustration,thenumberofexam-
ples ofcat anddog images thatwouldbe requiredin theoryin order tolearn to
tell themapart wouldbe way larger thanthe numberof atoms intheuniverse
26
— there are simply not enough cats and dogs around to do it.
Thestruggleofmachinelearningmethodstoscaletohighdimensionswas
brought up by the British mathematician Sir James Lighthill(1973) in a paper
that AIhistorianscallthe‘LighthillReport,’inwhichhe usedtheterm‘com-
binatorialexplosion’andclaimedthatexistingAImethodscouldonlywork
ontoyproblemsandwouldbecomeintractableinreal-worldapplications.
Lighthillfurthercomplainedthat"mostworkers inAIresearchandinrelated
fieldsconfesstoapronouncedfeelingofdisappointmentinwhathasbeen
achievedinthepasttwenty-fiveyears"and"innopartofthefieldhavethe
discoveriesmadesofarproducedthemajorimpactthatwasthenpromised."
Thesewerenotmerefrustrationsofagrumpyacademicstuckwithahard
problem:the Reportwascommissionedby theBritishScienceResearch Coun-
ciltoevaluateacademicresearchinthefieldofartificialintelligence,andits
pessimisticconclusionsresultedinfundingcutsacrossthepond.Together
withsimilardecisionsbytheAmericanfundingagencies,thisamountedto
a wrecking ball for AI research in the 1970s.
Forus,therealisationthatclassicalfunctionalanalysiscannotprovidean
adequateframeworktodealwithlearningproblemswillbethemotivation
toseekstronger,geometricformsofregularity,thatcanbeimplementedin
aparticularwiringoftheneuralnetwork—suchasthelocalconnectivityof
convolutionalneuralnetworks.It isfairtosaythatthetriumphantreemergence
of deep learning a decade ago owes, at least in part, to these insights. It is also
probablytruethattheroleofsymmetryandinvarianceasabroadorganising
Introduction17
anddesignprincipleinneural networkshas notbeengiven enoughcredit atthe
time, and us highlighting these principles here are a hindsight.
1.2.3Secrets of the Visual Cortex and the Neocognitron
Theinspirationforthefirstneuralnetworkarchitecturesofthenew‘geo-
metric’typecamefromneuroscience.Inaseriesofexperimentsthatwould
becomeclassicalandbringthemaNobelPrizeinmedicine,theduoof
HarvardneurophysiologistsDavidHubelandTorstenWiesel(1959;1962)
unveiledthestructureandfunctionofapartofthebrainresponsibleforpat-
ternrecognition—thevisualcortex.Bypresentingchanginglightpatternsto
acatandmeasuringtheresponseofitsbraincells(neurons)–asdepictedin
Figure1.12–theyshowedthattheneurons inthe visualcortexhaveamulti-
layerstructurewithlocalspatialconnectivity:acellwouldproduceresponse
onlyifcellsinitsproximity(‘receptive field’
27
)wereactivated.Furthermore,
theorganisationappearedtobehierarchical,wheretheresponsesof‘simple
cells’reactingtolocalprimitiveorientedstep-likestimuliwereaggregated
by‘complexcells,’whichproducedresponsestomorecomplexpatterns.It
was hypothesized that cells in deeper layersofvisual cortex would respond to
increasinglycomplexpatternscomposedofsimplerones,withasemi-joking
suggestionoftheexistenceofa‘grandmothercell’
28
thatreactsonlywhen
shown the face of one’s grandmother.
Electrical signal
from brain
Stimulus
Visual area
of brain
Recording electrode
David Hubel
(1926–2013)
Torsten Wiesel
(b. 1924)
Figure 1.12
The classicalexperiment ofHubel andWiesel (1962) revealed thestructure of thebrain
visual cortex and inspired a new generation of neural network architectures mimicking
its local connectivity.
Theunderstandingofthestructureofthevisualcortexhashadprofound
impactonearlyworksincomputervisionandpatternrecognition,withmul-
tipleattemptstoimitateitsmainingredients.KunihikoFukushima(1980),
atthattimearesearcherattheJapanBroadcastingCorporation,developeda
18Chapter 1
new neuralnetworkarchitecture"similartothehierarchymodelofthevisual
nervoussystemproposedbyHubelandWiesel,"whichwasgiventhename
neocognitron.
29
Theneocognitronconsistedofinterleaved S-andC-layersof
neurons (a naming convention reflecting its inspiration in the biological visual
cortex);theneuronsineachlayerwerearrangedin2Darraysfollowingthe
structureoftheinputimage(‘retinotopic’),withmultiple‘cell-planes’(fea-
ture maps in modern terminology) per layer. The S-layers were designed to be
translationallysymmetric:theyaggregatedinputsfromalocalreceptivefield
using shared learnable weights, resultingincellsinasingle cell-planehaving
receptivefieldsofthesamefunction,butatdifferentpositions.Therationale
was topickuppatterns thatcould appearanywhere in theinput. TheC-layers
were fixed andperformed local pooling(a weightedaverage), affordinginsen-
sitivityto the specific location of the pattern: a C-neuron would be activated if
any of the neurons in its input are activated.
Sincethemainapplicationoftheneocognitronwascharacterrecognition,
translationinvariance
30
wascrucial.Thispropertywasafundamentaldiffer-
encefromearlierneuralnetworkssuchasRosenblatt’sperceptron:inorder
touseaperceptronreliably,onehadtofirstnormalisethepositionofthe
inputpattern,whereasinneocognitrontheinsensitivitytothepatternposi-
tion was baked intothe architecture.Neocognitronachieved it byinterleaving
translationally-equivariantlocalfeatureextractionlayerswithpooling,cre-
atingamultiscalerepresentation—wewillrefertothisprincipleasscale
separationandstudyinsubsequentChapterswhyitcanalsohelpdealwith
a broaderclassof geometrictransformations inaddition totranslations. Com-
putationalexperimentsshowedthatFukushima’sarchitecturewasableto
successfullyrecognisecomplexpatternssuchaslettersordigits,eveninthe
presence of noise and geometric distortions.
Lookingfromthevantagepointoffourdecadesofprogressinthefield,
onefindsthattheneocognitronalreadyhadstrikinglymanycharacteristicsof
modern deep learningarchitectures: depth (Fukishimasimulated a seven-layer
networkinhispaper),localreceptivefields,sharedweights,andpooling.It
evenusedhalf-rectifier(ReLU)activationfunction,whichisoftenbelieved
tobeintroducedinrecentdeep learningarchitectures.
31
Themaindistinction
frommodernsystemswasinthewaythenetworkwastrained:neocogni-
tronwasa‘self-organised’architecturetrainedinanunsupervisedmanner,
sincebackpropagationhadstillnotbeenwidelyusedintheneuralnetwork
community.
Introduction19
1.2.4Convolutional Neural Networks
Fukushima’sdesignwasfurtherdevelopedbyYannLeCun,afreshgraduate
fromtheUniversityofParis
32
withaPhDthesisontheuseofbackpropa-
gationfortrainingneuralnetworks.Inhisfirstpost-doctoralpositionatthe
AT&TBellLaboratories,LeCunandcolleaguesbuiltasystemtorecognise
hand-writtendigitsonenvelopesinordertoallowtheUSPostalServiceto
automaticallyroutemail.Inapaperthatisnow classical,LeCunetal.(1989)
describedthefirstthree-layerconvolutionalneuralnetwork(CNN).
33
Simi-
larlytotheneocognitron,LeCun’sCNNalsousedlocalconnectivitywith
sharedweightsandpooling(Figure1.13).However,itforwentFukushima’s
morecomplexnonlinearfiltering(inhibitoryconnections)infavourofsim-
plelinearfiltersthatcouldbeefficientlyimplementedasconvolutionsusing
multiply-accumulateoperationsonadigitalsignalprocessor(DSP).
34
This
designchoice,departingfromtheneuroscienceinspirationandterminology
andmovingintotherealmofsignalprocessing,wouldplayacrucialrolein
theensuing successof deeplearning.Another key novelty ofCNN wasthe use
of backpropagation for training.
Kunihiko
Fukushima
10 output units
H2.12
H1.12
H1.1
9
layer H3
30 hidden units
layer H2
12 x 16 =192
hidden units
hidden units
12 x 64 =768
layer H1
256 input units
fully connected
300 links
fully connected
6000 links
40,000 links
from 12 kernels
5 x 5 x 8
20,000 links
from 12 kernels
5 x 5
H2.1
0
Yann LeCun
Figure 1.13
Theoriginalvariantsofconvolutionalneuralnetworkarchitectureshavebeenintro-
ducedbyFukushima andLeCun(though thename"convolutional" wouldappearlater).
Implementedon adigitalsignalprocessor,LeCun’sCNN allowedreal-time handwrit-
ten digit recognition for the first time.
LeCun’sworksshowedconvincingly thepowerofgradient-basedmethods
forcomplexpatternrecognitiontasksandwasoneofthefirstpracticaldeep
learning-based systems for computer vision.Anevolution of this architecture,
a CNN with five layers named LeNet-5 as a pun on theauthor’s name(LeCun
etal.1998),wasusedbyUSbankstoreadhandwrittencheques.Thecomputer
visionresearchcommunity,however,initsvastmajoritysteeredawayfrom
neuralnetworksandtookadifferentpath.Thetypicalarchitectureofvisual
20Chapter 1
recognitionsystemsofthefirstdecadeofthenewmillenniumwasacare-
fully hand-craftedfeature extractor(typicallydetecting interestingpoints in an
image andproviding theirlocaldescriptionina waythat isrobust toperspec-
tivetransformationsandcontrastchanges
35
)followedbyasimpleclassifier
(mostoftenasupportvectormachine(SVM)andmorerarely,asmallneural
network).
36
1.2.5Recurrent Neural Networks, Vanishing gradients, and LSTMs
WhileCNNsweremainlyappliedtomodellingdatawithspatialsymmetries
suchasimages,anotherdevelopmentwasbrewing:onethatrecognisesthat
data is often non-static, but can evolve in a temporal manner aswell. The sim-
plestformoftemporally-evolvingdataisatimeseriesconsisting ofa sequence
of steps with a data point provided at each step.
Amodelhandlingtime-seriesdatahenceneedstobecapableofmeaning-
fullyadaptingtosequencesofarbitrarylength–afeatthatCNNswerenot
triviallycapableof.ThisledtodevelopmentofRecurrentNeuralNetworks
(RNNs)inthelate1980sandearly1990s
37
,theearliestexamplesofwhich
simplyappliesasharedupdateruledistributedthroughtime(Jordan1986;
Elman1990).Ateverystep,theRNNstateisupdatedasafunctionofthe
previous state and the current input.
OneissuethatplaguedRNNsfromthebeginning wasthevanishinggradi-
entproblem.Vanishinggradientsariseinverydeep neuralnetworks—ofwhich
RNNsareoftenarepresentativeinstance,giventhattheireffectivedepthis
equal totheinput sequencelength.As sigmoidalactivations areunrolledover
manysteps,gradientsquicklyapproachzerowhenbackpropagationisper-
formed.Thiseffectstrongly limitstheinfluenceofearlier stepsinthesequence
to the predictions made in the latter steps.
AsolutiontothisproblemwasfirstelaboratedinSeppHochreiter’sDiploma
thesis(1991),carriedoutinMunichunderthesupervisionofJürgenSchmid-
huber,intheformofanarchitecturethatwasdubbedLongShort-Term
Memory (LSTM; Figure 1.14).
38
LSTMs combat the vanishing gradient prob-
lembyhavingamemorycell,withexplicitgatingmechanismsdecidinghow
much of that cell’s state to overwriteat every step — allowing one to learn the
degreeofforgettingfromdata(oralternatively,rememberingforlongtime).
Incontrast,simpleRNNsofJordan(1986)andElman(1990)performafull
overwrite at every step.
This ability to remember context for long time turned crucial in natural lan-
guageprocessingandspeechanalysisapplications,whererecurrentmodels
were shown successful applications starting from theearly 2000s (Gersand E.
Schmidhuber 2001;Graveset al.2004). However, likeit happenedwithCNNs
Introduction21
Sepp
Hochreiter
Jürgen
Schmidhuber
Figure 1.14
Thecreatorsofthelongshort-termmemorycell—thefirstrecurrentneuralnetwork
modulecapable ofdealingwith long-rangedependencies—HochreiterandSchmidhu-
ber, pictured alongside one of the earliest schematic depictions of their creation.
in computervision, thebreakthrough would needto wait foranother decadeto
come.
1.2.6Gating Mechanisms and Time Warping
Inourcontext,itmaynotbeinitiallyobviouswhetherrecurrentmodels
embodyanykindofsymmetryorinvarianceprinciplesimilartothetrans-
lationalinvariancewehaveseeninCNNs.Nearlythreedecadesafterthe
developmentofRNNs,CorentinTallecandYannOllivier(2018)showed
thatthereisatypeofsymmetryunderlyingrecurrentneuralnetworks:time
warping.
Time serieshave averyimportantbutsubtle nuance:it israrely thecasethat
the individual steps of atime series are evenly spaced.Indeed, inmany natural
phenomena,wemaydeliberatelychooseto takemany measurementsatsome
timepointsandveryfew(orno)measurementsduringothers.Inaway,the
notion of ‘time’presented to theRNN workingon such data undergoes a form
of warping operation. Can we somehow guarantee that our RNN is "resistant"
to time warping, in the sensethat we will always beable to find a set of useful
parameters for it, regardless of the warping operation applied?
Inordertohandletimewarpingwithadynamicrate,thenetworkneeds
tobeabletodynamicallyestimatehowquicklytimeisbeingwarped(the
so-called‘warpingderivative’)ateverystep.Thisestimateisthenusedto
selectively overwrite theRNN’s state.Intuitively, forlargervalues ofthe warp-
ing derivative, a more strict state overwrite should be performed, as more time
has elapsed since.
Theideaofdynamicallyoverwritingdatainaneuralnetworkisimple-
mentedthroughthegatingmechanism.TallecandOlliviereffectivelyshow
that,inordertosatisfytimewarpinginvariance,anRNNneedstohavea
22Chapter 1
gatingmechanism. Thisprovides atheoretical justificationfor gate RNNmod-
els,suchastheaforementionedLSTMsofHochreiterandSchmidhuberor
theGatedRecurrentUnit(GRU)(Choetal.2014).Fulloverwritingusedin
simpleRNNscorrespondstoanimplicitassumptionofaconstanttimewarp-
ing derivative equal to one–a situationunlikely tohappenin mostreal-world
scenarios – which also explains the success of LSTMs.
1.2.7The Triumph of Deep Learning
Asalready mentioned,the initialreceptionof whatwould laterbe called"deep
learning"hadinitiallybeenratherlukewarm.Thecomputervisioncommu-
nity,wherethefirstdecadeofthenew centurywasdominatedbyhandcrafted
featuredescriptors,appearedtobeparticularlyhostiletotheuseofneural
networks.However,thebalanceofpowerwaschangedbytherapidgrowth
incomputingpowerandtheamountsofavailableannotatedvisualdata.It
becamepossibletoimplementandtrainincreasinglybiggerandmorecom-
plexCNNsthatallowedaddressingincreasinglychallengingvisualpattern
recognitiontasks,
39
culminatinginaHolyGrailofcomputervisionatthat
time:theImageNetLargeScaleVisualRecognitionChallenge(Figure1.15).
Established by the American-Chinese researcherFei-Fei Li,ImageNet was an
annual challenge consisting in the classification of millions of human-labelled
images into 1000 different categories.
0%
25%
26%
16.4%
11.7%
6.7%
5%
3.6%
3.1%
2011
XREC
2015
ResNet
5%
10%
15%
20%
30%
Top-5 error
28%
2010
NEC-UIUC
2012
AlexNet
2013
ZFNet
2014
GoogLeNet
VGGNet
Human
2016
GoogLeNet
-v4
Fei Fei Li
Figure 1.15
ImageNet, a benchmarkdevelopedby Fei FeiLi at StanfordUniversity was one ofthe
‘HolyGrail’challengesincomputervisionintheearly2010s.Thedramaticperfor-
mance improvement provided bythe convolutionalneural networkAlexNet in 2012is
considered the turning point leadingto the widespread adoption of deep learning inthe
field.
A CNNarchitecturedeveloped at theUniversity of Torontoby Krizhevsky,
Sutskever,andHinton(2012)managedtobeatbyalargemargin
40
allthe
competingapproachessuchassmartlyengineeredfeaturedetectorsbasedon
decadesofresearchinthefield.AlexNet(asthearchitecturewascalledin
Introduction23
honourofitsdeveloper,AlexKrizhevsky;seeFigure1.16)wassignificantly
biggerintermsofthenumberofparametersandlayerscomparedto itsolder
siblingLeNet-5,
41
but conceptuallythesame.Thekey differencewastheuse
ofagraphicsprocessor(GPU)fortraining
42
,nowthemainstreamhardware
platform for deep learning.
43
11
11
11
11
5
5
3
3
3
3
3
3
3
5
5
3
3
3
3
3
3
3
3
224
224
55
48
55
128
27
27
13
13
13
13
13
13
192192128
128
192192
128
20482048
20482048
1000
48
Stride
of 4
Max
pooling
Max
pooling
Max
pooling
dense
densedense
CONV1
POOL1POOL2
CONV2CONV3CONV4CONV5
POOL3
FC1FC2FC3
Alex Krizhevsky
Figure 1.16
AlexKrizhevsky,picturednexttoaschematicofAlexNet—thefirstdeepneural
network-basedsolution towin theImageNet contest,by asignificantmargin.AlexNet’s
result iscommonly seenas apivotal moment inthe development ofmodern deeplearn-
ing,usheringinsignificantdevelopmentsinsubsequentyears—eventuallyleadingto
surpassing human performance on ImageNet only a few years later.
The successofCNNs onImageNet becamethe turning pointfordeep learn-
ingandheraldeditsbroadacceptanceinthefollowingdecade.Asimilar
transformationhappenedinnaturallanguageprocessingandspeechrecog-
nition,whichmovedentirelytoneuralnetwork-basedapproachesduringthe
2010s; indicatively, Google and Facebook switched their machine translations
systems toLSTM-basedarchitecturesaround2016–2017.Multi-billion dollar
industriesemergedas aresult ofthis breakthrough,with deeplearning success-
fullyusedincommercialsystemsrangingfromspeechrecognitioninApple
iPhonetoTeslaself-drivingcars.Morethanfortyyearsafterthescathing
review of Rosenblatt’s work, the connectionists were finally vindicated.
1.2.8Graph Neural Networks and Their Chemical Precursors
Ifthehistoryofsymmetryistightlyintertwinedwithphysics,thehistoryof
graphneuralnetworks—oneofthecentraltopicsofourbook—hasrootsin
anotherbranchofnaturalscience:chemistry.Chemistryhashistoricallybeen
– and stillis– one ofthemost data-intensive academic disciplines.Theemer-
genceofmodernchemistryintheeighteenthcenturyresultedintherapid
24Chapter 1
growthofknownchemicalcompoundsandanearlyneedfortheirorgani-
sation.ThisrolewasinitiallyplayedbyperiodicalssuchastheChemisches
Zentralblatt
44
and"chemicaldictionaries"liketheGmelinsHandbuchder
anorganischenChemie(anearlycompendiumofinorganiccompoundsfirst
publishedin1817
45
)andBeilsteinsHandbuchder organischenChemie (asim-
ilareffortfor organic chemistry)–allinitially publishedinGerman, whichwas
the dominant language of science until the early 20th century.
IntheEnglish-speakingworld,theChemicalAbstracts Service(CAS) was
createdin 1907andhasgraduallybecomethecentralrepositoryforthe world’s
publishedchemicalinformation.
46
However,thesheeramountofdata(the
Beilstein alone hasgrown to over 500 volumes and nearly half amillion pages
overitslifetime)hasquicklymadeitimpracticaltoprintandusesuchchemical
databases.
Sincethemid-nineteenthcentury,chemistshaveestablishedauniversally
understoodwaytorefertochemicalcompoundsthroughstructural formulae,
indicatingacompound’satoms,thebondsbetweenthem,andeventheir3D
geometry.Butsuchstructuresdidnotlendthemselvestoeasyretrieval.In
thefirsthalfofthe20thcentury,withtherapidgrowthofnewlydiscovered
compoundsandtheircommercialuse,theproblemoforganising,searching,
and comparingmoleculesbecameof crucialimportance:forexample, whena
pharmaceutical company sought to patent a new drug, the Patent Office had to
verify whether a similar compound had been previously deposited.
Toaddressthischallenge,severalsystemsforindexingmoleculeswere
introducedinthe1940s,formingfoundationsfor anewdisciplinethatwould
laterbecalledchemoinformatics.One suchsystem,namedthe‘GKDchemical
cipher’ after theauthors Gordon,Kendall,and Davison (1948), was developed
attheEnglishtirefirmDunloptobeusedwithearlypunchcard-basedcom-
puters.
47
In essence,the GKDcipher was analgorithm forparsing amolecular
structureintoastringthatcouldbemoreeasilylookedupbyahumanora
computer.
However,theGKDcipherandotherrelatedmethods
48
werefarfromsat-
isfactory.Inchemicalcompounds,similarstructuresoftenresultinsimilar
properties.Chemistsaretrainedtodevelopintuitiontospotsuchanalogies,
andlookforthemwhencomparingcompounds.
49
Ontheotherhand,when
amoleculeisrepresentedasastring(suchasintheGKDcipher),thecon-
stituents of a singlechemical structure may be mappedinto different positions
of thecipher. As aresult, two molecules containinga similarsubstructure (and
thus possibly similarproperties)mightbeencodedinvery different ways (see
an example in Figure 1.17).
Introduction25
George Vl
̆
adu ̧t
Figure 1.17
A figure fromVl
̆
adu ̧t et al.(1959) showing a chemical molecule (top left)and its frag-
ment(topright)andthecorrespondingGKD-ciphers(bottom).Notethatthiscoding
systembreaksthespatiallocalityofconnectedatomsinthemolecule,suchthatthe
fragmentciphercannotbefoundbysimplesubstringmatchinginthatofthethefull
molecule.This drawbackofearlychemicalrepresentation methodswasoneofthemoti-
vations for the search of structural representations of molecules as graphs.
Thisrealisationhasencouragedthedevelopmentof"topologicalciphers,"
tryingtocapturethestructureofthemolecule.Firstworksofthiskindwere
doneattheDowChemicalscompanybyOplerandNorton(1956)andthe
USPatentOfficebyRayandKirsch(1957)–bothheavyusersofchemical
databases.Oneofthemostfamoussuchdescriptors,knownasthe‘Morgan
fingerprint,’ wasdevelopedby HarryMorgan(1965) atthe ChemicalAbstracts
Service and used until today.
50
Afigurethathasplayedakeyroleindevelopingearly"structural"
approachesforsearchingchemicaldatabasesistheRomanian-bornSoviet
researcherGeorgeVl
̆
adu ̧t(Figure1.17).Achemistbytraining(withaPhD
inorganicchemistrydefendedattheMoscowMendeleevInstitutein1952),
he experienced a traumaticencounterwith the gargantuan Beilstein handbook
inhisfreshmanyears.
51
Thissteeredhisresearchintereststowardschemoin-
formatics(Vl
̆
adu ̧tetal.1959),afieldinwhichheworkedfortherestofhis
life.
Vl
̆
adu ̧t iscredited as oneof thepioneers of usinggraph theoryfor modeling
thestructuresandreactions ofchemical compounds.In asense, thisshouldnot
comeasasurprise:graphtheoryhasbeenhistoricallytiedtochemistry,and
eventheterm ‘graph’(referring toaset ofnodesand edges,rather thanaplot
of a function) was introduced by themathematician James Sylvester (1878) as
a mathematical abstraction of chemical molecules; cf. Figure 1.18.
In particular, Vl
̆
adu ̧t advocated the formulation of molecular structure com-
parisonasthegraphisomorphismproblem;hismostfamousworkwason
26Chapter 1
August Kekulé
(1829–1896)
James Joseph
Sylvester
(1814–1897)
Figure 1.18
Thestructuralformulaofbenzene(C
6
H
6
)proposedbythe19th-centuryGerman
chemistAugustKekulé.Theterm"graph"(inthesenseusedingraphtheory)was
firstintroducedasamodelofmoleculesbyJamesSylvesterinan1878Naturenote,
explicitly relating molecules to graphs expressing their "Kekuléan diagrams".
classifying chemical reactionsas the partial isomorphism(maximum common
subgraph) of the reactant and product molecules (Vleduts 1963).
Vl
̆
adu ̧t’sworkinspired
52
apairofyoungresearchers,BorisWeisfeiler(an
algebraicgeometer)andAndreyLehman
53
(self-describedasa"program-
mer"
54
).Inaclassicaljointpaper(WeisfeilerandLeman1968),theduo
introducedaniterativealgorithmfortestingwhetherapairofgraphsareiso-
morphic(i.e.,thegraphshavethesamestructureuptoreorderingofnodes),
which becameknown astheWeisfeiler-Lehman(WL)test
55
; seeFigure1.19.
Thoughthetwohadknowneachotherfromschoolyears,theirwaysparted
shortly after their publication and each became accomplished inhis respective
field
56
.
WeisfeilerandLehman’sinitialconjecturethattheiralgorithmsolvedthe
graphisomorphismproblem(anddoesitinpolynomialtime)wasincorrect:
whileLeman(1970)demonstrateditcomputationallyforgraphswithatmost
ninenodes,alargercounterexamplewasfoundayearlater(Adelson-Velskii
et al. 1969) (and in fact, a strongly regular graph failingthe WL test called the
‘Shrinkhande graph’ had been known even earlier, (Shrikhande 1959)).
ThepaperofWeisfeilerandLehmanhasbecomefoundationalinunder-
standinggraphisomorphism. Toput theirworkof inhistoricalperspective, one
should rememberthatin the1960s, complexity theory was still embryonicand
algorithmicgraph theorywasonly takingitsfirstbaby steps.AsLehmanrecol-
lected inthe late1990s: "in the60s, onecould inmatter ofdays re-discoverall
the facts,ideas,and techniquesin graphisomorphism theory.Idoubt, thatthe
word ‘theory’isapplicable;everythingwasatsuchabasiclevel."Inthecon-
textofgraphneuralnetworks,WeisfeilerandLehmanhaverecentlybecome
Introduction27
Boris
Weisfeiler
Andrey
Lehman
Figure 1.19
Thecreators oftheeponymousWeisfeiler-Lehmangraphisomorphismtest,thebedrock
ofexpressivityanalysisforbothgraphisomorphismandgraphneuralnetworks,are
pictured alongside the front page of their paper.
householdnameswiththeproofof theequivalence oftheir graphisomorphism
test to message passing (K. Xu et al. 2018; Morris et al. 2019).
1.2.9Back to the Origins
Thoughchemistshavebeen usingGNN-likealgorithms fordecades, itislikely
that their works onmolecularrepresentation remained practicallyunknown in
the machine learning community.
57
We find it hard to pinpoint precisely when
theconceptofgraphneuralnetworkshasbeguntoemerge:partlyduetothe
factthatmostoftheearlyworkdidnotplacegraphsasafirst-classcitizen,
partlysincegraphneuralnetworksbecamepracticalonlyinthelate2010s,
andpartlybecausethisfieldemerged fromtheconfluenceofseveral adjacent
researchareas;nonetheless,herewewilldiscussseveralpioneeringworks,
many of which have been designed by researchers in Figure 1.20.
Earlyformsofgraphneuralnetworkscanbetracedbackatleasttothe
1990s,withexamplesincluding"LabelingRAAM"byAlessandroSperduti
(1994),the"backpropagationthrough structure"of Gollerand Kuchler(1996),
andadaptiveprocessingofdatastructures(SperdutiandStarita1997;Frasconi,
Gori,andSperduti1998).Whiletheseworkswereprimarilyconcernedwith
operatingover"structures"(oftentreesordirectedacyclicgraphs),manyof
theinvariancespreservedintheirarchitecturesarereminiscentoftheGNNs
more commonly in use today.
The first propertreatment of the processingof generic graphstructures (and
the coiningof theterm ‘graphneural network’)happenedafter theturn ofthe
twenty-first century. AUniversityof Sienateam ledby Marco Gori(2005) and
28Chapter 1
FrancoScarselli(2008)proposedthefirst"GNN."Theyreliedonrecurrent
mechanisms,requiredtheneuralnetworkparameterstospecifycontraction
mappings,andthuscomputingnoderepresentationsbysearchingforafixed
point—thisinitselfnecessitatedaspecialformofbackpropagationanddid
notdependonnodefeaturesatall.Alloftheaboveissueswererectifiedby
the Gated GNN (GGNN)model of Yujia Li et al. (2015), whichbrought many
benefitsofmodernRNNs,suchasgatingmechanisms(Choetal.2014)and
backpropagationthroughtime.Theneuralnetworkforgraphs(NN4G)pro-
posedbyAlessioMicheli(2009)aroundthesametimeusedafeedforward
rather than recurrent architecture, in fact resembling more the modern GNNs.
Alessandro
Sperduti
Christoph
Goller
Andreas
Küchler
Marco
Gori
Franco
Scarselli
Alessio
Micheli
Figure 1.20
Sixofthepioneeringauthorsofgraphneuralnetworks(GNNs)withinthemachine
learningcommunity.AlessandroSperdutidevelopedtheearliestknownGNN-like
architecturetobepublishedatNeurIPS,in1994.GollerandKüchleralsopublished
an early form ofbackprop through structures in the ’90s.The term "GNN" was coined
by Goriand Scarselli’s workin 2005, firmlyestablishing theterm that’s very popularto
thisday—although,theconcurrentNN4G workofAlessioMicheliusesamechanism
that more closely resembles modern GNN implementations.
Anotherimportantclassofgraphneuralnetworks,oftenreferredtoas"spec-
tral",reliedonthenotionoftheGraphFourier transform(Brunaetal.2013).
Therootsofthisconstructionareinthesignalprocessingandcomputational
harmonicanalysis communities,wheredealingwith non-Euclideansignalshas
becomeprominentinthelate2000sandearly2010s.
58
Influentialpapersby
Shumanetal.(2013)andSandryhailaandMoura(2013)popularisedthenotion
of"GraphSignalProcessing"(GSP)andthegeneralisationofFouriertrans-
formsbasedontheeigenvectorsofgraphadjacencyandLaplacianmatrices.
ThegraphconvolutionalneuralnetworkrelyingonspectralfiltersbyDeffer-
rard,Bresson,andVandergheynst(2016)andtheGCNmodelfromKipfand
Welling(2016)—withitspresentationinspiredbyspectralfilters—areamong
the most cited in the field.
Introduction29
Itis worthnoting that,while theconceptofGNNs experienced severalinde-
pendentre-derivationsinthe2010sarisingfromseveral perspectives(besides
the alreadymentioned connectionsto computationalchemistry and signalpro-
cessing,weshouldhighlightprobabilisticgraphicalmodels(Dai,Dai,and
Song2016)andnaturallanguageprocessing(Vaswanietal.2017)),thefact
thatallofthesemodelsarrivedatdifferentinstancesofacommonblueprint
iscertainlytelling.Infact,ithasrecentlybeenposited(Veli
ˇ
ckovi
́
c2022)that
GNNs mayoffer auniversal framework forprocessingdiscretiseddata. Other
neuralarchitectureswe willdiscuss inthis book(such asCNNs orRNNs)may
be recoveredas specialcases ofGNNs, byinserting appropriatepriors intothe
message passing functionsorthe graphstructureover which the messagesare
computed.
Inasomewhatironictwistoffate,modernGNNsweretriumphantlyre-
introducedtochemistry(Figure1.21),afieldtheyoriginatedfrom,byDavid
Duvenaud etal.(2015) asareplacementforhandcraftedMorgan’smolecular
fingerprints, and byJustin Gilmer etal. (2017) inthe formof message-passing
neuralnetworksequivalenttotheWeisfeiler-Lehmantest.Afterfiftyyears,
thecirclefinallyclosed.Atthetimeofwriting,graphneuralnetworkshave
becomeastandardtoolinchemistryandalreadyusedindrugdiscoveryand
designpipelines.Anotable accoladewasclaimedwiththeGNN-based discov-
eryofnovelantibioticcompounds(Stokesetal.2020).DeepMind’sAlphaFold
2(Jumperetal.2021)usedaformofGNNsinordertoaddressahallmark
problem in structural biology – the problem of protein folding.
David DuvenaudJustin Gilmer
Figure 1.21
Thetwo worksspearheaded byDavid Duvenaudand JustinGilmer, respectively,have
greatly popularisedthe use ofGNNs forboth drugscreening and quantumchemistry—
applications that remain prominent to this day.
30Chapter 1
In1999,AndreyLehmanwrotetoamathematiciancolleaguethathehad
the"pleasuretolearnthat‘Weisfeiler-Leman’wasknownandstillcaused
interest."
59
HedidnotlivetoseetheriseofGNNsbasedonhisworkof
fiftyyearsearlier.NordidGeorgeVl
̆
adu ̧tseetherealisationofhisideasin
chemoinformatics, many of which remained on paper during his lifetime.
1.3The ‘Erlangen Programme’ of Deep Learning
Ourhistorical overviewof thegeometricfoundationsof deeplearning hasnow
naturallybroughtustotheblueprintthatunderpins thisbook.TakingConvo-
lutional andGraphNeural Networksastwoprototypicalexamples, at thefirst
glancecompletelyunrelated, wefindseveralcommontraits.First,both operate
ondata(imagesinthecaseofCNNsormoleculesinthecaseofGNNs)thathas
someunderlyinggeometricdomain(respectively,agridoragraph).Second,
in bothcases thetaskshave anaturalnotion ofinvariance (e.g.tothe position
ofanobjectinimageclassification,or thenumberingof atomsina molecule
inchemicalpropertyprediction)thatcanbeformulatedthroughappropriate
symmetry group (translation in the former example and permutation in the lat-
ter). Third, bothCNNs and GNNsincorporate the respective symmetries asan
inductivebiasbymakingtheirlayersinteractappropriatelywiththeactionof
thesymmetrygroupontheinput. InCNNs,itcomesinthe formofconvolu-
tionallayerswhoseoutputtransformsinthesamewayastheinput(wewill
callthispropertytranslation-equivariance,whichisthesameassayingthat
convolution commuteswith theshiftoperator).InGNNs,it assumestheform
of a symmetricmessagepassing function thataggregates theneighbournodes
irrespectiveoftheirorder,andtheoveralloutputofamessagepassinglayer
transformsinthesamewaywiththepermutationoftheinput(permutation-
equivariant).Finally,insomearchitecturalinstances,thedataareprocessed
in a multi-scalefashion;thiscorrespondstopoolinginCNNs(uniformly sub-
sampling the grid thatunderliestheimage) orgraphcoarseningin sometypes
of GNNs.
Overall,thisappearstobeaverygeneralprinciplethatcanbeappliedto
abroadrangeofproblems,typesofdata,andarchitectures.Wewillusethe
GeometricDeepLearningBlueprinttoderivefromfirstprinciplessomeof
themostcommonandpopularneuralnetworkarchitectures(CNNs,GNNs,
LSTMs,DeepSets,Transformers),whichasoftodayconstitutethemajority
ofthedeeplearningtoolkit.AswewillseeinChapters5–9,alloftheabove
canbeobtainedbyanappropriatechoiceofthedomainandtheassociated
symmetry group.
FurtherextensionsoftheBlueprint,suchasincorporatingthesymmetries
ofthedatainadditiontothoseofthedomain,allowtoobtainanewclassof
Introduction31
equivariantGNNarchitectures,thatwillbediscussedlateroninthisbook.
Sucharchitectureshaverecentlycometo limelightin molecularmodeling, and
accountfor thefactthat inmolecular graphs(where nodesrepresentatomsand
edgesarechemicalbonds),thepermutationofthenodesandtherotationofthe
node features (atomic coordinates) are different symmetries.
Suggested Further Reading
Onthetopicofsymmetry,Weyl’sclassic(1952)ishardtobeat.Straumann
(1996)providesafascinatingaccountonthedevelopment ofearlygaugethe-
oriesandWeyl’sroleinit.Anoverviewofthehistoricaldevelopmentof
non-Euclidean geometry is given by Faber (1983). Brannan, Esplen, and Gray
(2011) give a‘Kleinian’ introduction into geometry, and Sharpe (2000)shows
Cartan’sgeneralisationoftheErlangenProgramme.Forreadersinterestedin
theroleofsymmetryin physicswerecommendthebookPhysics from Symme-
try(Schwichtenberg2015),whichlivesuptothepromiseinitstitle,andthe
magnum opus of SirRoger Penrose (2005). For biographicnotes on the life of
work ofKleinandNoether, seeTobies(2019)andQuigg(2019);thebookof
Uspensky gives agoodoverviewoftheintellectualandpoliticalenvironment
inwhichWeisfeiler andLehman worked.Finally, weinvitethe readersto actu-
allyreadthecontroversialbookofMinskyandPapert(1969)anditsshorter
critical review by Block (1970).
Notes
1
Fully titled Strena, Seu De Nive Sexangula, (‘NewYear’s gift, or onthe Six-
CorneredSnowflake’,seeFigure1.1)was,assuggestedbythetitle,asmall
bookletsentbyKeplerin1611asaChristmasgifttohispatronandfriend
Johannes Matthäus Wackher von Wackenfels.
2
Galoisfamouslydescribedtheideasofgrouptheory(whichheconsidered
inthecontextoffindingsolutionstopolynomialequations)andcoinedthe
term ‘group’ (groupe inFrench) ina letterto a friendwritten onthe eve of his
fatalduel.Heaskedtocommunicatehisideastoprominentmathematicians
ofthetime,expressingthehopethattheywouldbeableto‘decipherallthis
mess’(‘déchiffrertoutcegâchis’).Galoisdiedtwodayslaterfromwounds
sufferedintheduelagedonly20,buthisworkhasbeentransformationalin
mathematics.
3
Omar Khayyamis nowadaysmainlyremembered asa poet andauthor ofthe
immortal line "a flask of wine, a book of verse, and thou beside me."
4
ThepublicationofEuclidesvindicatusrequiredtheapprovaloftheInqui-
sition,whichcamejustafewmonthsbeforetheauthor’sdeath.Rediscovered
32Chapter 1
bytheItaliandifferentialgeometerEugenioBeltramiinthenineteenthcen-
tury, Saccheri’s work isnow consideredan early, almost-successful, attemptto
construct hyperbolic geometry.
5
Ponceletwasamilitary engineerandparticipant ofNapoleon’sRussian cam-
paign,wherehewascapturedandheldasprisoneruntiltheendofthewar.It
was during this captivity periodthat he wrote the Traité des propriétés projec-
tivesdesfigures(‘Treatiseontheprojectivepropertiesoffigures’,1822)that
revivedtheinterestinprojectivegeometry.Earlierfoundationworkonthis
subject was done by his compatriot Gérard Desargues (1643).
6
Inthe1832lettertoFarkasBolyaifollowingthepublicationofhisson’s
results, Gauss famously wrote: "To praise it would amount to praising myself.
Fortheentirecontentoftheworkcoincidesalmostexactlywithmyown
meditationswhichhaveoccupiedmymindforthepastthirtyorthirty-five
years."Gausswasalsothefirsttousetheterm‘non-Euclideangeometry’(in
alettertoHeinrichChristianSchumacher),referringstrictusensutohisown
construction of hyperbolic geometry (Faber 1983).
7
Amodelforhyperbolicgeometryknownasthepseudosphere,asurface
withconstant negativecurvature,wasshownby Beltrami,whoalso provedthat
hyperbolicgeometrywaslogically consistent.The term‘hyperbolicgeometry’
wasintroducedbyKleininhis1873paperÜberdiesogenanntenicht-
EuklidischeGeometrie(‘Ontheso-callednon-Euclideangeometry’).The
prefix‘so-called’mightstrikewithasomewhatnegativeflavour,andindeed
itdoes:inhisErlangenProgramme,Kleinwrotethat"withthenamenon-
Euclideangeometryhavebeenassociatedamultitudeofnon-mathematical
ideas,whichhave beenaszealously cherishedby someas resolutelyrejected
by others." He however dropped the derogatory adjective in his latter works.
8
Forexample,an1834pamphletsignedonlywiththeinitials"S.S."
(believed by some to belong to Lobachevsky’s long-time opponent Ostrograd-
sky)claimedthatLobachevskymade"anobscureandheavytheory"outof
"thelightestandclearestchapterofmathematics,geometry,"wonderedwhy
one wouldprint such "ridiculousfantasies," and suggested thatthe book wasa
"joke or satire."
9
Accordingtoapopularbelief,repeatedinmanysourcesincluding
Wikipedia,theErlangenProgrammewasdeliveredinKlein’sinaugural
address inOctober 1872.Klein indeedgave such atalk (thoughon December
7, 1872), but it was for anon-mathematicalaudienceandconcernedprimarily
his ideas of mathematical education (Tobies 2019).
10
Klein’sprojective modelofhyperbolicgeometryisoftencalledthe‘Klein
disk’ or the ‘Cayley-Beltrami-Klein model.’
Introduction33
11
Atthetime,GöttingenwasGermany’sandtheworld’sleadingcentre
ofmathematics.ThoughErlangenisproudofitsassociationwithKlein,he
stayedthereforonlythreeyears,movingin1875totheTechnicalUniversity
ofMunich(thencalled TechnischeHochschule),followed byLeipzig(1880),
and finally settling down in Göttingen from 1886 until his retirement.
12
EmmyNoetherisrightfullyregardedasoneofthemostimportantwomenin
mathematicsandoneofthegreatestmathematiciansofthetwentiethcentury.
Shewasunluckytobebornandliveinanepochwhentheacademicworld
wasstillentrenchedinthemedievalbeliefsoftheunsuitabilityofwomen
forscience.Hercareerasoneofthefewwomeninmathematicshavingto
overcomeprejudiceandcontemptwasatrulytrailblazingone.Itshouldbe
saidtothecreditofhermalecolleaguesthatsomeofthemtriedtobreakthe
rules.WhenKleinandDavidHilbertfirstunsuccessfullyattemptedtosecure
ateachingpositionforNoetheratGöttingen,theymetfierceoppositionfrom
theacademichierarchs.Hilbertreportedlyretortedsarcasticallytoconcerns
broughtupinonesuchdiscussion:"Idonotseethatthesexofthecandidate
isanargumentagainstheradmissionasaPrivatdozent.Afterall,theSenate
isnotabathhouse."(Reid1976)Nevertheless,Noetherenjoyedgreatesteem
amongherclosecollaboratorsandstudents,andhermalepeersinGöttingen
affectionately referred to her asDer Noether, in the masculine (Quigg 2019).
13
Weylfirstconjectured(incorrectly)in1919thatinvarianceunderthe change
ofscaleor"gauge"wasalocalsymmetryofelectromagnetism.Theterm
gauge, or Eich inGerman,was chosen byanalogyto thevarious track gauges
of railroads. After the development of quantum mechanics, Weyl (1929) mod-
ifiedthegaugechoicebyreplacingthescalefactorwithachangeofwave
phase. See Straumann (1996).
14
TheDartmouthSummerResearchProjectonArtificialIntelligencewas
a1956summerworkshopatDartmouthCollegethatisconsideredtobethe
founding event of the field of artificial intelligence.
15
From ‘perception’ and the Greek suffix -τρoνdenoting an instrument.
16
Thatiswhytheauthorscanonlysmileatsimilarrecentclaimsaboutthe
‘consciousness’ of deep neural networks: nihil sub sole novum.
17
Sic in quotes, as pattern recognition had not yet become a common term.
18
Specifically,MinskyandPapert(1969)consideredbinaryclassification
problems ona 2D grid(‘retina’ in theirterminology)and asetof linearthresh-
oldfunctions.WhiletheinabilitytocomputetheXORfunctionisalways
broughtupasthemainpointofcriticisminthebook,muchoftheattention
was dedicatedtogeometricpredicatessuchasparityandconnectedness.This
34Chapter 1
problemisalludedtoonthe coverofthe book,whichis adornedby twopat-
terns:oneisconnected,oneisnot.Evenforahumanitisverydifficultto
determine which is which.
19
Kussuletal.(2001)showedthatRosenblatt’s3-layerperceptrontrained
with modernmethods and implementedon the21st centuryhardware wasable
toachieve anaccuracy of99.2%ontheMNISTdigitrecognitiontask,onpar
with modern models.
20
Hilbert’sThirteenthProblemisoneofthe23problemscompiledby
David Hilbertin1900andentailingtheproofofwhetherasolutionexists for
all7th-degreeequationsusingcontinuousfunctionsoftwoarguments.Kol-
mogorov and his student Arnold showed a solution of a generalised version of
thisproblem,whichis nowknownasthe Arnold–KolmogorovSuperposition
Theorem.
21
ComputationalgeometryhassubjectcodeI.3.5intheACMComputing
Classification System.
22
Backpropagationisbasedonthechainruleofdifferentiationthatitself
goesbacktotheco-inventorofdifferentialcalculusGottfriedWilhelmvon
Leibnizin1676.AprecursorofbackpropagationwasusedbyKelley1960
toperformoptimisationofcomplexnonlinearmulti-stagesystems.Efficient
backpropagation that isstillinuse today wasdescribedbySeppoLinnainmaa
(1970) inhismaster’sthesisinFinnish.Theearliestusein neuralnetworksis
due to Paul Werbos (1982), which isusually citedas theoriginof themethod.
See Schmidhuber (2015).
23
Thereareevenexamplesofcontinuous,nowheredifferentiablefunctions
such as the construction of Weierstrass (1872).
24
Roughly, Lipschitz-continuousfunctions donot arbitrarilyshrink orexpand
thedistancebetweenpointsonthedomain.Fordifferentiablefunctions,Lip-
schitzcontinuitycanbeexpressedasanupperboundonthenormofthe
gradient, implying that the function does not ‘jump’ too abruptly.
25
ThefirsttousethetermwasRichardBellman(1957)intheprefaceof
his bookDynamic Programming,where herefersto dimensionalityas‘a curse
whichhashungoverthe headof thephysicist andastronomer formanya year.’
26
The number of protons inthe observableuniverse, known as theEddington
number, is estimated at 10
80
.
27
Theterm‘receptivefield’predatesHubelandWieselandwasusedby
neurophysiologists from the early twentieth century, see Sherrington (1906).
28
Theterm‘grandmothercell’islikelytohavefirstappearedinJerryLettvin’s
course ‘BiologicalFoundations forPerception andKnowledge’ held atMIT in
1969. A similar concept of‘gnostic neurons’was introduced two years earlier
in a book by a Polish neuroscientist Jerzy Konorski (1967). See Gross (2002).
Introduction35
29
The name ‘neocognitron’ suggests it was an improved version of an earlier
architecture of Fukushima (1975), the cognitron.
30
Inthewordsoftheauthorhimself,havinganoutputthatis"dependent
only upon the shape ofthe stimulus pattern, and is notaffectedby the position
where the pattern is presented."
31
ReLU-typeactivationsdatebacktoatleastthe1960sandhavebeen
previously employed by Fukushima (1969, 1975).
32
Université Pierre-et-Marie-Curie, today part of the Sorbonne University.
33
InLeCun’s1989paper,thearchitecturewasnotnamed;theterm‘con-
volutionalneuralnetwork’or‘convnet’wouldappearinalaterpaperin
1998.
34
LeCun’sfirstCNNwastrainedonaCPU(aSUN-4/250machine).However,
theimagerecognitionsystemusingatrainedCNNwasrunonAT&TDSP-32C
(asecond-generationdigitalsignalprocessorwith256KBofmemorycapa-
bleofperforming125mfloatingpointmultiply-and-accumulate operationsper
second with 32-bit precision), achieving over 30 classifications per second.
35
One ofthemost popularfeature descriptorswasthescale-invariant feature
transform(SIFT),introducedby DavidLowe(1999).Itis oneof themost cited
computer vision papers.
36
Aprototypicalapproachwas"bag-of-words"representingimagesas
histograms of vector-quantised local descriptors (Sivic and Zisserman 2003).
37
Asit oftenhappens,it ishard topointto thefirstRNN design.Recurrencein
neuralnetworkswasalreadydescribedMcCulloch andPitts (1943),who noted
that"thenervoussystemcontainsmanycircularpaths"andreferredto"pre-
cisespecificationoftheseimplicationsbymeansofrecursivefunctionsand
determinationofthosethatcanbeembodiedintheactivityofnervousnets."
The bookofMinsky(1967)uses McCulloch-Pittsneuronsandcallsrecurrent
architectures "networkswithcycles." Thetechnicalreport ofRumelhart, Hin-
ton,andWilliams(1985)(anearlierversionofthefamousNature paper(1986)
by the same authors)contains generalisations for learning inRNNs, which are
named "recurrent nets," and credited to Minsky and Papert (1969).
38
The paper presenting LSTMs was initially rejected from NIPS in1995 and
appeared only two years later as Hochreiter and Schmidhuber (1997).
39
In particular, the groupofJürgen Schmidhuberdevelopeddeep large-scale
CNN modelsthatwon severalvision competitions,includingChinese charac-
terrecognition(Cire ̧sanetal.2010)andtrafficsignrecognition(Ciresanet
al. 2012).
40
AlexNet achieved an error over 10.8% smaller than the runner up.
41
AlexNethad eleven layersand wastrained on1.2M imagesfrom ImageNet
(forcomparison,LeNet-5hadfivelayersandwastrainedon60KMNIST
36Chapter 1
digits).AdditionalimportantchangescomparedtoLeNet-5weretheuseof
ReLUactivation(insteadof tanh),maximum pooling,dropoutregularisation,
and data augmentation.
42
It took nearly a week to train AlexNet on a pair of GPUs.
43
ThoughGPUs wereinitially designedfor graphicsapplications,theyturned
outtobeaconvenienthardwareplatformforgeneral-purposecomputations
("GPGPU").Firstsuchworksshowedlinearalgebraalgorithms(Krügerand
Westermann 2005).The firstuseofGPUs forneuralnetworks was by Ohand
Jung (2004), predating AlexNet by nearly a decade.
44
OriginallyPharmaceutischesCentral-Blatt,itwastheoldestGerman
chemical abstracts journal published between 1830 and 1969.
45
NamedafterLeopoldGmelinwhopublishedthefirstversionin1817,
theGmelinsHandbuch lastprintedition appearedinthe 1990s.The database
currentlycontains1.5millioncompoundsand1.3milliondifferentreactions
discovered between 1772 and 1995.
46
In1906,theAmericanChemicalSocietyauthorisedthepublicationof
ChemicalAbstracts,chargingitwiththemissionofabstractingtheworld’s
literature ofchemistryandassigninganinitialbudget offifteenthousanddol-
lars.Thefirstpublicationappearedin1907underthe stewardshipofWilliam
Noyes.Overacenturysinceitsestablishment,theCAScontainsnearly200
millionorganicandinorganicsubstancesdisclosedinpublicationssincethe
early 1800s.
47
A specialpurpose computer("Electronic StructuralCorrelator"), tobe used
in conjunctionwith a punchcard sorter, wasproposed bythe Gordon-Kendall-
Davisongroupinconnectionwiththeirsystemofchemicalciphering,butnever
built.
48
Therewereseveralcontemporaneoussystemsthatcompetedwitheach
other, see Wiswesser (1968).
49
For example, the association of the benzene ring with odouriferousproper-
tieswasthereasonforthenamingofthechemicalclassofaromaticcompounds
in the 19th century.
50
NotmuchbiographicalinformationisavailableaboutHarryMorgan.
Accordingtoanobituary,afterpublishinghisfamousmolecularfingerprints
paper,hemovedtoamanagerialpositionatIBM,wherehestayeduntil
retirement in 1993. He died in 2007.
51
According toVladimir Uspensky, Vl
̆
adu ̧ttold theanecdote of hisencounter
with Beilsteinin thefirst lectureof hisundergraduate courseon organic chem-
istryduringhisPatterson-CraneAwardacceptancespeechattheAmerican
Chemical Society.
Introduction37
52
WewereunabletofindsolidproofofwhetherorhowWeisfeilerand
LehmaninteractedwithVl
̆
adu ̧t,asmostofthepeoplewho hadknownboth are
nowdead. The strongest evidence is a comment in their classical paper (Weis-
feiler and Leman 1968) acknowledging Vl
̆
adu ̧t for "formulating the problem."
It isalso certain thatWeisfeiler andLehman were aware of themethods devel-
opedinthechemicalcommunity,inparticularthemethodofMorgan(1965),
whom they cited in their paper as a "similar procedure."
53
AndreyLehman’ssurnameisoftenalsospelledLeman,avariantthathe
preferredhimself,statinginanemailthattheformerspellingarosefroma
bookbytheGermanpublisherSpringerwhobelieved"thateveryLemanis
ahiddenLehman."SinceLehman’sfamilyhadTeutonicoriginsbyhisown
admission, we stick here to the German spelling.
54
Lehmanunsuccessfullyattemptedtodefendathesisbasedonhiswork on
graph isomorphism in 1971, whichwas rejected duetothe personalenmity of
theheadofthedissertationcommitteewithaverdict"itisnotmathematics."
Tothis,Lehmanbitterlyresponded:"Iamnotamathematician,Iamapro-
grammer."Heeventuallydefendedanotherdissertationin1973,ontopicsin
databases.
55
There are infact multipleversionsof the Weisfeiler-Lehman test. The orig-
inalpaperdescribedwhatisnowcalledthe"2-WLtest,"whichishowever
equivalent to 1-WL or node colour refinement algorithm.
56
SincelittlebiographicalinformationisavailableinEnglishonourheroes,
wewillusethisnotetooutlinetherestoftheircareers.Allthethreeended
up intheUnitedStates.GeorgeVl
̆
adu ̧t appliedforemigrationin1974,which
wasashocktohisbossesandresultedinhisdemotionfromtheheadoflab-
oratorypost(emigrationwasconsidereda"mortalsin"intheUSSR—to
peoplefromtheWest,itisnow hardtoimaginewhatanepiceffortitusedto
beforSovietcitizens).Vl
̆
adu ̧tlefthisfamilybehindandworkedattheInsti-
tuteforScientificInformationinPhiladelphiauntilhisdeathin1990.Being
of Jewish origin, Boris Weisfeiler decided toemigratein 1975dueto growing
official antesemitismintheUSSR–thelastdropwastherefusaltopublisha
monographonwhichhehadworkedextensivelyastoomanyauthorshad"non-
Russiansurnames."HebecameaprofessoratPennsylvaniaStateUniversity
workingonalgebraicgeometry afterashortperiod ofstayatthe Institutefor
AdvancedStudyinPrinceton.Anavidmountaineer,hedisappearedduringa
hikein Chilein1985.AndreyLehmanleftthe USSRin1990andsubsequently
worked as programmer in multiple American startups. He died in 2012.
57
Inthechemicalcommunity,multipleworksproposedGNN-likemodels
including Kireev (1995), Baskin, Palyulin, and Zefirov (1997), and Merkwirth
and Lengauer (2005).
38Chapter 1
58
It isworthnotingthatinthefield ofcomputergraphicsandgeometrypro-
cessing,non-EuclideanharmonicanalysispredatesGraphSignalProcessing
byatleastadecade.Wecantracespectralfiltersonmanifoldsandmeshesto
theworksofTaubin,Zhang,andGolub(1996),KarniandGotsman(2000),
and Lévy (2006).
59
From a private email sent by Lehman to Ilia Ponomarenko in 1999.

AltStyle によって変換されたページ (->オリジナル) /