Introduction

Thelastdecadehaswitnessedanexperimental revolution indatascienceand

machinelearning,epitomisedbydeeplearningmethods.Indeed,manyhigh-

dimensionallearningtaskspreviouslythoughttobebeyondreach–suchas

computervision,playingGo,orproteinfolding–areinfactfeasiblewith

appropriatedataandcomputationalscale.Remarkably,theessenceofdeep

learningisbuiltfromtwosimplealgorithmicprinciples:first,thenotion

ofrepresentationorfeaturelearning,wherebyadapted,oftenhierarchical,

featurescapturetheappropriatenotionofregularityforeachtask,andsec-

ond, learning by gradient descent-type optimisation,typically implemented as

backpropagation.

Whilelearninggeneric functionsinhighdimensionsisa cursedestimation

problem, mosttasks ofinterest arenot generic,and comewith essentialprede-

fined regularities arisingfromthe underlyinglow-dimensionality andstructure

ofthephysicalworld.Thisbookisconcernedwithexposingtheseregulari-

tiesthroughunifiedgeometric principlesthatcanbeappliedthroughoutawide

spectrum of applications.

Exploitingtheknown symmetriesof alargesystemisa powerfulandclas-

sicalremedyagainstthecurseofdimensionality,andformsthebasisofmost

physical theories. Deep learningsystems are no exception, and since theearly

days researchers have adapted neural networks to exploit the low-dimensional

geometry arisingfrom physicalmeasurements,e.g. gridsin images,sequences

intime-series,orpositionandmomentuminmolecules,andtheirassociated

symmetries,suchastranslationor rotation.Throughoutourexposition,we will

describe thesemodels,as wellas many others, asnaturalinstances ofthe same

underlying principle of geometric regularity.

GeometricDeepLearningisa‘geometricunification’endeavourinthespirit

ofKlein’sErlangenProgrammethatservesadualpurpose.Ononehand,it

provides a commonmathematical framework tostudy theclassical successful

4Chapter 1

neuralnetworkarchitectures,suchasCNNs,RNNs,GNNs,andTransform-

ers.Wewillshowthatalloftheabovecanbeobtainedbythechoiceofa

geometricdomain,itssymmetrygroup,andappropriatelyconstructedinvari-

ant andequivariant neuralnetworklayers—what werefer to asthe ‘Geometric

Deep Learning blueprint.’Wewill exemplify instances of this blueprinton the

‘5G ofGeometric DeepLearning’: Graphs,Grids, Groups, GeometricGraphs,

andGauges.Ontheotherhand,GeometricDeepLearninggivesaconstruc-

tiveproceduretoincorporatepriorphysicalknowledgeintoneuralnetworks

andprovideaprincipledwaytobuildnewarchitectures.Withthispremise,

we shallnow explore how the ideascentral toGeometric DeepLearning have

crystallised through time.

1.1On the Shoulders of Giants

"Symmetry,aswideorasnarrowasyoumaydefineitsmeaning,isoneidea

bywhichmanthroughtheageshastriedtocomprehendandcreateorder,

beauty, and perfection." This somewhat poetic definition of symmetry is given

intheeponymousbookofthegreatmathematicianHermannWeyl(1952),

his Schwanengesang ontheeve ofretirement fromtheInstitute forAdvanced

StudyinPrinceton.Weyltracesthespecialplacesymmetryhasoccupiedin

scienceandarttotheancienttimes,fromSumeriansymmetricdesignstothe

Pythagoreanswhobelievedthecircletobeperfectduetoitsrotationalsym-

metry.Platoconsideredthefiveregularpolyhedrabearinghisnametodayso

fundamentalthattheymustbethebasicbuildingblocksshapingthematerial

world.

Yet,thoughPlatoiscreditedwithcoiningthetermσνμμετρ ́ια,whichlit-

erallytranslatesas‘samemeasure’,heuseditonlyvaguelytoconveythe

beautyofproportioninartandharmonyinmusic.Itwastheastronomerand

mathematician JohannesKepler (1611)to attemptthe firstrigorous analysisof

thesymmetricshapeofwatercrystals.Inhistreatise‘OntheSix-Cornered

Snowflake’

heattributedthesix-folddihedralstructureofsnowflakesto

hexagonalpackingofparticles–anideathatthoughprecededtheclear

understandingofhowmatterisformed,stillholdstodayasthebasisof

crystallography (Ball 2011).

Inmodernmathematics,symmetryisalmostunivocallyexpressedinthe

languageofgrouptheory.The originsofthistheoryareattributedtoÉvariste

Galois, who coined the term andused it to study the solvability of polynomial

equationsinthe1830s.

Twoothernamesassociatedwithgrouptheoryare

thoseofSophusLieandFelixKlein,whometandworkedfruitfullytogether

foraperiodoftime(Tobies2019).Theformerwoulddevelopthetheoryof

Introduction5

Plato

(ca. 370 BC)

Johannes Kepler

(1571–1630)

Figure 1.1

Platobelievedthatsymmetricpolyhedra("Platonicsolids")werethefundamental

buildingblocksofnature.JohannesKeplerattributedforthefirsttimethesix-fold

symmetryofwatercrystalstothehexagonalpackingofparticles,antedatingmodern

crystallography.

continuoussymmetries thattodaybearshisname(Liegroups); thelatterpro-

claimedgrouptheorytobetheorganisingprincipleofgeometryinhisErlangen

Programme,whichwealreadymentionedinthePreface.GiventhatKlein’s

Programmeis theinspiration forourbook,it isworthwhile tospend moretime

on its historical context and revolutionary impact.

1.1.1A Strange New Universe out of Nothing

Thefoundationsofmodern geometrywereformalisedinancient Greecenearly

2300yearsagobyEuclidinatreatisenamedtheElements(Στoιχε`ια).

Euclideangeometry(whichisstilltaughtatschoolas‘thegeometry’)was

asetofresultsbuiltuponfiveintuitiveaxiomsorpostulates.TheFifthPos-

tulate–statingthatitis possibleto passonly oneline parallelto agivenline

throughapointoutsideit–appearedlessobviousandanillustriousrowof

mathematicians broke their teeth trying to prove it since antiquity, to no avail.

Anearlyapproachtotheproblemoftheparallelsappearsintheeleventh

centuryPersian treatiseA commentaryon thedifficultiesconcerning thepostu-

lates of Euclid’s Elements by Omar Khayyam.

The eighteenth-century Italian

Jesuit priestGiovanniSaccheri(1733) waslikely aware of thisprevious work

judgingby thetitleof hisownworkEuclidesab omninævovindicatus (‘Euclid

clearedofeverystain’,seeFigure1.2).LikeKhayyam,heconsideredthe

summitangles ofa quadrilateralwithsidesperpendicularto thebase. Thecon-

clusion thatacute angles leadto infinitelymanynon-intersecting lines thatcan

bepassedthroughapointnotonastraightlineseemedsocounter-intuitive

6Chapter 1

thatherejected itas‘repugnatis naturælinærectæ’(‘repugnanttothenature

of straight lines’).

Euclid

(ca. 300 BC)

Omar Khayyam

(1048–1131)

Figure 1.2

The"FatherofGeometry",Euclid,laidthefoundationsofmoderngeometryinthe

Elements. Omar Khayyam made early attempts toprove Euclid’s fifth postulate, which

were also referenced in Saccheri’s work (Euclides ab omni nævo vindicatus).

The nineteenth centuryhas brought the realisationthat the FifthPostulate is

notessentialandonecanconstructalternativegeometriesbasedondifferent

notions ofparallelism. One suchearlyexampleis projective geometry, arising,

asthenamesuggests,inperspectivedrawingandarchitecture.Inthisgeom-

etry,pointsandlinesareinterchangeable,andtherearenoparallellinesin

the usualsense:any linesmeet ina ‘pointatinfinity.’While resultsin projec-

tivegeometry areknown since antiquity,it wasfirst systematicallystudied by

Jean-Victor Poncelet (1822)

; see Figure 1.3.

Gérard Desargues

(1591–1661)

Jean-Victor

Poncelet

(1788–1867)

Figure 1.3

Based on theprior work ofGérard Desargues,Jean-Victor Poncelet revived theinterest

inprojectivegeometry,oneoftheearliestexamplesofgeometriesnotrequiringthe

parallel postulate.

Introduction7

Thecreditforthefirstconstructionofatruenon-Euclideangeometryis

disputed.TheprincepsmathematicorumCarlFriedrichGaussworkedonit

around 1813 but never published anyresults.

The first publicationon thesub-

jectofnon-Euclideangeometrywas‘OntheOriginsofGeometry’,bythe

RussianmathematicianNikolaiLobachevsky(1829)—asdepictedinFigure

1.4. Inthiswork, heconsideredthe FifthPostulatean arbitrarylimitationand

proposedanalternativeone, thatmorethanone linecanpassthrough apoint

that isparallel to agiven one.Such a constructionrequires aspace with nega-

tive curvature —whatwenowcalla hyperbolicspace —anotionthatwasstill

not fully mastered at that time.

Lobachevsky’s idea appeared heretical and he

was openly deridedby colleagues.

A similarconstruction was independently

discovered by theHungarian JánosBolyai, who publishedit in 1832under the

name‘absolutegeometry.’ Inan earlierletter tohis fatherdated 1823,he wrote

enthusiastically about this new development: ‘Ihave discovered such wonder-

fulthingsthatIwasamazed...outofnothingIhavecreatedastrangenew

universe.’

Carl Friedrich

Gauss

(1777–1855)

János

Bolyai

(1802–1860)

Nikolai

Lobachevsky

(1792–1856)

Bernhard

Riemann

(1826–1866)

Figure 1.4

Severalmathematiciansplayedpivotalrolesinproposingearlyvariantsofnon-

Euclideangeometriesintheeighteenthcentury.Gaussreportedlyworked onthetopic

withoutpublishinganymaterialonit.Thefirstpublication—focussedonhyperbolic

geometry—came fromLobachevsky (On theOrigins ofGeometry, depictedhere), with

Bolyaihavingworkedonitconcurrently. Riemannintroducedanysuchgeometries later

on, one of which was elliptic geometry.

In themeantime,new geometriescontinued tocomeoutlike fromacornu-

copia.AugustMöbius(1827),oftheeponymoussurfacefame,studiedaffine

geometry.Gauss’studentBernhardtRiemannintroducedaverybroadclass

ofgeometries—calledtodayRiemannianishishonour—inhishabili-

tationlecture,subsequentlypublishedunderthetitleÜberdieHypothesen,

8Chapter 1

welcheder GeometriezuGrunde liegen (‘Onthe Hypotheses onwhich Geom-

etryisBased,’1854).AspecialcaseofRiemanniangeometryisthe‘elliptic’

geometryofthesphere,anotherconstructionviolatingEuclid’sFifthPostu-

late, asthere isno pointon thesphere throughwhich aline canbedrawn that

neverintersectsagivenline.Towardsthesecondhalfofthenineteenthcen-

tury,Euclid’smonopolyovergeometrywascompletelyshuttered.Newtypes

ofgeometry(Euclidean,affine,projective,hyperbolic,spherical)emerged

andbecameindependentfieldsofstudy.However,therelationshipsofthese

geometries and their hierarchy were not understood.

It wasin thisexciting but messysituationthatFelixKleincameforth,with

a geniusinsight to usegrouptheory asanalgebraic abstractionofsymmetry to

organisethe ‘geometric zoo.’

It appearedthatEuclidean geometryis aspecial

caseofaffinegeometry,whichisinturnaspecialcaseofprojectivegeom-

etry(or,intermsofgrouptheory,theEuclideangroupisasubgroupofthe

projective group). Klein, and independently the Italiangeometer Eugenio Bel-

trami,furthershowedthatconstant-curvaturenon-Euclideangeometries(i.e.,

thehyperbolicgeometryofLobachevsky andBolyaiandthesphericalgeom-

etry orRiemann)could beobtained asspecial casesofprojectivegeometry

;

alsoseeFigure1.5.MoregeneralRiemanniangeometrywithnon-constant

curvaturewasnotincludedinKlein’sunifiedgeometricpicture,andittook

anotherfiftyyearsbeforeitwasintegrated,largely thankstothework ofÉlie

Cartan in the 1920s.

Felix Klein

(1849–1925)

Eugenio Beltrami

(1835–1900)

Figure 1.5

FelixKlein’sErlangenProgrammeofferedawaytoorganiseandcategoriseexisting

geometriesaccordingtotheirsymmetries.Additionally,KleinandBeltramiindepen-

dently proved thatnon-Euclidean geometrieswithconstant curvature arespecial cases

of projective geometry.

Klein’sErlangenProgrammehashadaprofoundmethodologicalandcul-

turalimpactongeometryandmathematicsingeneral.Itwasinasensethe

Introduction9

‘second algebraisation’ ofgeometry(thefirstonebeingtheanalytic geometry

ofRenéDescartesandthemethodofcoordinatesbearinghislatinisedname

Cartesius)thatallowedtoproduceresultsimpossiblebypreviousmethods.

Category Theory,abstractingtherelationsbetweenobjectsandnow pervasive

in pure mathematics, canbe "regardedas a continuation of theKlein Erlangen

Programme,in thesense thatageometricalspacewith itsgroup oftransforma-

tionsisgeneralizedtoacategorywithitsalgebraofmappings,"inthewords

ofits creatorsSamuel Eilenberg andSaundersMac Lane—see(Marquis 2009)

and Figure 1.6.

Élie

Cartan

(1869–1951)

Samuel

Eilenberg

(1913–1998)

Saunders

Mac Lane

(1909–2005)

Figure 1.6

Whilethe ErlangenProgrammeoutlinedapath towards unifyingallgeometriesunder

acommonlens,itwas notuntiltheworkofÉlieCartanseveral decadeslaterthatthis

unification was entirelyformalised. Additionally, the ErlangenProgramme inspiredthe

creationofentirenovelfieldsofmathematics,asevidencedby‘GeneralTheoryof

NaturalEquivalences’,theoriginaltextonCategoryTheorybyEilenbergandMac

Lane.

1.1.2The Theory of Everything

Consideringprojectivegeometrythemostgeneralofall,Kleincomplained

‘howpersistentlythemathematicalphysicistdisregardstheadvantages

affordedhiminmanycasesbyonlyamoderatecultivationoftheprojective

view’(Klein1872).Hisadvocatingfortheexploitationofgeometryandthe

principles ofsymmetryin physicshasforetoldthefollowing centurythatwas

truly revolutionary for the field.

InGöttingen,

Klein’scolleagueEmmy Noether(1918)

provedthatevery

differentiablesymmetry ofthe actionof aphysical systemhas acorresponding

conservationlaw.It was byallmeans astunning result:beforehand,meticulous

experimentalobservationwasrequiredtodiscoverfundamentallawssuchas

10Chapter 1

the conservation ofenergy,andeventhen, itwas anempiricalresultnot com-

ingfromanywhere.Noether’sTheorem—"aguidingstarto20thand21st

century physics," inthe words of theNobel laureateFrank Wilczek — allowed

for example to show that the conservation of energy emerges from the transla-

tionalsymmetry oftime,aratherintuitiveideathatthe resultsofanexperiment

shouldnotdependon whetherit isconducted todayor tomorrow.Thefirstpage

of this landmark result is depicted in Figure 1.7.

Anothersymmetryassociatedwithchargeconservation,theglobalgauge

invarianceoftheelectromagneticfield,firstappearedinMaxwell’sformu-

lationofelectrodynamics(Maxwell1865);however,itsimportanceinitially

remainedunnoticed.Thesame HermannWeyl whowrotesodithyrambically

aboutsymmetryistheonewhofirstintroducedtheconceptofgaugeinvari-

ance inphysics intheearly 20thcentury,

emphasizing itsrole asa principle

fromwhichelectromagnetismcanbederived.Ittookseveraldecadesuntil

thisfundamentalprinciple–initsgeneralisedformdevelopedbyYangand

Mills (1954) – proved successful in providing a unified framework to describe

thequantum-mechanicalbehaviourofelectromagnetismandtheweakand

strongforces,finallyculminatingintheStandardModelthatcapturesallthe

fundamental forcesof naturebut gravity. Assuccinctlyput byanother Nobel-

winningphysicist,PhilipAnderson(1972),"itisonlyslightlyoverstatingthe

case to say that physics is the study of symmetry."

Emmy

Noether

(1882–1935)

Hermann

Weyl

(1885–1955)

Chen-Ning

Yang

(b. 1922)

Robert

Mills

(1927–1999)

Figure 1.7

TheErlangenProgrammehadsignificantspillovereffectsinphysics.Thelandmark

resultwasNoether’stheorem,whichdirectlyspecifieshowconservationlawsarise

directlyfromsymmetryconstraints.Lateron,gaugesymmetries–firstintroducedby

Weyl, then developed byYang and Mills –provedan effective abstraction for discover-

ing the presently best-known model of the physical world: the Standard Model.

Introduction11

1.2Towards Geometric Deep Learning

Animpatientreadermightwonderatthispoint,whatdoesallthisexcursion

intothehistoryofgeometryandphysics,howeverexcitingitmightbe,have

todowithdeeplearning?Aswewillsee,thegeometricnotionsofsymmetry

andinvariancehavebeenrecognisedascrucialeveninearlyattemptstodo

‘patternrecognition,’anditisfairtosaythatgeometryhasaccompaniedthe

nascent fieldofartificialintelligence fromitsverybeginning. Whileitishard

toagreeonaspecificpointintimewhen‘artificialintelligence’wasbornas

a scientificfield(atthe end,humanshave beenobsessedwithcomprehending

intelligenceandlearningfromthedawnofcivilisation),andeventhehistory

ofdeeplearningisdisputed,wewilltryalessriskytaskoflookingatthe

precursorsofgeometricdeeplearning—themaintopicofourbook.This

history can be packed into less than a century.

1.2.1Early Neural Networks and the AI Winter

Bythe1930s,ithasbecomeclearthatthemindresidesinthebrain,and

researcheffortsturnedtoexplainingbrainfunctionssuchasmemory,per-

ception,andreasoning,intermsofbrainnetworkstructures.McCullochand

Pitts(1943)arecreditedwiththefirstmathematicalabstractionoftheneu-

ron,showingitscapabilitytocomputelogicalfunctions.Justayearafterthe

legendaryDartmouthCollegeworkshopthatcoinedtheveryterm‘artificial

intelligence,’

anAmericanpsychologistFrankRosenblatt(1957)fromCor-

nellAeronauticallaboratoryproposedaclassofneuralnetworkshecalled

‘perceptrons’

,cf.Figure1.8.Perceptrons,firstimplementedonadigital

machineandthenindedicatedhardware,managedtosolvesimplepattern

recognition problems such as classification of geometric shapes. However, the

quick rise of ‘connectionism’ (how AI researchers working on artificial neural

networkslabeledthemselves)receivedabucketofcoldwaterintheformof

thenow infamousbookPerceptronsbyMinskyandPapert(1969);theneural

network model they used is depicted in Figure 1.9.

Inthedeeplearningcommunity,itiscommontoretrospectivelyblame

MinskyandPapertfortheonsetofthefirst‘AIWinter,’whichmadeneural

networksfalloutoffashionforoveradecade.Atypicalnarrativementions

the ‘XOR Affair,’ a proofthat perceptrons wereunable to learnevenvery sim-

ple logicalfunctions asevidence of theirpoor expressive power. Somesources

evenaddapinchofdramarecallingthatRosenblattandMinskywenttothe

sameschoolandeven allegingthatRosenblatt’sprematuredeathinaboating

accident in 1971 was a suicide intheaftermath ofthe criticismof hiswork by

colleagues.

12Chapter 1

SENSORY

UNITS

(S-UNITS)

RETINAL

UNITS

CIRCUITS

RETNA

ASSOCIATION

UNITS

(A-UNITS)

RESPONSE

UNITS

(R-UNITS)

Frank Rosenblatt

(1928–1971)

Figure 1.8

ThePerceptronproposedbyFrankRosenblatt(1957)wasoneofthesimplestneural

network architectures.

The realityisprobablymoremundane andmorenuancedatthe sametime.

First,afarmoreplausiblereasonforthe‘AIWinter’intheUSAisthe1969

Mansfield Amendment, which requiredthemilitarytofund "mission-oriented

directresearch,ratherthanbasicundirectedresearch."Sincemanyeffortsin

artificial intelligence at the time, includingRosenblatt’s research, were funded

bymilitaryagenciesanddidnotshowimmediateutility,thecutinfunding

hashaddramaticeffects.Second,neuralnetworksandartificialintelligence

ingeneralwereover-hyped:itisenoughtorecalla1958NewYorkerarticle

callingperceptronsa"firstseriousrival tothehumanbraineverdevised"and

"remarkablemachines"thatwere"capableofwhatamountstothought,"

theoverconfidentMITSummerVisionProjectexpectinga"constructionofa

significantpartofavisualsystem"andachievingtheabilitytoperform"pattern

recognition"

during one summer term of1966.A realisationby theresearch

community that initialhopes to ‘solve intelligence’ hadbeen overly optimistic

was just a matter of time.

Ifone,however,looksintothesubstanceofthedispute,itisapparentthat

whatRosenblattcalleda‘perceptron’isratherdifferentfromwhatMinsky

and Papert understoodunder thisterm. MinskyandPapertfocusedtheir analy-

sis andcriticismon anarrow classof single-layerneuralnetworks they called

‘simpleperceptrons’ (andwhatistypicallyassociated withthistermin modern

times,see Figure1.8)that computeaweighted linearcombinationof theinputs

followedbyanonlinearfunction.

Ontheotherhand, Rosenblattconsidered a

broader classof architectures thatantedated many ideas ofwhat wouldnowbe

considered‘modern’deeplearning,includingmulti-layerednetworkswithran-

domandlocalconnectivity.

Rosenblattcouldhaveprobablyrebuttedsome

Introduction13

Marvin Minsky

(1927–2016)

Seymour Papert

(1928–2016)

Figure 1.9

TheinfluentialbookPerceptronsbyMinskyandPapert(1969)consideredsimple

single-layerneuralnetworksdepictedhere.Itwasperhapstheearliestgeometric

approach to learning, including the introduction of group invariance.

of the criticism concerning the expressive power of perceptrons had he known

aboutthe proofoftheThirteenthHilbertProblem

byVladimir Arnold(1956)

andAndreyKolmogorov(1957)establishingthatacontinuousmultivariate

functioncanbewrittenasasuperpositionofcontinuousfunctionsofasingle

variable.TheArnold–Kolmogorovtheoremwasaprecursorofasubsequent

classofresultsknownas‘universalapproximationtheorems’formulti-layer

(or ‘deep’) neural networks that put these concerns to rest.

While mostremember thebookofMinsky and Papertforthe roleitplayed

in cutting the wingsof theearly dayconnectionistsand lamentthelost oppor-

tunities, animportant overlooked aspect isthatitforthefirst timepresented a

geometricanalysis oflearning problems.This factisreflectedintheveryname

ofthe book,subtitled Anintroduction toComputational Geometry.At thetime

it was aradically new idea, and Block(1970) in his critical review ofthe book

(which essentially stood to the defense of Rosenblatt),wondered whether "the

newsubjectof‘ComputationalGeometry"’would"growintoanactivefield

of mathematics, or will itpeter outina miscellany of dead ends?"(theformer

happened: computational geometry is now a well-established field

Furthermore,MinskyandPapertprobablydeservethecreditforthefirst

introductionofgrouptheoryintotherealmofmachinelearning:theirGroup

Invariancetheoremstated thatifaneural networkisinvarianttosomegroup,

thenitsoutputcouldbe expressedasfunctionsofthe orbitsofthegroup(we

will define thesetermsinsubsequent Chapters). Whilethey usedthisresultto

provelimitationsofwhataperceptroncouldlearn,similarapproacheswere

subsequentlyusedbyAmari(1978)fortheconstructionofinvariantfeatures

inpatternrecognitionproblems.Anevolutionoftheseideasintheworksof

14Chapter 1

Sejnowski,Kienker,andHinton(1986)andShawe-Taylor(1989,1993),unfor-

tunately rarely cited today, provided the foundations of the geometric learning

blueprint described in this book.

1.2.2Universal Approximation and the Curse of Dimensionality

Theaforementionednotionofuniversalapproximationdeservesfurtherdiscus-

sion. The term refers tothe ability to approximate any continuous multivariate

functiontoanydesiredaccuracy;inthemachinelearning literature,thistype

ofresultsisusuallycreditedtoCybenko(1989)andHornik(1991).Figure

1.10outlinesthepioneeringresearchersworkingonuniversalapproximation

results.

David

Hilbert

(1862–1943)

Andrey

Kolmogorov

(1903–1987)

Vladimir

Arnold

(1937–2010)

George

Cybenko

Kurt

Hornik

Figure 1.10

DavidHilbert’sThirteenthProblem,provenbyAndreyKolmogorovandVladimir

Arnold,wasoneofthefirstresultsshowingthatamultivariatecontinuousfunction

couldbeexpressedasacompositionandsumofsimpleone-dimensionalfunctions.

George Cybenkoand KurtHornikprovedresultsspecifictoneural networks,showing

thataperceptron withone hiddenlayercan approximateanycontinuousfunction toany

desired accuracy.

Unlike‘simple’(single-layer)perceptronscriticisedbyMinskyandPapert

(1969),multilayerneuralnetworksareuniversalapproximatorsandthusare

anappealingchoiceofarchitectureforlearningproblems.Wecanthinkof

supervisedmachinelearningasafunctionapproximationproblem:giventhe

outputsofsomeunknown function(e.g.,animageclassifier)onatraining set

(e.g.,imagesofcatsanddogs),wetrytofindafunctionfromsomehypoth-

esisclassthatfitswellthetrainingdataandallowstopredicttheoutputson

previously unseen inputs (‘generalisation’).

Universalapproximationguaranteesthatwecanexpressfunctionsfroma

verybroadregularityclass(continuousfunctions)bymeansofamulti-layer

neuralnetwork.Inotherwords,thereexistsaneuralnetworkwithacertain

Introduction15

numberofneuronsandcertainweightsthatapproximatesagivenfunction

mappingfromtheinputtotheoutputspace(e.g.,fromthespaceofimages

to the space of labels). However, universal approximation theorems do not tell

ushowtofindsuchweights.Infact,learning(i.e.,findingweights)inneu-

ralnetworkshasbeenabigchallengeintheearlydays.Rosenblattshowed

alearningalgorithmonlyforasingle-layerperceptron;totrainmulti-layer

neural networks, IvakhnenkoandLapa (1966)used alayer-wiselearning algo-

rithmcalled‘groupmethodofdatahandling.’ ThisallowedIvakhnenko(1971)

to go as deep as eight layers — a remarkable feat for the early 1970s!

Abreakthroughcamewiththeinventionofbackpropagation,analgorithm

usingthechainruletocomputethegradientoftheweightswithrespectto

the loss function and allowed to use gradient descent-based optimisation tech-

niquestotrainneuralnetworks.Asoftoday,thisisthestandardapproachin

deeplearning.Whiletheoriginsofbackpropagationdatebacktoatleastthe

1960s,

the firstconvincingdemonstration ofthis methodinneuralnetworks

wasinthewidelycitedNaturepaperofRumelhart,Hinton,andWilliams

(1986). The introduction of this simpleand efficientlearning method has been

a keycontributingfactorto thereturn ofneuralnetworks tothe AIscene in the

1980sand1990s.ThekeycontributorsinthisspacearediscussedinFigure

1.11.

Alexey

Ivakhnenko

(1913–2007)

Seppo

Linnainmaa

(b. 1945)

Paul

Werbos

(b. 1947)

David

Rumelhart

(1942–2011)

Figure 1.11

Four ofthemostimportantfigures inthedevelopmentof modernmethodsoflearning

viagradientdescent.Ivakhnenkoderivedalayer-wiselearningalgorithmwhichallowed

for training an eight-layerneural network in 1971. Thebackpropagation algorithm was

initially describedby Linnainmaa,andfirst deployed indeepneural networksby Wer-

bos.Rumelhart’s workofferedthefirst convincing demonstrationof themethod indeep

neural networks, which propelled the use of backpropagation in future works.

Lookingatneuralnetworksthroughthelensofapproximationtheoryhas

led some cynics to call deep learninga "glorified curve fitting." We will letthe

16Chapter 1

reader judgehow truethismaximisbytryingtoanswerthefollowing impor-

tant question: how many samples (trainingexamples) are neededto accurately

approximateafunction?Approximationtheoristswillimmediatelyretortthat

theclassofcontinuousfunctionsthatmultilayerperceptronscanrepresent

isobviouslywaytoolarge:onecanpassinfinitelymanydifferentcontinu-

ousfunctionsthroughafinitecollectionofpoints.

Itisnecessarytoimpose

additional regularity assumptions suchas Lipschitz continuity,

in whichcase

onecanprovideaboundontherequirednumberofsamples.Unfortunately,

theseboundsscaleexponentiallywithdimension—aphenomenoncolloqui-

allyknownasthe‘curseofdimensionality’

—whichisunacceptablein

machinelearningproblems:evensmall-scalepatternrecognitionproblems,

suchasimageclassification,dealwithinputspacesofthousandsofdimen-

sions.Ifonehadtorelyonlyonclassicalresultsfromapproximationtheory,

machinelearningwouldbeimpossible.Inourillustration,thenumberofexam-

ples ofcat anddog images thatwouldbe requiredin theoryin order tolearn to

tell themapart wouldbe way larger thanthe numberof atoms intheuniverse

— there are simply not enough cats and dogs around to do it.

Thestruggleofmachinelearningmethodstoscaletohighdimensionswas

brought up by the British mathematician Sir James Lighthill(1973) in a paper

that AIhistorianscallthe‘LighthillReport,’inwhichhe usedtheterm‘com-

binatorialexplosion’andclaimedthatexistingAImethodscouldonlywork

ontoyproblemsandwouldbecomeintractableinreal-worldapplications.

Lighthillfurthercomplainedthat"mostworkers inAIresearchandinrelated

fieldsconfesstoapronouncedfeelingofdisappointmentinwhathasbeen

achievedinthepasttwenty-fiveyears"and"innopartofthefieldhavethe

discoveriesmadesofarproducedthemajorimpactthatwasthenpromised."

Thesewerenotmerefrustrationsofagrumpyacademicstuckwithahard

problem:the Reportwascommissionedby theBritishScienceResearch Coun-

ciltoevaluateacademicresearchinthefieldofartificialintelligence,andits

pessimisticconclusionsresultedinfundingcutsacrossthepond.Together

withsimilardecisionsbytheAmericanfundingagencies,thisamountedto

a wrecking ball for AI research in the 1970s.

Forus,therealisationthatclassicalfunctionalanalysiscannotprovidean

adequateframeworktodealwithlearningproblemswillbethemotivation

toseekstronger,geometricformsofregularity,thatcanbeimplementedin

aparticularwiringoftheneuralnetwork—suchasthelocalconnectivityof

convolutionalneuralnetworks.It isfairtosaythatthetriumphantreemergence

of deep learning a decade ago owes, at least in part, to these insights. It is also

probablytruethattheroleofsymmetryandinvarianceasabroadorganising

Introduction17

anddesignprincipleinneural networkshas notbeengiven enoughcredit atthe

time, and us highlighting these principles here are a hindsight.

1.2.3Secrets of the Visual Cortex and the Neocognitron

Theinspirationforthefirstneuralnetworkarchitecturesofthenew‘geo-

metric’typecamefromneuroscience.Inaseriesofexperimentsthatwould

becomeclassicalandbringthemaNobelPrizeinmedicine,theduoof

HarvardneurophysiologistsDavidHubelandTorstenWiesel(1959;1962)

unveiledthestructureandfunctionofapartofthebrainresponsibleforpat-

ternrecognition—thevisualcortex.Bypresentingchanginglightpatternsto

acatandmeasuringtheresponseofitsbraincells(neurons)–asdepictedin

Figure1.12–theyshowedthattheneurons inthe visualcortexhaveamulti-

layerstructurewithlocalspatialconnectivity:acellwouldproduceresponse

onlyifcellsinitsproximity(‘receptive field’

)wereactivated.Furthermore,

theorganisationappearedtobehierarchical,wheretheresponsesof‘simple

cells’reactingtolocalprimitiveorientedstep-likestimuliwereaggregated

by‘complexcells,’whichproducedresponsestomorecomplexpatterns.It

was hypothesized that cells in deeper layersofvisual cortex would respond to

increasinglycomplexpatternscomposedofsimplerones,withasemi-joking

suggestionoftheexistenceofa‘grandmothercell’

thatreactsonlywhen

shown the face of one’s grandmother.

Electrical signal

from brain

Stimulus

Visual area

of brain

Recording electrode

David Hubel

(1926–2013)

Torsten Wiesel

(b. 1924)

Figure 1.12

The classicalexperiment ofHubel andWiesel (1962) revealed thestructure of thebrain

visual cortex and inspired a new generation of neural network architectures mimicking

its local connectivity.

Theunderstandingofthestructureofthevisualcortexhashadprofound

impactonearlyworksincomputervisionandpatternrecognition,withmul-

tipleattemptstoimitateitsmainingredients.KunihikoFukushima(1980),

atthattimearesearcherattheJapanBroadcastingCorporation,developeda

18Chapter 1

new neuralnetworkarchitecture"similartothehierarchymodelofthevisual

nervoussystemproposedbyHubelandWiesel,"whichwasgiventhename

neocognitron.

Theneocognitronconsistedofinterleaved S-andC-layersof

neurons (a naming convention reflecting its inspiration in the biological visual

cortex);theneuronsineachlayerwerearrangedin2Darraysfollowingthe

structureoftheinputimage(‘retinotopic’),withmultiple‘cell-planes’(fea-

ture maps in modern terminology) per layer. The S-layers were designed to be

translationallysymmetric:theyaggregatedinputsfromalocalreceptivefield

using shared learnable weights, resultingincellsinasingle cell-planehaving

receptivefieldsofthesamefunction,butatdifferentpositions.Therationale

was topickuppatterns thatcould appearanywhere in theinput. TheC-layers

were fixed andperformed local pooling(a weightedaverage), affordinginsen-

sitivityto the specific location of the pattern: a C-neuron would be activated if

any of the neurons in its input are activated.

Sincethemainapplicationoftheneocognitronwascharacterrecognition,

translationinvariance

wascrucial.Thispropertywasafundamentaldiffer-

encefromearlierneuralnetworkssuchasRosenblatt’sperceptron:inorder

touseaperceptronreliably,onehadtofirstnormalisethepositionofthe

inputpattern,whereasinneocognitrontheinsensitivitytothepatternposi-

tion was baked intothe architecture.Neocognitronachieved it byinterleaving

translationally-equivariantlocalfeatureextractionlayerswithpooling,cre-

atingamultiscalerepresentation—wewillrefertothisprincipleasscale

separationandstudyinsubsequentChapterswhyitcanalsohelpdealwith

a broaderclassof geometrictransformations inaddition totranslations. Com-

putationalexperimentsshowedthatFukushima’sarchitecturewasableto

successfullyrecognisecomplexpatternssuchaslettersordigits,eveninthe

presence of noise and geometric distortions.

Lookingfromthevantagepointoffourdecadesofprogressinthefield,

onefindsthattheneocognitronalreadyhadstrikinglymanycharacteristicsof

modern deep learningarchitectures: depth (Fukishimasimulated a seven-layer

networkinhispaper),localreceptivefields,sharedweights,andpooling.It

evenusedhalf-rectifier(ReLU)activationfunction,whichisoftenbelieved

tobeintroducedinrecentdeep learningarchitectures.

Themaindistinction

frommodernsystemswasinthewaythenetworkwastrained:neocogni-

tronwasa‘self-organised’architecturetrainedinanunsupervisedmanner,

sincebackpropagationhadstillnotbeenwidelyusedintheneuralnetwork

community.

Introduction19

1.2.4Convolutional Neural Networks

Fukushima’sdesignwasfurtherdevelopedbyYannLeCun,afreshgraduate

fromtheUniversityofParis

withaPhDthesisontheuseofbackpropa-

gationfortrainingneuralnetworks.Inhisfirstpost-doctoralpositionatthe

AT&TBellLaboratories,LeCunandcolleaguesbuiltasystemtorecognise

hand-writtendigitsonenvelopesinordertoallowtheUSPostalServiceto

automaticallyroutemail.Inapaperthatisnow classical,LeCunetal.(1989)

describedthefirstthree-layerconvolutionalneuralnetwork(CNN).

Simi-

larlytotheneocognitron,LeCun’sCNNalsousedlocalconnectivitywith

sharedweightsandpooling(Figure1.13).However,itforwentFukushima’s

morecomplexnonlinearfiltering(inhibitoryconnections)infavourofsim-

plelinearfiltersthatcouldbeefficientlyimplementedasconvolutionsusing

multiply-accumulateoperationsonadigitalsignalprocessor(DSP).

This

designchoice,departingfromtheneuroscienceinspirationandterminology

andmovingintotherealmofsignalprocessing,wouldplayacrucialrolein

theensuing successof deeplearning.Another key novelty ofCNN wasthe use

of backpropagation for training.

Kunihiko

Fukushima

10 output units

H2.12

H1.12

H1.1

layer H3

30 hidden units

layer H2

12 x 16 =192

hidden units

12 x 64 =768

layer H1

256 input units

fully connected

300 links

fully connected

6000 links

40,000 links

from 12 kernels

5 x 5 x 8

20,000 links

from 12 kernels

5 x 5

H2.1

Yann LeCun

Figure 1.13

Theoriginalvariantsofconvolutionalneuralnetworkarchitectureshavebeenintro-

ducedbyFukushima andLeCun(though thename"convolutional" wouldappearlater).

Implementedon adigitalsignalprocessor,LeCun’sCNN allowedreal-time handwrit-

ten digit recognition for the first time.

LeCun’sworksshowedconvincingly thepowerofgradient-basedmethods

forcomplexpatternrecognitiontasksandwasoneofthefirstpracticaldeep

learning-based systems for computer vision.Anevolution of this architecture,

a CNN with five layers named LeNet-5 as a pun on theauthor’s name(LeCun

etal.1998),wasusedbyUSbankstoreadhandwrittencheques.Thecomputer

visionresearchcommunity,however,initsvastmajoritysteeredawayfrom

neuralnetworksandtookadifferentpath.Thetypicalarchitectureofvisual

20Chapter 1

recognitionsystemsofthefirstdecadeofthenewmillenniumwasacare-

fully hand-craftedfeature extractor(typicallydetecting interestingpoints in an

image andproviding theirlocaldescriptionina waythat isrobust toperspec-

tivetransformationsandcontrastchanges

)followedbyasimpleclassifier

(mostoftenasupportvectormachine(SVM)andmorerarely,asmallneural

network).

1.2.5Recurrent Neural Networks, Vanishing gradients, and LSTMs

WhileCNNsweremainlyappliedtomodellingdatawithspatialsymmetries

suchasimages,anotherdevelopmentwasbrewing:onethatrecognisesthat

data is often non-static, but can evolve in a temporal manner aswell. The sim-

plestformoftemporally-evolvingdataisatimeseriesconsisting ofa sequence

of steps with a data point provided at each step.

Amodelhandlingtime-seriesdatahenceneedstobecapableofmeaning-

fullyadaptingtosequencesofarbitrarylength–afeatthatCNNswerenot

triviallycapableof.ThisledtodevelopmentofRecurrentNeuralNetworks

(RNNs)inthelate1980sandearly1990s

,theearliestexamplesofwhich

simplyappliesasharedupdateruledistributedthroughtime(Jordan1986;

Elman1990).Ateverystep,theRNNstateisupdatedasafunctionofthe

previous state and the current input.

OneissuethatplaguedRNNsfromthebeginning wasthevanishinggradi-

entproblem.Vanishinggradientsariseinverydeep neuralnetworks—ofwhich

RNNsareoftenarepresentativeinstance,giventhattheireffectivedepthis

equal totheinput sequencelength.As sigmoidalactivations areunrolledover

manysteps,gradientsquicklyapproachzerowhenbackpropagationisper-

formed.Thiseffectstrongly limitstheinfluenceofearlier stepsinthesequence

to the predictions made in the latter steps.

AsolutiontothisproblemwasfirstelaboratedinSeppHochreiter’sDiploma

thesis(1991),carriedoutinMunichunderthesupervisionofJürgenSchmid-

huber,intheformofanarchitecturethatwasdubbedLongShort-Term

Memory (LSTM; Figure 1.14).

LSTMs combat the vanishing gradient prob-

lembyhavingamemorycell,withexplicitgatingmechanismsdecidinghow

much of that cell’s state to overwriteat every step — allowing one to learn the

degreeofforgettingfromdata(oralternatively,rememberingforlongtime).

Incontrast,simpleRNNsofJordan(1986)andElman(1990)performafull

overwrite at every step.

This ability to remember context for long time turned crucial in natural lan-

guageprocessingandspeechanalysisapplications,whererecurrentmodels

were shown successful applications starting from theearly 2000s (Gersand E.

Schmidhuber 2001;Graveset al.2004). However, likeit happenedwithCNNs

Introduction21

Sepp

Hochreiter

Jürgen

Schmidhuber

Figure 1.14

Thecreatorsofthelongshort-termmemorycell—thefirstrecurrentneuralnetwork

modulecapable ofdealingwith long-rangedependencies—HochreiterandSchmidhu-

ber, pictured alongside one of the earliest schematic depictions of their creation.

in computervision, thebreakthrough would needto wait foranother decadeto

come.

1.2.6Gating Mechanisms and Time Warping

Inourcontext,itmaynotbeinitiallyobviouswhetherrecurrentmodels

embodyanykindofsymmetryorinvarianceprinciplesimilartothetrans-

lationalinvariancewehaveseeninCNNs.Nearlythreedecadesafterthe

developmentofRNNs,CorentinTallecandYannOllivier(2018)showed

thatthereisatypeofsymmetryunderlyingrecurrentneuralnetworks:time

warping.

Time serieshave averyimportantbutsubtle nuance:it israrely thecasethat

the individual steps of atime series are evenly spaced.Indeed, inmany natural

phenomena,wemaydeliberatelychooseto takemany measurementsatsome

timepointsandveryfew(orno)measurementsduringothers.Inaway,the

notion of ‘time’presented to theRNN workingon such data undergoes a form

of warping operation. Can we somehow guarantee that our RNN is "resistant"

to time warping, in the sensethat we will always beable to find a set of useful

parameters for it, regardless of the warping operation applied?

Inordertohandletimewarpingwithadynamicrate,thenetworkneeds

tobeabletodynamicallyestimatehowquicklytimeisbeingwarped(the

so-called‘warpingderivative’)ateverystep.Thisestimateisthenusedto

selectively overwrite theRNN’s state.Intuitively, forlargervalues ofthe warp-

ing derivative, a more strict state overwrite should be performed, as more time

has elapsed since.

Theideaofdynamicallyoverwritingdatainaneuralnetworkisimple-

mentedthroughthegatingmechanism.TallecandOlliviereffectivelyshow

that,inordertosatisfytimewarpinginvariance,anRNNneedstohavea

22Chapter 1

gatingmechanism. Thisprovides atheoretical justificationfor gate RNNmod-

els,suchastheaforementionedLSTMsofHochreiterandSchmidhuberor

theGatedRecurrentUnit(GRU)(Choetal.2014).Fulloverwritingusedin

simpleRNNscorrespondstoanimplicitassumptionofaconstanttimewarp-

ing derivative equal to one–a situationunlikely tohappenin mostreal-world

scenarios – which also explains the success of LSTMs.

1.2.7The Triumph of Deep Learning

Asalready mentioned,the initialreceptionof whatwould laterbe called"deep

learning"hadinitiallybeenratherlukewarm.Thecomputervisioncommu-

nity,wherethefirstdecadeofthenew centurywasdominatedbyhandcrafted

featuredescriptors,appearedtobeparticularlyhostiletotheuseofneural

networks.However,thebalanceofpowerwaschangedbytherapidgrowth

incomputingpowerandtheamountsofavailableannotatedvisualdata.It

becamepossibletoimplementandtrainincreasinglybiggerandmorecom-

plexCNNsthatallowedaddressingincreasinglychallengingvisualpattern

recognitiontasks,

culminatinginaHolyGrailofcomputervisionatthat

time:theImageNetLargeScaleVisualRecognitionChallenge(Figure1.15).

Established by the American-Chinese researcherFei-Fei Li,ImageNet was an

annual challenge consisting in the classification of millions of human-labelled

images into 1000 different categories.

25%

26%

16.4%

11.7%

6.7%

3.6%

3.1%

2011

XREC

2015

ResNet

10%

15%

20%

30%

Top-5 error

28%

2010

NEC-UIUC

2012

AlexNet

2013

ZFNet

2014

GoogLeNet

VGGNet

Human

2016

GoogLeNet

-v4

Fei Fei Li

Figure 1.15

ImageNet, a benchmarkdevelopedby Fei FeiLi at StanfordUniversity was one ofthe

‘HolyGrail’challengesincomputervisionintheearly2010s.Thedramaticperfor-

mance improvement provided bythe convolutionalneural networkAlexNet in 2012is

considered the turning point leadingto the widespread adoption of deep learning inthe

field.

A CNNarchitecturedeveloped at theUniversity of Torontoby Krizhevsky,

Sutskever,andHinton(2012)managedtobeatbyalargemargin

allthe

competingapproachessuchassmartlyengineeredfeaturedetectorsbasedon

decadesofresearchinthefield.AlexNet(asthearchitecturewascalledin

Introduction23

honourofitsdeveloper,AlexKrizhevsky;seeFigure1.16)wassignificantly

biggerintermsofthenumberofparametersandlayerscomparedto itsolder

siblingLeNet-5,

but conceptuallythesame.Thekey differencewastheuse

ofagraphicsprocessor(GPU)fortraining

,nowthemainstreamhardware

platform for deep learning.

224

128

192192128

128

192192

128

20482048

1000

Stride

of 4

Max

pooling

Max

pooling

Max

pooling

dense

densedense

CONV1

POOL1POOL2

CONV2CONV3CONV4CONV5

POOL3

FC1FC2FC3

Alex Krizhevsky

Figure 1.16

AlexKrizhevsky,picturednexttoaschematicofAlexNet—thefirstdeepneural

network-basedsolution towin theImageNet contest,by asignificantmargin.AlexNet’s

result iscommonly seenas apivotal moment inthe development ofmodern deeplearn-

ing,usheringinsignificantdevelopmentsinsubsequentyears—eventuallyleadingto

surpassing human performance on ImageNet only a few years later.

The successofCNNs onImageNet becamethe turning pointfordeep learn-

ingandheraldeditsbroadacceptanceinthefollowingdecade.Asimilar

transformationhappenedinnaturallanguageprocessingandspeechrecog-

nition,whichmovedentirelytoneuralnetwork-basedapproachesduringthe

2010s; indicatively, Google and Facebook switched their machine translations

systems toLSTM-basedarchitecturesaround2016–2017.Multi-billion dollar

industriesemergedas aresult ofthis breakthrough,with deeplearning success-

fullyusedincommercialsystemsrangingfromspeechrecognitioninApple

iPhonetoTeslaself-drivingcars.Morethanfortyyearsafterthescathing

review of Rosenblatt’s work, the connectionists were finally vindicated.

1.2.8Graph Neural Networks and Their Chemical Precursors

Ifthehistoryofsymmetryistightlyintertwinedwithphysics,thehistoryof

graphneuralnetworks—oneofthecentraltopicsofourbook—hasrootsin

anotherbranchofnaturalscience:chemistry.Chemistryhashistoricallybeen

– and stillis– one ofthemost data-intensive academic disciplines.Theemer-

genceofmodernchemistryintheeighteenthcenturyresultedintherapid

24Chapter 1

growthofknownchemicalcompoundsandanearlyneedfortheirorgani-

sation.ThisrolewasinitiallyplayedbyperiodicalssuchastheChemisches

Zentralblatt

and"chemicaldictionaries"liketheGmelinsHandbuchder

anorganischenChemie(anearlycompendiumofinorganiccompoundsfirst

publishedin1817

)andBeilsteinsHandbuchder organischenChemie (asim-

ilareffortfor organic chemistry)–allinitially publishedinGerman, whichwas

the dominant language of science until the early 20th century.

IntheEnglish-speakingworld,theChemicalAbstracts Service(CAS) was

createdin 1907andhasgraduallybecomethecentralrepositoryforthe world’s

publishedchemicalinformation.

However,thesheeramountofdata(the

Beilstein alone hasgrown to over 500 volumes and nearly half amillion pages

overitslifetime)hasquicklymadeitimpracticaltoprintandusesuchchemical

databases.

Sincethemid-nineteenthcentury,chemistshaveestablishedauniversally

understoodwaytorefertochemicalcompoundsthroughstructural formulae,

indicatingacompound’satoms,thebondsbetweenthem,andeventheir3D

geometry.Butsuchstructuresdidnotlendthemselvestoeasyretrieval.In

thefirsthalfofthe20thcentury,withtherapidgrowthofnewlydiscovered

compoundsandtheircommercialuse,theproblemoforganising,searching,

and comparingmoleculesbecameof crucialimportance:forexample, whena

pharmaceutical company sought to patent a new drug, the Patent Office had to

verify whether a similar compound had been previously deposited.

Toaddressthischallenge,severalsystemsforindexingmoleculeswere

introducedinthe1940s,formingfoundationsfor anewdisciplinethatwould

laterbecalledchemoinformatics.One suchsystem,namedthe‘GKDchemical

cipher’ after theauthors Gordon,Kendall,and Davison (1948), was developed

attheEnglishtirefirmDunloptobeusedwithearlypunchcard-basedcom-

puters.

In essence,the GKDcipher was analgorithm forparsing amolecular

structureintoastringthatcouldbemoreeasilylookedupbyahumanora

computer.

However,theGKDcipherandotherrelatedmethods

werefarfromsat-

isfactory.Inchemicalcompounds,similarstructuresoftenresultinsimilar

properties.Chemistsaretrainedtodevelopintuitiontospotsuchanalogies,

andlookforthemwhencomparingcompounds.

Ontheotherhand,when

amoleculeisrepresentedasastring(suchasintheGKDcipher),thecon-

stituents of a singlechemical structure may be mappedinto different positions

of thecipher. As aresult, two molecules containinga similarsubstructure (and

thus possibly similarproperties)mightbeencodedinvery different ways (see

an example in Figure 1.17).

Introduction25

George Vl

adu ̧t

Figure 1.17

A figure fromVl

adu ̧t et al.(1959) showing a chemical molecule (top left)and its frag-

ment(topright)andthecorrespondingGKD-ciphers(bottom).Notethatthiscoding

systembreaksthespatiallocalityofconnectedatomsinthemolecule,suchthatthe

fragmentciphercannotbefoundbysimplesubstringmatchinginthatofthethefull

molecule.This drawbackofearlychemicalrepresentation methodswasoneofthemoti-

vations for the search of structural representations of molecules as graphs.

Thisrealisationhasencouragedthedevelopmentof"topologicalciphers,"

tryingtocapturethestructureofthemolecule.Firstworksofthiskindwere

doneattheDowChemicalscompanybyOplerandNorton(1956)andthe

USPatentOfficebyRayandKirsch(1957)–bothheavyusersofchemical

databases.Oneofthemostfamoussuchdescriptors,knownasthe‘Morgan

fingerprint,’ wasdevelopedby HarryMorgan(1965) atthe ChemicalAbstracts

Service and used until today.

Afigurethathasplayedakeyroleindevelopingearly"structural"

approachesforsearchingchemicaldatabasesistheRomanian-bornSoviet

researcherGeorgeVl

adu ̧t(Figure1.17).Achemistbytraining(withaPhD

inorganicchemistrydefendedattheMoscowMendeleevInstitutein1952),

he experienced a traumaticencounterwith the gargantuan Beilstein handbook

inhisfreshmanyears.

Thissteeredhisresearchintereststowardschemoin-

formatics(Vl

adu ̧tetal.1959),afieldinwhichheworkedfortherestofhis

life.

adu ̧t iscredited as oneof thepioneers of usinggraph theoryfor modeling

thestructuresandreactions ofchemical compounds.In asense, thisshouldnot

comeasasurprise:graphtheoryhasbeenhistoricallytiedtochemistry,and

eventheterm ‘graph’(referring toaset ofnodesand edges,rather thanaplot

of a function) was introduced by themathematician James Sylvester (1878) as

a mathematical abstraction of chemical molecules; cf. Figure 1.18.

In particular, Vl

adu ̧t advocated the formulation of molecular structure com-

parisonasthegraphisomorphismproblem;hismostfamousworkwason

26Chapter 1

August Kekulé

(1829–1896)

James Joseph

Sylvester

(1814–1897)

Figure 1.18

Thestructuralformulaofbenzene(C

)proposedbythe19th-centuryGerman

chemistAugustKekulé.Theterm"graph"(inthesenseusedingraphtheory)was

firstintroducedasamodelofmoleculesbyJamesSylvesterinan1878Naturenote,

explicitly relating molecules to graphs expressing their "Kekuléan diagrams".

classifying chemical reactionsas the partial isomorphism(maximum common

subgraph) of the reactant and product molecules (Vleduts 1963).

adu ̧t’sworkinspired

apairofyoungresearchers,BorisWeisfeiler(an

algebraicgeometer)andAndreyLehman

(self-describedasa"program-

mer"

).Inaclassicaljointpaper(WeisfeilerandLeman1968),theduo

introducedaniterativealgorithmfortestingwhetherapairofgraphsareiso-

morphic(i.e.,thegraphshavethesamestructureuptoreorderingofnodes),

which becameknown astheWeisfeiler-Lehman(WL)test

; seeFigure1.19.

Thoughthetwohadknowneachotherfromschoolyears,theirwaysparted

shortly after their publication and each became accomplished inhis respective

field

WeisfeilerandLehman’sinitialconjecturethattheiralgorithmsolvedthe

graphisomorphismproblem(anddoesitinpolynomialtime)wasincorrect:

whileLeman(1970)demonstrateditcomputationallyforgraphswithatmost

ninenodes,alargercounterexamplewasfoundayearlater(Adelson-Velskii

et al. 1969) (and in fact, a strongly regular graph failingthe WL test called the

‘Shrinkhande graph’ had been known even earlier, (Shrikhande 1959)).

ThepaperofWeisfeilerandLehmanhasbecomefoundationalinunder-

standinggraphisomorphism. Toput theirworkof inhistoricalperspective, one

should rememberthatin the1960s, complexity theory was still embryonicand

algorithmicgraph theorywasonly takingitsfirstbaby steps.AsLehmanrecol-

lected inthe late1990s: "in the60s, onecould inmatter ofdays re-discoverall

the facts,ideas,and techniquesin graphisomorphism theory.Idoubt, thatthe

word ‘theory’isapplicable;everythingwasatsuchabasiclevel."Inthecon-

textofgraphneuralnetworks,WeisfeilerandLehmanhaverecentlybecome

Introduction27

Boris

Weisfeiler

Andrey

Lehman

Figure 1.19

Thecreators oftheeponymousWeisfeiler-Lehmangraphisomorphismtest,thebedrock

ofexpressivityanalysisforbothgraphisomorphismandgraphneuralnetworks,are

pictured alongside the front page of their paper.

householdnameswiththeproofof theequivalence oftheir graphisomorphism

test to message passing (K. Xu et al. 2018; Morris et al. 2019).

1.2.9Back to the Origins

Thoughchemistshavebeen usingGNN-likealgorithms fordecades, itislikely

that their works onmolecularrepresentation remained practicallyunknown in

the machine learning community.

We find it hard to pinpoint precisely when

theconceptofgraphneuralnetworkshasbeguntoemerge:partlyduetothe

factthatmostoftheearlyworkdidnotplacegraphsasafirst-classcitizen,

partlysincegraphneuralnetworksbecamepracticalonlyinthelate2010s,

andpartlybecausethisfieldemerged fromtheconfluenceofseveral adjacent

researchareas;nonetheless,herewewilldiscussseveralpioneeringworks,

many of which have been designed by researchers in Figure 1.20.

Earlyformsofgraphneuralnetworkscanbetracedbackatleasttothe

1990s,withexamplesincluding"LabelingRAAM"byAlessandroSperduti

(1994),the"backpropagationthrough structure"of Gollerand Kuchler(1996),

andadaptiveprocessingofdatastructures(SperdutiandStarita1997;Frasconi,

Gori,andSperduti1998).Whiletheseworkswereprimarilyconcernedwith

operatingover"structures"(oftentreesordirectedacyclicgraphs),manyof

theinvariancespreservedintheirarchitecturesarereminiscentoftheGNNs

more commonly in use today.

The first propertreatment of the processingof generic graphstructures (and

the coiningof theterm ‘graphneural network’)happenedafter theturn ofthe

twenty-first century. AUniversityof Sienateam ledby Marco Gori(2005) and

28Chapter 1

FrancoScarselli(2008)proposedthefirst"GNN."Theyreliedonrecurrent

mechanisms,requiredtheneuralnetworkparameterstospecifycontraction

mappings,andthuscomputingnoderepresentationsbysearchingforafixed

point—thisinitselfnecessitatedaspecialformofbackpropagationanddid

notdependonnodefeaturesatall.Alloftheaboveissueswererectifiedby

the Gated GNN (GGNN)model of Yujia Li et al. (2015), whichbrought many

benefitsofmodernRNNs,suchasgatingmechanisms(Choetal.2014)and

backpropagationthroughtime.Theneuralnetworkforgraphs(NN4G)pro-

posedbyAlessioMicheli(2009)aroundthesametimeusedafeedforward

rather than recurrent architecture, in fact resembling more the modern GNNs.

Alessandro

Sperduti

Christoph

Goller

Andreas

Küchler

Marco

Gori

Franco

Scarselli

Alessio

Micheli

Figure 1.20

Sixofthepioneeringauthorsofgraphneuralnetworks(GNNs)withinthemachine

learningcommunity.AlessandroSperdutidevelopedtheearliestknownGNN-like

architecturetobepublishedatNeurIPS,in1994.GollerandKüchleralsopublished

an early form ofbackprop through structures in the ’90s.The term "GNN" was coined

by Goriand Scarselli’s workin 2005, firmlyestablishing theterm that’s very popularto

thisday—although,theconcurrentNN4G workofAlessioMicheliusesamechanism

that more closely resembles modern GNN implementations.

Anotherimportantclassofgraphneuralnetworks,oftenreferredtoas"spec-

tral",reliedonthenotionoftheGraphFourier transform(Brunaetal.2013).

Therootsofthisconstructionareinthesignalprocessingandcomputational

harmonicanalysis communities,wheredealingwith non-Euclideansignalshas

becomeprominentinthelate2000sandearly2010s.

Influentialpapersby

Shumanetal.(2013)andSandryhailaandMoura(2013)popularisedthenotion

of"GraphSignalProcessing"(GSP)andthegeneralisationofFouriertrans-

formsbasedontheeigenvectorsofgraphadjacencyandLaplacianmatrices.

ThegraphconvolutionalneuralnetworkrelyingonspectralfiltersbyDeffer-

rard,Bresson,andVandergheynst(2016)andtheGCNmodelfromKipfand

Welling(2016)—withitspresentationinspiredbyspectralfilters—areamong

the most cited in the field.

Introduction29

Itis worthnoting that,while theconceptofGNNs experienced severalinde-

pendentre-derivationsinthe2010sarisingfromseveral perspectives(besides

the alreadymentioned connectionsto computationalchemistry and signalpro-

cessing,weshouldhighlightprobabilisticgraphicalmodels(Dai,Dai,and

Song2016)andnaturallanguageprocessing(Vaswanietal.2017)),thefact

thatallofthesemodelsarrivedatdifferentinstancesofacommonblueprint

iscertainlytelling.Infact,ithasrecentlybeenposited(Veli

ckovi

c2022)that

GNNs mayoffer auniversal framework forprocessingdiscretiseddata. Other

neuralarchitectureswe willdiscuss inthis book(such asCNNs orRNNs)may

be recoveredas specialcases ofGNNs, byinserting appropriatepriors intothe

message passing functionsorthe graphstructureover which the messagesare

computed.

Inasomewhatironictwistoffate,modernGNNsweretriumphantlyre-

introducedtochemistry(Figure1.21),afieldtheyoriginatedfrom,byDavid

Duvenaud etal.(2015) asareplacementforhandcraftedMorgan’smolecular

fingerprints, and byJustin Gilmer etal. (2017) inthe formof message-passing

neuralnetworksequivalenttotheWeisfeiler-Lehmantest.Afterfiftyyears,

thecirclefinallyclosed.Atthetimeofwriting,graphneuralnetworkshave

becomeastandardtoolinchemistryandalreadyusedindrugdiscoveryand

designpipelines.Anotable accoladewasclaimedwiththeGNN-based discov-

eryofnovelantibioticcompounds(Stokesetal.2020).DeepMind’sAlphaFold

2(Jumperetal.2021)usedaformofGNNsinordertoaddressahallmark

problem in structural biology – the problem of protein folding.

David DuvenaudJustin Gilmer

Figure 1.21

Thetwo worksspearheaded byDavid Duvenaudand JustinGilmer, respectively,have

greatly popularisedthe use ofGNNs forboth drugscreening and quantumchemistry—

applications that remain prominent to this day.

30Chapter 1

In1999,AndreyLehmanwrotetoamathematiciancolleaguethathehad

the"pleasuretolearnthat‘Weisfeiler-Leman’wasknownandstillcaused

interest."

HedidnotlivetoseetheriseofGNNsbasedonhisworkof

fiftyyearsearlier.NordidGeorgeVl

adu ̧tseetherealisationofhisideasin

chemoinformatics, many of which remained on paper during his lifetime.

1.3The ‘Erlangen Programme’ of Deep Learning

Ourhistorical overviewof thegeometricfoundationsof deeplearning hasnow

naturallybroughtustotheblueprintthatunderpins thisbook.TakingConvo-

lutional andGraphNeural Networksastwoprototypicalexamples, at thefirst

glancecompletelyunrelated, wefindseveralcommontraits.First,both operate

ondata(imagesinthecaseofCNNsormoleculesinthecaseofGNNs)thathas

someunderlyinggeometricdomain(respectively,agridoragraph).Second,

in bothcases thetaskshave anaturalnotion ofinvariance (e.g.tothe position

ofanobjectinimageclassification,or thenumberingof atomsina molecule

inchemicalpropertyprediction)thatcanbeformulatedthroughappropriate

symmetry group (translation in the former example and permutation in the lat-

ter). Third, bothCNNs and GNNsincorporate the respective symmetries asan

inductivebiasbymakingtheirlayersinteractappropriatelywiththeactionof

thesymmetrygroupontheinput. InCNNs,itcomesinthe formofconvolu-

tionallayerswhoseoutputtransformsinthesamewayastheinput(wewill

callthispropertytranslation-equivariance,whichisthesameassayingthat

convolution commuteswith theshiftoperator).InGNNs,it assumestheform

of a symmetricmessagepassing function thataggregates theneighbournodes

irrespectiveoftheirorder,andtheoveralloutputofamessagepassinglayer

transformsinthesamewaywiththepermutationoftheinput(permutation-

equivariant).Finally,insomearchitecturalinstances,thedataareprocessed

in a multi-scalefashion;thiscorrespondstopoolinginCNNs(uniformly sub-

sampling the grid thatunderliestheimage) orgraphcoarseningin sometypes

of GNNs.

Overall,thisappearstobeaverygeneralprinciplethatcanbeappliedto

abroadrangeofproblems,typesofdata,andarchitectures.Wewillusethe

GeometricDeepLearningBlueprinttoderivefromfirstprinciplessomeof

themostcommonandpopularneuralnetworkarchitectures(CNNs,GNNs,

LSTMs,DeepSets,Transformers),whichasoftodayconstitutethemajority

ofthedeeplearningtoolkit.AswewillseeinChapters5–9,alloftheabove

canbeobtainedbyanappropriatechoiceofthedomainandtheassociated

symmetry group.

FurtherextensionsoftheBlueprint,suchasincorporatingthesymmetries

ofthedatainadditiontothoseofthedomain,allowtoobtainanewclassof

Introduction31

equivariantGNNarchitectures,thatwillbediscussedlateroninthisbook.

Sucharchitectureshaverecentlycometo limelightin molecularmodeling, and

accountfor thefactthat inmolecular graphs(where nodesrepresentatomsand

edgesarechemicalbonds),thepermutationofthenodesandtherotationofthe

node features (atomic coordinates) are different symmetries.