Graphs

•

Inmultiplebranchesofscience,fromsociologytoparticlephysics,

graphs are a fundamental model of systems of relations and interactions.

•

Graphsgiverisetoabasictypeofsymmetrymodelledbythegroupof

permutations. This is why we start our investigation by studying them.

•

Mostotherobjectsofinteresttous,suchasgridsandsets,canbe

obtainedasaparticularcaseofgraphs,andmanyarchitectureswewill

discussherewillbeexpressibleinthelanguageofgraphneuralnetworks.

•

Thisalsoincludesthemodernlargelanguagemodel (LLM)stack,which

we present as a case study in this Chapter.

5.1Domain

Inmanyways,graphsarethemainmodalityofdatawereceivefromnature.

Indeed,atvirtuallyalllevelsoforganisationofbiologicalandphysical

systems—frommoleculestotheconnectomicstructureoftheneuronsinthe

brain—graphsofferanaturalabstraction;namely,thatofaninterconnected

networkofnodes.Similarconclusionscanbedrawnabouthuman-designed

systems,suchastransportationnetworksorsocialnetworks.Infact,itmay

evenbearguedthatthedomainswheredeeplearninghastraditionallyseen

mostsuccess—suchasimages,textorspeech—canallbeintheveryleast

seen as special cases of graph structures, but also very often just grid-like pro-

jectionsofafarmoreirregularly

structuredphenomenon.Furthermore,the

veryorganisationofcognitionseemstobeirregularlystructured,withcon-

cepts freelyrelated andrecombined toform new knowledge. Allof thistaken

into account, it is likely not very controversial to claim that a truly artificially-

intelligentsystemwillnecessarilyneedtoknowhowtoprocessandreason

about graph-structureddata. Conveniently,this frameworkalso fitsverynicely

inourGDLblueprint,andcanbeusedasabasisforstudyingmanyother

geometric domains. Now, it is our turn to study graph representation learning.

130Chapter 5

Thisisagraph.G=(V,E)Thisisitstensorialrepresentation.

X∈R

|V×ばつk

A∈R

|V×ばつ|V|

100101

010001

001110

101101

001011

110111

Figure 5.1

AnundirectedgraphG=(V,E)containingthesetofnodesV={u,v,a,b,c,d}and

edgesE={(a,d),(a,v),(b,v),(c,d),(c,u),(d,v),(u,v)}(symmetricedgesomittedfor

brevity).Thisgraphisalsorepresentedusingtensors(nodefeaturesX,adjacencymatrix

A) for a node permutation g=[a,b,c,d,u,v].

Agraph G=(V,E)isacollectionofnodes

VandedgesE⊆V×ばつVbetween

pairsofnodes.Forthepurposeofthefollowingdiscussion,wewillfurther

assumethe nodesto beendowedwith s-dimensionalnode features,denotedby

for all u∈V; see Figure 5.1 for an illustration. Social networks are perhaps

amongthemostcommonlystudiedexamplesofgraphs,wherenodesrepre-

sentusers,edgescorrespondtofriendshiprelationsbetweenthem,andnode

features modeluserproperties suchasage,profilepicture,etc.Itisalsooften

possibletoendowtheedges,orentiregraphs,withfeatures;butasthisdoes

not alter our main findings, we will defer discussing it to Section 5.2.2.

ThekeystructuralpropertyofgraphsisthatthenodesinVareusuallynot

assumedtobeprovidedinanyparticularorder,andthusanyoperationsper-

formedongraphsshouldnotdependontheorderingofnodes.Thedesirable

propertythatfunctionsactingongraphsshouldsatisfyisthuspermutation

invariance,anditimpliesthatforanytwoisomorphicgraphs—thelikesof

whichwesawinFigure4.4—theoutcomesofthesefunctionsareidentical.

Wecansee thisasa particularsettingof ourblueprint,wherethe domainΩ=G

and thespaceX(G ,R

) isthatof d-dimensionalnode-wise signals.Thesym-

metryweconsiderisgivenbythepermutationgroupG=S

,whoseelements

are all the possible orderings of the set of node indices {1,...,n}.

5.1.1Sets

Letusfirstillustratetheconceptofpermutationinvarianceonsets,aspecial

caseofgraphswithout edges(i.e., E=∅). Bystacking thenodefeaturesasrows

ofthen×ばつd matrix X=(x

,...,x

)

⊤

,wedo effectivelyspecify anorderingof

Graphs131

thenodes.Theactionofthepermutationg∈S

onthesetofnodesamounts

tothereorderingoftherows ofX,which canberepresentedasann×ばつnper-

mutation matrix ρ(g)=P,

where each row andcolumn contains exactly one 1

and all the other entries are zeros.

Afunctionfoperatingonthissetisthensaidtobepermutationinvariant

if, for any such permutation matrix P, it holds that f(PX)=f(X). One simple

such function is

f(X)=φ

u∈V

(

)

,(5.46)

where thefunction ψis independently appliedto every node’s features, andφ

is appliedon its sum-aggregated outputs: assum isindependent ofthe order in

whichitsinputsareprovided,suchafunctionisinvariantwithrespecttothe

permutation ofthe node set,and is henceguaranteed to always returnthe same

output,no matterhowthe nodesare permuted.Suchalayer isverypowerful—

undersuitableconditionsonψ’soutputdimensionality,itcanbeshownthat

any permutation-invariant setfunction must be representableusing theform in

Equation 5.46 (Wagstaff et al. 2019).

Functionsliketheabove providea‘global’set-leveloutput,but veryoften,

we willbe interestedin functionsthat act‘locally’, ina node-wisemanner.For

example, we maywant to applysome functiontoupdate thefeaturesin every

node, obtainingthe set oflatent node features. Ifwe stack these latentfeatures

intoamatrixH,thefunction

F(X)=Hisnolongerpermutationinvariant:

theorderoftherowsofH shouldbetiedtotheorderoftherowsofX,so

that we know which output node feature corresponds to which input node. We

needinsteadamorefine-grainednotionofpermutationequivariance,stating

that,oncewe "commit"to apermutation ofinputs, itconsistently permutesthe

resulting objects. Formally, F(X) isa permutation equivariantfunction if, for

anypermutation matrix P,it holds thatF(PX)=PF(X). A shared node-wise

linear transform

(X)=XΘ(5.47)

specified by aweight matrix Θ∈R

d×ばつd

′

, is onepossible construction ofsuch a

permutationequivariant function,producinginourexamplelatentfeaturesof

the form h

=Θ

⊤

ThisconstructionarisesnaturallyfromourGeometricDeepLearning

blueprint.Namely,wecancharacteriselinearequivariants(functionsofthe

formFPX=PFX),forwhichitiseasytoverifythatanysuchmapcanbe

writtenasalinearcombinationoftwogenerators,theidentityF

X=X and

theaverageF

⊤

u=1

.InSection5.2.3,weshowhowthe

popular DeepSets (Zaheeret al.2017) architecturefollows preciselythis idea.

132Chapter 5

5.1.2From sets to graphs

We cannowgeneralise the notionsof permutationinvariance and equivariance

from sets tographs.In the generic settingE =∅, the graphconnectivity can be

represented by the n×ばつn adjacency matrix A,

defined as

(

1(u,v)∈E

0otherwise.

(5.48)

NotethatnowtheadjacencyandfeaturematricesAandXare"synchronised",

inthesensethata

specifiestheadjacencyinformationbetweenthenodes

described by the uth and vth rows of X– cf. Figure 5.1. Therefore, applying a

permutation matrixPto thenode features Xautomatically impliesapplying it

toA’s rows andcolumns,PAP

⊤

Wesaythat(agraph-wisefunction

)fis

permutation invariant if

f(PX,PAP

⊤

)=f(X,A)(5.49)

and (a node-wise function) Fis permutation equivariant if

F(PX,PAP

⊤

)=PF(X,A)(5.50)

for any permutation matrix P.

Hereagain,wecanfirstcharacteriselinearequivariantfunctions.

observedbyMaronetal.2018,anylinearFsatisfyingequation(5.50)can

beexpressedasalinearcombinationoffifteenlineargenerators;remarkably,

thisfamilyofgeneratorsisindependentofn.Amongstthesegenerators,our

blueprintspecificallyadvocatesforthosethatarealsolocal,i.e.,wherebythe

output onnodeudirectly depends onits neighbouring nodesin the graph.We

canformalisethisconstraintexplicitly inourmodelconstruction,bydefining

what it means for a node to be neighbouring another.

A(undirected)neighbourhoodofnodeu,sometimesalsocalled1-hop,is

defined as

={v:(u,v)∈Eor(v,u)∈E}(5.51)

and the neighbourhood features are defined as the multiset

={{x

:v∈N

}}.(5.52)

Operating on 1-hopneighbourhoods aligns well withthe locality aspectof our

blueprint:namely,definingourmetric overgraphsas theshortestpathdistance

between nodes using edges in E.

TheGDLblueprint thusyields ageneralrecipeforconstructingpermutation

equivariant functions on graphs, by specifying a local function φthat operates

over thefeatures ofa nodeand itsneighbourhood, φ(x

). Then,a permu-

tation equivariant function Fcan beconstructed byapplying φto everynode’s

Graphs133

φ(x

)

Figure 5.2

Anillustrationofconstructingpermutation-equivariantfunctionsovergraphs,byapply-

inga permutation-invariantfunction φtoeveryneighbourhood.Inthis case,φis applied

tothefeaturesx

ofnodebaswellasthemultisetofitsneighbourhoodfeatures,

={{x

}}. Applying φin this manner to every node’s neighbourhood

recovers the rows of the resulting matrix of latents features H=F(X,A).

neighbourhood in isolation (see Figure 5.2):

F(X,A)=







φ(x

)

φ(x

)

φ(x

)







(5.53)

AsF isconstructedbyapplyingasharedfunctionφtoeachnodelocally,its

permutationequivariancerestsonφ’soutputbeingindependentontheordering

ofthenodesinN

.Thus,ifφisbuilttobepermutationinvariant,thenthis

property is satisfied.

Itisalsoworthnoticingthatthedifferencebetweenfunctionsdefinedon

sets andmore generalgraphsin thisexample isthat inthe lattercasewe need

to explicitly accountfor thestructure ofthe domain. Asa consequence,graphs

stand apartinthesense thatthedomainbecomespartof theinputinmachine

learningproblems,whereaswhendealingwithsetsandgrids(bothparticu-

larcasesofgraphs)wecanspecifyonlythefeaturesandassumethedomain

tobefixed.Thisdistinctionwillbearecurringmotifinourdiscussion.As

aresult,thenotionofgeometricstability(invariancetodomaindeformation)

is crucial in mostproblemsof learningongraphs.Itstraightforwardly follows

fromour constructionthatpermutation invariantandequivariantfunctionspro-

duce identical outputs on isomorphic (topologically-equivalent) graphs. These

resultscanbegeneralisedtoapproximatelyisomorphicgraphs,andseveral

results on stability under graphperturbations exist (Levie et al. 2018).We will

134Chapter 5

returntothisimportantpointinourdiscussiononmanifolds,whichwewill

use as an vehicle to study such invariance in further detail.

Second,duetotheiradditionalstructure,graphsandgrids,unlikesets,

canbecoarsenedinanon-trivialway

,givingrisetoavarietyofpooling

operations.

5.2Model

GraphNeuralNetworks(GNNs)aretherealisationofourGeometricDeep

Learningblueprintongraphsleveragingthepropertiesofthepermutation

group. GNNsare amongthe mostgeneral classofdeep learningarchitectures

currently in existence, andas we will see in this text, most other deep learning

architecturescanbeunderstoodasaspecialcaseoftheGNNwithadditional

geometric structure.

AsperourdiscussioninSection5.1,weconsideragraphtobespecifiedwith

an adjacency matrix Aand nodefeatures X.We will studyGNN architectures

thatarepermutationequivariantfunctionsF(X,A)constructedbyapplying

sharedpermutationinvariantfunctionsφ(x

)overlocalneighbourhoods.

Undervariousguises,thislocalfunctionφcanbereferredtoas"diffusion",

"propagation",or"messagepassing",andtheoverallcomputationofsuchF

as a "GNN layer".

5.2.1The three "flavours" of GNNs

ThedesignandstudyofGNNlayersisoneofthemostactiveareasofdeep

learningatthetimeofwriting,makingitalandscapethatischallengingto

navigate.Fortunately,wefindthatthevastmajorityoftheliteraturemaybe

directlyderivedfromonlythreespatial

"flavours"ofGNNlayers(Figure

5.3),whichwewillpresenthere.Theseflavoursgoverntheextenttowhich

φtransformstheneighbourhoodfeatures,allowingforvaryingdegreesof

complexity when modelling interactions across the graph.

Inallthreeflavours,permutationinvarianceisensuredbyaggregatingfea-

turesfromX

(potentiallytransformed,bymeansofsomefunctionψ)with

somepermutation-invariantfunction

,andthenupdatingthefeaturesof

nodeu,bymeansofsomefunctionφ.Typically,

ψandφarelearnable,

whereas

isrealisedasanonparametricoperationsuchassum,mean,

ormaximum,thoughitcanalsobeconstructede.g.usingrecurrentneural

networks with appropriate constraints (R. L. Murphy et al. 2018).

Intheconvolutionalflavour(KipfandWelling2016;Defferrard,Bresson,

andVandergheynst2016;F. Wu etal.2019), thefeatures ofthe neighbourhood

Graphs135

Convolutional

Attentional

Message-passing

Figure 5.3

AvisualisationofthedataflowforthethreeflavoursofGNNlayers,g.Weusethe

neighbourhoodofnodebfromFigure5.2toillustratethis.Left-to-right:convolutional,

where sendernode featuresare multipliedwith aconstant, c

; attentional,where this

multiplierisimplicitlycomputedviaanattentionmechanismofthereceiveroverthe

sender:α

=a(x

);andmessage-passing,wherevector-basedmessagesarecom-

puted based on both the sender and receiver: m

=ψ(x

nodes are directly aggregated with fixed weights,

=φ

v∈N

ψ(x

)

.(5.54)

Here,c

specifiestheimportanceof nodevto nodeu’s representation.Itisa

constant that often directly depends on the entries in Arepresenting the struc-

tureofthegraph.Note thatwhentheaggregationoperator

ischosentobe

thesummation,it canbe consideredas alinear diffusionorposition-dependent

linear filtering, a generalisation of convolution

—givingthe flavour its name.

Intheattentionalflavour(Veli

ckovi

cetal.2018;Montietal.2017;J.Zhang

et al. 2018; Brody, Alon, and Yahav 2022), these interactions are implicit:

=φ

v∈N

a(x

)ψ(x

)

.(5.55)

Here,aisalearnableself-attentionmechanismthatcomputestheimportance

coefficientsα

=a(x

)implicitly.Theyareoftensoftmax-normalised

acrossallneighbours.When

isthesummation,theaggregationisstilla

linearcombinationoftheneighbourhoodnodefeatures,butnowtheweights

are feature-dependent.

Finally,themessage-passingflavour(Battagliaetal.2016;Gilmeret

al.2017;Battagliaetal.2018)amountstocomputingarbitraryvectors

("messages") across edges,

=φ

v∈N

ψ(x

)

.(5.56)

136Chapter 5

Here,ψisalearnablemessagefunction,computingv’svectorsenttou,and

the aggregation can be considered as a form of message passing on the graph.

Oneimportantthingtonoteisarepresentationalcontainmentbetween

theseapproaches:convolution⊆attention⊆message-passing. Indeed,atten-

tionalGNNscanrepresentconvolutionalGNNsbyanattentionmechanism

implementedasalook-uptablea(x

)=c

,andbothconvolutionaland

attentionalGNNsarespecialcasesofmessage-passingwherethemessages

areonlythesendernodes’features:ψ(x

)=c

ψ(x

)forconvolutional

GNNsandψ(x

)=a(x

)ψ(x

)forattentionalGNNs. Notethatthelat-

ter containment holdseven ifthe attentional coefficients arenormalised across

neighbours; we will explore this further in Section 5.4.1.

This does not imply that messagepassing GNNs are always the most useful

variant;astheyhaveto computevector-valuedmessages acrossedges, theyare

typically harder totrain and requireunwieldy amounts of memory. Further, on

awiderangeofnaturally-occurringgraphs,thegraph’sedgesencodefordown-

stream classsimilarity (i.e.anedge (u,v) impliesthat uand vare likely tohave

thesameoutput).Forsuchgraphs(oftencalledhomophilic),convolutional

aggregationacrossneighbourhoodsisoftenafarbetterchoice,bothinterms

ofregularisationandscalability.AttentionalGNNsoffera"middle-ground":

theyallowformodellingcomplexinteractionswithinneighbourhoodswhile

computingonlyscalar-valuedquantitiesacrosstheedges,makingthemmore

scalable than message-passing.

The"threeflavour"categorisationpresentedhereisprovidedwithbrevity

inmindandinevitablyneglectsawealthofnuances,insights,perspectives,

generalisations,andhistoricalcontextstoGNNmodels.However,basedon

argumentswewillpresentinthefollowingsubsections,itsexpressivitymay

be sufficient to represent anygraph function of interest.

5.2.2A maximally potent GNN: the Graph Network model

Westartedoutourdiscussionbydeliberatelyassumingthegraphonlyhad

featuresinthenodes;however,formanydomainsofscientificandindustrial

interest, we may be interested in featurising the graph at other granularities.

Oneveryimportantsuchcontextiscomputationalchemistry,andwhen

working with molecular data we may expect to see,simultaneously: nodefea-

tures,x

,suchasatomtypeandcharge;edge features,x

,suchasbondtypes,

andwhethertheyareinaring;andgraph features, x

,suchasthemolecular

weight orfingerprints.We maythenbe interested inusing a GNNto compute

representations h

, h

at all of these levels.

Toillustratehowalltheconcepts weintroducedbeforegeneralise tothisset-

ting, wepresent theGraphNetwork architecture (Battagliaet al.2018).It isa

Graphs137

v∈N

(u,v)∈E

u∈V

Figure 5.4

ThedataflowoftheGraphNetworkmodel(Battagliaetal.2018),integratingnode

features(inblue),edgefeatures(ingreen)andgraphfeatures(inred)tocomputeappro-

priate representations. Equivariant(blue) and invariant (red) layers featureprominently

across the Graph Network’s three functions ψ, φand ρ.

generalblueprintfor buildingpermutationequivariantfunctionsoverattributed

graphs, and all of theflavours we discussed can be obtained byanappropriate

restriction ofits equations.It proceedsin three phases,at eachstep usingall of

the relevant information (Figure 5.4).

First,edgerepresentations,h

,arecomputedusingthefeaturesoftherel-

evanttwonodes(uandv),thefeaturesoftheedge,andthefeaturesofthe

graph:

=ψ

(

)

,(5.57)

whereψwouldbeanalogoustothepreviouslydefinedmessagefunction.

Treatingh

asmessages,wecancomputenoderepresentations.TheGraph

Network takes into account the node’s features, the aggregated edge represen-

tations from all of the neighbours, and the graph features:

=φ

v∈N

.(5.58)

Finally,sincegraph-levelrepresentationsareaglobalpropertyofthegraph,

theyneedtobecomputedinvariantly.Embodyingthespiritofthegeometric

deep learning blueprint, we aggregate all the node andedgerepresentationsto

compute h

=ρ





u∈V

(u,v)∈E





.(5.59)

138Chapter 5

Lastly,thereadermaybecuriousaboutwhethersuchaframework trulyis

maximallypotent—for example,couldit alsosupportstructures suchashyper-

graphs, wheremorethantwo nodesmaybe involved inahyperedge.Wewill

ponder this question from several angles in the following sections.

5.2.3Deep Sets, Transformers, and Graph Rewiring

The"threeflavours"viewofGNNspresentedthusfarlargelyconcernsitself

withthedesignoftheirmessageandupdatefunctions(ψandφ).However,

lookingatthegeneralmessagepassingequation(Equation5.56),therearein

fact two other "moving parts": the choice of aggregator,

, and the choice of

neighbourhoods, N

. In the process of analysing thesechoices, it turns out we

canformtightconnectionsbetweenGNNsandmanyotherinfluentialneural

network architectures. We will begin by analysing the choice of N

Infact, thechoiceofN

isdirectlytiedtotheinputgraphstructureweare

assuming. Thusfar,wepresumed thatN

is given to us, andwe didnotques-

tionits structure.However, suchasituation isinreality quiterare:manygraphs

we observe intherealworld are merelyapproximationsof the "ground-truth"

interactionsbetweenthenodes.Forexample,inasocialnetwork,weonly

observe the friendship links users have explicitly formed, rather than the com-

plete collectionof friendshipinteractionsbetween thosepeople. Andwithina

molecule,thereexistphysical interactionsbetween allpairs ofatoms—not just

onesexplicitlyconnectedbyachemicalbond.Thisexamplebecomesespe-

ciallyprominentwhenweconsiderinteractionswithinmacromoleculessuch

asproteins,thatfoldin3Dspace,allowingatomsthatarequitedistantinthe

molecular graph to interact more directly.

Beyondtheneedtohandlesuchapproximationstoground-truthgraphs,it

is also well-understoodthat certainchoices of graphare simply pathologically

bad–andtheirusewillleadtoissuesregardlessoftheGNNarchitecturewe

use(AlonandYahav2020;DiGiovannietal.2023).Onetypicalinstanceof

thisariseswhenthegraphhasbottlenecks—smallcollectionsofedgeswhich

areresponsibleforcarrying representationsbetween largegroupsofnodes.For

example,inabarbellgraph,therededgeisundersig-

nificantrepresentationalpressuretotransportinformationbetweenthenodes

inthetwocommunities.Ifitisimportantforthosecommunitiestodirectly

exchangefeatures,itispossibletoprovethatinmanycases,GNNswillnot

even be able to fit their training data accurately under such a graph.

TostudyvariousstandardchoicesofN

withoutdiscussingpropertiesof

any given input graphs, we will now briefly return to the domain of unordered

sets.InthelanguageofSection5.1,weassumethatwearegiven amatrixof

Graphs139

Figure 5.5

ThreestrategiesforchoosinganedgesetforaGNN,left-to-right:empty,leadingto

DeepSets(Zaheeretal.2017);complete,leadingtoTransformers(Vaswanietal.2017);

and precomputed, leading to EGP (Deac,Lackenby, and Veli

ckovi

c 2022). Inall cases

we highlight one of the nodes’ neighbourhoods, N

, in red.

node features, X, but without any specified adjacency or ordering information

betweenthenodes.Thespecificarchitectureswillarisebydecidingtowhat

extent to model interactions between the nodes; see Figure 5.5.

5.2.3.1EmptyedgesetUnorderedsetsareprovidedwithoutanyadditional

structureorgeometrywhatsoever—hence,itcouldbearguedthatthemost

natural way toprocess themis to treateach setelement entirelyindependently.

Thistranslatestoapermutationequivariantfunctionoversuchinputs,which

wasalreadyintroducedin Section5.1:asharedtransformation appliedtoevery

nodeinisolation. Assumingthe samenotation aswhendescribingGNNs,such

models can be represented as

=ψ(x

),(5.60)

whereψisalearnabletransformation.Itmaybeobservedthatthisisaspe-

cialcase ofa convolutional GNNwith N

={u}—or, equivalently,A=I. Such

anarchitectureiscommonlyreferredtoasDeepSets,inrecognitionofthe

workofZaheeretal.2017thathavetheoreticallyprovedseveraluniversal-

approximationproperties ofsucharchitectures. Itshouldbe notedthat theneed

toprocessunorderedsetscommonlyarisesincomputervisionandgraphics

when dealingwithpoint clouds;therein,such modelsareknownasPointNets

(Qietal.2017).Philosophically, itmayalsobenotedthatthisarchitectureis

thefurthestwecangowith"pure"sets;anyadditionalinputstoψimplythat

there exist connections between the nodes, making our domain a graph.

5.2.3.2CompleteedgesetWhileassuminganemptyedgesetisavery

efficient constructforbuildingfunctionsover unorderedsets,oftenwewould

140Chapter 5

expectthatelementsofthesetexhibitsomeformofrelationalstructure—

i.e.,thatthereexistsalatentgraphbetweenthenodes.SettingA=Idiscards

anysuchstructure,andmayyieldsuboptimalperformance.Conversely,we

couldassume that,in absenceofanyother priorknowledge,we cannotupfront

excludeanypossiblelinksbetweennodes.Inthisapproachweassumethe

completegraph,A=11

⊤

;equivalently,N

=V.Aswedonotassumeaccess

to any coefficients of interaction, running convolutional-type GNNs over such

a graph would amount to:

=φ

v∈V

ψ(x

)

,(5.61)

where the secondinput,

v∈V

ψ(x

) is identicalfor all nodesu

, and assuch

makes the modelequivalently expressive to ignoringthatinput altogether;i.e.

the A=Icase mentioned above.

This motivates the use of a more expressive GNN flavour, the attentional,

=φ

v∈V

a(x

)ψ(x

)

(5.62)

whichyieldstheself-attentionoperator, thecoreoftheTransformerarchitec-

ture(Vaswanietal.2017).Assumingsomekindofnormalisationoverthe

attentionalcoefficients(e.g. softmax),wecan constrainallthe scalarsa(x

)

tobeintherange[0,1];assuch,wecanthinkofself-attentionasinfer-

ringasoftadjacencymatrix,a

=a(x

),asabyproductofgradient-based

optimisation for some downstream task.

TheaboveperspectivemeansthatwecanposeTransformersexactlyas

attentionalGNNsoveracompletegraph(Joshi2020).

However,thisisin

apparentconflictwithTransformersbeinginitiallyproposedformodelling

sequences—therepresentationsofh

shouldbemindfulofnodeu’sposition

in the sequence, which complete-graphaggregation would ignore. Transform-

ersaddressthisissuebyintroducingpositionalencodings:modulatingeither

thenodefeaturesx

orthecomputedattentioncoefficientsa

inawaythat

dependsontheabsoluteorrelativepositionsofuandvinthesequence.The

originalTransformer architecture,forexample, samplesanadditivefactor for

from a sine wave, whose frequency is dependent on u.

Ongraphs,wherenonaturalorderingofnodesexists,multiplealterna-

tiveswereproposedtosuchpositionalencodings.Oneparticularlypromising

directioninvolvesarealisationthatthepositionalencodingsusedin‘vanilla’

TransformerscanbedirectlyrelatedtothediscreteFouriertransform(DFT),

andhencetotheeigenvectorsofthegraphLaplacianofa"circulargrid".

Hence,suchpositionalencodingsareimplicitlyrepresentingourassumption

Graphs141

thatinputnodesareconnectedinagrid.Formoregeneralgraphstructures,

onemaysimplyusetheLaplacianeigenvectorsofthe(assumed)graph—an

observationexploited byDwivedi and Bresson(2020) within their empirically

powerfulGraphTransformermodel.DesigningTransformersforgraphsisa

veryactiveandimportantareaofresearch;wedirectthereadertoMülleret

al. (2023) for an excellent overview.

Finally,wenotethatwhileourdiscussionmighthavebeensufficientto

describemanyimportantaspectsofself-attention,additionalcarefuldetails

areneededtorelatethemodernlargelanguagemodel(LLM)stacktoour

framework. We detailedly cover this as a case study in Section 5.4.

5.2.3.3Inferred edgesetFinally,onecantrytolearnthelatentrelational

structure,leadingtosomegeneralA thatisneitherI nor11

⊤

.Theproblem

of inferringa latentadjacency matrixAforaGNN touse (oftencalledlatent

graphinference)isofhighinterestforgraphrepresentationlearning.Thisis

duetothefactthatassumingA=Imayberepresentationallyinferior,and

A=11

⊤

maybechallengingtoimplementduetomemoryrequirementsand

largeneighbourhoods toaggregate over.Additionally, it isclosest tothe "true"

problem:inferringanadjacencymatrixA impliesdetectingusefulstructure

betweentherowsofX,whichmaythenhelpformulatehypothesessuchas

causal relations between variables.

Unfortunately,suchaframingnecessarilyinducesastep-upinmodelling

complexity.Specifically,itrequiresproperlybalancingastructurelearning

objective(whichisdiscrete,andhencechallengingforgradient-basedopti-

misation)withanydownstreamtaskthegraphisusedfor.Thismakeslatent

graphinferenceahighly challengingand intricateproblem. Accordingly,many

of thecurrent popularmethodsinthis spaceside-steptheissue oflearnability

byusingfullynonparametricmethods(Wangetal.2019),orpre-computing

thegraphstructuretouse(Deac,Lackenby,andVeli

ckovi

c2022;Shirzad

etal.2023).Thesemethodsarealsopopularwhenmodifyingagiveninput

graph toe.g.relieve bottlenecks—an approachcommonly referred toas (non-

parametric)graph rewiring(Gasteiger,Weißenberger,andGünnemann2019;

Topping et al. 2021).

5.2.4The expressive power of GNNs, and the Weisfeiler-Lehman test

The remainingdesignchoiceofGNNstodiscussisthechoiceofaggregation

function,

.Whilestudyingthechoiceofneighbourhoodsallowedustorelate

manypopulararchitecturestoGNNs,studyingaggregatorswillallowusto

reason about the GNNs’ expressive power, as we will now discover.

142Chapter 5

φ(•,{{•,•}})

φ(•,{{•,•,•}})

φ(•,{{•,•}})

φ(•,{{•,•,•}})

φ(•,{{•,•}})

Figure 5.6

HowtheWeisfeiler-Lehmanalgorithmfeaturisesagivengraphoffivenodesandsix

edges, withcategoricalfeatures comingfrom∆

(i.e., effectivelya featurelessgraph).

At each step, the colours are refined using injective mappings of neighbourhood multi-

sets, until thedistribution ofcolours stops changing –in this case, yieldingthe multiset

{{•,•,•,•,•}}.Thishistogramof colourscanthenbecompared againstanothergraph’s.

ItwouldbeveryusefultounderstandwhethertheGNNflavourswedis-

cussedarecapableofrepresentinganypermutationequivariantfunctionwe

careaboutovergraphs,orifthereexistsaneedtostudyalternativeframe-

works.ToquantifythekindsoffunctionsaGNNcanexpress,wetypically

studythemfromtheperspectiveofdecidinggraphisomorphism:giventwo

non-isomorphicgraphs,G

andG

andaGNNembeddingthemintoalatent

spaceh,doesitholdthath

?Ifnot,anykindofdistinctclassification

of these two graphs is hopeless

Tosimplifyouranalysis,wewillassumethatthegraphshavecategor-

icalnodefeatures,x

∈∆

,thatis,thesefeaturesareone-hotvectorsof

kdimensions.Wecaninterpretthisasacolouringoftheirnodes.Recall

that,inourderivationsofar,weexpressGNNsusingtheabstractformula

=φ(x

),wherethefeaturesoftheneighbourhoodarerepresentedas

v∈N

ψ(x

). Inorderto maximisethecapacity of theGNN, notethath

willretainmaximalinformationaboutthegraph—andhence,havemaximal

expressivecapacity—whenφisinjectiveinitsinputs:thatis,twodifferent

neighbourhood multisets must be mapped to different vectors by φ.

Inthecategorical-featurescase,wecanobtaininjectivitybychoosinga

verysimplemessagefunction:ψ(x

)=x

,andthenpickinganaggrega-

tor

that preserves allmultiset statisticsofX

, especiallyaroundrepeated

elements.Suchanaggregatorissummation(

),asitwillpreserve

informationaboutallcardinalitiesofrepeateditemsinthemultiset.Weaker

Graphs143

aggregators,suchasaveragingormaximisation,mayreducethesecardinali-

ties to their relative ratios (

), or eliminate them altogether (max). K. Xu

etal.(2018)leveragethisinsighttoderivethegraphisomorphismnetwork

(GIN), a maximally-expressive GNN over categorical node features:

=φ

(

1+ε

)

v∈N

(5.63)

whereε∈Risalearnablescalardesignedtogiveinjectiverepresentationtothe

receivernode(u),andφisadeepmultilayerperceptron(whichcanrepresent

aninjectivefunction).TheexpressivepowerofGNNsisthenequivalentto

theWeisfeiler-Lehman graphisomorphismtest(WeisfeilerandLeman1968),

aclassical algorithmingraphtheory whichiterativelyrefines thecoloursofthe

nodes usingan (injective) hashof the neighbours’colours (Morriset al.2019).

An illustration of how this algorithm operates is provided in Figure 5.6.

5.2.5*More expressive GNNs

Whiletheequivalenceofmessagepassing(asdescribedinEquation5.56)

totheWeisfeiler-Lehmantestiscertainlyasatisfyingresult,itleavesmuch

tobedesired.Forexample,itimpliesthatmessagepassingneuralnetworks

provablycannotdistinguisha6-cyclefromtwotriangles;the

colourrefinementprocedurewillneverupdateanycolour,asallneighbour-

hoods "locallylook the same".In some applications, thisis amajor drawback:

forexampleinorganicchemistrywheremoleculesaremodeledasgraphs,

cyclic structures such as aromatic rings are abundant and important.

TherearemultiplewaystoextendtheexpressivepowerofGNNsthatwe

willdiscussnext;notethatnoneofthemwillimplythattheequivalenceof

GNNstotheWLtestdoesn’thold,astheywillmodifyassumptionsconcerning

theinput providedto theGNN. Infact, wemay oftenleveragespecific limiting

examples—like the 6-cycle/triangle one above—to derive meaningful ways in

which such limitations can be overcome in general.

5.2.5.1FeatureaugmentationWestartfromperhapsthemostobvious

reasonwhytheWLtestmayfail–toomanyneighbourhoodsbeinglocally

identical.Thisissuearisesbecausetherearenodistinguishingfeatures inthe

nodes; sometimes,the graphmayeven befeatureless, as wediscussed. Italso

immediatelyhintsatacheapwaytosurpassthislimitation:simplyaugment

the graph with more descriptive features.

Anapproachlikethis"pre-colours"different nodes(oredges)ofthegraph

byattachingsomeadditionalfeaturestothem,giving theWLalgorithm(and

144Chapter 5

hencealsoGNNs!)additionalinformationtoworkwithwhendistinguish-

ingnon-isomorphicgraphs.Thesepre-coloursdonothavetobeparticularly

meaningful— alreadyrandomlyfeaturisingthe nodesisprovably sufficient to

increaseexpressivepower(Sato,Yamada,andKashima2021).Randomfea-

turesallowmessagepassingtoidentifyanodeinitsownk-hopneighbourhood,

empowering GNNs to detect cycles of a certain cardinality.

Naturally,featureaugmentationalsoworksvery wellwhenperformedina

moreinformedmanner(R.Murphyetal.2019);itisevenpossibletocount

substructuresweexpecttocareabout(e.g.aromaticringsinmolecules)and

use these counts for additional featurisation (Bouritsas et al. 2022).

5.2.5.2MessagepassingmodulationWhileintroducingadditionalstruc-

turalfeaturesboostsexpressivepower,theGNNisnotcommittedtousing

suchfeatures.As such,another promisingclass ofmodels injectssuch features

directly into theequationofthemessagefunctionψ; in that way, the model is

forced to take this information into account.

There aremany representative instancesofsuch modulatedmessagemech-

anisms;wecallouttwoearlysuccessfulexamples.Firstly,directionalgraph

networks(Beainietal.2021),whereinLaplacianeigenvectorsareusedto

definecanonical"flows"onagraph,andtheyexplicitlyguidethescaling

factorsofvariousincomingmessages.Secondly,theLSPEmodel(Dwivedi

etal.2021)hasdedicatedmessagepassingoverthepositionalfeaturesit

leverages, once again coercing the model to process them nontrivially.

SuchideashavealsopermeatedintographTransformers,withperhaps

themostprominentexamplebeingtheGraphormer(C.Yingetal.2021).

Graphormers computeall-pairs shortestpaths betweennodes inthe graph,and

use them as an explicit bias for the self-attention operator.

5.2.5.3ComputationgraphdesignWhilepre-computinginteresting

graphpropertiesandthenleveragingthem(eitherasadditionalfeaturesor

throughmodifyingmessagesdirectly)isanattractiveapproach,italsoarguably

requiresadegreeoffeatureengineering.Itmightbemorefavourableto

empower themodeltoextractmeaningfulstructurebyitself—butitprovably

cannot do soover many input graphs,owingto theequivalence tothe WLtest.

This motivates a broad class of methods that modifythe input graph of a GNN

in a way that will allow message passing to better detect certain substructures.

We’vealreadycoveredsomeoftheseapproachesinSection5.2.3.3under

theumbrellaof"graphrewiring";onekeydifferenceisthatapproacheswe

willdiscuss hereoftenintroduceadditionalnodesintothegraph.Suchnodes

(often referredto asvirtual nodes)carry representationsthat promotestructure

Graphs145

Subgraph1

Subgraph2

Subgraph3

1 3

1 2

2 3

101

011

111

110

111

011

111

110

101

Figure 5.7

AsubgraphGNNoperatesoverseveralsampledsubgraphsofthegiveninputgraph

simultaneously.ESAN(Bevilacquaetal.2021)isasubgraphGNNthatoperatesina

way that respects boththe permutation symmetries of the nodes (S

) and the subgraphs

)—hence, it is equivariant to the product group S

×ばつS

discovery;whenappropriatelyplaced,theycanprovablyimprovethequality

of the GNN’s dataflow (Wilson, Bechler-Speicher, and Veli

ckovi

c 2024).

Perhapstheeasiestplacetostarttrackingtheseapproachesisbystudy-

ingextensionstotheWeisfeiler-Lehmantestitself.Itwasindeedextended

inthelate1970s

toahierarchyofincreasinglymorepowerfulhigher-

dimensional k-WLteststhat operateon k-tuples ofnodes. ChristopherMorris

etal.(2019)andHaggaiMaronetal.(2018;2019)showeddifferentdesigns

ofhigher-dimensionalGNNarchitecturesequivalenttok-WLtests;however,

suchmodelslosetheadvantageoflocalityandlinearcomplexityofthe

standardmessage-passingGNN

.Whilesuchapproachestakeintoaccount

featurisingallrelevant k-tuplesindiscriminately,itisalsopossibleto perform

moretargetedmessagepassingoverspecifictopologicalsubstructuresinthe

graph, such as simplicial andcellular complexesof the graph (Bodnar, Frasca,

Wang, et al. 2021; Bodnar, Frasca, Otter, et al. 2021).

While approaches inspired by theabove typically addnodes and edges into

thegraph,itisalsopossibletorecoverimprovedexpressivepowerbyselec-

tivelydeletingnodesoredges—effectively,bymessagepassingover selected

subgraphsoftheinputgraph(Figure5.7).Oneparticularlyinterestingrepre-

sentativesubgraphGNNfromtheperspectiveofequivariancesisESAN;the

equivariantsubgraphaggregationnetwork(Bevilacquaetal.2021).ESAN

operatesovermultisetsofsubgraphs,whosesymmetryarisesfromboththe

structureofeachconstituentsubgraphaswellasthemultisetasthewhole.

146Chapter 5

The principle ofequivariancerequiresthatchanging the order ofm subgraphs

in the multiset (permutation group S

) and the order of the n nodes inthe sub-

graphs(permutationgroupS

)yieldsanequivalent representation,orinother

words,thearchitectureisequivariantw.r.t.theproductgroup

G=S

×ばつS

This example serves asa relevantinstance ofcombining symmetries withinthe

samearchitecture;wewillexploremoreinvolvedinstancesofthiswhenwe

discuss geometric graphsin latter Chapters.

5.2.5.4Synchronisationand"clocks"Thusfar,wehavediscussedsev-

eral ways inwhich themessage passingequation may beedited: itsparametric

functions(ψ,φ),aggregationfunction(

),features(x

),andcomputation

graph(N

).Itmayseemasifwe’vecoveredalltheanglespossiblebydoing

so, but in fact, there remains onehidden but important aspect – thesynchroni-

sation betweenthedifferent stepsof themechanism. Inreality, Equation 5.56

stores embeddings over time:

(t+1)

=φ

(t)

v∈N



(t)



,(5.64)

wheret∈Nidentifiesthecurrenttime-stepoftheGNN,andx

(t)

correspond

totheGNNinputembeddingsavailableattimet—notethatwemaysimply

equate x

(t+1)

in most cases, though they may also differ.

Inalmostallcases,Equation5.64isexecutedsynchronously:allh

(t+1)

embeddings arecomputed simultaneously, in parallel– inessence,there isno

distinctionbetweenthenotionoftime-stepand thenotionofGNNlayer.But

whatifwerelaxedthisassumption,andallowedspecificsubsetsofmessages

to be computed or aggregated at one step, rather than all-at-once?

Itturnsoutthatthismodification,evenwhenonlyonemessageatatime

isprocessed(seeFigure5.8),canyieldimprovementstoexpressivepower.

Thiseffect wasfirststudiedandprovedbyFaberandWattenhofer(2024)for

theirGwACasynchronousarchitecture.Theargumentisasfollows:clearly,

anasynchronousGNNcanmatchtheexecutionofasynchronousone—we

justneedtouseanexecutionorderthatmatchestherelevantsynchronisation

points.Assuch,themodelisatleast1-WLexpressive.Further,forspecific

execution orders, it can be shown that it is possible to exactly trace the outline

ofparticularcyclestructuresinthegraph,oreven exactlyidentifyeachnode

(aswe didwithrandomfeaturespreviously),allowingtodiscernstructures that

1-WL cannot.

GwAC does nottake avery strongstance on theexecution order in itsasyn-

chronoustrace,butseveralotherarchitectureshavesinceemergedthat do.One

Graphs147

=ψ





=ψ





=φ





=ψ





=φ





=ψ





=φ





=φ





Figure 5.8

Anexecutiontraceofafully-asynchronousGNN:ateachtime-step(denotedby

advancingclock),exactlyonemessageiseithercomputedoraggregated.Notethat

the snapshot of the node features dynamically changes whenever φis invoked.

such architectureis theCo-GNN (Finkelshtein etal. 2024), whereineach node

can select whether to send or receive messages at a particular GNN layer.

While powerful, asynchronous GNNs do not neatly align with the currently

prevalentmassively-parallelapproachestomachinelearningmodels.There-

fore,amiddle-groundapproachthatcanexploitsuchparallelism,whilestill

havingcleartraitsofasynchronouscomputation,wouldbeveryattractive.

Dudziketal.(2024)haveshownthatthisisindeedpossibletobuildsuch

amodel,throughtheconceptofasynchronyinvariance:keepingthemodel

fully synchronous, but constrainingitto give identicalembeddingseven if we

madeitsexecutiontraceasynchronousincertainways.Thisapproachneatly

resonateswiththeideasofsymmetrythatweexploredthusfar;however,it

requiresworkingwithgeneralisationsofgroups(monoids),andassuchwe

will only study it significantly later in this book—namely, in its final Chapter.

5.2.5.5ContinuousnodefeaturesThereisnodoubtthatthescaffold

offeredbythek-WLhierarchyhashadoutsizedimpacttothedevelopment

ofgraphrepresentationlearning.Aswe’vejustseen,thisalgorithmfamily,

andtheobservationthatGNNsnaturallyaligntoit,hasinformedcountless

contributionsinfeatureengineering,architecturedesign,graphrewiring,and

148Chapter 5

evenasynchronousmodels.Itsflexibilitydoesnotcomewithoutlimitations,

however—andperhapsthemostglaringoftheseisitsassumptionondis-

cretefeaturesinthenodes.Formanyscientificandindustrialtasksarising

from real-world data, features we have available will often be real-valued. We

willonlyfocusonthisparticularextension,andhowitleadstotranscending

theWLhierarchy.Thatsaid,weremarkseveralotherapproachesthatarenot

well-aligned with k-WL emerged in recent times; e.g., Tönshoff et al. (2021).

While K.Xuetal.(2018) leftthestudyofaligningGNNswithreal-valued

features to decidinggraph isomorphism tofuture work, it wasalreadyevident

fromtheworkofZaheeretal.(2017)thatextendingsuchresultseveninthe

domain ofsets was goingto bedifficult. Theessence ofdifficultiesgoverning

thistransitionwerefinallyexposedandsettledbyCorsoetal.(2020).The

keyissue arises in thelargerflexibilityof theobjects that needto be injectively

aggregated:whileintheWLsettingwedealtwithaggregatingmultisetsofone-

hot vectors—and hence all we had to track for injectivity was the preservation

oftheircardinalities—withreal-valuedfeatures,non-isomorphicgraphscan

have clashing embeddings from unexpected directions.

Toillustratethis,letusfirstrecallthat,accordingto1-WL,aggregator

choicesfor

areordered byexpressivity asmax<

,with

being

maximally expressive. Now, assume thatour inbound messagesare in Rrather

than∆

.Itisrelativelystraightforward toplayaroundwithexamplesofreal-

valued, non-isomorphic multisetswhere notonly thishierarchy isbroken, but

some may not be injectively distinguished by any of these three aggregators:

•

{{4,1,1}} vs. {{3,3,0}}: distinguished by max but not by mean or sum.

•

{{4,2,2,0}} vs. {{4,4,0,0}}: cannot be distinguished by either max, mean or

sum—however, they can be distinguished by standard deviation.

Following these leads, with a littlehelp from standard facts inalgebraic topol-

ogy

,it ispossibletoshowthatnosingleaggregatorchoice for

issufficient

to injectively embedreal-valued neighbourhoods. Infact, itcanbe shown that

thenumberofaggregators,N,neededforinjectivityisequaltothemaximal

neighbourhood size, max

u∈V

|, in ourdataset. While it isnot yet clearwhat

property thesecollectionsof aggregators need tosatisfyto beinjective, Corso

etal.(2020)showedthatthefirstNmomentsoftheneighbourhoodmultiset

(whose elements are modelled usinga random variable X), defined as follows:

=E(X)M

E[(X–E(X))

](n>1),(5.65)

Graphs149

wouldserveasasufficientcollectionofaggregators.Theyleveragedthis

insighttoproposetheprincipalneighbourhoodaggregation(PNA)archi-

tecture,whichretainsonlythenumericallystablemoments(n∈{1,2,∞},

amounting to mean, standard deviation, min and max).

Concurrentlywiththesedevelopments,Wagstaffetal.(2019)successfully

settledtheextensiontoreal-valuedfeaturesforpureunorderedsets,success-

fullyextendingresultsfromZaheeretal.(2017).Remarkably,theyfinda

very compatible result—proving that it is possible to injectively embed sets of

real-valuedfeatureswithasingleaggregator,solongasthenumberoflatent

features islarger thanor equalto themaximalsetsizeinthedataset.Thefact

that bothCorso etal. (2020)and Wagstaffet al.(2019) relyon themaximal set

size hints at a deep connection between these results

PNAoffersimprovedexpressivityofGNNsthrough theuseofmultiplefixed

aggregators.Whatif,instead,wehadaparametricaggregator?Severalpro-

posals attemptto introducescalarparameters inan otherwiserigidaggregator

backbone(Pellegriniet al.2020). Itis alsopossible toreplace aggregators with

a scaffold relying on fully-generic neuralnetworks (suchas MLPs) serving as

the

operator;however,caremustbetakenthatsuchnetworkssupporta

commutativemonoidstructure(OngandVeli

ckovi

c2022).Asmonoidsare

structuresthat generalisegroups andhence outoffocus forthe GDLblueprint,

we will defer this particular example to the latter Chapters of the book.

5.2.6What lies beyond message passing?

Throughout theprevious five subsections,we made clearinroads towards sur-

passingthe1-WLlimitationonexpressivityforGNNsasspecifiedbyEquation

5.56. This raises clear questionsas towhethermessage passingissufficient to

describe all possible permutation-equivariant layers we might come across.

Forsomeapproacheswediscussed,itiscleartheequationstillapplies:

featureaugmentationdoesnotchangetheoperationsofthemodel,message

passingmodulationconditionsfunctionslikeψ(x

)toψ

G,(u,v)

)—

makingthem dependonthegraph structurearoundnodes uandv,and methods

like PNA amount to representing

using a combination of aggregators.

Forotherapproaches,thesituationislessclear,andthejuryisstillvery

much outconcerning theuniversality ofmessage passing; noteven theauthors

ofthisbookareinagreement

.Tostimulateausefuldiscussion,wewill

presentaclaimthatmessagepassingisuniversaluptoachangeofcomputation

graph;the argument was firstwritten up by Veli

ckovi

c (2022)and popularised

by Francesco Di Giovanni in his NeurIPS’22 Workshop talk.

Theclaimis asfollows(Figure 5.9):let GNN

(G,X) beanyneuralnetwork

withlearnable parametersθ, appliedover agraph Gwith(node/edge/graph...)

150Chapter 5

GNN

MPNN

Figure 5.9

SchematicoftheconjectureformalisedbyFrancescoDiGiovanniathisNeurIPS’22

GraphLearning FrontiersWorkshoptalk,illustrated onthe caseof asimplicial complex

GNN (Bodnar, Frasca, Wang, et al. 2021). Such models perform message passing over

substructures – spanningmore than twovertices, andhence not conforming toa binary

messagefunctionsuchasψ(x

).However, itispossibletoequivalentlyre-express

such computationby firstrewiring thegraph—in thiscase, bycreating new nodesthat

willrepresenteachsimplicialcomplex,appropriatelywiringthemtotheirconstituent

nodes—and then performing the usualmessage passing equation. It is conjectured that

every (graph) neural network may be expressed in this manner.

features X. Then, it is conjectured that, for some Rand θ

′

GNN

(G,X)=MPNN

′

(R(G),R(X))(5.66)

whereMPNN

′

)isamessagepassingGNNasspecifiedinEquation

5.56,withparametersspecifiedbyθ

′

,neighbourhoodsspecifiedbythegraph

′

andinputfeaturesspecifiedbyX

′

.Therewiringfunction,R:G×ばつR

→

G×ばつR

out

,modulates theinput graphin anappropriate way. Notethat weallow

Rto act onthe node features aswell, as e.g. theoperation may introduce new

nodes into the graph, which then need to be appropriately initialised.

If the conjecturein Equation 5.66holds, then themessage passing primitive

wouldbe universal—perhapsforall ofdeeplearningoverdiscretestructures

Forthespecificcomputationgraphmodulationswepresentedearlier,wecan

indeed propose how to construct appropriate rewiring operations:

•

Formessagepassingoversubstructures(suchask-tuples),wecaninvent

newnodeswhichrepresentthesesubstructures,andthenconnectthemtoall

Graphs151

oftheirconstituentnodes(inamannerthatdependsonwhetherpermutation

invariance holds across the substructure).

•

ForsubgraphGNNs,wecanduplicateeachnodeoncepersubgraph,and

connect the various copies of a node together using edges of a separate type.

•

ForasynchronousGNNs,wecanartificiallyslowdownmessagesby

introducing slow nodes in-between specific edges (Azabou et al. 2023).

Whiletheseexamplesofferreasonablemotivation,interestingedgecases

withunclearsemanticsarestillample;forexample,messagefunctionsmay

becomputedasaresultofoptimisinganembeddingforcertainproperties

(Bartunov,Fuchs,andLillicrap2022),oreventheentireoutputofcertain

continuous-layerGNNsmayberecoveredbyanalyticallysolvingforsome

constraints(Xhonneux,Qu,andTang2020).Further,therearemanypropos-

als formachinelearning models over domains exhibiting richerstructure than

graphs, such assheaves (Bodnar et al. 2022),bundles (Bamberger et al.2024)

andgeneraltopologicalspaces(Papamarkou etal.2024).Anyattemptatfor-

mally demonstratingwhethermessage passingcanbeauniversal operatorfor

deeplearningwillneedtotakesuchworksintoaccount,butwebelievesuch

efforts (in either direction) will be very much worth it!

5.3Case Study: Recommender Systems And Social Networks

Thefirstpopularised large-scale industrialapplicationsofgraphrepresentation

learning have occurred within social networks, primarily in thecontext of rec-

ommender systems. Recommenders are tasked with decidingwhichcontent to

servetousers,potentiallydependingontheirprevioushistoryofinteractions

ontheservice.ThisistypicallyrealisedbyfirstusingaGNNtoembedthe

propertiesofeachitemofcontent(andpotentiallyindividualusers),u,intoa

vector,h

,leveragingagraphcomprisingknownuser-content(e.g.aproduct

purchase)orcontent-content(e.g.productsco-purchased together)interactions

to perform message passing over.

Notethat,inaproductionenvironment,suchgraphsarefrequentlyhetero-

geneous: we might have multiple types of nodes (e.g. user, product...) as well

asmultipletypesofedges(e.g.purchase,impression,co-purchase...)—and

thisinformationshouldconditionthemessagepassingoperation.Thereexist

severalwaystodothisinGNNs:eithersimplypassingnode/edgetypeasa

categoricalfeature,ormaintainingseparatemessagefunctionparametersfor

each edge type, or anything in between (C. Zhang et al. 2019).

Oncecomputed,theseembeddingsareoptimisedthroughalinkprediction

objective (Figure5.10): supervisethe embeddingsofvarious nodes (piecesof

content) such thatthey are kept closetogetherif they are deemed related (e.g.

152Chapter 5

u∈V

(u,v)∈E

Link

Predictor

(e.g.h

⊤

)

∈B

Figure 5.10

Illustrationofatypicalrecommendersystemsettingwithgraphneuralnetworks.In

suchsettings,nodesu∈Vmaycorrespondtousers,contentorproducts,andedges

(u,v)∈Emay correspondto purchases,impressions, co-purchases, orsimilar. A GNN

first embeds all nodesinto a shared latentspace (h

for node u∈V),after which a link

predictionobjectiveisoptimised;forexample,wemightoptimisetheinnerproduct

⊤

topredictwhetheruserawouldpurchaseproductb(y

),givenobservedpur-

chase logs. This prediction can then be used to drive recommendations downstream.

commonlyviewedtogether).Thentheproximityoftwonodes’embeddings

(e.g. theirinnerproduct,h

⊤

) canbeinterpretedas theprobabilitythatthey

arelinkedbyanedgeinthecontentgraph.Then,foranycontentqueriedby

users,one approachcouldserveits nearestneighboursin theembeddingspace.

Importantly,atthescalesfacedbymanydeployedrecommendersystems,

naïvely deploying amessagepassingmodelisfrequentlyprohibitively expen-

sive.Thecostsmaycomeeitherfromavailablememory–keepingentire

graphsofmillionsorevenbillionsofentitiesinmemorymaybeimpossible

– or from required latency – the model’s forward pass may simply be tooslow

foraseamless userexperience.Akey earlyexampleof aGNN mindful ofsuch

issuesisGraphSAGE(Hamilton,Ying,andLeskovec2017),whichdoesnot

feedtheentiregraphintotheGNNatonce—instead,itprocessesabatchof

nodesatonce,anddynamicallysubsampleseachofthosenodes’neighbour-

hoods

toruntheGNNover(Figure 5.11).Beyondthe subsamplingapproach,

twootherpopularmechanismsforbuildingscalableGNNsincludepartition-

ing the graph across multiple training steps(Chiang et al. 2019) or simplifying

Graphs153

Figure 5.11

GraphSAGE(Hamilton,Ying,andLeskovec2017)processesabatchofthreenodes,

aggregatingtheirK-hopneighbourhoods(here,K =2).Foreachofthethreenodes,it

randomlysamples,withreplacement,acertainnumberofk-orderneighboursofeach

node(for1≤k≤K;here,threefork=1,andtwofork=2),toformthefinalsub-

sampledgraph.This graphisthenusedtosteertheGNN’s messagepassing.Onlythe

messages necessary tocompute the centralnode’s featurevectors, h

, are computed,in

ordertooptimisememoryusage.Notethat,whiledrawnhereasalldistinct,thesampled

nodes may be duplicated or overlapping, within as well as across the sampled patches.

themessagepassingequationbyleveragingconceptsfromdiffusion(F.Wu

et al. 2019; Rossi, Frasca, et al. 2020) or PageRank (Bojchevski et al. 2020).

Amongthepioneers ofthis methodologyis theAmericanimagesharingand

socialmediacompanyPinterest:besidespresentingoneofthefirstsuccess-

fuldeploymentsofGNNsinproduction, theirmethod, PinSage

,successfully

madegraph representationlearningscalableto graphsofmillionsof nodesand

billions of edges(R. Ying et al. 2018). Relatedapplications, particularly inthe

space of productrecommendations,soonfollowed. PopularGNN-backed rec-

ommendersthathadbeendeployedinproductionincludeAlibaba’sAligraph

(Zhuetal.2019),Amazon’sP-Companion(Haoetal.2020),Spotify’s2T-

HGNN (DeNadaiet al.2024),LinkedIn’s LiGNN(Borisyuketal.2024) and

Snapchat’sGiGL(Zhaoetal.2025).GNNshadalsobeendeployedindustrially

beyondthecontext ofrecommenders—forexample,theypowertheexpected

timeofarrival(ETA)predictionswithinGoogleMaps(Derrow-Pinionet

al. 2021).All of thesedeployments takentogether

, itis not far-fetchedto say

thatmillionsofpeoplearecomingintocontactwithgraphmachinelearning

systems on a daily level.

Within the context of content analysis on social networks, another notewor-

thyeffortisFabulaAI,whichisamongthefirstGNN-basedstartupstobe

acquired(in2019,byTwitter).Thestartup,foundedbyoneoftheauthorsof

thetextandhisteam,developednoveltechnologyfordetectingmisinformation

154Chapter 5

house

FFN

house

∈R

FFN(x)=W

σ(W

x+b

)+b

∈R

∥

w∈{A,house,of}

softmax((Qh

)

⊤

)Vh

cards

house

Figure 5.12

An illustrationoftrainingalarge languagemodel (LLM)onthenext-tokenprediction

task for the input text "Ahouseofcards". Themodern LLM stack interchanges

twotypesoflayer:afeedforward network(FFN)appliedtoeachtokenrepresentation

inisolation,andaself-attentionlayer,allowingtokenstoexchangetheirembeddings

usinganattentionalGNN. Notethatit isoften necessaryto restrictthealloweddataflow

so thatatoken mayonlyinteract withthetokens precedingthem;this protectsagainst

leaking privilegedinformation when alltoken representationsin thetext aresupervised

at once.Accordingly, modern LLMsaregraph neural networksoperating over graphs

of tokens. See also Figure 5.13 for a zoomed-in view of the individual operations.

on socialnetworks (Montiet al.2019). Fabula’ssolution consistsof modelling

the spreadof aparticularnewsitem bythenetworkof userswho sharedit.The

usersareconnectedifoneofthemre-sharedtheinformationfromtheother,

but also if they follow each other on the social network. This graph is then fed

intoa graphneural network,which classifiestheentire graphas either‘true’ or

‘fake’content –with labelsbased onagreementbetweenfact-checkingbodies.

Besides demonstrating strong predictivepowerwhich stabilises quickly(often

withinafew hoursofthenews spreading),analysingtheembeddingsofindi-

vidualuser nodesrevealedclear clusteringof userswho tendto shareincorrect

information, exemplifying the well-known ‘echo chamber’ effect.

5.4Case Study: Large Language Models

AtthetimeofwritingthisChapter—andsurelyforasignicantamountof

timeafter—themostprevalentartificialintelligencesystemisthelargelan-

guagemodel (LLM;Figure 5.12).These systemshave outstandingcapabilities

inlanguage,demonstratingcoherentandusefulanswerstovariousprompts

ofinteresttousers;thiscapabilityisreflectedbothintheirwidespreadusage

(throughsystemssuchasGemini,ClaudeandChatGPT)andthemattracting

significant scientific interest.Naturally, it would be highlyusefulto be able to

describe and reason about LLMs within the framework proposed in this book.

Graphs155

As itturns out,we alreadycharted allofthe necessarycomponents toderive

the geometricdeep learning propertiesof LLMs. Infact, weeven hintedat this

in Section 5.2.3.2: attentional GNNs over complete graphs, i.e.

=φ

v∈V

a(x

)ψ(x

)

,(5.67)

mayalreadybesufficienttodescribetheTransformerarchitecture(Vaswani

et al.2017), which formsthe keybackbone ofcontemporary languagemodels.

In this setting, the nodesofthegraphcorrespondtotokens (fundamental units

oftext—theymaybewordsbutalso,e.g.,wordfragments),andtheirinput

features, x

, are one-hot indicators specifying which token is at position u.

Therearetwokeymissingpieces:detailinghowdot-productself-attention

conformstothisequation,andappropriatelytakingcareofcausalmasking:

both of which we detail next.

5.4.1Dot-product self attention

Thedot-productself-attentionoperator

,asusedby Transformers, isdefined

as follows (for general neighbourhoodN

v∈N

(5.68)

where v

=Vx

(the value vector of v) and α

is the softmax-normalised dot-

product

between q

=Qx

and k

=Kx

⊤

expe

v∈N

expe

(5.69)

NotethatQ,K andVaretheonlyparametriccomponentsofthisoperator.

Fromthisanalysis, wecanaligntheequationto attentionalGNNs aspresented

inEquation5.67–set

,letψ(x

)=Vx

andleta(x

)=α

,where

the softmax-normalisationis implied. Transformerstypically interchange self-

attentionlayerswithfeedforwardlayers,whicharemultilayerperceptrons

applied, in parallel,to each token’s embeddingseparately—much likein Deep

Sets

. Such a multilayer perceptron takes the role of φ, e.g.:

φ(x

)=W

(

)+b

)

(5.70)

whereW

arethelearnableweightandbiasparametersofeachlayer,σ

isanactivationfunction,andtheinputtokenfeaturesareresiduallyaddedto

the attendedvector z

. ThisMLP might alsofeature normalisation operations,

such as layer normalisation (Ba, Kiros, and Hinton 2016).

While thisderivation(Figure 5.13)already aligns Transformers tothe atten-

tional GNN flavour, we find it an instructive exercise to demonstrate how they

156Chapter 5

−∞−∞

−∞

000

softmax

Norm

FFN

Norm

FFN

Norm

FFN

Figure 5.13

ZoominginontheschematicfromFigure5.12,thisFigurepresentstheindividual

operations that commonlyfeature in themodern Transformer block.The dataflow also

emphasiseshowsuchmodelscanbeseenasgraphneuralnetworks:theadjacency

matrix isapplied inthe formof anattention mask(settingnon-edge logitsto –∞); the

valuevectorsv

maybeseenasmessages,andtheattentionalcoefficientsα

deter-

minetheinteraction strengthswhenaggregatingthese messages.Thelatterpartofthe

block (normalisationand feedforwardnetwork)corresponds tothe GNN’s update func-

tion, φ. Note, the figure alsodepicts the frequently used rotary position embedding (Su

et al. 2024) as a rotation matrix R(u,v) applied to the dot product e

⊤

R(u,v)k

can be expressed as amessage passing GNN. Thisderivation will work for all

softmax-normalised attentionalGNNs, even ifnot usingdot-product attention.

To help us in this endeavour, let’s write Equation 5.68 as follows:

v∈N

(expe

v∈N

expe

(5.71)

The key insight is that both thenumerator and the denominator in thisfraction

canbecomputedwithsimplesum-aggregatedlocalmessages.Thisallows us

toexpressourmessage asacombinationoftwomessages,(n,d),wheren∈R

is thenumerator, andd∈Rthe denominator.Both nand d canbe computedas

functions of x

and x

only, and therefore conform to the required form.

Nowit onlyremains todefine ouraggregationfunction,

.Note thatin this

case, we need to define how our combined messages are aggregated

; i.e.,

(n,d)⊕(n

′



n+n

′

,d+d

′



(5.72)

Graphs157

Once aggregated, it remains to divide the numerator by the denominator—this

can beperformed withinthe definition ofφ. Hence,Transformers are indeed

(attentional andmessage-passing) graph neural networks

(Joshi 2020).

Whyattentional?It isuseful topause fora momentand ponderwhy wasthe

attentional GNNflavour soimpactfulin sucharich naturallanguage domain,

asopposedtomoregenericmessagepassingarchitectures.Whilewecannot

definitively settle this question here, we do offer a few conversation-starters.

Ononehand,wordsaremeaninglesswithoutcontext:asourinputnode

featuresaresimplyone-hotindicesintoatoken vocabulary,itisunlikelythat

complexmessagesarenecessary

tomakesenseofthem.Thesuccessofatten-

tioninthisspacereinforcestheideathatonecaneasilyreasonaboutword

representations by leveraging weighted sums of other words’ representations.

Onanotherhand,wemustacknowledgethemassivescaleatwhichTrans-

formersaredeployedtoday,withmanylargelanguagemodelssporting

hundredsofbillionsofparameters.Thisisanothersettingthatpresentsan

advantageforattentionalGNNs:especiallywhenusinglinearordot-product

attentionwithoutgivenedgefeatures,novectors(onlyscalars)needtomate-

rialise on theedgesofthegraph, saving significant memory

. In contrast,the

mostgeneralmessagepassingGNNsrequireexplicitlycomputingmessage

vectors and storing them in each edge.

Finally,thespecificcomputationsofTransformers—whichconsistof,e.g.,

many carefullyplacedfullmatrixproducts—alignvery wellwiththemodern

AI hardwarestack(e.g.GPUsandTPUs)andhence,Transformers are GNNs

that are currently winning the hardware lottery (Hooker 2021)!

5.4.2Causal masking

Itisprecisely theaforementioned scalabilityoflanguagemodeltrainingwhich

yieldsthesecondkeyconstrainttoconsider.TheparametersofanLLMare

typicallypre-trainedusinganext-tokenpredictionobjective:givenaspecific

sequence oftokens, predictthe next onein thesequence. This objectiveis then

deployed over tokenised texts scraped across the entire Internet.

Given theincrediblescales involved,itis critical thatthis training proceeds

inamanner whichis asparallelisable aspossible. Becauseof this,the modelis

typically trainedoverentire chunks ofcontiguous textat once,and supervision

is applied on all of the tokens simultaneously.

Thisdecisionenablesfurtherscalabilitybutlimitsthealloweddataflow

betweentokens;namely,itnecessitatesthatinformationflowsinafeedfor-

wardmannerthroughout themodel. Shouldany informationflowfromfuture

tokens towardspast oneswhile also supervisingon alltokens atonce, it would

158Chapter 5

(1)

(2)

(3)

(1)

(2)

(3)

Figure 5.14

Two typical harmful effects on information propagation in decoder-only Transformers:

overmixing(left) andoversquashing(right). Overmixingleadstoimportant information

(depicted in red) being diffusedtoo quickly acrossthe token embeddings. Oversquash-

ingleadstooverly relyingonearliertokens,astheyhavesubstantiallymorepathsto

the final tokenembedding (depicted inblue for anearly token vs. redfor a latetoken).

Both issues occur when attentional coefficients, α

, are insufficiently sharp.

allowthe model to"cheat" on thenext-tokenprediction objective(information

about the next token would leak into the token tasked with predicting it).

Enforcing thisconstraintin an LLMamounts to maskingthe allowed token

adjacencytoalwaysbealower-triangularmatrix—oftenalsoreferredtoas

acausalmask.Notethatthisisaformofnonparametricgraphrewiring:it

sets neighbourhoodsN

={v |v≤u}. In Figure5.13, this isexemplifiedby the

relevant rewiredadjacencymatrixbeingpassedastheattentionmask(setting

logits in the u-th row that are not in N

to –∞).

Taking allof these constraintstogether, we have now fully specifiedtheso-

called decoder-only Transformer

, which is the essence within the generative

pre-trained Transformer (GPT) (Radford et al. 2018) framework.

5.4.3Analysing LLM graph dataflow

Nowweknowthatdecoder-onlyTransformers(theprevalentLLMarchitecture

at the timeof writing) are actuallysecretly (?) graph machine learning models

underthehood.Anaturalquestionarises:canweleveragethisknowledgeto

make sense of how data propagates within LLMs – and potentiallyeven make

meaningful improvements by tinkering with their graph topology?

Wecananswerthisquestionintheaffirmative—infact, thespecificchoice

ofgraphstructureimpliedbyacausalmaskturnsouttobeharmfulfor

dataflow,fromtheperspectiveof(m)anymetricsofgraph"health"knownto

graph theory. Wewill providetwo examples ofthese issuesfor illustrativepur-

poses(seealsoFigure5.14);whileweremaincognisantoftherapidpaceof

research in this space, we hope they serve as useful inspiration.

Firstly,becauseofthefully-connectedstructureofthiscausalgraph,mes-

sages are always passedfullyacrossall allowed pairs of tokens. Thiscanlead

toachallengingphenomenonknownasovermixing(Barberoetal.2025),

Graphs159

whereinthemodellosesthefidelityofinformationaboutindividualtokens,

as they rapidly mix with other tokens’ embeddings across different layers.

Acomplementaryissuearisesduetothecausalityofthecomputational

graph,coupledwiththefactthatonlythelasttoken’sembeddingisusedto

predictthenextone.Thecausalgraphnecessarilyinducesaformoftoken

asymmetry,withearliertokenshavingsignificantlymorepathstothelast

token’sembeddingthanlaterones.Thisleadstotheover-squashingproblem

(Barberoetal.2024):asinformationisasymmetricallycompressedintothe

final token’s embedding (after L layers), h

(L)

, early tokens have a significantly

higher influence over it (in termsof the Jacobian, ∂h

(L)

/∂x

, for input token at

positionv).Thisasymmetryissopronouncedthatitispossibletoshowthat, in

thelimitwhere thenumber ofdecoder-onlyTransformerlayers goesto infinity

(L→∞), only the first token influences the final predictions.

Whilebothoftheaboveissuespresentimportantimpedimentstothe

model’sdataflow,theymayin principlebefixedbyappropriately chosenatten-

tional coefficients, α

. For example, thecoefficients maybe selected suchthat

theypreserve themost vulnerablepathsfor longer,or delaymixingby concen-

trating attentionin specificplaces.In bothof thesecases,a desirableproperty

is sharpness—beingable tochoose disproportionately highervaluesof α

for

asmallersubsetoftokens.Whilesuchsharpcircuitsmayindeedbediscov-

ered withintrained LLMs(Elhageet al.2021), theyonly tendto appearin data

distributions"familiar"tothemodelaftertraining.Itcan,infact,beproved

usingstandardresultsinrealanalysis(Veli

ckovi

cetal.2025)thatthecoeffi-

cientsα

mustalldispersebelowany chosenpositive εforsufficientlylarge

input positionuat inference time(when going beyond themaximal input size

encountered during training).

Manyoftheseissuesarewellfamiliartoresearchersingraphrepresenta-

tion learning; overmixing is frequently referred toas oversmoothing

(Rusch,

Bronstein,andMishra2023),whereasoversquashingisknownbythesame

name(AlonandYahav2020).Whilethiscertainlyinvitesaplethoraofinter-

estingmethodsfromthefield,itisalsoimportanttoacknowledgetheunique

challengesthatcomefromthefeedforwardstructureofthisparticulargraph,

which are seldomstudied. Initial stepstowards bridging this gap have recently

been made (Vitvitskyi et al. 2025), but a lot more work is needed.