5
Graphs
Inmultiplebranchesofscience,fromsociologytoparticlephysics,
graphs are a fundamental model of systems of relations and interactions.
Graphsgiverisetoabasictypeofsymmetrymodelledbythegroupof
permutations. This is why we start our investigation by studying them.
Mostotherobjectsofinteresttous,suchasgridsandsets,canbe
obtainedasaparticularcaseofgraphs,andmanyarchitectureswewill
discussherewillbeexpressibleinthelanguageofgraphneuralnetworks.
Thisalsoincludesthemodernlargelanguagemodel (LLM)stack,which
we present as a case study in this Chapter.
5.1Domain
Inmanyways,graphsarethemainmodalityofdatawereceivefromnature.
Indeed,atvirtuallyalllevelsoforganisationofbiologicalandphysical
systems—frommoleculestotheconnectomicstructureoftheneuronsinthe
brain—graphsofferanaturalabstraction;namely,thatofaninterconnected
networkofnodes.Similarconclusionscanbedrawnabouthuman-designed
systems,suchastransportationnetworksorsocialnetworks.Infact,itmay
evenbearguedthatthedomainswheredeeplearninghastraditionallyseen
mostsuccess—suchasimages,textorspeech—canallbeintheveryleast
seen as special cases of graph structures, but also very often just grid-like pro-
jectionsofafarmoreirregularly
1
structuredphenomenon.Furthermore,the
veryorganisationofcognitionseemstobeirregularlystructured,withcon-
cepts freelyrelated andrecombined toform new knowledge. Allof thistaken
into account, it is likely not very controversial to claim that a truly artificially-
intelligentsystemwillnecessarilyneedtoknowhowtoprocessandreason
about graph-structureddata. Conveniently,this frameworkalso fitsverynicely
inourGDLblueprint,andcanbeusedasabasisforstudyingmanyother
geometric domains. Now, it is our turn to study graph representation learning.
130Chapter 5
uc
d
v
a
b
Thisisagraph.G=(V,E)Thisisitstensorialrepresentation.
XR
|V×ばつk
AR
|V×ばつ|V|
x
a
x
b
x
c
x
d
x
u
x
v
X=
100101
010001
001110
101101
001011
110111
A=
Figure 5.1
AnundirectedgraphG=(V,E)containingthesetofnodesV={u,v,a,b,c,d}and
edgesE={(a,d),(a,v),(b,v),(c,d),(c,u),(d,v),(u,v)}(symmetricedgesomittedfor
brevity).Thisgraphisalsorepresentedusingtensors(nodefeaturesX,adjacencymatrix
A) for a node permutation g=[a,b,c,d,u,v].
Agraph G=(V,E)isacollectionofnodes
2
VandedgesE⊆V×ばつVbetween
pairsofnodes.Forthepurposeofthefollowingdiscussion,wewillfurther
assumethe nodesto beendowedwith s-dimensionalnode features,denotedby
x
u
for all u∈V; see Figure 5.1 for an illustration. Social networks are perhaps
amongthemostcommonlystudiedexamplesofgraphs,wherenodesrepre-
sentusers,edgescorrespondtofriendshiprelationsbetweenthem,andnode
features modeluserproperties suchasage,profilepicture,etc.Itisalsooften
possibletoendowtheedges,orentiregraphs,withfeatures;butasthisdoes
not alter our main findings, we will defer discussing it to Section 5.2.2.
ThekeystructuralpropertyofgraphsisthatthenodesinVareusuallynot
assumedtobeprovidedinanyparticularorder,andthusanyoperationsper-
formedongraphsshouldnotdependontheorderingofnodes.Thedesirable
propertythatfunctionsactingongraphsshouldsatisfyisthuspermutation
invariance,anditimpliesthatforanytwoisomorphicgraphs—thelikesof
whichwesawinFigure4.4—theoutcomesofthesefunctionsareidentical.
Wecansee thisasa particularsettingof ourblueprint,wherethe domainΩ=G
and thespaceX(G ,R
d
) isthatof d-dimensionalnode-wise signals.Thesym-
metryweconsiderisgivenbythepermutationgroupG=S
n
,whoseelements
are all the possible orderings of the set of node indices {1,...,n}.
5.1.1Sets
Letusfirstillustratetheconceptofpermutationinvarianceonsets,aspecial
caseofgraphswithout edges(i.e., E=). Bystacking thenodefeaturesasrows
ofthen×ばつd matrix X=(x
1
,...,x
n
)
,wedo effectivelyspecify anorderingof
Graphs131
thenodes.TheactionofthepermutationgS
n
onthesetofnodesamounts
tothereorderingoftherows ofX,which canberepresentedasann×ばつnper-
mutation matrix ρ(g)=P,
3
where each row andcolumn contains exactly one 1
and all the other entries are zeros.
Afunctionfoperatingonthissetisthensaidtobepermutationinvariant
if, for any such permutation matrix P, it holds that f(PX)=f(X). One simple
such function is
f(X)=φ
X
u∈V
ψ
(
x
u
)
!
,(5.46)
where thefunction ψis independently appliedto every node’s features, andφ
is appliedon its sum-aggregated outputs: assum isindependent ofthe order in
whichitsinputsareprovided,suchafunctionisinvariantwithrespecttothe
permutation ofthe node set,and is henceguaranteed to always returnthe same
output,no matterhowthe nodesare permuted.Suchalayer isverypowerful—
undersuitableconditionsonψ’soutputdimensionality,itcanbeshownthat
any permutation-invariant setfunction must be representableusing theform in
Equation 5.46 (Wagstaff et al. 2019).
Functionsliketheabove providea‘global’set-leveloutput,but veryoften,
we willbe interestedin functionsthat act‘locally’, ina node-wisemanner.For
example, we maywant to applysome functiontoupdate thefeaturesin every
node, obtainingthe set oflatent node features. Ifwe stack these latentfeatures
intoamatrixH,thefunction
4
F(X)=Hisnolongerpermutationinvariant:
theorderoftherowsofH shouldbetiedtotheorderoftherowsofX,so
that we know which output node feature corresponds to which input node. We
needinsteadamorefine-grainednotionofpermutationequivariance,stating
that,oncewe "commit"to apermutation ofinputs, itconsistently permutesthe
resulting objects. Formally, F(X) isa permutation equivariantfunction if, for
anypermutation matrix P,it holds thatF(PX)=PF(X). A shared node-wise
linear transform
F
Θ
(X)=(5.47)
specified by aweight matrix ΘR
d×ばつd
, is onepossible construction ofsuch a
permutationequivariant function,producinginourexamplelatentfeaturesof
the form h
u
=Θ
x
u
.
ThisconstructionarisesnaturallyfromourGeometricDeepLearning
blueprint.Namely,wecancharacteriselinearequivariants(functionsofthe
formFPX=PFX),forwhichitiseasytoverifythatanysuchmapcanbe
writtenasalinearcombinationoftwogenerators,theidentityF
1
X=X and
theaverageF
2
X=
1
n
11
X=
1
n
P
n
u=1
x
u
.InSection5.2.3,weshowhowthe
popular DeepSets (Zaheeret al.2017) architecturefollows preciselythis idea.
132Chapter 5
5.1.2From sets to graphs
We cannowgeneralise the notionsof permutationinvariance and equivariance
from sets tographs.In the generic settingE =, the graphconnectivity can be
represented by the n×ばつn adjacency matrix A,
5
defined as
a
uv
=
(
1(u,v)∈E
0otherwise.
(5.48)
NotethatnowtheadjacencyandfeaturematricesAandXare"synchronised",
inthesensethata
uv
specifiestheadjacencyinformationbetweenthenodes
described by the uth and vth rows of X– cf. Figure 5.1. Therefore, applying a
permutation matrixPto thenode features Xautomatically impliesapplying it
toA’s rows andcolumns,PAP
.
6
Wesaythat(agraph-wisefunction
7
)fis
permutation invariant if
f(PX,PAP
)=f(X,A)(5.49)
and (a node-wise function) Fis permutation equivariant if
F(PX,PAP
)=PF(X,A)(5.50)
for any permutation matrix P.
Hereagain,wecanfirstcharacteriselinearequivariantfunctions.
8
As
observedbyMaronetal.2018,anylinearFsatisfyingequation(5.50)can
beexpressedasalinearcombinationoffifteenlineargenerators;remarkably,
thisfamilyofgeneratorsisindependentofn.Amongstthesegenerators,our
blueprintspecificallyadvocatesforthosethatarealsolocal,i.e.,wherebythe
output onnodeudirectly depends onits neighbouring nodesin the graph.We
canformalisethisconstraintexplicitly inourmodelconstruction,bydefining
what it means for a node to be neighbouring another.
A(undirected)neighbourhoodofnodeu,sometimesalsocalled1-hop,is
defined as
9
N
u
={v:(u,v)∈Eor(v,u)∈E}(5.51)
and the neighbourhood features are defined as the multiset
10
X
N
u
={{x
v
:v∈N
u
}}.(5.52)
Operating on 1-hopneighbourhoods aligns well withthe locality aspectof our
blueprint:namely,definingourmetric overgraphsas theshortestpathdistance
between nodes using edges in E.
TheGDLblueprint thusyields ageneralrecipeforconstructingpermutation
equivariant functions on graphs, by specifying a local function φthat operates
over thefeatures ofa nodeand itsneighbourhood, φ(x
u
,X
N
u
). Then,a permu-
tation equivariant function Fcan beconstructed byapplying φto everynode’s
Graphs133
x
b
x
a
x
c
x
d
x
e
h
b
φ(x
b
,X
N
b
)
Figure 5.2
Anillustrationofconstructingpermutation-equivariantfunctionsovergraphs,byapply-
inga permutation-invariantfunction φtoeveryneighbourhood.Inthis case,φis applied
tothefeaturesx
b
ofnodebaswellasthemultisetofitsneighbourhoodfeatures,
X
N
b
={{x
a
,x
b
,x
c
,x
d
,x
e
}}. Applying φin this manner to every node’s neighbourhood
recovers the rows of the resulting matrix of latents features H=F(X,A).
neighbourhood in isolation (see Figure 5.2):
F(X,A)=
φ(x
1
,X
N
1
)
φ(x
2
,X
N
2
)
.
.
.
φ(x
n
,X
N
n
)
(5.53)
AsF isconstructedbyapplyingasharedfunctionφtoeachnodelocally,its
permutationequivariancerestsonφ’soutputbeingindependentontheordering
ofthenodesinN
u
.Thus,ifφisbuilttobepermutationinvariant,thenthis
property is satisfied.
Itisalsoworthnoticingthatthedifferencebetweenfunctionsdefinedon
sets andmore generalgraphsin thisexample isthat inthe lattercasewe need
to explicitly accountfor thestructure ofthe domain. Asa consequence,graphs
stand apartinthesense thatthedomainbecomespartof theinputinmachine
learningproblems,whereaswhendealingwithsetsandgrids(bothparticu-
larcasesofgraphs)wecanspecifyonlythefeaturesandassumethedomain
tobefixed.Thisdistinctionwillbearecurringmotifinourdiscussion.As
aresult,thenotionofgeometricstability(invariancetodomaindeformation)
is crucial in mostproblemsof learningongraphs.Itstraightforwardly follows
fromour constructionthatpermutation invariantandequivariantfunctionspro-
duce identical outputs on isomorphic (topologically-equivalent) graphs. These
resultscanbegeneralisedtoapproximatelyisomorphicgraphs,andseveral
results on stability under graphperturbations exist (Levie et al. 2018).We will
134Chapter 5
returntothisimportantpointinourdiscussiononmanifolds,whichwewill
use as an vehicle to study such invariance in further detail.
Second,duetotheiradditionalstructure,graphsandgrids,unlikesets,
canbecoarsenedinanon-trivialway
11
,givingrisetoavarietyofpooling
operations.
5.2Model
GraphNeuralNetworks(GNNs)aretherealisationofourGeometricDeep
Learningblueprintongraphsleveragingthepropertiesofthepermutation
group. GNNsare amongthe mostgeneral classofdeep learningarchitectures
currently in existence, andas we will see in this text, most other deep learning
architecturescanbeunderstoodasaspecialcaseoftheGNNwithadditional
geometric structure.
AsperourdiscussioninSection5.1,weconsideragraphtobespecifiedwith
an adjacency matrix Aand nodefeatures X.We will studyGNN architectures
thatarepermutationequivariantfunctionsF(X,A)constructedbyapplying
sharedpermutationinvariantfunctionsφ(x
u
,X
N
u
)overlocalneighbourhoods.
Undervariousguises,thislocalfunctionφcanbereferredtoas"diffusion",
"propagation",or"messagepassing",andtheoverallcomputationofsuchF
as a "GNN layer".
5.2.1The three "flavours" of GNNs
ThedesignandstudyofGNNlayersisoneofthemostactiveareasofdeep
learningatthetimeofwriting,makingitalandscapethatischallengingto
navigate.Fortunately,wefindthatthevastmajorityoftheliteraturemaybe
directlyderivedfromonlythreespatial
12
"flavours"ofGNNlayers(Figure
5.3),whichwewillpresenthere.Theseflavoursgoverntheextenttowhich
φtransformstheneighbourhoodfeatures,allowingforvaryingdegreesof
complexity when modelling interactions across the graph.
Inallthreeflavours,permutationinvarianceisensuredbyaggregatingfea-
turesfromX
N
u
(potentiallytransformed,bymeansofsomefunctionψ)with
somepermutation-invariantfunction
L
,andthenupdatingthefeaturesof
nodeu,bymeansofsomefunctionφ.Typically,
13
ψandφarelearnable,
whereas
L
isrealisedasanonparametricoperationsuchassum,mean,
ormaximum,thoughitcanalsobeconstructede.g.usingrecurrentneural
networks with appropriate constraints (R. L. Murphy et al. 2018).
Intheconvolutionalflavour(KipfandWelling2016;Defferrard,Bresson,
andVandergheynst2016;F. Wu etal.2019), thefeatures ofthe neighbourhood
Graphs135
c
ba
c
bc
c
bd
c
be
c
bb
Convolutional
x
b
x
a
x
c
x
d
x
e
α
ba
α
bc
α
bd
α
be
α
bb
Attentional
x
b
x
a
x
c
x
d
x
e
m
ba
m
bc
m
bd
m
be
m
bb
Message-passing
x
b
x
a
x
c
x
d
x
e
Figure 5.3
AvisualisationofthedataflowforthethreeflavoursofGNNlayers,g.Weusethe
neighbourhoodofnodebfromFigure5.2toillustratethis.Left-to-right:convolutional,
where sendernode featuresare multipliedwith aconstant, c
uv
; attentional,where this
multiplierisimplicitlycomputedviaanattentionmechanismofthereceiveroverthe
sender:α
uv
=a(x
u
,x
v
);andmessage-passing,wherevector-basedmessagesarecom-
puted based on both the sender and receiver: m
uv
=ψ(x
u
,x
v
).
nodes are directly aggregated with fixed weights,
h
u
=φ
x
u
,
M
v∈N
u
c
uv
ψ(x
v
)
!
.(5.54)
Here,c
uv
specifiestheimportanceof nodevto nodeu’s representation.Itisa
constant that often directly depends on the entries in Arepresenting the struc-
tureofthegraph.Note thatwhentheaggregationoperator
L
ischosentobe
thesummation,it canbe consideredas alinear diffusionorposition-dependent
linear filtering, a generalisation of convolution
14
—givingthe flavour its name.
Intheattentionalflavour(Veli
ˇ
ckovi
́
cetal.2018;Montietal.2017;J.Zhang
et al. 2018; Brody, Alon, and Yahav 2022), these interactions are implicit:
h
u
=φ
x
u
,
M
v∈N
u
a(x
u
,x
v
)ψ(x
v
)
!
.(5.55)
Here,aisalearnableself-attentionmechanismthatcomputestheimportance
coefficientsα
uv
=a(x
u
,x
v
)implicitly.Theyareoftensoftmax-normalised
acrossallneighbours.When
L
isthesummation,theaggregationisstilla
linearcombinationoftheneighbourhoodnodefeatures,butnowtheweights
are feature-dependent.
Finally,themessage-passingflavour(Battagliaetal.2016;Gilmeret
al.2017;Battagliaetal.2018)amountstocomputingarbitraryvectors
("messages") across edges,
h
u
=φ
x
u
,
M
v∈N
u
ψ(x
u
,x
v
)
!
.(5.56)
136Chapter 5
Here,ψisalearnablemessagefunction,computingv’svectorsenttou,and
the aggregation can be considered as a form of message passing on the graph.
Oneimportantthingtonoteisarepresentationalcontainmentbetween
theseapproaches:convolutionattentionmessage-passing. Indeed,atten-
tionalGNNscanrepresentconvolutionalGNNsbyanattentionmechanism
implementedasalook-uptablea(x
u
,x
v
)=c
uv
,andbothconvolutionaland
attentionalGNNsarespecialcasesofmessage-passingwherethemessages
areonlythesendernodes’features:ψ(x
u
,x
v
)=c
uv
ψ(x
v
)forconvolutional
GNNsandψ(x
u
,x
v
)=a(x
u
,x
v
)ψ(x
v
)forattentionalGNNs. Notethatthelat-
ter containment holdseven ifthe attentional coefficients arenormalised across
neighbours; we will explore this further in Section 5.4.1.
This does not imply that messagepassing GNNs are always the most useful
variant;astheyhaveto computevector-valuedmessages acrossedges, theyare
typically harder totrain and requireunwieldy amounts of memory. Further, on
awiderangeofnaturally-occurringgraphs,thegraph’sedgesencodefordown-
stream classsimilarity (i.e.anedge (u,v) impliesthat uand vare likely tohave
thesameoutput).Forsuchgraphs(oftencalledhomophilic),convolutional
aggregationacrossneighbourhoodsisoftenafarbetterchoice,bothinterms
ofregularisationandscalability.AttentionalGNNsoffera"middle-ground":
theyallowformodellingcomplexinteractionswithinneighbourhoodswhile
computingonlyscalar-valuedquantitiesacrosstheedges,makingthemmore
scalable than message-passing.
The"threeflavour"categorisationpresentedhereisprovidedwithbrevity
inmindandinevitablyneglectsawealthofnuances,insights,perspectives,
generalisations,andhistoricalcontextstoGNNmodels.However,basedon
argumentswewillpresentinthefollowingsubsections,itsexpressivitymay
be sufficient to represent anygraph function of interest.
5.2.2A maximally potent GNN: the Graph Network model
Westartedoutourdiscussionbydeliberatelyassumingthegraphonlyhad
featuresinthenodes;however,formanydomainsofscientificandindustrial
interest, we may be interested in featurising the graph at other granularities.
Oneveryimportantsuchcontextiscomputationalchemistry,andwhen
working with molecular data we may expect to see,simultaneously: nodefea-
tures,x
u
,suchasatomtypeandcharge;edge features,x
uv
,suchasbondtypes,
andwhethertheyareinaring;andgraph features, x
G
,suchasthemolecular
weight orfingerprints.We maythenbe interested inusing a GNNto compute
representations h
u
, h
uv
, h
G
at all of these levels.
Toillustratehowalltheconcepts weintroducedbeforegeneralise tothisset-
ting, wepresent theGraphNetwork architecture (Battagliaet al.2018).It isa
Graphs137
x
G
x
v
x
uv
x
u
ψ
h
uv
φ
h
u
ρ
h
G
L
v∈N
u
L
(u,v)∈E
L
u∈V
Figure 5.4
ThedataflowoftheGraphNetworkmodel(Battagliaetal.2018),integratingnode
features(inblue),edgefeatures(ingreen)andgraphfeatures(inred)tocomputeappro-
priate representations. Equivariant(blue) and invariant (red) layers featureprominently
across the Graph Network’s three functions ψ, φand ρ.
generalblueprintfor buildingpermutationequivariantfunctionsoverattributed
graphs, and all of theflavours we discussed can be obtained byanappropriate
restriction ofits equations.It proceedsin three phases,at eachstep usingall of
the relevant information (Figure 5.4).
First,edgerepresentations,h
uv
,arecomputedusingthefeaturesoftherel-
evanttwonodes(uandv),thefeaturesoftheedge,andthefeaturesofthe
graph:
h
uv
=ψ
(
x
u
,x
v
,x
uv
,x
G
)
,(5.57)
whereψwouldbeanalogoustothepreviouslydefinedmessagefunction.
Treatingh
uv
asmessages,wecancomputenoderepresentations.TheGraph
Network takes into account the node’s features, the aggregated edge represen-
tations from all of the neighbours, and the graph features:
h
u
=φ
x
u
,
M
v∈N
u
h
uv
,x
G
!
.(5.58)
Finally,sincegraph-levelrepresentationsareaglobalpropertyofthegraph,
theyneedtobecomputedinvariantly.Embodyingthespiritofthegeometric
deep learning blueprint, we aggregate all the node andedgerepresentationsto
compute h
G
:
h
G
=ρ
M
u∈V
h
u
,
M
(u,v)∈E
h
uv
,x
G
.(5.59)
138Chapter 5
Lastly,thereadermaybecuriousaboutwhethersuchaframework trulyis
maximallypotent—for example,couldit alsosupportstructures suchashyper-
graphs, wheremorethantwo nodesmaybe involved inahyperedge.Wewill
ponder this question from several angles in the following sections.
5.2.3Deep Sets, Transformers, and Graph Rewiring
The"threeflavours"viewofGNNspresentedthusfarlargelyconcernsitself
withthedesignoftheirmessageandupdatefunctions(ψandφ).However,
lookingatthegeneralmessagepassingequation(Equation5.56),therearein
fact two other "moving parts": the choice of aggregator,
L
, and the choice of
neighbourhoods, N
u
. In the process of analysing thesechoices, it turns out we
canformtightconnectionsbetweenGNNsandmanyotherinfluentialneural
network architectures. We will begin by analysing the choice of N
u
.
Infact, thechoiceofN
u
isdirectlytiedtotheinputgraphstructureweare
assuming. Thusfar,wepresumed thatN
u
is given to us, andwe didnotques-
tionits structure.However, suchasituation isinreality quiterare:manygraphs
we observe intherealworld are merelyapproximationsof the "ground-truth"
interactionsbetweenthenodes.Forexample,inasocialnetwork,weonly
observe the friendship links users have explicitly formed, rather than the com-
plete collectionof friendshipinteractionsbetween thosepeople. Andwithina
molecule,thereexistphysical interactionsbetween allpairs ofatoms—not just
onesexplicitlyconnectedbyachemicalbond.Thisexamplebecomesespe-
ciallyprominentwhenweconsiderinteractionswithinmacromoleculessuch
asproteins,thatfoldin3Dspace,allowingatomsthatarequitedistantinthe
molecular graph to interact more directly.
Beyondtheneedtohandlesuchapproximationstoground-truthgraphs,it
is also well-understoodthat certainchoices of graphare simply pathologically
bad–andtheirusewillleadtoissuesregardlessoftheGNNarchitecturewe
use(AlonandYahav2020;DiGiovannietal.2023).Onetypicalinstanceof
thisariseswhenthegraphhasbottlenecks—smallcollectionsofedgeswhich
areresponsibleforcarrying representationsbetween largegroupsofnodes.For
example,inabarbellgraph,therededgeisundersig-
nificantrepresentationalpressuretotransportinformationbetweenthenodes
inthetwocommunities.Ifitisimportantforthosecommunitiestodirectly
exchangefeatures,itispossibletoprovethatinmanycases,GNNswillnot
even be able to fit their training data accurately under such a graph.
TostudyvariousstandardchoicesofN
u
withoutdiscussingpropertiesof
any given input graphs, we will now briefly return to the domain of unordered
sets.InthelanguageofSection5.1,weassumethatwearegiven amatrixof
Graphs139
12
3
4
56
7
8
12
3
4
56
7
8
12
3
4
56
7
8
Figure 5.5
ThreestrategiesforchoosinganedgesetforaGNN,left-to-right:empty,leadingto
DeepSets(Zaheeretal.2017);complete,leadingtoTransformers(Vaswanietal.2017);
and precomputed, leading to EGP (Deac,Lackenby, and Veli
ˇ
ckovi
́
c 2022). Inall cases
we highlight one of the nodes’ neighbourhoods, N
1
, in red.
node features, X, but without any specified adjacency or ordering information
betweenthenodes.Thespecificarchitectureswillarisebydecidingtowhat
extent to model interactions between the nodes; see Figure 5.5.
5.2.3.1EmptyedgesetUnorderedsetsareprovidedwithoutanyadditional
structureorgeometrywhatsoever—hence,itcouldbearguedthatthemost
natural way toprocess themis to treateach setelement entirelyindependently.
Thistranslatestoapermutationequivariantfunctionoversuchinputs,which
wasalreadyintroducedin Section5.1:asharedtransformation appliedtoevery
nodeinisolation. Assumingthe samenotation aswhendescribingGNNs,such
models can be represented as
h
u
=ψ(x
u
),(5.60)
whereψisalearnabletransformation.Itmaybeobservedthatthisisaspe-
cialcase ofa convolutional GNNwith N
u
={u}—or, equivalently,A=I. Such
anarchitectureiscommonlyreferredtoasDeepSets,inrecognitionofthe
workofZaheeretal.2017thathavetheoreticallyprovedseveraluniversal-
approximationproperties ofsucharchitectures. Itshouldbe notedthat theneed
toprocessunorderedsetscommonlyarisesincomputervisionandgraphics
when dealingwithpoint clouds;therein,such modelsareknownasPointNets
(Qietal.2017).Philosophically, itmayalsobenotedthatthisarchitectureis
thefurthestwecangowith"pure"sets;anyadditionalinputstoψimplythat
there exist connections between the nodes, making our domain a graph.
5.2.3.2CompleteedgesetWhileassuminganemptyedgesetisavery
efficient constructforbuildingfunctionsover unorderedsets,oftenwewould
140Chapter 5
expectthatelementsofthesetexhibitsomeformofrelationalstructure—
i.e.,thatthereexistsalatentgraphbetweenthenodes.SettingA=Idiscards
anysuchstructure,andmayyieldsuboptimalperformance.Conversely,we
couldassume that,in absenceofanyother priorknowledge,we cannotupfront
excludeanypossiblelinksbetweennodes.Inthisapproachweassumethe
completegraph,A=11
;equivalently,N
u
=V.Aswedonotassumeaccess
to any coefficients of interaction, running convolutional-type GNNs over such
a graph would amount to:
h
u
=φ
x
u
,
M
v∈V
ψ(x
v
)
!
,(5.61)
where the secondinput,
L
v∈V
ψ(x
v
) is identicalfor all nodesu
15
, and assuch
makes the modelequivalently expressive to ignoringthatinput altogether;i.e.
the A=Icase mentioned above.
This motivates the use of a more expressive GNN flavour, the attentional,
h
u
=φ
x
u
,
M
v∈V
a(x
u
,x
v
)ψ(x
v
)
!
(5.62)
whichyieldstheself-attentionoperator, thecoreoftheTransformerarchitec-
ture(Vaswanietal.2017).Assumingsomekindofnormalisationoverthe
attentionalcoefficients(e.g. softmax),wecan constrainallthe scalarsa(x
u
,x
v
)
tobeintherange[0,1];assuch,wecanthinkofself-attentionasinfer-
ringasoftadjacencymatrix,a
uv
=a(x
u
,x
v
),asabyproductofgradient-based
optimisation for some downstream task.
TheaboveperspectivemeansthatwecanposeTransformersexactlyas
attentionalGNNsoveracompletegraph(Joshi2020).
16
However,thisisin
apparentconflictwithTransformersbeinginitiallyproposedformodelling
sequences—therepresentationsofh
u
shouldbemindfulofnodeu’sposition
in the sequence, which complete-graphaggregation would ignore. Transform-
ersaddressthisissuebyintroducingpositionalencodings:modulatingeither
thenodefeaturesx
u
orthecomputedattentioncoefficientsa
uv
inawaythat
dependsontheabsoluteorrelativepositionsofuandvinthesequence.The
originalTransformer architecture,forexample, samplesanadditivefactor for
x
u
from a sine wave, whose frequency is dependent on u.
Ongraphs,wherenonaturalorderingofnodesexists,multiplealterna-
tiveswereproposedtosuchpositionalencodings.Oneparticularlypromising
directioninvolvesarealisationthatthepositionalencodingsusedin‘vanilla’
TransformerscanbedirectlyrelatedtothediscreteFouriertransform(DFT),
andhencetotheeigenvectorsofthegraphLaplacianofa"circulargrid".
Hence,suchpositionalencodingsareimplicitlyrepresentingourassumption
Graphs141
thatinputnodesareconnectedinagrid.Formoregeneralgraphstructures,
onemaysimplyusetheLaplacianeigenvectorsofthe(assumed)graph—an
observationexploited byDwivedi and Bresson(2020) within their empirically
powerfulGraphTransformermodel.DesigningTransformersforgraphsisa
veryactiveandimportantareaofresearch;wedirectthereadertoMülleret
al. (2023) for an excellent overview.
Finally,wenotethatwhileourdiscussionmighthavebeensufficientto
describemanyimportantaspectsofself-attention,additionalcarefuldetails
areneededtorelatethemodernlargelanguagemodel(LLM)stacktoour
framework. We detailedly cover this as a case study in Section 5.4.
5.2.3.3Inferred edgesetFinally,onecantrytolearnthelatentrelational
structure,leadingtosomegeneralA thatisneitherI nor11
.Theproblem
of inferringa latentadjacency matrixAforaGNN touse (oftencalledlatent
graphinference)isofhighinterestforgraphrepresentationlearning.Thisis
duetothefactthatassumingA=Imayberepresentationallyinferior,and
A=11
maybechallengingtoimplementduetomemoryrequirementsand
largeneighbourhoods toaggregate over.Additionally, it isclosest tothe "true"
problem:inferringanadjacencymatrixA impliesdetectingusefulstructure
betweentherowsofX,whichmaythenhelpformulatehypothesessuchas
causal relations between variables.
Unfortunately,suchaframingnecessarilyinducesastep-upinmodelling
complexity.Specifically,itrequiresproperlybalancingastructurelearning
objective(whichisdiscrete,andhencechallengingforgradient-basedopti-
misation)withanydownstreamtaskthegraphisusedfor.Thismakeslatent
graphinferenceahighly challengingand intricateproblem. Accordingly,many
of thecurrent popularmethodsinthis spaceside-steptheissue oflearnability
byusingfullynonparametricmethods(Wangetal.2019),orpre-computing
thegraphstructuretouse(Deac,Lackenby,andVeli
ˇ
ckovi
́
c2022;Shirzad
etal.2023).Thesemethodsarealsopopularwhenmodifyingagiveninput
graph toe.g.relieve bottlenecks—an approachcommonly referred toas (non-
parametric)graph rewiring(Gasteiger,Weißenberger,andGünnemann2019;
Topping et al. 2021).
5.2.4The expressive power of GNNs, and the Weisfeiler-Lehman test
The remainingdesignchoiceofGNNstodiscussisthechoiceofaggregation
function,
L
.Whilestudyingthechoiceofneighbourhoodsallowedustorelate
manypopulararchitecturestoGNNs,studyingaggregatorswillallowusto
reason about the GNNs’ expressive power, as we will now discover.
142Chapter 5
φ(,{{,}})
φ(,{{,,}})
φ(,{{,}})
φ(,{{,,}})
φ(,{{,}})
Figure 5.6
HowtheWeisfeiler-Lehmanalgorithmfeaturisesagivengraphoffivenodesandsix
edges, withcategoricalfeatures comingfrom
1
(i.e., effectivelya featurelessgraph).
At each step, the colours are refined using injective mappings of neighbourhood multi-
sets, until thedistribution ofcolours stops changing –in this case, yieldingthe multiset
{{,,,,}}.Thishistogramof colourscanthenbecompared againstanothergraph’s.
ItwouldbeveryusefultounderstandwhethertheGNNflavourswedis-
cussedarecapableofrepresentinganypermutationequivariantfunctionwe
careaboutovergraphs,orifthereexistsaneedtostudyalternativeframe-
works.ToquantifythekindsoffunctionsaGNNcanexpress,wetypically
studythemfromtheperspectiveofdecidinggraphisomorphism:giventwo
non-isomorphicgraphs,G
1
andG
2
andaGNNembeddingthemintoalatent
spaceh,doesitholdthath
G
1
=h
G
2
?Ifnot,anykindofdistinctclassification
of these two graphs is hopeless
17
.
Tosimplifyouranalysis,wewillassumethatthegraphshavecategor-
icalnodefeatures,x
u
k
,thatis,thesefeaturesareone-hotvectorsof
kdimensions.Wecaninterpretthisasacolouringoftheirnodes.Recall
that,inourderivationsofar,weexpressGNNsusingtheabstractformula
h
u
=φ(x
u
,X
N
u
),wherethefeaturesoftheneighbourhoodarerepresentedas
L
v∈N
u
ψ(x
u
,x
v
). Inorderto maximisethecapacity of theGNN, notethath
u
willretainmaximalinformationaboutthegraph—andhence,havemaximal
expressivecapacity—whenφisinjectiveinitsinputs:thatis,twodifferent
neighbourhood multisets must be mapped to different vectors by φ.
Inthecategorical-featurescase,wecanobtaininjectivitybychoosinga
verysimplemessagefunction:ψ(x
u
,x
v
)=x
v
,andthenpickinganaggrega-
tor
L
that preserves allmultiset statisticsofX
N
u
, especiallyaroundrepeated
elements.Suchanaggregatorissummation(
L
=
P
),asitwillpreserve
informationaboutallcardinalitiesofrepeateditemsinthemultiset.Weaker
Graphs143
aggregators,suchasaveragingormaximisation,mayreducethesecardinali-
ties to their relative ratios (
1
|N
u
|
P
), or eliminate them altogether (max). K. Xu
etal.(2018)leveragethisinsighttoderivethegraphisomorphismnetwork
(GIN), a maximally-expressive GNN over categorical node features:
h
u
=φ
(
1+ε
)
x
u
+
X
v∈N
u
x
v
!
(5.63)
whereεRisalearnablescalardesignedtogiveinjectiverepresentationtothe
receivernode(u),andφisadeepmultilayerperceptron(whichcanrepresent
aninjectivefunction).TheexpressivepowerofGNNsisthenequivalentto
theWeisfeiler-Lehman graphisomorphismtest(WeisfeilerandLeman1968),
aclassical algorithmingraphtheory whichiterativelyrefines thecoloursofthe
nodes usingan (injective) hashof the neighbours’colours (Morriset al.2019).
An illustration of how this algorithm operates is provided in Figure 5.6.
5.2.5*More expressive GNNs
Whiletheequivalenceofmessagepassing(asdescribedinEquation5.56)
totheWeisfeiler-Lehmantestiscertainlyasatisfyingresult,itleavesmuch
tobedesired.Forexample,itimpliesthatmessagepassingneuralnetworks
provablycannotdistinguisha6-cyclefromtwotriangles;the
colourrefinementprocedurewillneverupdateanycolour,asallneighbour-
hoods "locallylook the same".In some applications, thisis amajor drawback:
forexampleinorganicchemistrywheremoleculesaremodeledasgraphs,
cyclic structures such as aromatic rings are abundant and important.
TherearemultiplewaystoextendtheexpressivepowerofGNNsthatwe
willdiscussnext;notethatnoneofthemwillimplythattheequivalenceof
GNNstotheWLtestdoesn’thold,astheywillmodifyassumptionsconcerning
theinput providedto theGNN. Infact, wemay oftenleveragespecific limiting
examples—like the 6-cycle/triangle one above—to derive meaningful ways in
which such limitations can be overcome in general.
5.2.5.1FeatureaugmentationWestartfromperhapsthemostobvious
reasonwhytheWLtestmayfail–toomanyneighbourhoodsbeinglocally
identical.Thisissuearisesbecausetherearenodistinguishingfeatures inthe
nodes; sometimes,the graphmayeven befeatureless, as wediscussed. Italso
immediatelyhintsatacheapwaytosurpassthislimitation:simplyaugment
the graph with more descriptive features.
Anapproachlikethis"pre-colours"different nodes(oredges)ofthegraph
byattachingsomeadditionalfeaturestothem,giving theWLalgorithm(and
144Chapter 5
hencealsoGNNs!)additionalinformationtoworkwithwhendistinguish-
ingnon-isomorphicgraphs.Thesepre-coloursdonothavetobeparticularly
meaningful— alreadyrandomlyfeaturisingthe nodesisprovably sufficient to
increaseexpressivepower(Sato,Yamada,andKashima2021).Randomfea-
turesallowmessagepassingtoidentifyanodeinitsownk-hopneighbourhood,
empowering GNNs to detect cycles of a certain cardinality.
Naturally,featureaugmentationalsoworksvery wellwhenperformedina
moreinformedmanner(R.Murphyetal.2019);itisevenpossibletocount
substructuresweexpecttocareabout(e.g.aromaticringsinmolecules)and
use these counts for additional featurisation (Bouritsas et al. 2022).
5.2.5.2MessagepassingmodulationWhileintroducingadditionalstruc-
turalfeaturesboostsexpressivepower,theGNNisnotcommittedtousing
suchfeatures.As such,another promisingclass ofmodels injectssuch features
directly into theequationofthemessagefunctionψ; in that way, the model is
forced to take this information into account.
There aremany representative instancesofsuch modulatedmessagemech-
anisms;wecallouttwoearlysuccessfulexamples.Firstly,directionalgraph
networks(Beainietal.2021),whereinLaplacianeigenvectorsareusedto
definecanonical"flows"onagraph,andtheyexplicitlyguidethescaling
factorsofvariousincomingmessages.Secondly,theLSPEmodel(Dwivedi
etal.2021)hasdedicatedmessagepassingoverthepositionalfeaturesit
leverages, once again coercing the model to process them nontrivially.
SuchideashavealsopermeatedintographTransformers,withperhaps
themostprominentexamplebeingtheGraphormer(C.Yingetal.2021).
Graphormers computeall-pairs shortestpaths betweennodes inthe graph,and
use them as an explicit bias for the self-attention operator.
5.2.5.3ComputationgraphdesignWhilepre-computinginteresting
graphpropertiesandthenleveragingthem(eitherasadditionalfeaturesor
throughmodifyingmessagesdirectly)isanattractiveapproach,italsoarguably
requiresadegreeoffeatureengineering.Itmightbemorefavourableto
empower themodeltoextractmeaningfulstructurebyitself—butitprovably
cannot do soover many input graphs,owingto theequivalence tothe WLtest.
This motivates a broad class of methods that modifythe input graph of a GNN
in a way that will allow message passing to better detect certain substructures.
We’vealreadycoveredsomeoftheseapproachesinSection5.2.3.3under
theumbrellaof"graphrewiring";onekeydifferenceisthatapproacheswe
willdiscuss hereoftenintroduceadditionalnodesintothegraph.Suchnodes
(often referredto asvirtual nodes)carry representationsthat promotestructure
Graphs145
1
23
Subgraph1
Subgraph2
Subgraph3
2
1
3
2
1
3
2
1
3
1 3
1 2
2 3
101
011
111
110
111
011
111
110
101
S
n
S
m
Figure 5.7
AsubgraphGNNoperatesoverseveralsampledsubgraphsofthegiveninputgraph
simultaneously.ESAN(Bevilacquaetal.2021)isasubgraphGNNthatoperatesina
way that respects boththe permutation symmetries of the nodes (S
n
) and the subgraphs
(S
m
)—hence, it is equivariant to the product group S
n
×ばつS
m
.
discovery;whenappropriatelyplaced,theycanprovablyimprovethequality
of the GNN’s dataflow (Wilson, Bechler-Speicher, and Veli
ˇ
ckovi
́
c 2024).
Perhapstheeasiestplacetostarttrackingtheseapproachesisbystudy-
ingextensionstotheWeisfeiler-Lehmantestitself.Itwasindeedextended
inthelate1970s
18
toahierarchyofincreasinglymorepowerfulhigher-
dimensional k-WLteststhat operateon k-tuples ofnodes. ChristopherMorris
etal.(2019)andHaggaiMaronetal.(2018;2019)showeddifferentdesigns
ofhigher-dimensionalGNNarchitecturesequivalenttok-WLtests;however,
suchmodelslosetheadvantageoflocalityandlinearcomplexityofthe
standardmessage-passingGNN
19
.Whilesuchapproachestakeintoaccount
featurisingallrelevant k-tuplesindiscriminately,itisalsopossibleto perform
moretargetedmessagepassingoverspecifictopologicalsubstructuresinthe
graph, such as simplicial andcellular complexesof the graph (Bodnar, Frasca,
Wang, et al. 2021; Bodnar, Frasca, Otter, et al. 2021).
While approaches inspired by theabove typically addnodes and edges into
thegraph,itisalsopossibletorecoverimprovedexpressivepowerbyselec-
tivelydeletingnodesoredges—effectively,bymessagepassingover selected
subgraphsoftheinputgraph(Figure5.7).Oneparticularlyinterestingrepre-
sentativesubgraphGNNfromtheperspectiveofequivariancesisESAN;the
equivariantsubgraphaggregationnetwork(Bevilacquaetal.2021).ESAN
operatesovermultisetsofsubgraphs,whosesymmetryarisesfromboththe
structureofeachconstituentsubgraphaswellasthemultisetasthewhole.
146Chapter 5
The principle ofequivariancerequiresthatchanging the order ofm subgraphs
in the multiset (permutation group S
m
) and the order of the n nodes inthe sub-
graphs(permutationgroupS
n
)yieldsanequivalent representation,orinother
words,thearchitectureisequivariantw.r.t.theproductgroup
20
G=S
m
×ばつS
n
.
This example serves asa relevantinstance ofcombining symmetries withinthe
samearchitecture;wewillexploremoreinvolvedinstancesofthiswhenwe
discuss geometric graphsin latter Chapters.
5.2.5.4Synchronisationand"clocks"Thusfar,wehavediscussedsev-
eral ways inwhich themessage passingequation may beedited: itsparametric
functions(ψ,φ),aggregationfunction(
L
),features(x
u
),andcomputation
graph(N
u
).Itmayseemasifwe’vecoveredalltheanglespossiblebydoing
so, but in fact, there remains onehidden but important aspect – thesynchroni-
sation betweenthedifferent stepsof themechanism. Inreality, Equation 5.56
stores embeddings over time:
h
(t+1)
u
=φ
x
(t)
u
,
M
v∈N
u
ψ
x
(t)
u
,x
(t)
v
!
,(5.64)
wheretNidentifiesthecurrenttime-stepoftheGNN,andx
(t)
u
correspond
totheGNNinputembeddingsavailableattimet—notethatwemaysimply
equate x
(t+1)
u
=h
(t+1)
u
in most cases, though they may also differ.
Inalmostallcases,Equation5.64isexecutedsynchronously:allh
(t+1)
u
embeddings arecomputed simultaneously, in parallel– inessence,there isno
distinctionbetweenthenotionoftime-stepand thenotionofGNNlayer.But
whatifwerelaxedthisassumption,andallowedspecificsubsetsofmessages
to be computed or aggregated at one step, rather than all-at-once?
Itturnsoutthatthismodification,evenwhenonlyonemessageatatime
isprocessed(seeFigure5.8),canyieldimprovementstoexpressivepower.
Thiseffect wasfirststudiedandprovedbyFaberandWattenhofer(2024)for
theirGwACasynchronousarchitecture.Theargumentisasfollows:clearly,
anasynchronousGNNcanmatchtheexecutionofasynchronousone—we
justneedtouseanexecutionorderthatmatchestherelevantsynchronisation
points.Assuch,themodelisatleast1-WLexpressive.Further,forspecific
execution orders, it can be shown that it is possible to exactly trace the outline
ofparticularcyclestructuresinthegraph,oreven exactlyidentifyeachnode
(aswe didwithrandomfeaturespreviously),allowingtodiscernstructures that
1-WL cannot.
GwAC does nottake avery strongstance on theexecution order in itsasyn-
chronoustrace,butseveralotherarchitectureshavesinceemergedthat do.One
Graphs147
m
ba
m
bc
m
db
m
bb
x
b
x
a
x
c
x
d
:m
ba
=ψ
x
b
,x
a
:m
bc
=ψ
x
b
,x
c
:x
b
=φ
x
b
,m
bc
:m
bb
=ψ
x
b
,x
b
:x
b
=φ
x
b
,m
ba
:m
db
=ψ
x
d
,x
b
:x
b
=φ
x
b
,m
bb
:x
d
=φ
x
d
,m
db
Figure 5.8
Anexecutiontraceofafully-asynchronousGNN:ateachtime-step(denotedby
advancingclock),exactlyonemessageiseithercomputedoraggregated.Notethat
the snapshot of the node features dynamically changes whenever φis invoked.
such architectureis theCo-GNN (Finkelshtein etal. 2024), whereineach node
can select whether to send or receive messages at a particular GNN layer.
While powerful, asynchronous GNNs do not neatly align with the currently
prevalentmassively-parallelapproachestomachinelearningmodels.There-
fore,amiddle-groundapproachthatcanexploitsuchparallelism,whilestill
havingcleartraitsofasynchronouscomputation,wouldbeveryattractive.
Dudziketal.(2024)haveshownthatthisisindeedpossibletobuildsuch
amodel,throughtheconceptofasynchronyinvariance:keepingthemodel
fully synchronous, but constrainingitto give identicalembeddingseven if we
madeitsexecutiontraceasynchronousincertainways.Thisapproachneatly
resonateswiththeideasofsymmetrythatweexploredthusfar;however,it
requiresworkingwithgeneralisationsofgroups(monoids),andassuchwe
will only study it significantly later in this book—namely, in its final Chapter.
5.2.5.5ContinuousnodefeaturesThereisnodoubtthatthescaffold
offeredbythek-WLhierarchyhashadoutsizedimpacttothedevelopment
ofgraphrepresentationlearning.Aswe’vejustseen,thisalgorithmfamily,
andtheobservationthatGNNsnaturallyaligntoit,hasinformedcountless
contributionsinfeatureengineering,architecturedesign,graphrewiring,and
148Chapter 5
evenasynchronousmodels.Itsflexibilitydoesnotcomewithoutlimitations,
however—andperhapsthemostglaringoftheseisitsassumptionondis-
cretefeaturesinthenodes.Formanyscientificandindustrialtasksarising
from real-world data, features we have available will often be real-valued. We
willonlyfocusonthisparticularextension,andhowitleadstotranscending
theWLhierarchy.Thatsaid,weremarkseveralotherapproachesthatarenot
well-aligned with k-WL emerged in recent times; e.g., Tönshoff et al. (2021).
While K.Xuetal.(2018) leftthestudyofaligningGNNswithreal-valued
features to decidinggraph isomorphism tofuture work, it wasalreadyevident
fromtheworkofZaheeretal.(2017)thatextendingsuchresultseveninthe
domain ofsets was goingto bedifficult. Theessence ofdifficultiesgoverning
thistransitionwerefinallyexposedandsettledbyCorsoetal.(2020).The
keyissue arises in thelargerflexibilityof theobjects that needto be injectively
aggregated:whileintheWLsettingwedealtwithaggregatingmultisetsofone-
hot vectors—and hence all we had to track for injectivity was the preservation
oftheircardinalities—withreal-valuedfeatures,non-isomorphicgraphscan
have clashing embeddings from unexpected directions.
Toillustratethis,letusfirstrecallthat,accordingto1-WL,aggregator
choicesfor
L
areordered byexpressivity asmax<
1
|N
u
|
P
<
P
,with
P
being
maximally expressive. Now, assume thatour inbound messagesare in Rrather
than
k
.Itisrelativelystraightforward toplayaroundwithexamplesofreal-
valued, non-isomorphic multisetswhere notonly thishierarchy isbroken, but
some may not be injectively distinguished by any of these three aggregators:
{{4,1,1}} vs. {{3,3,0}}: distinguished by max but not by mean or sum.
{{4,2,2,0}} vs. {{4,4,0,0}}: cannot be distinguished by either max, mean or
sum—however, they can be distinguished by standard deviation.
Following these leads, with a littlehelp from standard facts inalgebraic topol-
ogy
21
,it ispossibletoshowthatnosingleaggregatorchoice for
L
issufficient
to injectively embedreal-valued neighbourhoods. Infact, itcanbe shown that
thenumberofaggregators,N,neededforinjectivityisequaltothemaximal
neighbourhood size, max
u∈V
|N
u
|, in ourdataset. While it isnot yet clearwhat
property thesecollectionsof aggregators need tosatisfyto beinjective, Corso
etal.(2020)showedthatthefirstNmomentsoftheneighbourhoodmultiset
(whose elements are modelled usinga random variable X), defined as follows:
M
1
=E(X)M
n
=
n
p
E[(XE(X))
n
](n>1),(5.65)
Graphs149
wouldserveasasufficientcollectionofaggregators.Theyleveragedthis
insighttoproposetheprincipalneighbourhoodaggregation(PNA)archi-
tecture,whichretainsonlythenumericallystablemoments(n{1,2,},
amounting to mean, standard deviation, min and max).
Concurrentlywiththesedevelopments,Wagstaffetal.(2019)successfully
settledtheextensiontoreal-valuedfeaturesforpureunorderedsets,success-
fullyextendingresultsfromZaheeretal.(2017).Remarkably,theyfinda
very compatible result—proving that it is possible to injectively embed sets of
real-valuedfeatureswithasingleaggregator,solongasthenumberoflatent
features islarger thanor equalto themaximalsetsizeinthedataset.Thefact
that bothCorso etal. (2020)and Wagstaffet al.(2019) relyon themaximal set
size hints at a deep connection between these results
22
.
PNAoffersimprovedexpressivityofGNNsthrough theuseofmultiplefixed
aggregators.Whatif,instead,wehadaparametricaggregator?Severalpro-
posals attemptto introducescalarparameters inan otherwiserigidaggregator
backbone(Pellegriniet al.2020). Itis alsopossible toreplace aggregators with
a scaffold relying on fully-generic neuralnetworks (suchas MLPs) serving as
the
L
operator;however,caremustbetakenthatsuchnetworkssupporta
commutativemonoidstructure(OngandVeli
ˇ
ckovi
́
c2022).Asmonoidsare
structuresthat generalisegroups andhence outoffocus forthe GDLblueprint,
we will defer this particular example to the latter Chapters of the book.
5.2.6What lies beyond message passing?
Throughout theprevious five subsections,we made clearinroads towards sur-
passingthe1-WLlimitationonexpressivityforGNNsasspecifiedbyEquation
5.56. This raises clear questionsas towhethermessage passingissufficient to
describe all possible permutation-equivariant layers we might come across.
Forsomeapproacheswediscussed,itiscleartheequationstillapplies:
featureaugmentationdoesnotchangetheoperationsofthemodel,message
passingmodulationconditionsfunctionslikeψ(x
u
,x
v
)toψ
G,(u,v)
(x
u
,x
v
)—
makingthem dependonthegraph structurearoundnodes uandv,and methods
like PNA amount to representing
L
using a combination of aggregators.
Forotherapproaches,thesituationislessclear,andthejuryisstillvery
much outconcerning theuniversality ofmessage passing; noteven theauthors
ofthisbookareinagreement
23
.Tostimulateausefuldiscussion,wewill
presentaclaimthatmessagepassingisuniversaluptoachangeofcomputation
graph;the argument was firstwritten up by Veli
ˇ
ckovi
́
c (2022)and popularised
by Francesco Di Giovanni in his NeurIPS’22 Workshop talk.
Theclaimis asfollows(Figure 5.9):let GNN
θ
(G,X) beanyneuralnetwork
withlearnable parametersθ, appliedover agraph Gwith(node/edge/graph...)
150Chapter 5
A
B
Y
A
B
GNN
R
MPNN
Figure 5.9
SchematicoftheconjectureformalisedbyFrancescoDiGiovanniathisNeurIPS’22
GraphLearning FrontiersWorkshoptalk,illustrated onthe caseof asimplicial complex
GNN (Bodnar, Frasca, Wang, et al. 2021). Such models perform message passing over
substructures – spanningmore than twovertices, andhence not conforming toa binary
messagefunctionsuchasψ(x
u
,x
v
).However, itispossibletoequivalentlyre-express
such computationby firstrewiring thegraph—in thiscase, bycreating new nodesthat
willrepresenteachsimplicialcomplex,appropriatelywiringthemtotheirconstituent
nodes—and then performing the usualmessage passing equation. It is conjectured that
every (graph) neural network may be expressed in this manner.
features X. Then, it is conjectured that, for some Rand θ
:
GNN
θ
(G,X)=MPNN
θ
(R(G),R(X))(5.66)
whereMPNN
θ
(G
,X
)isamessagepassingGNNasspecifiedinEquation
5.56,withparametersspecifiedbyθ
,neighbourhoodsspecifiedbythegraph
G
andinputfeaturesspecifiedbyX
.Therewiringfunction,R:G×ばつR
n
in
G×ばつR
n
out
,modulates theinput graphin anappropriate way. Notethat weallow
Rto act onthe node features aswell, as e.g. theoperation may introduce new
nodes into the graph, which then need to be appropriately initialised.
If the conjecturein Equation 5.66holds, then themessage passing primitive
wouldbe universal—perhapsforall ofdeeplearningoverdiscretestructures
24
.
Forthespecificcomputationgraphmodulationswepresentedearlier,wecan
indeed propose how to construct appropriate rewiring operations:
Formessagepassingoversubstructures(suchask-tuples),wecaninvent
newnodeswhichrepresentthesesubstructures,andthenconnectthemtoall
Graphs151
oftheirconstituentnodes(inamannerthatdependsonwhetherpermutation
invariance holds across the substructure).
ForsubgraphGNNs,wecanduplicateeachnodeoncepersubgraph,and
connect the various copies of a node together using edges of a separate type.
ForasynchronousGNNs,wecanartificiallyslowdownmessagesby
introducing slow nodes in-between specific edges (Azabou et al. 2023).
Whiletheseexamplesofferreasonablemotivation,interestingedgecases
withunclearsemanticsarestillample;forexample,messagefunctionsmay
becomputedasaresultofoptimisinganembeddingforcertainproperties
(Bartunov,Fuchs,andLillicrap2022),oreventheentireoutputofcertain
continuous-layerGNNsmayberecoveredbyanalyticallysolvingforsome
constraints(Xhonneux,Qu,andTang2020).Further,therearemanypropos-
als formachinelearning models over domains exhibiting richerstructure than
graphs, such assheaves (Bodnar et al. 2022),bundles (Bamberger et al.2024)
andgeneraltopologicalspaces(Papamarkou etal.2024).Anyattemptatfor-
mally demonstratingwhethermessage passingcanbeauniversal operatorfor
deeplearningwillneedtotakesuchworksintoaccount,butwebelievesuch
efforts (in either direction) will be very much worth it!
5.3Case Study: Recommender Systems And Social Networks
Thefirstpopularised large-scale industrialapplicationsofgraphrepresentation
learning have occurred within social networks, primarily in thecontext of rec-
ommender systems. Recommenders are tasked with decidingwhichcontent to
servetousers,potentiallydependingontheirprevioushistoryofinteractions
ontheservice.ThisistypicallyrealisedbyfirstusingaGNNtoembedthe
propertiesofeachitemofcontent(andpotentiallyindividualusers),u,intoa
vector,h
u
,leveragingagraphcomprisingknownuser-content(e.g.aproduct
purchase)orcontent-content(e.g.productsco-purchased together)interactions
to perform message passing over.
Notethat,inaproductionenvironment,suchgraphsarefrequentlyhetero-
geneous: we might have multiple types of nodes (e.g. user, product...) as well
asmultipletypesofedges(e.g.purchase,impression,co-purchase...)—and
thisinformationshouldconditionthemessagepassingoperation.Thereexist
severalwaystodothisinGNNs:eithersimplypassingnode/edgetypeasa
categoricalfeature,ormaintainingseparatemessagefunctionparametersfor
each edge type, or anything in between (C. Zhang et al. 2019).
Oncecomputed,theseembeddingsareoptimisedthroughalinkprediction
objective (Figure5.10): supervisethe embeddingsofvarious nodes (piecesof
content) such thatthey are kept closetogetherif they are deemed related (e.g.
152Chapter 5
h
u
h
c
h
d
h
v
h
a
h
b
/
u∈V
(u,v)∈E
Link
Predictor
(e.g.h
a
h
b
)
y
ab
B
Figure 5.10
Illustrationofatypicalrecommendersystemsettingwithgraphneuralnetworks.In
suchsettings,nodesu∈Vmaycorrespondtousers,contentorproducts,andedges
(u,v)∈Emay correspondto purchases,impressions, co-purchases, orsimilar. A GNN
first embeds all nodesinto a shared latentspace (h
u
for node u∈V),after which a link
predictionobjectiveisoptimised;forexample,wemightoptimisetheinnerproduct
h
a
h
b
topredictwhetheruserawouldpurchaseproductb(y
ab
),givenobservedpur-
chase logs. This prediction can then be used to drive recommendations downstream.
commonlyviewedtogether).Thentheproximityoftwonodes’embeddings
(e.g. theirinnerproduct,h
u
h
v
) canbeinterpretedas theprobabilitythatthey
arelinkedbyanedgeinthecontentgraph.Then,foranycontentqueriedby
users,one approachcouldserveits nearestneighboursin theembeddingspace.
Importantly,atthescalesfacedbymanydeployedrecommendersystems,
naïvely deploying amessagepassingmodelisfrequentlyprohibitively expen-
sive.Thecostsmaycomeeitherfromavailablememory–keepingentire
graphsofmillionsorevenbillionsofentitiesinmemorymaybeimpossible
– or from required latency – the model’s forward pass may simply be tooslow
foraseamless userexperience.Akey earlyexampleof aGNN mindful ofsuch
issuesisGraphSAGE(Hamilton,Ying,andLeskovec2017),whichdoesnot
feedtheentiregraphintotheGNNatonce—instead,itprocessesabatchof
nodesatonce,anddynamicallysubsampleseachofthosenodes’neighbour-
hoods
25
toruntheGNNover(Figure 5.11).Beyondthe subsamplingapproach,
twootherpopularmechanismsforbuildingscalableGNNsincludepartition-
ing the graph across multiple training steps(Chiang et al. 2019) or simplifying
Graphs153
h
1
h
2
h
3
Figure 5.11
GraphSAGE(Hamilton,Ying,andLeskovec2017)processesabatchofthreenodes,
aggregatingtheirK-hopneighbourhoods(here,K =2).Foreachofthethreenodes,it
randomlysamples,withreplacement,acertainnumberofk-orderneighboursofeach
node(for1kK;here,threefork=1,andtwofork=2),toformthefinalsub-
sampledgraph.This graphisthenusedtosteertheGNN’s messagepassing.Onlythe
messages necessary tocompute the centralnode’s featurevectors, h
u
, are computed,in
ordertooptimisememoryusage.Notethat,whiledrawnhereasalldistinct,thesampled
nodes may be duplicated or overlapping, within as well as across the sampled patches.
themessagepassingequationbyleveragingconceptsfromdiffusion(F.Wu
et al. 2019; Rossi, Frasca, et al. 2020) or PageRank (Bojchevski et al. 2020).
Amongthepioneers ofthis methodologyis theAmericanimagesharingand
socialmediacompanyPinterest:besidespresentingoneofthefirstsuccess-
fuldeploymentsofGNNsinproduction, theirmethod, PinSage
26
,successfully
madegraph representationlearningscalableto graphsofmillionsof nodesand
billions of edges(R. Ying et al. 2018). Relatedapplications, particularly inthe
space of productrecommendations,soonfollowed. PopularGNN-backed rec-
ommendersthathadbeendeployedinproductionincludeAlibaba’sAligraph
(Zhuetal.2019),Amazon’sP-Companion(Haoetal.2020),Spotify’s2T-
HGNN (DeNadaiet al.2024),LinkedIn’s LiGNN(Borisyuketal.2024) and
Snapchat’sGiGL(Zhaoetal.2025).GNNshadalsobeendeployedindustrially
beyondthecontext ofrecommenders—forexample,theypowertheexpected
timeofarrival(ETA)predictionswithinGoogleMaps(Derrow-Pinionet
al. 2021).All of thesedeployments takentogether
27
, itis not far-fetchedto say
thatmillionsofpeoplearecomingintocontactwithgraphmachinelearning
systems on a daily level.
Within the context of content analysis on social networks, another notewor-
thyeffortisFabulaAI,whichisamongthefirstGNN-basedstartupstobe
acquired(in2019,byTwitter).Thestartup,foundedbyoneoftheauthorsof
thetextandhisteam,developednoveltechnologyfordetectingmisinformation
154Chapter 5
A
house
of
x
A
x
house
x
of
FFN
FFN
FFN
h
A
h
house
h
of
z
A
z
house
z
of
R
n
FFN(x)=W
2
σ(W
1
x+b
1
)+b
2
R
k
P
w∈{A,house,of}
softmax((Qh
of
)
Kh
w
)Vh
w
cards
of
house
Figure 5.12
An illustrationoftrainingalarge languagemodel (LLM)onthenext-tokenprediction
task for the input text "Ahouseofcards". Themodern LLM stack interchanges
twotypesoflayer:afeedforward network(FFN)appliedtoeachtokenrepresentation
inisolation,andaself-attentionlayer,allowingtokenstoexchangetheirembeddings
usinganattentionalGNN. Notethatit isoften necessaryto restrictthealloweddataflow
so thatatoken mayonlyinteract withthetokens precedingthem;this protectsagainst
leaking privilegedinformation when alltoken representationsin thetext aresupervised
at once.Accordingly, modern LLMsaregraph neural networksoperating over graphs
of tokens. See also Figure 5.13 for a zoomed-in view of the individual operations.
on socialnetworks (Montiet al.2019). Fabula’ssolution consistsof modelling
the spreadof aparticularnewsitem bythenetworkof userswho sharedit.The
usersareconnectedifoneofthemre-sharedtheinformationfromtheother,
but also if they follow each other on the social network. This graph is then fed
intoa graphneural network,which classifiestheentire graphas either‘true’ or
‘fake’content –with labelsbased onagreementbetweenfact-checkingbodies.
Besides demonstrating strong predictivepowerwhich stabilises quickly(often
withinafew hoursofthenews spreading),analysingtheembeddingsofindi-
vidualuser nodesrevealedclear clusteringof userswho tendto shareincorrect
information, exemplifying the well-known ‘echo chamber’ effect.
5.4Case Study: Large Language Models
AtthetimeofwritingthisChapter—andsurelyforasignicantamountof
timeafter—themostprevalentartificialintelligencesystemisthelargelan-
guagemodel (LLM;Figure 5.12).These systemshave outstandingcapabilities
inlanguage,demonstratingcoherentandusefulanswerstovariousprompts
ofinteresttousers;thiscapabilityisreflectedbothintheirwidespreadusage
(throughsystemssuchasGemini,ClaudeandChatGPT)andthemattracting
significant scientific interest.Naturally, it would be highlyusefulto be able to
describe and reason about LLMs within the framework proposed in this book.
Graphs155
As itturns out,we alreadycharted allofthe necessarycomponents toderive
the geometricdeep learning propertiesof LLMs. Infact, weeven hintedat this
in Section 5.2.3.2: attentional GNNs over complete graphs, i.e.
h
u
=φ
x
u
,
M
v∈V
a(x
u
,x
v
)ψ(x
v
)
!
,(5.67)
mayalreadybesufficienttodescribetheTransformerarchitecture(Vaswani
et al.2017), which formsthe keybackbone ofcontemporary languagemodels.
In this setting, the nodesofthegraphcorrespondtotokens (fundamental units
oftext—theymaybewordsbutalso,e.g.,wordfragments),andtheirinput
features, x
u
, are one-hot indicators specifying which token is at position u.
Therearetwokeymissingpieces:detailinghowdot-productself-attention
conformstothisequation,andappropriatelytakingcareofcausalmasking:
both of which we detail next.
5.4.1Dot-product self attention
Thedot-productself-attentionoperator
28
,asusedby Transformers, isdefined
as follows (for general neighbourhoodN
u
):
z
u
=
X
v∈N
u
α
uv
v
v
(5.68)
where v
v
=Vx
v
(the value vector of v) and α
uv
is the softmax-normalised dot-
product
29
between q
u
=Qx
u
and k
v
=Kx
v
:
e
uv
=q
u
k
v
α
uv
=
expe
uv
P
v∈N
u
expe
uv
(5.69)
NotethatQ,K andVaretheonlyparametriccomponentsofthisoperator.
Fromthisanalysis, wecanaligntheequationto attentionalGNNs aspresented
inEquation5.67–set
L
=
P
,letψ(x
v
)=Vx
v
andleta(x
u
,x
v
)=α
uv
,where
the softmax-normalisationis implied. Transformerstypically interchange self-
attentionlayerswithfeedforwardlayers,whicharemultilayerperceptrons
applied, in parallel,to each token’s embeddingseparately—much likein Deep
Sets
30
. Such a multilayer perceptron takes the role of φ, e.g.:
φ(x
u
,z
u
)=W
2
σ
(
W
1
(x
u
+z
u
)+b
1
)
+b
2
(5.70)
whereW
·
,b
·
arethelearnableweightandbiasparametersofeachlayer,σ
isanactivationfunction,andtheinputtokenfeaturesareresiduallyaddedto
the attendedvector z
u
. ThisMLP might alsofeature normalisation operations,
such as layer normalisation (Ba, Kiros, and Hinton 2016).
While thisderivation(Figure 5.13)already aligns Transformers tothe atten-
tional GNN flavour, we find it an instructive exercise to demonstrate how they
156Chapter 5
x
1
x
2
x
3
q
3
q
2
q
1
k
2
k
1
k
3
v
1
v
2
v
3
Q
K
V
e
11
e
12
e
13
e
21
e
22
e
23
e
31
e
32
e
33
R
R
R
0
−∞−∞
00
−∞
000
L
softmax
θ
α
12
α
11
α
13
α
21
α
22
α
23
α
31
α
32
α
33
z
2
z
1
z
3
h
1
h
2
h
3
Norm
FFN
Norm
FFN
Norm
FFN
L
L
Figure 5.13
ZoominginontheschematicfromFigure5.12,thisFigurepresentstheindividual
operations that commonlyfeature in themodern Transformer block.The dataflow also
emphasiseshowsuchmodelscanbeseenasgraphneuralnetworks:theadjacency
matrix isapplied inthe formof anattention mask(settingnon-edge logitsto –); the
valuevectorsv
u
maybeseenasmessages,andtheattentionalcoefficientsα
uv
deter-
minetheinteraction strengthswhenaggregatingthese messages.Thelatterpartofthe
block (normalisationand feedforwardnetwork)corresponds tothe GNN’s update func-
tion, φ. Note, the figure alsodepicts the frequently used rotary position embedding (Su
et al. 2024) as a rotation matrix R(u,v) applied to the dot product e
uv
=q
u
R(u,v)k
v
.
can be expressed as amessage passing GNN. Thisderivation will work for all
softmax-normalised attentionalGNNs, even ifnot usingdot-product attention.
To help us in this endeavour, let’s write Equation 5.68 as follows:
z
u
=
P
v∈N
u
(expe
uv
)v
v
P
v∈N
u
expe
uv
(5.71)
The key insight is that both thenumerator and the denominator in thisfraction
canbecomputedwithsimplesum-aggregatedlocalmessages.Thisallows us
toexpressourmessage asacombinationoftwomessages,(n,d),wherenR
k
is thenumerator, anddRthe denominator.Both nand d canbe computedas
functions of x
u
and x
v
only, and therefore conform to the required form.
Nowit onlyremains todefine ouraggregationfunction,
L
.Note thatin this
case, we need to define how our combined messages are aggregated
31
; i.e.,
(n,d)(n
,d
)=
n+n
,d+d
(5.72)
Graphs157
Once aggregated, it remains to divide the numerator by the denominator—this
can beperformed withinthe definition ofφ. Hence,Transformers are indeed
(attentional andmessage-passing) graph neural networks
32
(Joshi 2020).
Whyattentional?It isuseful topause fora momentand ponderwhy wasthe
attentional GNNflavour soimpactfulin sucharich naturallanguage domain,
asopposedtomoregenericmessagepassingarchitectures.Whilewecannot
definitively settle this question here, we do offer a few conversation-starters.
Ononehand,wordsaremeaninglesswithoutcontext:asourinputnode
featuresaresimplyone-hotindicesintoatoken vocabulary,itisunlikelythat
complexmessagesarenecessary
33
tomakesenseofthem.Thesuccessofatten-
tioninthisspacereinforcestheideathatonecaneasilyreasonaboutword
representations by leveraging weighted sums of other words’ representations.
Onanotherhand,wemustacknowledgethemassivescaleatwhichTrans-
formersaredeployedtoday,withmanylargelanguagemodelssporting
hundredsofbillionsofparameters.Thisisanothersettingthatpresentsan
advantageforattentionalGNNs:especiallywhenusinglinearordot-product
attentionwithoutgivenedgefeatures,novectors(onlyscalars)needtomate-
rialise on theedgesofthegraph, saving significant memory
34
. In contrast,the
mostgeneralmessagepassingGNNsrequireexplicitlycomputingmessage
vectors and storing them in each edge.
Finally,thespecificcomputationsofTransformers—whichconsistof,e.g.,
many carefullyplacedfullmatrixproducts—alignvery wellwiththemodern
AI hardwarestack(e.g.GPUsandTPUs)andhence,Transformers are GNNs
that are currently winning the hardware lottery (Hooker 2021)!
5.4.2Causal masking
Itisprecisely theaforementioned scalabilityoflanguagemodeltrainingwhich
yieldsthesecondkeyconstrainttoconsider.TheparametersofanLLMare
typicallypre-trainedusinganext-tokenpredictionobjective:givenaspecific
sequence oftokens, predictthe next onein thesequence. This objectiveis then
deployed over tokenised texts scraped across the entire Internet.
Given theincrediblescales involved,itis critical thatthis training proceeds
inamanner whichis asparallelisable aspossible. Becauseof this,the modelis
typically trainedoverentire chunks ofcontiguous textat once,and supervision
is applied on all of the tokens simultaneously.
Thisdecisionenablesfurtherscalabilitybutlimitsthealloweddataflow
betweentokens;namely,itnecessitatesthatinformationflowsinafeedfor-
wardmannerthroughout themodel. Shouldany informationflowfromfuture
tokens towardspast oneswhile also supervisingon alltokens atonce, it would
158Chapter 5
x
1
x
2
x
3
x
4
x
5
h
(1)
1
h
(1)
2
h
(1)
3
h
(1)
4
h
(1)
5
h
(2)
1
h
(2)
2
h
(2)
3
h
(2)
4
h
(2)
5
h
(3)
1
h
(3)
2
h
(3)
3
h
(3)
4
h
(3)
5
x
1
x
2
x
3
x
4
x
5
h
(1)
1
h
(1)
2
h
(1)
3
h
(1)
4
h
(1)
5
h
(2)
1
h
(2)
2
h
(2)
3
h
(2)
4
h
(2)
5
h
(3)
1
h
(3)
2
h
(3)
3
h
(3)
4
h
(3)
5
x
6
Figure 5.14
Two typical harmful effects on information propagation in decoder-only Transformers:
overmixing(left) andoversquashing(right). Overmixingleadstoimportant information
(depicted in red) being diffusedtoo quickly acrossthe token embeddings. Oversquash-
ingleadstooverly relyingonearliertokens,astheyhavesubstantiallymorepathsto
the final tokenembedding (depicted inblue for anearly token vs. redfor a latetoken).
Both issues occur when attentional coefficients, α
uv
, are insufficiently sharp.
allowthe model to"cheat" on thenext-tokenprediction objective(information
about the next token would leak into the token tasked with predicting it).
Enforcing thisconstraintin an LLMamounts to maskingthe allowed token
adjacencytoalwaysbealower-triangularmatrix—oftenalsoreferredtoas
acausalmask.Notethatthisisaformofnonparametricgraphrewiring:it
sets neighbourhoodsN
u
={v |vu}. In Figure5.13, this isexemplifiedby the
relevant rewiredadjacencymatrixbeingpassedastheattentionmask(setting
logits in the u-th row that are not in N
u
to –).
Taking allof these constraintstogether, we have now fully specifiedtheso-
called decoder-only Transformer
35
, which is the essence within the generative
pre-trained Transformer (GPT) (Radford et al. 2018) framework.
5.4.3Analysing LLM graph dataflow
Nowweknowthatdecoder-onlyTransformers(theprevalentLLMarchitecture
at the timeof writing) are actuallysecretly (?) graph machine learning models
underthehood.Anaturalquestionarises:canweleveragethisknowledgeto
make sense of how data propagates within LLMs – and potentiallyeven make
meaningful improvements by tinkering with their graph topology?
Wecananswerthisquestionintheaffirmative—infact, thespecificchoice
ofgraphstructureimpliedbyacausalmaskturnsouttobeharmfulfor
dataflow,fromtheperspectiveof(m)anymetricsofgraph"health"knownto
graph theory. Wewill providetwo examples ofthese issuesfor illustrativepur-
poses(seealsoFigure5.14);whileweremaincognisantoftherapidpaceof
research in this space, we hope they serve as useful inspiration.
Firstly,becauseofthefully-connectedstructureofthiscausalgraph,mes-
sages are always passedfullyacrossall allowed pairs of tokens. Thiscanlead
toachallengingphenomenonknownasovermixing(Barberoetal.2025),
Graphs159
whereinthemodellosesthefidelityofinformationaboutindividualtokens,
as they rapidly mix with other tokens’ embeddings across different layers.
Acomplementaryissuearisesduetothecausalityofthecomputational
graph,coupledwiththefactthatonlythelasttoken’sembeddingisusedto
predictthenextone.Thecausalgraphnecessarilyinducesaformoftoken
asymmetry,withearliertokenshavingsignificantlymorepathstothelast
token’sembeddingthanlaterones.Thisleadstotheover-squashingproblem
(Barberoetal.2024):asinformationisasymmetricallycompressedintothe
final token’s embedding (after L layers), h
(L)
n
, early tokens have a significantly
higher influence over it (in termsof the Jacobian, h
(L)
n
/x
v
, for input token at
positionv).Thisasymmetryissopronouncedthatitispossibletoshowthat, in
thelimitwhere thenumber ofdecoder-onlyTransformerlayers goesto infinity
(L→∞), only the first token influences the final predictions.
Whilebothoftheaboveissuespresentimportantimpedimentstothe
model’sdataflow,theymayin principlebefixedbyappropriately chosenatten-
tional coefficients, α
uv
. For example, thecoefficients maybe selected suchthat
theypreserve themost vulnerablepathsfor longer,or delaymixingby concen-
trating attentionin specificplaces.In bothof thesecases,a desirableproperty
is sharpness—beingable tochoose disproportionately highervaluesof α
uv
for
asmallersubsetoftokens.Whilesuchsharpcircuitsmayindeedbediscov-
ered withintrained LLMs(Elhageet al.2021), theyonly tendto appearin data
distributions"familiar"tothemodelaftertraining.Itcan,infact,beproved
usingstandardresultsinrealanalysis(Veli
ˇ
ckovi
́
cetal.2025)thatthecoeffi-
cientsα
uv
mustalldispersebelowany chosenpositive εforsufficientlylarge
input positionuat inference time(when going beyond themaximal input size
encountered during training).
Manyoftheseissuesarewellfamiliartoresearchersingraphrepresenta-
tion learning; overmixing is frequently referred toas oversmoothing
36
(Rusch,
Bronstein,andMishra2023),whereasoversquashingisknownbythesame
name(AlonandYahav2020).Whilethiscertainlyinvitesaplethoraofinter-
estingmethodsfromthefield,itisalsoimportanttoacknowledgetheunique
challengesthatcomefromthefeedforwardstructureofthisparticulargraph,
which are seldomstudied. Initial stepstowards bridging this gap have recently
been made (Vitvitskyi et al. 2025), but a lot more work is needed.
Suggested Further Reading
Graph neural networks, as well asgraph theory moregenerally, are veryexcit-
ing andvibrant areas ofmathematics andcomputer science thatare wellworth
studying further than what our book has space for.
160Chapter 5
Foragracefulintroductiontoboth,wewarmlyrecommendthebookfrom
Hamilton(2020).WhileitissomewhatdatedinthespecificGNNmodelsit
studies,itprovidesagentleandintuitiveoverviewofcomputingimportant
metricson agraph,aswell asmanyrelevantmodelsthat precededgraphneural
networks (such as node embedding techniques).
Then,shouldyouwishtodivedeeperintothemanyexcitingfacetsof
graphmachinelearning,thebookofL.Wuetal.(2022)isanexcellentand
comprehensive reference which still holds up very well today.
InparallelwithdivingdeeperintoGNNs,orevenbeforedoingso,we
believeitisworthwhiletogainastrongfoundationingraphtheory. Thiswill
beofsignificantvalue:bothingainingimportantintuitionsforunderstand-
ingGNNs,andforinspirationonwaystofurtherimprovethem.For this,we
recommend the foundational text from Bollobás (1998).
Exercises
1.A linear permutation-equivariant functionon a set is afunction of the form F(X)=
FXsatisfyingFPX=PFXforeverypermutationmatrixP.Showthatanylin-
earpermutation-equivariantfunctioncanbewrittenasalinearcombinationofthe
identity and average functions, i.e. FX=(I+11
)X.
2.Let Xbeafeature spacecontaining acountable setofpossible values. Thismeans
that every possible set ofvalues coming from Xcan be represented by a bitmask in
B
X
(where B={0,1} is the set of bits), and any permutation-invariant real function
over sucha set must be representable asf:B
X
R. Show that such a function can
alwaysbeexpressedasaDeepSetsmodel,i.e.,f(X)=φ
P
i∈V
ψ
(
x
i
)
,forsome
choice of ψ:X→Rand φ:RR, for every set V∈B
X
.
3.Considertherecipeforbuildingpermutation-equivariantgraphneuralnetworks
F(X,A) fromEquation 5.53.Recallhowit leverages a localfunction φ(x
u
,X
N
u
),
applied to theneighbourhood of eachnode u∈Vinparallel. Prove that, if φis per-
mutationinvariant, thatis,φ(x
u
,PX
N
u
)=φ(x
u
,X
N
u
),then Fwillbepermutation
equivariant, thatis, F(PX,PAP
)=PF(X,A),for every permutation matrixP.
(Note that, when N
u
⊂V, the space of Pdiffers across the two equations.)
4.ApopularwaytoimplementattentionalGNNscomputesattentionallogitsas
a(x
u
,x
v
)=σ
a
W[x
u
x
v
]
,wherea:R
l
,W:R
l×ばつ2k
aretheparameters,:R
n
×ばつ
R
m
R
n+m
isvector concatenation,andσ:RRisamonotonicactivation func-
tion. Another popularvariant appliestwo of these operations inreverse: a(x
u
,x
v
)=
a
σ
W[x
u
x
v
]
.Explain,usingaproof,whichofthetwovariantsis moreexpres-
sive(Hint:considertherankingbetweenattentionallogitsacrossallthenodesin
the neighbourhood).Which of thetwo formsis more scalable,and why?(Note that,
as the softmax function is monotonic, you may ignore it for this question.)
5.RecallthatsubgraphGNNsshouldbeequivarianttoG=S
n
×ばつS
m
,whereS
n
isthe
n-elementpermutationgroup,nisthenumberofnodesinthegraphandmisthe
Graphs161
numberofitssubgraphswehavesampled.Assumingthattheinputtoasubgraph
GNNisanodefeaturematrix XR
n×ばつk
andanadjacencytensor AR
n×ばつn×ばつm
,write
outtheequationsuchaG-equivariantGNNshouldsatisfy,and apossible implemen-
tation of sucha GNN thatallowssubgraph representations to meaningfullyinteract.
Provethat,foraspecificchoiceofmsubgraphs,yourmodelcanaccuratelydecide
isomorphism for a particular pair of graphs where the 1-WL test would fail.
6.Expresstheequationsofdot-productself-attention,asleveraged withinadecoder-
onlyTransformerinthemessagepassingframework(i.e.,byspecifyingfunctions
ψ, φand
L
in Equation 5.56).
7.Considerasingle-layerdecoder-onlyTransformer,withmodelequationsas
describedinSection5.4.1.Further,assumethatitsfeedforwardnetwork,φ,isκ
1
-
Lipschitz;thatis,
∂φ(x
u
,z
u
)
x
u
κ
1
and
∂φ(x
u
,z
u
)
z
u
κ
1
,andfurtherassumethat
V
2
κ
2
,whereVisthematrixtransformingtokenfeaturesintovalues(i.e.,
v
u
=Vx
u
). Prove that, if vu:
h
u
x
v
κ
1
(
δ
uv
+κ
2
α
uv
)
(5.73)
whereα
uv
istheattentionalcoefficientbetweenvandu,andδ
uv
is1ifu=vand0
otherwise.Then,using thisresult, showthat,in aL-layer decoder-onlyTransformer:
h
(L)
n
x
v
C
X
k
1
v
X
k
2
k
1
···
X
k
L–1
k
L–2
̄α
(L)
n,k
L–1
L–1
Y
l=2
̄α
(l)
k
l
,k
l–1
!
̄α
(1)
k
1
,v
(5.74)
whereC>0isaconstant,k
1
,...,k
L–1
areintermediatenodesonaparticularpath
oflengthL+1fromuton,and ̄α
(l)
uv
aresimplefunctionsofα
(l)
uv
—theattentional
coefficientsbetween vand uin thel-th layer. Thisresult impliesthat adeep decoder-
onlyTransformeristopologicallybiasedtopaymoreattentiontowardsearly tokens,
as they will have substantially more paths to the final token (n).
Notes
1
Forexample,photostakenintherealworldaretwo-dimensionalprojec-
tionsof(atleast!)three-dimensionalscenes,oftenfeaturingobjectsthatare
inherently not grid-structured themselves.
2
Dependingontheapplicationfield,nodesmayalsobecalledvertices,
andedgesareoftenreferredtoaslinksorrelations.Wewillusetheseterms
interchangeably.
3
Thereareexactlyn!suchpermutations,soS
n
is,evenformodestn,avery
large group.
4
WeusetheboldnotationforourfunctionF(X)toemphasiseitoutputs
node-wise vector features and is hence a matrix-valued function.
5
Whenthegraphisundirected,i.e.(u,v)∈Eiff(v,u)∈E,theadjacency
matrix is symmetric, A=A
.
162Chapter 5
6
PAP
is the representation of S
n
acting on matrices.
7
Asawaytoemphasisethefactthatourfunctionsoperatingovergraphs
now needtotakeintoaccounttheadjacencyinformation, weusethenotation
f(X,A).
8
ThiscorrespondstotheBellnumberB
4
,whichcountsthenumberofways
to partitiona setof 4 elements,in this case givenby the4-indices (u,v),(u
,v
)
indexing a linear map acting on the adjacency matrix.
9
Often, the node u itself is included in its own neighbourhood.
10
Amultiset,denoted{{...}},isasetwherethesameelementcanappear
morethanonce.Thisisthecaseherebecausethefeaturesofdifferentnodes
can be equal.
11
Moreprecisely,wecannotdefineanon-trivialcoarseningassumingset
structurealone.Thereexistestablishedapproachesthatinfertopological
structure from unordered sets, and those can admit non-trivial coarsening.
12
Thoughmanyworksmakeadistinctionbetween"spatial"and"spectral"
GNNs, we find that this distinction is rather artificial andmust be taken with a
grain of salt. Many proposed spectralfilters are expressible in terms ofsimple
matrixoperations(e.g.polynomialsofthegraphLaplacian)andhenceboil
downtosimplespatialoperationsthatfallunderthe"convolutional"flavour
presented here.
13
Mostcommonly,ψandφarelearnableaffinetransformationswith
activationfunctions;e.g.ψ(x)=Wx+b;φ(x,z)=σ
(
Wx+Uz+b
)
,where
W,U,b arelearnableparametersandσisanactivationfunctionsuchasthe
rectifiedlinearunit.Theadditionalinputofx
u
toφrepresentsanoptional
skip-connection, which is often very useful.
14
Itis worthyto notethatthis flavourdoesnot expresseveryGNNlayer thatis
convolutional(in the sense of commutingwith the graph structure), but covers
most such approaches proposed in practice.
15
This is a direct consequence of the permutation invariance of
L
.
16
Itisalsoappropriatetoapplythemessage-passingflavour.Whilepopu-
lar forphysicssimulationsand relational reasoning(e.g. Battagliaetal.2016;
Santoro etal. 2017),theyhave not beenas widelyused asTransformers. This
islikelyduetothememoryissuesassociatedwithcomputingvectormes-
sagesoveracompletegraph,orthefactthatvector-basedmessagesareless
interpretable than the "soft adjacency" provided by self-attention.
17
Note that,due tothe assumption ofGNNs’ permutation invariance in h
G
, it
is certain that the embeddings of isomorphic graphs will be equal.
18
László Babai (2016) credits the invention of the k-WL tests to himself with
Rudolf Mathonand independently, Neil Immermanand EricLander (the latter
wastrainedasamathematicianbutismorebroadlyknownforhisworksin
Graphs163
genomicsandbiologyandalsoastheScienceAdvisorinPresidentBiden’s
administration).
19
Thek-GNNarchitectureintroducedbyMorrisetal.(2019)incursO(n
k
)
memorycomplexity.Inafollow-upwork,Morris,Rattan,andMutzel(2020)
alsodevisedalocalversionofk-GNNsbasedonaggregationinlocalneigh-
bourhoods,takingthesparsityoftheunderlyinggraphintoaccountandshowed
thatthislocalversionhas atleast thesame poweras theordinary k-WL.Invari-
antGraphNetworks(IGN)introducedbyMaronetal.(2018)arebasedon
k-ordertensorsandarek-WL-equivalent.IGNsarederivedfromadifferent
variantof k-WL and are more advantageous in terms oftheir complexity com-
paredtok-GNNs. Inparticular, the3-WL-equivalentIGNhas"only"quadratic
complexity.
20
ESANitselfisaninstanceoftheDeepSetsofSymmetricObjects(DSS)
architecture proposed by Maron et al. (2020).
21
Corsoetal.(2020)provethatnosingleaggregatorcaninjectivelyembed
real-valued neighbourhoodsbyleveraging theBorsuk-Ulamtheorem (Borsuk
1933).Itcanbeusedtoconclude,forexample,thattheremustalwaysexist
twopointson oppositesides ofthe Earthwith anidentical continuousproperty
(such as temperature, atmospheric pressure, etc.).
22
This connection wasbriefly exploredand highlighted by two of the authors
ofthesepapers(FuchsandVeli
ˇ
ckovi
́
c2023)—werecommenditasagood,
lightweight introductionto characterising expressivityof functionson sets and
graphs.
23
Specifically,PetarVeli
ˇ
ckovi
́
cisafirmbelieverthatthemessagepassing
primitive can be usedto express everything of interest indeeplearning—even
ifsometimesitisnotthemostnaturalperspectiveoflookingattherelevant
structures. He often callsupon the idea thateven fundamental algorithms such
as matrix multiplication canbeobserved as a formofmessagepassing,where
dataalongonedimensionofthefirstmatrixareexchangingmessageswith
dataalongacorrespondingdimensionofthesecond—followedbyanappro-
priatesum-aggregationoperation.MichaelBronsteinhasoftenrespondedto
this bysayingthat,whendrillingdown tothescaleofmatrixmultiplies,such
argument is hardly practically useful.
24
Thisimplicitlyassumesthatanydiscretestructurecanbereasonedabout
using some kind of graph.
25
Such anapproachmayhaveunexpected benefitstothemodel’sexpressive
power—as we discussed within the context of subgraph GNNs.
26
It’s worthy to note thatGraphSAGE (Hamilton, Ying, andLeskovec 2017)
was explicitlydevelopedinordertoprototypeaGNNsystemcapableofrun-
ninginproductionatPinterest—whichlaterbecamePinSage.Pinteresthad
164Chapter 5
also presented follow-up work, PinnerSage (Pal et al. 2020), which effectively
integrates user-specific contextual information into the recommender.
27
Notethatthelistaboveonlyshowspublishedacademicpapersdetailing
how variouscompanieshavedeployedGNNsinproduction.Itexcludes less-
detailedblogpostswhichdescribeGNN deploymentwithinproductslikeUber
EatsandAirBnb,aswellasGNNplatformslikeTF-GNN(Ferludinetal.2022)
that directly state"Manyproduction models at Google use TF-GNN" without
providing specifics.
28
The discussion withinthissection focusses squarely onanalysinga single-
headdotproductattentionoperator;however,itissimpletoadaptitfor
multi-headattention,avariantmorecommonlyusedinmodernTransform-
ers(wheretheattentionoperatorisreplicatedacrosskindependentheads).
Todoso, simplyfactoriseallthe relevantfunctionsand parameterstooperate
across multipleheads, similarto how we’ve analysedprincipal neighbourhood
aggregation.
29
TheTransformerpaperreferstothisas"scaleddot-productattention",
astheattentionallogitsareadditionallyscaledbyafactorof1/
d
k
;thatis,
e
uv
=
1
d
k
q
u
k
v
,whered
k
isthedimensionalityofthekeyvectorsk
v
.Thisis
introduced to allow richer gradients at larger scales of dimensionality.
30
Interestingly,bothDeepSets(Zaheeretal.2017)andTransformers
(Vaswanietal.2017)werepresentedatthesameconference:N(eur)IPS’17
in Long Beach, California.
31
Itisalsopossibletodefinethisoperationinawaythatdoesnotdelay
therenormalisationforlater;forexamplebymaintaininganonlineaverage:
(n,d)(n
,d
)=
dn+d
n
d+d
,d+d
.
32
Interestingly, beyond thepermutation equivariancewhich isimplied bythis
derivation, thedot-productattentionisalsorotation invariant:ifwerotateall
key andqueryvectorsbythesamerotationmatrix,we willobtainexactly the
same attentionlogits (sincerotations donot affect distances).This observation
will be leveraged in future Chapters.
33
This may no longer be the case when nodes become entire concepts.
34
AverygoodexampleofthisprincipleistheGATv2architecture(Brody,
Alon, andYahav2022). GATv2s areattentionalGNNs thatleveragedeep mul-
tilayerperceptronsastheirattentionfunction,a,allowingthemtoprovably
expresscertainpatternsofattentionthatareoutofreachofdotproductself-
attention.However,thecomputationoftheintermediatelayersofdeepMLPs
requiresstoring latentembeddings inthe graph’sedges, makingthe modelless
practical for use over large, fully-connected graphs of language tokens.
Graphs165
35
Thename"decoder-only"referstotherootsoftheTransformerarchitec-
ture(Vaswanietal.2017).Itwasinitiallyusedformachinetranslation—a
supervisedlearningtask—ratherthannext-tokenprediction—aself-supervised
learning task.Whenused for machinetranslation, it was notnecessary for the
entirearchitecturetohaveacausalmask:therecouldbeadedicatedcomponent
(theencoder)whichsetsN
u
=V,alongwitha causallymaskedcomponent(the
decoder), setting N
u
={v |vu}, generatingthe translatedsentence word-for-
word. Whenonlythe decoderpart isused, werecover theGPT setting.When
onlythe encoderpartisused,we recovertheBERTsetting(Devlinetal.2019).
36
In fact,thefinding thatdecoder-only Transformers overmix information—
in spiteof theirattentional coefficients—is similarto thediscoverythat graph
attentionnetworksoversmooth(X.Wuetal.2023)inspiteofthesame
mechanism.

AltStyle によって変換されたページ (->オリジナル) /