Foundations of Equivariant Deep Learning

•

Wedefinesymmetriesasstructure-preservingtransformationsofan

object, and study examples.

•

Symmetriesofalearningproblem aretransformationsoftheinputspace

that leave the output invariant.

•

Toexploitsymmetriesindeeplearning,weuseequivariantandinvariant

layers.

•

Symmetriesformgroups.Inthischapterwestudybasicnotionsfrom

group theory.

•

Groupsactondataviagroupactionsorrepresentations.Inthischapter

we study basic notions from representation theory.

•

Westudyequivariantmappingsandequivariantnetworksin general,and

provide an overview of common equivariant neural network layers.

Inthischapterwebeginourstudyofthefoundationsofgeometricdeep

learning.Thecentralconceptissymmetry:atransformationthatleavesall

relevantstructureofanobjectunchanged.Wewilllookatthewayinwhich

symmetries arisein machinelearning, andstudy thecentral algebraic structure

usedinthemathematicalstudyofsymmetries:thegroup.Wedefinegeneral

andlineargroupactions/representations,whichtellushowtheelementsof

anabstractgroupcanberepresentedbyconcrete(linear)maps.Usinggroup

representations, wecan specifyhow agroup actson various kinds ofgeomet-

ricdata,fromimagesto pointclouds totensor fields.Nextwe discussinvariant

andequivariantmaps,whicharethebuildingblocksofequivariantnetworks

and the primary toolswe use to leverage knowledge ofsymmetries to improve

thestatisticalefficiencyofdeeplearningmethods.Weendthechapterwith

theequivariancecookbook,whichprovidesanumberofcommonkindsof

equivariant maps from which equivariant networks can be constructed.

Ourfocusinthischapterisonthealgebraic(group-theoretic)aspectsof

GDL, whereas the next chapter will coverthe geometric ones.Concretely, this

meansthatinthepresentchapterwe considerdatathatlives inaplainvector

62Chapter 3

spacewithagroupofsymmetriesactingonit,butwithouttoomuchfurther

structure, while inthe nextchapter this vector spaceis taken to be thespace of

signals(functions)onaspace(e.g.graph,manifold,etc.).Thus,fornow,we

cantalkaboutsymmetries,groups,representationsandequivariance,butwe

willnot considerconceptssuch aslocalityandconvolution, whichdependon

the structure of the underlying space.

3.1Symmetry Transformations

Letusbeginbygivingageneraldefinitionofsymmetry,beforeweconsider

the role it plays in geometric deep learning. Just as how the best way to define

"vector"is"elementofavectorspace"(definingthethingbyitsrelationto

otherthingsofthesamekind

),onecoulddefine"symmetry"as"elementof

a symmetry group". Butalthough such a definition wouldbe general, compact

andprecise,itwouldnotbeveryintuitive.Sowebeginwithamoreinfor-

maldefinition,andthenshowhowthenotionofgroup(tobedefined)neatly

captures the idea.

Themodernconceptofsymmetryappliestoawiderangeofabstract

mathematicalstructures,butletusfirstconsiderthefamiliarexampleofan

equilateraltriangle(fig.3.1,left).Wesaythatrotationby120

◦

isasym-

metryofthetriangle,becauseafterapplyingthistransformationthetriangle

looksexactlyasitdidbefore.Thesameistrueforreflectionaboutoneof

thethree bisectors,aswellas arbitrarysequencesofrotationsand reflections.

Furthermore, the inverse of any of these symmetries is itself also a symmetry.

Figure 3.1

Euclidean symmetries (left) and general isometries (right) of triangle.

Whendrawnonpaper,theoriginalandtransformedtrianglelookexactly

thesame.However,forotherwaysofencodingorrepresentingthetriangle,

wemayfindthatthereareactuallysomesuperficialchanges.Forinstance,if

weencodethetriangleasanorderedlistofthreecornerpointsin(x,y)coor-

dinates,thenarotationby2π/3ofeachcoordinatevectorleadstoanoverall

permutation (cyclic shift) ofthe three corners.Nevertheless, if weforget about

Foundations of Equivariant Deep Learning63

theordering,westillhavethesamesetofcorners.Sincewedonotconsider

the orderof thecorners to bepart ofthe essentialstructure of thetriangle, the

rotated triangle(list of cornercoordinates) should beconsidered equivalentto

theonewestartedwith. Thisphenomenoniskeyto geometricdeeplearning,

becauseanysensiblecomputationonsuchdata(includingforinstancethose

performedbyaneuralnetwork)shouldtreatequivalentinputsequivalently,

even if they have superficial differences.

In thisexample andalsoin general,a symmetrytransformationis aspecial

kindofstructure-preservingmap. Whatismeantby structureishighlycontext-

dependent,butinorder todiscusssymmetriesonehastochoose somenotion

ofstructureofinterest.IfwearedoingEuclideangeometry,thestructureof

interestisEuclideandistancestructure,andsoherestructure-preservingmap

meansdistance-preservingmap,akaisometryorEuclideantransformation.

More precisely, g:R

→R

is an isometry wrt distance metric d, if

d(gu,gv)=d(u,v),

Forallu,v∈R

.Theisometriesoftheplaneconsistofrotations,reflections,

translations, and all possible compositions of these.

Applying an arbitrary isometry to ourtriangle, we obtain a new but congru-

ent(orisomorphic

)triangle(Fig.3.1,right).Whenthetransformedtriangle

isnot justcongruent butactually thesame astheone westartedwith, wecall

thetransformationasymmetryofthetriangle.Inotherwords,asymmetry

transformation is a way in which the triangle is congruent to itself.

With thisin mindwe couldsay ingeneralthat asymmetry isa wayin which

an object is equivalent (or isomorphic) to itself, or more prosaically:

A symmetry transformation (or automorphism) of an object is an

invertible and structure-preserving map from the object to itself.

Oftentimes,weareinterestedinthesymmetriesofaspace.Considerfor

exampletheplaneendowedwithaEuclideandistancemetric.Thisspacehasas

symmetriesall isometriesoftheplane, i.e.allrotations,reflections andtrans-

lations.Theyaresymmetriesbecausetheyaremapsfromtheplanetoitself,

theyareinvertible,and theypreservedistances.In thiscase ithappensto bethe

case that all structure-preserving maps are invertible, but in general one has to

add this as an additional requirement.

The same set may be endowed with different structures, leading to different

symmetries. For instance we could see the plane as:

1.A set, so that symmetries are bijections (invertible maps; permutations),

64Chapter 3

2.A vector space, so that symmetries are invertible linear transformations,

3.An affine space

, so thatsymmetries are invertible affine maps (linearmaps

plus translation),

4.Atopologicalspace,sothatsymmetriesarehomeomorphisms(continuous

maps with continuous inverse).

5.A smoothmanifold, so thatsymmetries are diffeomorphisms (smooth maps

with smooth inverse).

Wecouldcontinuethelist,regardingtheplaneeachtimeasaninnerproduct

space,Euclideanmetric space,andsoon.Thosenotfamiliarwiththese notions

shouldnotworry:theonlypointwewishtomakeisthatwhatcountsasa

symmetry depends on exactly what we consider to be the structure of interest.

3.2Symmetries of Learning Problems

Howdosymmetriesappearinmachinelearning,andwhyaretheyrelevant?

Inthissectionwelookattwokindsofsymmetriesthatariseinsupervised

learning: symmetriesof the target function (manifestedas invarianceor equiv-

ariance)andsymmetriesoftheparameterization.Theformerwillbeused

throughoutthisbooktodevelopstatisticallyefficientnetworksthatrespect

suchsymmetries,whereasthelatterhasrelevanceforoptimizationbutwill

not be the focus of this book.

3.2.1Invariance

In supervised learning (section 2.1) the goal is touse data samples to approxi-

matethetargetfunctionf:X→YmappingfrominputspaceXtooutputspace

Y. Forconcreteness we will consider theproblem of classification,where Yis

thesetofclasslabels,but wecouldjustaswellconsideracontinuousoutput

space in case of regression.

In the previous section, we defined a symmetry of an objectas an invertible

structure-preservingtransformationoftheobject.Forlearningproblemssuch

asclassification,theobjectofinterestistheinputspaceX,andweconsider

thelabelfunctionftodefinethestructureofinterest("classstructure").A

symmetryofthelearningproblemisthusaninvertibleandlabel-preserving

mapg:X→X.Tosaythatgpreservesthelabel,orleavesitinvariant,isto

saythat forall x∈X, wehave (f◦g)(x)=f(x). Thatis,applyingg andthen fto

anyinput x∈Xgivesthesameresult asjust applyingfdirectly. Indiagramatic

Foundations of Equivariant Deep Learning65

Figure 3.2

Rotation is a symmetry of the label function in image classification.

form, we can express this as:

(3.7)

Acommutativediagramlikethis expresses thefactthat ifwe composemaps

(arrows)alonganydirectedpathwiththesamebeginandendpoints,theresult-

ing composite maps arethe same. Hence this diagram expresses the fact that g

followed by fequals f.

As an example of a label symmetry, consider image classification, where X

is the space of planar images and Yis a set of labels. Typically, the class label

doesnot dependonthepositionand orientationoftheobject,so atranslation

or rotation ofthe image, will bea symmetry (see fig.3.2). Similarly,if we are

classifyinga pointcloud,then Xisa spaceofpoint sequences.Typically,the

label doesnot dependon the orderof thepoints, soany permutationof points

will bea symmetry. Additionally,it maybe thatrotating and translatingall of

the points simultaneously does not change the label.

The examples of symmetries mentioned above are quite natural, but in gen-

eralthere willbemanymore. Forinstance, ifwehavetwoinputs x,x

′

∈Xwith

thesamelabel(i.e.f(x)=f(x

′

)),wecandefineatransformationgthatswaps

them, whileleaving all otherpoints unchanged.This will bea symmetryof f,

but it is unlikely we would know if we didn’t already know f!

If somehow we came to know all symmetries ofa label function f, learning

becomesalmostentirelytrivial.Thereasonisthatifweobserveadatapoint

xwith label,weimmediatelylearn thelabelof allpointsx

′

inthe sameclass,

becausethereexistsasymmetrythatswapsxandx

′

.Sointheorywewould

needonlyonelabelperclass.Thisexamplenicelydemonstrateshowknowl-

edgeofsymmetriescan inprinciplebeusedtoimprovedataefficiency,butit

also suggests thatif we knew all symmetries ofthe label functionwe wouldn’t

really need learning. In other words, for problems where we do need learning,

66Chapter 3

Figure 3.3

Oneinvariantandthreeequivariantcomputervisionproblems:imageclassification,

oriented keypoint detection, image segmentation, optical flow estimation.

weshould notexpectto beable toknowallofthesymmetries apriori. Inchap-

ter4,wewillconsideranaturalclassofsymmetriesarisingfromthedomain

on whichthedata lives,andwhich weoften know abouta priori.For now we

willsimplyassumeknowledgeofsomesymmetrieswithoutworryingabout

the specific kind or how we have come to know them.

3.2.2Equivariance

Forinvariant learningproblems, theoutput remainsunchanged when weapply

asymmetrytransformationtotheinput.Thereisanotherclassofproblems

where insteadthe target functionis equivariant, meaningthat theoutput trans-

formsalong withthe input(seefig. 3.3forexamples). Inimage segmentation

forinstance,theinputisanRGBimageandtheoutputisasegmentation

mask,whichassignsalabeltoeachpixel.Whenweshifttheinput,theout-

put should also shift. Another example is image registration, where the goal is

to output a transformation that puts the imagein a canonical pose (e.g. putting

abrainscaninuprightpositionandperhapsapplyingadeformationsothat

brain regions can be compared voxel-wise). Similarly, learned or hand-crafted

keypoint detectors(e.g.SIFT;Lowe1999; LencandVedaldi2016) shouldbe

equivariant:whentheimageistransformed,thekeypointposition,scale,and

orientation transforms along with it.

Astheseexamplesshow,theoutputwillingeneralbeadifferentkindof

geometric quantitythan the input,meaning that itlivesin a different space and

transforms ina different wayunder thesame symmetry. For instance,the same

rotationcanact onaninputsignal(e.g.brainscanwith n

voxels)andonthe

outputposevariable(e.g.a3×ばつ3rotationmatrix).Whatitmeansto"rotate"

avoxelgrid or3×ばつ3 matrixis different. Forthis reasonwewill formalizethe

concepts ofgroup(e.g. rotations)and group representations /actions (how to

Foundations of Equivariant Deep Learning67

applyarotationtovariouskindsofquantities)laterinthischapter.Fornow,

wewillsimplywritegforasymmetry,andwriteρ

(g):X→Xandρ

(g):

Y→Yfor the way it acts on input and output space.

Equivariance of the target function f:X→Ycan then be stated as:

f◦ρ

(g)=ρ

(g)◦f.(3.8)

Thisisanequalityofmappings,whichmeansthatthemappingsontheleft

and right-handside shouldagree on allinputs x∈X. Hence,this equationsays

that ifwe applya symmetrytransformationg toan inputx usingρ

and then

apply f, we get the same outputas applying ffirst andthen applying the same

symmetry g to the output using ρ

We can also express equivariance indiagramatic form (appendix E), so that

we can easily see the domains and codomains of the maps involved:

(g)

∗

(g)

(3.9)

To saythat this diagram commutesis to saythat starting inthe bottom leftX,

going up and then right is the same as right and then up.

Wecanobtainyetanotherwayofexpressingequivariancebymultiplying

eq. (3.8) on the left by ρ

(g)

–1

and cancelling:

(g)

–1

◦f◦ρ

(g)=f.

Thisshows mostclearlythatalsointhecaseofequivariancewecanthinkof

gasasymmetry,becauseitleavesfinvariantwhenactingontheinputand

outputspacesimultaneously,similartotheinvarianceequationf◦ρ

(g)=f.

Indeed onecan seethat invariance (eq.(3.7)) isa specialcase ofequivariance

(eq. (3.9)) where ρ

(g)=id

is the identity map for all g∈G.

3.2.2.1EquivarianceandDataEfficiencyAswithinvariance,equivari-

anceputsaconstraintonthehypothesisspaceF,therebyreducingthe

statisticalcomplexityoflearning(chapter2).Themoresymmetrieswehave,

themoreconstraintsweget,andthesmallertheremaininghypothesisspace.

Althoughwehavesofardiscussedequivarianceofthenetworkasawhole,

inpracticeonewouldbuildanequivariantnetworkbycomposinglearnable

equivariantlayers,eachofwhichissubjecttoaconstraintthatreducesthe

number of parameters in that layer.

To getter abetter understanding of therole of equivariance in pattern recog-

nition,considerfig.3.4.Inthisexample,aneuralnetworklayerlearnsa

68Chapter 3

style-agnosticrepresentationofthepattern‘A’(twoinputsx,x

′

∈X,repre-

senting two styles of ‘A,’ are mapped into the same point, y=f(x)=f(x

′

)∈Y).

However,whentheinputpatternsaretransformedbyg(inourexample,a

rotationρ

(g)xandρ

(g)x

′

),ageneralneuralnetworkdoesnotnecessarily

generalizeina symmetry-consistentmanner, i.e.,itmightbe thatf(ρ

(g)x)=

f(ρ

(g)x

′

Byimposing equivariance,weguarantee thatthemodel generalizesina way

that is consistent with the symmetry: from the equivariance condition,

f(ρ

(g)x)=ρ

(g)f(x)

f(ρ

(g)x

′

)=ρ

(g)f(x

′

);

togetherwiththeassumptionf(x)=f(x

′

),itfollowsthatf(ρ

(g)x)=

f(ρ

(g)x

′

Figure 3.4

Example of the role of equivariance in pattern recognition.

Inadditiontotheseheuristicarguments,thereisagrowingbodyofwork

in learning theory that formally characterize the benefits of equivariance (Lyle

etal.2019;Sannai,Imaizumi,andKawano2021;ElesedyandZaidi2021;

Behboodi, Cesa, and Cohen 2022).

3.2.3Symmetries of the Parameterization

There isanother wayin whichsymmetries appear inmachine learning.When

we approximate the true labelfunction f:X→Yby a parameterizedfunction

f:X×ばつΘ→Y,wemaybeintroducingsymmetriesoftheparameterization.

Foundations of Equivariant Deep Learning69

Whereasinput-spacesymmetriesactonX,aparameter-spacesymmetryacts

onΘ.Amappingh:Θ→Θisasymmetryoftheparameterizationifforall

u∈Xand θ∈Θ:

f(u,hθ)=f(u,θ).(3.10)

Inotherwords,themappingsf(–,θ)andf(–,hθ)arethesame;theparameter

vectors θand hθcorrespond to the same classifier.

Inneuralnetworks, onecanusuallypermutehidden units(alongwiththeir

incoming and outgoingweights), without changingthe input-output mapping.

Furthermore, forcertain kindsofnon-linearities like ReLU(x)=max(x,0),we

canscaletheinputandinversescaletheoutputofanonlinearitywithout

changing the function: α

–1

ReLU(αx,0)=ReLU(x).

Knowledgeofparameter-spacesymmetriescanbeusefulinoptimization,

Bayesianinference,andmodelaveraging(Kuninetal.2021;Ainsworth,

Hayase,andSrinivasa2022;Weietal.2022;Kurleetal.2022;Zhaoet

al.2022;Zhao etal. 2023;Wei andLau 2023).In someapplications itisuseful

foroneneuralnetworktoprocesstheweightsofanother,inwhichcasethe

formershould respecttheweight-spacesymmetries(Navon etal.2023;Zhou

etal.2023).Intherestofthisbookwewillprimarilyfocusoninputspace

symmetries and how to use them to improve statistical efficiency.

3.3Symmetry Groups

Havingdefinedsymmetriesandseenhowtheyemergein machinelearning,we

willnowbeginourstudyofthemathematicalmachineryusedtothinkabout

symmetriesinasystematicandgeneralway.Webegininthissectionwith

theconceptofasymmetrygroup,beforestudyinggrouprepresentationsand

equivariant maps.

Thesetofallsymmetriesofanobjecthasanumberofsalientproperties,

which have beendistilled intoa few abstractgroupaxioms. Thefirstproperty

tonoteisthattheidentitytransformation("donothing")isalwaysasymme-

try,inatrivialway.Secondly,symmetriesmaybecombinedtoobtainnew

symmetries:ifgandharetwosymmetries,thentheircompositiong◦h(first

applyinghandtheng)isalsoasymmetry,andthiscompositionoperationis

associative.Finally,bydefinitionsymmetriesarealwaysinvertible,andthe

inverseisalsoasymmetry.Weencouragethereadertoverifyforthemselves

thatthesepropertiesindeedholdforsymmetriesaswehavedefinedthemin

section 3.1.

Itispossibletodefineasymmetrygroupasasetofsymmetrytransforma-

tionsg:X→X,satisfyingtheseaxioms(identity,closureundercomposition

andinverses).Howeveritturnsout,asitoftendoes,thatinordertostudya

70Chapter 3

groupitisnotsoimportantwhatsymmetriesare(i.e.whattheydotoele-

mentsofX),butratherhowtheycomposewithothersymmetries.Sothe

mostcommondefinitionofgroupdoesnotsaywhatthegroupelementsare

(symmetries), but only encodes the compositional structure:

AgroupisasetGalongwithabinaryoperation◦:G×ばつG→G

calledcompositionorthegroupoperation(forbrevity,denotedby

juxtaposition g◦h=gh) satisfying the following axioms:

Associativity: (gh)k=g(hk) for all g,h,k∈G.

Identity: there exists a unique e∈Gsatisfying eg=ge=g for all g∈G.

Inverse:Foreachg∈Gthereisauniqueinverseg

–1

∈Gsuchthat

–1

g=e.

Notethataswithfunctioncompositionormatrixmultiplicationbutunlike

multiplicationofnumbers,thegroupoperationneednotbecommutative,i.e.

we mayhavegh=hg.Groups forwhich gh=hg forall g,h∈Garecalled com-

mutativeorAbelian.

Groupscanbefiniteorinfinite,discreteorcontinuous.

So-calledLiegroups,likethegroupofrotationsinR

,aresmoothmani-

folds,meaningwecandocalculusonthem(moreonthisinlaterchapters).

Section 3.3.3showsa number ofexamples of groups.It is worth checkingfor

each one that it indeed satisfies the axioms of a group.

Asmentionedbefore,wehavedefinedagroupasanabstractobject,meaning

wehavenotstatedwhatthegroupelementsare(e.g.symmetrytransfor-

mationsofsomeobject,aswedefinedtheminsection3.1),onlyhowthey

compose.Bycontrast,aconcretegroup(sometimescalledsymmetrygroup)

wouldhaveaselementssymmetrytransformations(section3.1)g:X→X,

andthegroupoperationisgivenbyfunctioncomposition.In thespecialcaseof

matrix groups, the group elements are matrices

A∈G⊂R

n×ばつn

, and the group

operationisgivenbymatrixmultiplication.Startingfromaconcretegroup

oftransformations,weobtainanabstractgroupbysimplyforgettingthatthe

elementsarefunctionsormatrices,whilerememberingthecompositionrule

◦:G×ばつG→G.Asitturnsout,knowinghowgroupelementscomposeisall

you need to study every aspect of the group itself.

In practice, almost all of the groups weconsider in this book can be defined

as matrix groups, so let us state the definition explicitly:

Foundations of Equivariant Deep Learning71

An n-dimensionalreal matrix group is a setof matricesG⊂R

n×ばつn

sat-

isfying the following axioms:

Closure: For all A,B∈G, the matrix product ABis also in G.

Identity: the identity matrix Iis in G.

Inverse: AllA∈Gareinvertible (fullrank)and thematrix inverse A

–1

is also in G.

In amatrix group,the compositionoperation is givenby matrix-matrix mul-

tiplication,whichisautomaticallyassociative,andsowedonotneedtoadd

thisaxiom.Inanabstractgroupwhereweexplicitlydefinethecomposition

rule asa map◦:G×ばつG→G, thecomposition of two elements inGis always in

G, but since the matrix product of two elements of G⊂R

n×ばつn

need not be in G,

we need to add this closure axiom explicitly for matrix groups.

Whereasaconcretegroupismadeupoffunctionsg:X→Xthatcanbe

appliedtoelementsx∈X,thedataspecifyinganabstract groupdoesnottellus

howto act on geometric data. This abilitycan be reintroduced by defining one

ormoregrouprepresentations(orgroupactions),asdiscussedinsection3.4.

Grouprepresentationsareusefulevenwhenwearedealingwithamatrix

group,becauseoftentimeswewantthesamegrouptoactondifferentkinds

ofgeometricquantities,suchastheinputs,outputsandinternalfeaturesofa

neural network.

3.3.1Subgroups

Wenotedin section3.1thatwhatcanbe calledasymmetrydependsonwhat

structurewewanttopreserve.Whenweaddstructure,therewillbefewer

symmetries; our group is reduced to a subgroup.

LetGbeagroup.AsubgroupH⊂GisasubsetofGthatisitselfa

group. Concretely, thismeans thatHisclosed undercomposition and

inverses:when wecompose twoelementsin Htheresultis againin H,

and the inverse of any element in His also in H.

For any group, the subsetH={e} is asubgroup, and similarlyany group is

a(non-proper)subgroupof itself.Foraslightlyless trivialexample, consider

the groupof planartranslations (R

,+), andthe subgroup givenby translations

along the x-axis preserving the y-axis.

72Chapter 3

AclassicalexamplecomesfromtheErlangenprogram:Kleinnotedthat

differentgeometries are associated with different groups,which in some cases

formsubgroupsthatpreserveprogressivelymoregeometricstructure.For

instance,thegroupofEuclideantransformationsisasubgroupoftheaffine

group,whichisasubgroupoftheprojectivegroup.Projective,affineand

Euclidean geometry are concernedwith the study of invariantsof their respec-

tivegroups,suchasdistancesforEuclideangeometry,ratiosofsurfaceareas

for affine geometry, and cross-ratios for Projective geometry.

3.3.2*Mappings between Groups: Homomorphisms and Isomorphisms

Whenever we study a new kind mathematicalgadget, it is a good idea to think

abouttheappropriatenotionofstructure-preservingmapping betweengadgets.

In thecaseof groups,these arecalled grouphomomorphisms.Theyaremaps

between groups that preserve the compositional structure of a group:

LetGandHbegroups. Agroup homomorphismismapping φ:G→H

such that for all g,h∈G:

φ(gh)=φ(g)φ(h).

When interpreting the equationabove, note thaton the left handside we are

composing g and h using thecomposition operation in G, whereas on the right

handsidewearecomposingφ(g)andφ(h)inH.Onecanshowthatagroup

homomorphism maps the identity of Gto the identity of H.

Ifwehavetwogrouphomomorphismsφ

:G→Handφ

:H→K,then

their composition φ

◦φ

:G→Kis also a homomorphism. Moreover, for any

groupGtheidentitymapid

:G→Gisahomomorphism,whichshowsthat

groups and group homomorphisms form a category (appendix E).

Onecanshowthattheimageofagrouphomomorphism,imφ={h∈

H|∃g∈G:φ(g)=h},isasubgroupofH.Similarly,thekernelkerφ={g∈

G|φ(g)=e

}isasubgroupofG.Theinterestedreadermaystudythe

firstisomorphismtheorem,whichrelatestheimageandkernelofagroup

homomorphism.

Group homomorphisms always preserve positive compositional relations of

theformgh=k,buttheyneednotpreservenegativerelationsoftheformgh=k.

Asanextreme example,forany groupsGandH,themappingφ(g)=e

that

sends every element of Gto the identity of H, is a group homomorphism.

Whenagrouphomomorphismdoespreservedistinctness,i.e. g=h⇒φ(g)=

φ(h)itis calledaninjectivegrouphomomorphismormonomorphism.Hence

onecan thinkof aninjectivegrouphomomorphism aspointingto aG-shaped

Foundations of Equivariant Deep Learning73

subgroupofH,labellingtheelementsofthissubgroupbyelementsofG

without duplicates.

Ifamappingisbothinjective(mappingeachelementofthedomaintoa

differentelementofthecodomain)andsurjective(reachingeveryelementof

the codomain)it isbijective,or invertible.A bijectivegroup homomorphismis

calledagroupisomorphism. Isomorphicgroupsareforallintentsand purposes

thesame,becausewecantranslateanystatementaboutoneofthemtoan

equallytruestatementabouttheother,andsoitisverycommontoconsider

them as the same group.

3.3.3Examples of Groups

Throughoutthisbookwewillencountervariouskindsofsymmetrygroups;

commutativeandnon-commutative,finiteandinfinite,discreteandcontinuous.

Inordertobuildintuitionforthegeneralconceptofgroup,andtobecome

familiarwiththegroupswewillconsiderlateroninthisbook,wepresent

inthissectionanoverviewofthemostimportantones.Itisnotnecessaryto

understandandbecomefamiliarwithallofthese;itissufficientfornowto

look at a few examples to make the concept of group feel less abstract.

3.3.3.1FiniteGroupsWhenstudyinganewconceptitishelpfultolook

at simple examples first. What is the simplestexampleof a group? Looking at

the axiomsof a group, wenotice that theidentity axiomtells usthat any group

hastohaveanidentity element,sowecannothave anemptygroupwithzero

elements.Canwehaveaone-elementgroup,withjusttheidentityelement

e?WedefineT={e},compositionbye◦e=e(theonlypossiblechoice)and

inverse by e

–1

=e (idem).By definition we havean identity, and an inversefor

all elements (juste), so the identity andinverse axioms aresatisfied. We check

thatcompositionisassociative: (ee)e=(e)e=e=e(e)=e(ee),soindeedTisa

group,calledthetrivialgroup.Asamatrixgroup,wemayrepresenteasthe

1×ばつ1 identity matrix I

=[1], ora higher dimensional identitymatrix, as these

satisfyII=IandI

–1

=Ijustliketheabstractgroupelemente.Whenwehave

anobjectwhosesymmetrygroupistrivial,itmeansthattheobjectdoesnot

have any non-trivial (non-identity) symmetries.

Whatabout agroupwithtwo elements?The groupshouldhaveanidentity

e,and letscallthe otheronem. Bytheidentityaxiom, weshouldhaveee=e,

em=m,andme=m,fixing3of the4compositions.Forthelast one,wehave to

choose betweenmm=mor mm=e. However, thefirst option runsinto trouble

becauseifwemultiplybothsidesoftheequationmm=mbym

–1

(however

we choose m

–1

) we find m=e, a contradiction. Thus we must have mm=e and

–1

=m.Wecanrealize thisgroupasa matrixgroupwith1×ばつ1matrices(i.e.

74Chapter 3

scalars)bychoosingG={1,–1}.Thisalsorevealsthegeometricmeaningof

this group:it consists ofthe mirror reflectionm andidentity e. Like thetrivial

group, the two-element group is commutative.

Perhapsthemostfundamentalclassoffinitegroupsarethepermutation

groups.Recallthatapermutationofanorderedlistofnelementsiswayto

rearrange("shuffle")them,forexampleswapping twoelements.Wecanrep-

resentapermutationasabijective(i.e.invertible)functionσ:n→n,where

n={1,...n}.Alternatively,wecanrepresentapermutationbyapermutation

matrixP,whichisamatrixwithasingle1ineachrowandcolumnandall

otherentries0.Whenactingonavectorx∈R

,thismatrixwillpermutethe

coordinates.

Thesetofallpermutationsofnelementsform agroupcalledthesymmet-

ricgroupS

(nottobeconfusedwiththegeneraltermsymmetrygroup).To

seethatS

isindeedagroup,notethat1)theidentityid

:n→nisaper-

mutation,2)thecompositeoftwopermutationsisapermutation(closure),

and3)apermutationhasaninverseandthisinverseisitselfapermutation.

Tospecifyapermutationσ,weneedtochooseσ(1)(npossibilities),σ(2)

(n–1 possibilities),etc.,upto σ(n)(onlyonechoiceleft). Thus,thegroupS

has n·(n–1)·(n–2)···2·1=n! elements. We canequivalently define S

as a

concrete group of mappings σ, as a matrix group, or as an abstract group.

Fortheinterestedreader,wementionedafewmorefinitegroups,with-

outgoingintodetail.Therotationalsymmetriesofregularpolygonssuchas

triangles,squares,pentagonsandhexagons(andsoon)arecapturedbythe

cyclicgroupsC

,andtheirrotationandreflectionsymmetriesbythedihe-

dralgroupsD

.Finitegroupsofsymmetriesinthreedimensionsincludethe

crystallographic point groups and the groupsof symmetries of Platonic solids.

AccordingtoCayley’stheorem

,anyfinitegroupwithnelementscanbe

viewedasasubgroupofS

.Furthermore,inamajormilestoneinmathemat-

ics

,the finitesimple groups(fromwhich allfinitegroups canbemade) have

been classified; all such groups belong to one of several infinite families, or to

a relatively short list of sporadic groups.

3.3.3.2InfiniteDiscreteGroupsGroupsneednotbefinite.Considerfor

instancethe setof integers Z={...,–2,–1,0,1,2,...}. With compositiongiven

byadditionand0astheidentity,thissetformsagroup.Geometrically,we

canthinkofthisgroupasagroupofdiscretetranslationsinonedimension.

Similarly, we canthink of Z

=Z×ばつZas agroup ofdiscrete translations intwo

dimensions.Inchapter6wewillconsiderthesegroupsasthesymmetriesof

regular grids (e.g. a grid of image pixels).

Foundations of Equivariant Deep Learning75

Inadditiontodiscretetranslationalsymmetries,someregulargrids/lat-

ticesalsohavediscreterotationalandreflectionsymmetry,suchasfour-fold

rotationalsymmetryofasquaregridandsix-foldrotationalsymmetryofa

hexagonalgrid.Wewillconsidersuchgroupsinchapter7whenwestudy

groupconvolution.Arelatedclassofgroupsarethe17wallpapergroups,

whichdescribethepossiblesymmetriesofwallpaperpatterns(Schattschnei-

der 1978). Their3D analogues are called(crystallographic) space groups, and

these have been classified into 230 distinct groups (Aroyo 2017).

3.3.3.3TheClassicalMatrixGroupsClassicalgroupsareanameofan

importantclassofgroupsthatoccurthroughoutmathematics,physics,and

computerscience,andseveralofwhichwillplayanimportantroleinthis

book.Theyweresonamedby(Weyl1939)inhisfamousbook"TheClassi-

cal Groups" on invariant polynomials, representation theory,and the Erlangen

programme. Mostgenerally defined, theclassical groups aregroups ofinvert-

iblematriceswithreal,complexorquaternionic

entries,possiblysubjectto

additionalconstraints(e.g.,volumepreservationdet(A)=1orpreservationof

certain kinds ofgeneralized inner product). Here wewill only consider certain

importantexamplesofrealandcomplexclassicalgroups;manymoreexamples

and details may be found e.g. in Goodman and Wallach (2009).

ThequintessentialmatrixgroupsarethegenerallineargroupsGL(n,R)

andGL(n,C)ofalln×ばつnrealorcomplexinvertiblematrices.Theseform

groups,becausetheyincludetheidentitymatrix,inversesforeachelement,

andbecause theproductoftwoinvertiblematrices isagaininvertible. They can

bethoughtofasthesymmetrygroupsofthevectorspacesR

andC

,being

the invertibletransformationsthat preserve theirvector spacestructure.

The

classicalgroups(andindeedallmatrixgroups)aresubgroupsofthegeneral

lineargroups,whichisthe"largest"ofthematrixgroups.Inthissense,the

generallineargroupplaysthesameroleformatrixgroupsasthesymmetric

group plays for finite groups.

FromGL(n,R)wecanobtainsubgroupsthatpreservenotjustthelinear

structure of R

, but also lengths,angles, orientation, and volume. Theorthog-

onalgroupO(n)contains onlythosematrices thatpreserve thestandardinner

productx

⊤

y.Thatis,weshouldhave(Ax)

⊤

(Ay)=x

⊤

yforallx,y∈R

,or

equivalently, A

⊤

A=I:

O(n)={A∈GL(n,R):A

⊤

A=I}.(3.11)

Thus, this group consists of rotations and reflections.

Geometrically,the innerproduct representstheanglebetweentwovectors

so orthogonal matrices preserve angles. The determinant of a matrix A tells us

76Chapter 3

GroupGL(n,R)SL(n,R)O(n)SO(n)

Constraintsdet(A)=0det(A)=1A

⊤

A=I

det(A)=1

⊤

A=I

Dimension

–1

n(n–1)

Mapping

Invariants

Volume,

Orientation

Volume,

Distance,

Angle

Volume,

Orientation,

Distance,

Angle

Table 3.1

Classicalrealmatrixgroupsandtheirproperties.Eachgroupisdefinedasthesetof

n×ばつnrealmatricessatisfyingaconstraint(row1).Anequalityconstraintmayelimi-

natedegreesoffreedom,leavingonlyacertainnumberofdimensions(row2).Note

thatO(n)andSO(n)havethesamedimensionality,becauseorthogonalmatriceshave

det(A)=±1sotheadditionaldet(A)=1constraintforSO(n)eliminatesonlyadiscrete

degree of freedom.In the mapping row we show for each group how a generic element

actsonthestandardbasise

=(1,0)(blue)ande

=(0,1)(red)forn=2dimensions.

Finally, we list the geometric properties left invariant by each group.All matrix groups

leaveinvariantthevectorspacestructure,meaningtheorigin0andalllinearrelations

z=αx+βy, and additionally the properties listed in the last row.

howmuch itexpands volume, with negative values indicatinga reversal ofori-

entation. Elementsof theorthogonalgroup have det(A)=±1,i.e., itpreserves

volume but not necessarily orientation (reflections have determinant –1).

ThespeciallineargroupSL(n,R)consistsofvolume-preservinglinear

maps:

SL(n,R)={A∈GL(n,R):det(A)=1}.(3.12)

It doesnot necessarilypreserve angles.Similarly, thespecial orthogonal group

consists of volume and inner-product preserving linear maps:

SO(n)={A∈GL(n,R):A

⊤

A=I,det(A)=1}.(3.13)

In table 3.1 we show these groups and some of their properties.

ThecomplexanalogsofSL(n,R),O(n)andSO(n)arecalledthecomplex

speciallineargroupSL(n,C),unitarygroupU(n)andspecialunitarygroup

SU(n).Whileinourbookwewillfocusprimarilyonrealmatrixgroups,we

should mention here thatthe unitary groups play animportant role in physics:

U(1)×ばつSU(2) isthe "electroweak symmetry" associatedwith electromagnetic

Foundations of Equivariant Deep Learning77

andweakforces,whereasSU(3)appearsinquantumchromodynamicsthat

describes strong interactions between quarks.

3.3.3.4MotiongroupsFurtherexamplesofgroupsthatareparticularly

important in applications are the Euclidean motion groups. Euclidean motions

canbedefinedasisometries(distance-preservingmaps)ofEuclideanspace

(e.g.R

withthestandardinnerproduct).Thisincludestranslations,rota-

tions,andreflections,andindeedonecanshowthatanyEuclideanmotion

isasequenceofthese.TheseformtheEuclideangroup,denotedE(n).When

reflections are excluded, it is called the special Euclidean group SE(n).

Itisnotpossibletoperformatranslationx7→x+t forx,t∈R

bymul-

tiplyingxbyann×ばつnmatrix,becauseA0=0foranymatrix.Forthis

reasonitiscommontoaddahomogeneouscoordinate,representingxby

x=(x

,...,x

,1)∈R

n+1

.AEuclideantransformationcanthenberepresented

by a matrix of shape n+1×ばつn+1 of the following form:





,(3.14)

whereR∈O(n)isann×ばつnorthogonalmatrix(rotationandreflection)and

t∈R

isatranslationvector.Onemayverifythatmultiplysuchamatrixby

xresultsintherequiredtransformationx7→Rx+t.Furthermore,thecom-

positionoperationofE(n)isgivenbymatrixmultiplication,andsoitcanbe

defined as a matrix group.

Byintroducinguniformscalingtransformations(replacingRbysRin

eq.(3.14)),weobtainthegroupofsimilaritytransformationsSim(n).The

groupofaffinetransformationscontainsinadditiontorotations,transla-

tions,andreflections,alsoscaleandsheartransformations.Usingthesame

homogeneouscoordinaterepresentation,wecanconvenientlyencodeaffine

transformations using the same form aseq. (3.14), but taking R∈GL(n,R) to

be an arbitrary invertible matrix rather than an orthogonal one.

Projectivetransformationscanalsobeencodedconvenientlyusinghomo-

geneouscoordinates.Ann-dimensionalprojectivetransformationmaybe

encodedasanon-singularn+1×ばつn+1matrix,andcompositionisgiven

bymatrixmultiplication.Inthisrepresentation,twomatricesrelatedbya

scalarmultiplicationA=λBaretobeconsideredequal.Toapplyaprojec-

tive transformationto a vector x, oneuses matrix-vector multiplication on the

homogeneouscoordinate vector

x,followedby adivision bythehomogeneous

coordinate(whichisnolongerequalto1ingeneral).Projectivetransforma-

tionsarea mainstayofclassicalcomputervision(HartleyandZisserman 2004)

and graphics. Table 3.2 lists several motions groups and their properties.

78Chapter 3

GroupPGL(n+1)Aff(n)(S)Sim(n)(S)E(n))

MatrixH





sQt





Domain

H∈GL(n+1)

A∈GL(n)

t∈R

Q∈(S)O(n)

t∈R

s∈R

Q∈(S)O(n)

t∈R

Dimension

(n+1)

–1n

n(n–1)

+n+1

n(n–1)

Mapping

Invariants

Colinearity,

Cross-ratio

Parallelism,

Volume

Ratio,Dis-

tanceRatio,

Angle

Volume,

Distance,

Angle

Table 3.2

Motiongroupsandtheirproperties.Motionsinndimensionsarerepresentedbyn+1×ばつ

n+1dimensionalmatricesusinghomogeneouscoordinates(seetext).Foreach group

we show theform ofthematrix, thedomainof thesub-matrices,andthe dimensionof

thegroup.NotethatfortheprojectivegroupPGL(n+1)(correspondingtoprojective

transformationsofn-dimensionalspace),anymatrixH∈GL(n+1)representsavalid

projective transformation, but matrices related by ascalar multiple should be identified

andthusthedimensionisreducedby1.Themappingrowshowshowasquaregrid

couldbemappedbyanelementintherespectivegroups,andtheinvariantsrowlists

several invariants of the group.

3.4Group Actions And Representations

Ingeometryingeneral,andgeometricdeeplearninginparticular,oneoften

wishes toperform computationsinvolving different kindsof geometric quanti-

ties, such as scalars,vectors, andtensors, as well as fieldsof such quantities.

The key thing that distinguishesand in fact characterizesthese and other types

of geometric quantitiesis not their dimensionality, butrather the way in which

theytransformundersymmetrytransformations.Theideathatthetypeofa

geometricquantityisdetermined bythe wayin whichthe groupof symmetries

acts on it was championed by Weyl (1939) and is widely used in physics.

Asanexample,considerthespecialorthogonalgroupSO(2)consistingof

planarrotationsaroundtheorigin.Itisintuitivelyobviousthatwecanapply

thesamerotationg∈SO(2)toapointonthecircleS

,ontheplaneR

,orto

aplanarimage(afunctionon theplane).Wearethusdealingwithonegroup

Foundations of Equivariant Deep Learning79

actingonseveraldifferentspaces.Moreover,thegroupcouldactindifferent

ways on the same space, for instance rotating clockwise or counterclockwise.

Ingeometricdeeplearning,allfeaturespacesareassociatedwithagroup

representation,whichtellsushowto transformdata livinginthatfeaturespace.

Thisincludestheinput space,outputspace,andallintermediatefeaturespaces.

Therearetwoequivalentwaystogiveprecisemeaningtotheideaofhav-

ingagroupGactonasetX.Thefirstapproachistodefineagroupaction

as a map α:G×ばつX→Xsatisfying certain criteria. Bycurrying we can obtain

fromαanothermapρ:G→[X,X](again satisfyingcertaincriteria)calleda

grouprepresentation,where[X,X]isthesetofmappingsfromXtoitself.

The reasonfor the nameis that eachabstract group elementg isrepresented in

thespaceXbyaconcretemappingρ(g):X→X.Itshouldbenotedthatthe

term"representation" hastraditionallybeen usedmostly fortheimportant case

whereρ(g)isalinearmapactingonavectorspaceX.Wewillusetheterm

representationforthegeneralconcept,andwherenecessarymakeadistinc-

tionbetweenlinearor Vect-representationsand generalor Set-representations.

Furthermore,sincegroupactionsandrepresentationsareequivalent(upto

currying), we will also permit ourselves to use these terms interchangeably.

Let us now look atthe aforementioned "certain criteria" thata group action

mustsatisfy.Withoutfurtherrestrictions,amapofthetypeα:G×ばつX→X

coulddoratherstrangethings.For example,theidentityelemente∈Gmight

notactastheidentityonX,i.e.,onecouldhaveα(e,x)=x.Also,ifweact

firstwithhandthenwithg,theresultmightbedifferent fromactingwithgh

(composinggandhin G).Thus, wewantagroupactiontorespectthestructure

ofthegroup,whichconsistsofadesignatedidentityelementandtheruleof

composition:

LetGbeagroupandletXbeaset.AgroupactionofGonXisa

mapping α:G×ばつX→Xsatisfying:

α(e,x)=x(Identity)

α(gh,x)=α(g,α(h,x)),(Composition)

for all g,h∈G, x∈X, and where e∈Gis the identity.

When thereis littlechance ofconfusion, groupactions canalso bedenoted

as g·x=α(g,x).

Ifwehaveagroupactionα,wecanobtainagrouprepresentationρby

definingρ(g)=α(g,–).Wecanalsodefinearepresentationwithoutreference

80Chapter 3

toanaction, butagain notallassignmentswilldo; ρmustsatisfytwo criteria

equivalent to the ones for group actions:

LetGbeagroup,letXbeaset,anddenoteby[X,X]thesetof

mappingsfromXtoX.A(Set-)representationofGonXisamap-

pingρ:G→[X,X]thatassigns toeachg∈Gamappingρ(g):X→X

satisfying:

ρ(e)=id

(Identity)

ρ(g)◦ρ(h)=ρ(gh),(Composition)

for all g,h∈G, and where e∈Gis the identity.

Thefirst axiom(identity)says thattheidentity e∈Gshouldberepresented

bytheidentitymappingonX,i.e.itshoulddonothing

.Thesecondaxiom

(composition) ensures thatρpreserves all positive relations of theform gh=k

that hold in the group

. This includes in particular relationssuch as gg

–1

=e=

–1

gthatexpressthatg

–1

istheinverseofg,whichimpliesthatρ(g)mustbe

an invertible map for any g∈G.

Negativerelations,oftheformgh=kmaynotbepreserved,however.For

instance, forany group wecan define thetrivial representation

, whichmaps

allelementsgtotheidentitymaponX.Onemaycheckthatthisisindeeda

validgroupaction.Anactioniscalledfaithfulwhenρisinjective(meaning

thatitisdifference-preserving:g=h⇒ρ(g)=ρ(h)).Inthiscase,allnegative

properties are also preserved.

Inmanycases,GandXhaveadditionalstructure,suchastopologicalor

smooth(Liegroup)structure,inwhichcaseonehastoadaptthedefinition

inordertomakethemathworknicely.Forinstance,byrequiringρtobea

continuousor smoothmap,andtaking [X,X] tobethe spaceofcontinuous/

smooth maps from Xto itself

In GDL, datausually live in a vector space, andmost of the representations

weencounterinpracticearelinear,meaningthatthemapsρ(g):X→Xare

all lineartransformations ofa vector spaceX. Aswe will see inlater chapters,

thisistrueeventhoughtheneuralnetworksweworkwitharehighlynon-

linear.Suchlinear representationsinvectorspacesare definedanalogouslyto

ordinary representations in sets:

Foundations of Equivariant Deep Learning81

LetGbeagroup,letXbeavectorspace,anddenoteby[X,X]the

set oflinear mappingsfromXtoX. Alinear representation(or Vect-

representation) ofGon Xisamapping ρ:G→[X,X]that assignsto

each g∈Ga linear mapping ρ(g):X→Xsatisfying:

ρ(e)=id

(Identity)

ρ(g)◦ρ(h)=ρ(gh),(Composition)

for all g,h∈G, and where e∈Gis the identity.

We notethat onecan equivalently definea (linear) group representationas a

group homomorphism(section 3.3.2) fromGto the groupof invertible (linear)

transformations of X.

InpracticeweworkinabasisofXrelativetowhichthelinearmaps

ρ(g):X→Xcanberepresentedbymatrices

.Hence,wecanalsodefine

ρasamatrix-valuedfunction.Then,composition◦correspondstomatrix

multiplication, and the application ρ(g)(x) is matrix-vector multiplication.

Linearrepresentationsareusefulbecausetheyturnabstractgrouptheory

intolinearalgebra,whichiswellunderstoodandforwhichwehavehighly

efficient specializedhardware.Evenwhenworkingwithagroupofmatrices,

it is useful to consider linear representations, and one should be careful to dis-

tinguishthe conceptsof matrixgroupsand linearrepresentations.Forinstance,

one couldconsider the groupSO(2) asa matrix group consistingof 2×ばつ2 rota-

tionmatrices,andhaveitactvian×ばつnmatricesρ(g)onthespaceofsignals

onthecircle,sampledatnpoints.Similarly,wecanconsiderthegroupof

d×ばつdpermutationmatrices,actingonthespaceofd×ばつdadjacencymatrices

by d

×ばつd

matrices ρ(g). We will cover these examples in more detail later.

Wehaveformallydefineda(linear)representationasamapρ:G→X

butitiscommontorefertotherepresentationspaceXastherepresentation

whenitisclearwhatρisassociatedwithX.Neverthelessitisimportantto

notethataccordingtoWeyl’sprinciple,thetypeofgeometricalquantityis

determinedbythe representation,notbythespacealone.Considerforexample

thegroup SO(2)of2×ばつ2rotationmatrices.An SO(2)-vectoris aquantityx∈

that transformsaccording tothe standard(vector) representation ρ

(g)=g.

Ontheotherhand,aquantityinthesamespaceR

iscalledapairofSO(2)-

scalars if it transformsaccording to the trivial representation ρ

(g)=I

(where

isthe2×ばつ2identitymatrix).Thustherepresentationdeterminesthetype

ofgeometricquantity,andaswewillseeamappingbetweentwotypesof

geometric quantities / representations is called an equivariant map.

82Chapter 3

3.4.1Examples of Representations

Itis hightimeto considersomeexamplesof grouprepresentationsand group

actions.Most groupsconsidered inthisbook arematrixgroups, andforthese

groupsthereisanaturalchoiceofrepresentationoftencalledthestandard

representation,whereeachgroupelement(matrix)A∈Gisrepresentedby

itself: ρ(A)=A. This definition trivially satisfies theidentity and composition

requirements to qualify as a representation.

Asanexample,considerthematrixgroupSO(2)={R∈R

2×ばつ2

|RR

⊤

I,det(R)=1}of2×ばつ2rotationmatrices.Clearly,eachrotationmatrixR∈

SO(2)definesalinearmappingR

→R

givenbymatrixvectormultiplica-

tion: x7→Rx; thisis the standardrepresentation ρ(R)=Rof SO(2). Notethat

wecanequivalentlydefineSO(2) asthesetofrotation angles[0,2π)withcom-

positiongivenbyadditionmodulo2π.Inthatcasewecannotdefineρ(θ)=θ

becauseθ∈[0,2π]isanangleandnotamatrix,butwecanstilldefinean

equivalent representation:

ρ(θ)=



cos(θ)–sin(θ)

sin(θ)cos(θ)



(3.15)

Usingtherulesoftrigonometryonecanshowthatmatrixmultiplicationthen

mimics angle-addition: ρ(θ)ρ(θ

′

)=ρ(θ+θ

′

), and that ρ(0)=I.

Similarly,thestandardrepresentationofthepermutationgroupS

mapseach

abstract permutation to amatrix that permutes thecoefficients of the vector on

whichitacts.Forinstance,ifπ

isthepermutationthatswapsthefirsttwo

elements of a length-3 sequence, the corresponding permutation matrix is:

ρ(π





010

100

001





(3.16)

ThisrepresentationisusefulinGDLapplicationsinvolvingsetsandgraphs

(chapter5),whereeachofthennodes/elementsisequippedwithafeature

∈Rthatwewish topermute.Indeedwe findthat(ρ(π)x)

π(i)

.If instead

ofone scalarper elementwe havea d-dimensional vectorx

∈R

wecan stack

them into amatrix X∈R

n×ばつd

and act onit with the samematrix: X7→ρ(π)X.

Equivalently, we canstack theminto onecolumn vector x∈R

in whichcase

we use an equivalent representation matrix with didentical blocks:

(π)=







ρ(π)







(3.17)

Foundations of Equivariant Deep Learning83

ThepermutationgroupS

canalsoacton matricesofsizen×ばつn,by simul-

taneouslypermutingtherowsandcolumns.Ifρ(π)isthestandardmatrix

representation of S

defined above, we get a representation on matricesvia the

tensorrepresentation(section3.6.6):(ρ⊗ρ)(π)A=ρ(π)Aρ(π)

⊤

.Thisrepre-

sentationis usefulforinstance fordescribingthe transformationbehaviour of

the adjacency matrix or edge-feature matrix of a graph.

Anotherrepresentationofthepermutationgroupisthesign(orparity)

representation.Anypermutationπcanbedecomposedintoasequenceof

transpositions(swaps ofapair ofelements),and althoughthisdecomposition

is notunique, the numberof transpositions N(π) isalwaysthe same.The sign

representationisgivenbyρ(π)=(–1)

N(π)

,i.e.itis1foranevennumberof

transpositions, and –1 for an odd number of transpositions.

ForanygroupGwecandefinethetrivialrepresentationρ(g)=1,which

mapseveryelementtothe1×ばつ1identitymatrix.Onemayverifythatthis

isindeedarepresentation.Ingeometricdeeplearning,weusethetrivial

representation for features that should be invariant.

ThematrixdeterminanthasthepropertythatdetI =1anddetAB=

detAdetB foranysquare matricesAand Bof thesame size. Hence,it defines

arepresentationofthegenerallineargroupandanysubgroupthereof(and

hence,ofanymatrixgroup).Forexample,foranelementoftheorthogo-

nalgroupA∈O(n),thedeterminanttellsuswhetherA flipsthehandedness

(detA=–1)orleavesitunchangeddetA=1).Flippinghandednesstwiceis

the same as not changing it, and indeed (–1)·(–1)=1.

Theexamples aboveare alllinearrepresentations,soletusconsideraSet-

representation (groupaction). For any group G, the composition◦:G×ばつG→G

definesanactionofGonthesetX=G,i.e.onitself.Writtenasarepresen-

tation, thissend g∈Gtothe mappingρ(g):G→Gdefined byρ(g)=g◦–, i.e.

ρ(g)(h)=g◦h.

A finalexample thatwillplay animportant roleinthe next chapterandthe

restofthebook,istheactionofagrouponaspaceoffunctions.Ifwehave

agroupGactingonanyspaceΩ,wecannaturallydefineanactionofGon

thespaceofreal-valuedfunctionsx:Ω→R.DenotingtheactionofGonΩ

by g·u, this is defined as: (ρ(g)f)(u)=f(g

–1

·u). For instance, if Gis the group

of rotations andtranslations of theplane Ω=R

, thenρisthe way in whichG

acts on the space of grayscale images on R

3.5Equivariant Maps & Equivariant Nets

Havingdefinedgroupsandrepresentations,wearefinallyreadytogivea

proper mathematical definition of equivariance:

84Chapter 3

Letf:X→Ybeamappingofsets(orvectorspaces),Gagroup,

andletρ

:G→[X,X]andρ

:G→[Y,Y]betwo(linear)represen-

tationsofGactingonthedomainXandcodomainYoff.Thenfis

G-equivariantif for all g∈G:

f◦ρ

(g)=ρ

(g)◦f.(3.18)

An equivariant(linear) mapcan also becalled amapping or morphism

of representations, written f:(X,ρ

)→(Y,ρ

Unlike our preliminary discussion of equivariance in section 3.2.2, this def-

initionmakesitclearthattheequivarianceconstraintshouldbesatisfiedfor

everyelement ina group, andthat ρ

are notarbitrary mapsbutshould satisfy

the requirements for being a representation (section 3.4).

If we have two equivariant maps then their composite is also equivariant:

◦f

◦ρ

(g)=f

◦ρ

(g)◦f

=ρ

(g)◦f

◦f

(3.19)

Asalwaysinmathematicsandprogramming,wecanonlycomposefunctionsif

the codomain(output space) of thefirst map matchesthe domain(input space)

ofthe second.Forexample,wecancomposef

→X

andf

→X

obtain f

◦f

→X

(read: f

after f

), but we cannotform f

◦f

because f

expects input of type X

but f

returns output of type X

instead.

It isvery importantto notethat when composingequivariant maps,we must

ensurethat notonlythedomain/codomain matchassetsorvectorspaces, but

alsoas representations,orelsethe compositewillnotbe equivariant.Inother

words,weshouldviewanequivariantmapf

notasamappingofsetsbutof

representations. Soin the exampleabove,f

maps fromrepresentation (X

,ρ

)

to (X

,ρ

), andf

maps from(X

,ρ

) to(X

,ρ

). Sincethe output representa-

tion of f

matches the inputrepresentation of f

, we cancompose them and the

composite f

◦f

will be an equivariant map from (X

,ρ

) to (X

,ρ

Followingthisrule,wecancreatearbitrarycomputationgraphsfrom

equivariant maps, i.e. create equivariant networks:

Definition 1 (Equivariant Network)Let G bea group. A G-equivariantnet-

workisacomputationgraph(DAG),whereeverynodeisassociatedwitha

vector spaceX

and representationρ

of Gin X

, andall mappings are equiv-

ariant. Thatis, every nodeis computed asa functionof itsparentsin thegraph

bya(possibly parameterized)mappingthatisequivariant withrespect tothe

representations acting on the input and output space of the map.

Foundations of Equivariant Deep Learning85

Thus,whendesigninganequivariantnetwork,wenotonlyhavetochoose

the featurespaces X

(e.g. how manydimensions it has),but alsohow wewant

ittotransform

(exceptfortheinputandoutputspace,wheretherepresen-

tationisusuallydeterminedbytheproblem).Everymappinginthenetwork

shouldbeequivariant,includingnon-linearities,normalizationlayers,learn-

able linearlayers,etc., becauseeven asinglenon-equivariant layerbreaks the

equivariance of the network as a whole.

One can createan invariant network simply by makingthe last layer invari-

ant,i.e.choosingtherepresentationoftheoutputspacetobethetrivial

one.Forexampleinaclassicalconvolutionalnetwork,onecouldchoose

aglobalaveragepoolinglayer(averagingoverspace),whichproducesa

translation-invariant output

Mappings in ageneral computation graph can have multipleinputs and pro-

duce multipleoutputs. Inthis casea simultaneous transformationg appliedto

allinputsusingtheirrepresentationshouldleadtoacorrespondingtransfor-

mation ofthe outputs. Suchmulti-variablemappings can alsobe considered as

mappings witha singleinput andoutputthat isthe concatenationof in/output

vectorspaces. In anequivariant networkeach of those vectorspaces has a rep-

resentationassociated withit,andtheirconcatenation willtransformaccording

toa block-diagonalrepresentationobtainedbystacking representationsalong

the diagonal (see section 3.6.1).

Thefeaturespacesofequivariantnetworksusuallyconsistofanumbern

ofcopies(calledthe multiplicity)ofafew (typicallylow-dimensional)repre-

sentationsρ

actingonR

.Asdiscussedinsection3.4,eachrepresentation

correspondsto atypeofgeometricfeature (e.g.scalar,vector,tensor). Thus,

a geometric featurespace consists of a numberof copies ofgeometric features

ofvarioustypes.Thismeansthatthechannels/dimensionsofageometric

featurevectoraregroupedattwolevels:eachdimensionbelongstoapartic-

ulargeometricfeaturevectorofdimensiond

(e.g.a3Dvector),andalln

geometricfeaturevectorsofthistypebelongtoasingleisotypicsubspaceof

dimensionn

.Forexample,wecouldhavetwogeometricfeaturevectors

1,1

1,2

of type ρ

(forming an isotypic subspace ofdimension 2d

), and one

feature vector v

2,1

of type ρ





1,1

1,2

2,1





7→





(g)









1,1

1,2

2,1





(3.20)

Inthenextchapterwewillconsiderfeaturespacesconsistingofsignalsona

domain,in whichcasethegeometric quantitiesarenotjuststacked"vertically"

as wehave done here,but also"horizontally". For example, apoint cloud(??)

86Chapter 3

consistsof3Dpointfeatures(thestandardrepresentationofSE(3))stacked

horizontally, and wehavea permutation groupacting along thehorizontal axis

bypermutingpoints.Aplanarvectorfield(chapter7)consistsof2Dvectors

transforming accordingtothe standardrepresentation ofSO(2), andaddition-

ally thegroup SE(2)acts bymoving vectors to anew positionon theplane as

well as rotating them.

Inadditiontogeometricfeatures,alearnablemappingtakesparametersas

input.Parametersshouldingeneralbeconsideredasinvariantquantities(trans-

formingaccordingtothetrivialrepresentation),andsodon’trequirespecial

treatment and don’t impose any constraints on the layer.

3.6The Equivariance Cookbook

Wehaveseenhowequivariantnetworkscanbeconstructedfromequivariant

buildingblocks,buthavenotdiscussedthebuildingblocksthemselves.Here

we considera numberof common neuralnetwork layers andtheir equivariant

counterparts.

3.6.1Concatenation, Direct Sums, and Geometric Feature Channels

A common operationin neural networksis to concatenate a number ofvectors

∈R

,usuallydenotedx=concat(x

,...,x

).WhenthevectorspacesR

areassociatedwithrepresentationsρ

ofthesamegroupG,theconcatenated

vector xtransforms according to a block diagonal representation:

ρ(g)=block-diag(ρ

,...,ρ







(g)







(3.21)

Inotherwords,concat(ρ

(g)x

,...,ρ

(g)x

)=ρ(g)concat(x

,...,x

).Foreach

g∈G,thematrixρ(g)hasblocksofsized

×ばつd

onthediagonal,andzeros

elsewhere.

Inmore abstractand basis-freelanguage, concatenatingvectorscorresponds

tothedirectsumofvectorspaces,andblock-diagonalstackingofrepresen-

tationscorrespondstoadirectsumofrepresentations.Wewillnotgointo

thishere,butnotethattheseoperationsaredenotedby

≃R

and

ρ=

, respectively.

Oneexamplewhere directsums appearin applicationsis whenwe dealwith

pointcloud data.Hereour dataconsistsof anumberofpoints x

∈R

,which

wecanstackintoonebigvectorx=concat(x

,...,x

).Eachpointisassoci-

atedwith arepresentationof therotationgroup G=SO(3),namelythe standard

representation(section3.4.1)ρ

(R)x

=Rx

(whereRisarotationmatrix).

Thus, thewhole pointcloud transformsas ρ(R)=block-diag(ρ

,...,ρ

), i.e.a

Foundations of Equivariant Deep Learning87

matrix withn copiesof Ralong thediagonal. Ifwe want toaddsome rotation-

invariantfeatures,suchasatimestamp,wecandosobyaddinganumberof

copies of the trivial representationρ

(g)=1 along the diagonal.

Insteadof stackingthex

intoa flatvector,wecouldalso stackthevectors

intoamatrixX∈R

3×ばつn

,inwhichcasetheactionisgivenbyX7→RX.For

consistencyhowever,itismathematicallymoreconvenienttoalwayswork

withthematrix-vectorform,evenifinourimplementationwemaychoose

a different shape array for our data.

Concatenation can be performed explicitly as a computation step in the net-

work,butcanalsobeusedtodescribefeaturespacesconsistingofmultiple

geometric quantities mathematically.In conventional neural networks, feature

spacescanbedescribedasaconcatenationofscalarfeaturesorinthecase

ofconvolutionalnetworks,ofchannels.Inequivariantneuralnetworks,one

typicallyconsidersfeaturespacesconsistingofmultiplekindsofgeometric

features(representations)ofvarioustype,eachoccuringwithsomemulti-

plicity.Forinstance,SO(3)rotationequivariantnetworkmightuseacertain

number of scalar features, 3D vectors, and 3×ばつ3 tensors as features. If thei-th

featuretype hasdimensiond

andwe have c

copiesof it,thetotaldimension

of the representation space would be d=

3.6.2Vector Addition, Scalar Multiplication & Linear Combinations

Anothercommonoperationisvectoraddition:z=x+y∈R

.Thisoperation

isequivariant,provided thatxandynotonlyhavethesamedimension dbut

also transform according to the same representation ρ. In that case we have:

ρ(g)x+ρ(g)y=ρ(g)(x+y),(3.22)

that is, vector addition is equivariant.

Similarly,multiplyingbyaninvariantscalar(i.e.eitheraconstantoran

input-dependentbutgroupinvariantquantity)isalsoequivariant.Thereason

isthatscalarmultiplicationcommuteswithmatrixmultiplication:ρ(g)αx=

αρ(g)x.Combiningvectoradditionandscalarmultiplication,wecantake

linearcombinationsofvectors,andthisisequivariantaswellwhentherep-

resentationsarethesameandscalarsareinvariant:z=

,forscalars

∈R.

3.6.3Change of Basis

Ifwehavead-dimensionalmatrixrepresentationρandanarbitrarynon-

singular(invertible)matrixA∈R

d×ばつd

,wecancreateanewrepresentationby

changing the basis:

′

(g)=Aρ(g)A

–1

88Chapter 3

We verify that ρ

′

is indeed a representation:

′

(g)ρ

′

(h)=Aρ(g)A

–1

Aρ(h)A

–1

=Aρ(g)ρ(h)A

–1

=Aρ(gh)A

–1

=ρ

′

(gh).

Tworepresentationsρandρ

′

relatedinthiswayarecalledequivalentor

isomorphic representations.

Whenρandρ

′

areequivalentbyachangeofbasisA,thenAisan

equivariant linear map between them:

Aρ(g)=Aρ(g)A

–1

A=ρ

′

(g)A.(3.23)

3.6.4Permutation representations and pointwise nonlinearities

Atypicalneuralnetworkconsistsmostlyoflearnablelinearmapsandfixed

elementwisenonlinearities. Inan equivariantnetwork,allmaps mustbe equiv-

ariant,includingnonlinearities.Whereastheequivarianceconstraintapplied

to learnable linear layerssimply leads to a reduction in thenumber of parame-

ters,nonlinearities generallydon’thave anyparameters andsowe mustchoose

themtobeequivariant.Equivarianceisexpressedrelativetogrouprepresen-

tationsactingontheinputandoutputspaceofthenonlinearmap,andso

therepresentationsandnonlinearitymustbechosentogether.Inthissection

andseveralthatfollow,weconsidervariouskindsofnonlinearitiesandthe

representations to which they are equivariant.

Considera discretegroupG, afeaturespaceX=R

anda representationρ

ofGactingonit.Ifρisapermutationrepresentation,meaningthatρ(g)isa

permutationmatrixforallg∈G,thenanypointwisenonlinearityisequivari-

ant withrespect toρacting onthe inputand outputspace

. Thisis becausea

pointwisenonlinearityactsthesameway oneachcoordinate,andapermuta-

tionsimply shuffles thecoordinates,so itdoesnot matterifwe shufflebefore

or after applying the nonlinearity.

Ingraph andsetneuralnetworks (chapter5),wehaveapermutationgroup

actingbypermutationmatrices,andsowecanfreelyusearbitrarypoint-

wisenonlinearities.Similarly,discretetranslationsofgrids(chapter6)such

asimagesandaudiosignalsareanexampleofpermutationrepresentations,

becauseadiscreteshiftsimplymoveseachpixeltoanotherposition(i.e.

dimension)without changingthe numericalvalues.Thus, pointwisenonlinear-

ities in convolutional networks do not break discrete translation equivariance.

Foundations of Equivariant Deep Learning89

Moregenerally,wegetpermutationrepresentationswhenworkingwith

scalar signalson a spaceΩhaving a group action(chapter 4) suchas a homo-

geneousspace(chapter7).However,ifthedomainiscontinuousthenwe

areforcedtodiscretizeandasaresultaliasingcanoccur,whichwillbreak

equivariance tocontinuous transformations. Sincenonlinearities can introduce

introducehigh-frequencystructureintothesignal,theyarelikelytoleadto

aliasing. In order to preserve equivariance to pointwise nonlinearities it is thus

necessary to take measures to avoid aliasing (Cesa, Lang, and Weiler 2022).

3.6.5Gating and Attention

Whentherepresentationmatricesρ(g)arenot permutationmatrices, pointwise

nonlinearitieswillingeneralnotbeequivariant.Forinstance,considerthestan-

dardrepresentationof arotationθ∈[0,2π) asa2×ばつ2matrix(eq.(3.15))acting

on R

. Aπ/2 rotationmaps (x,y)7→(–y,x) forexample,and thisoperation will

notcommutewithReLUbecauseofthesignflip.Similarly,afieldofsuch

quantities (a vector field; see chapter 7) will require special nonlinearities.

In cases likethese, gating nonlinearities can be used.In conventional neural

networks,gatingreferstomultiplyingascalarorvectorfeaturebyascalar

valuecalledthegate,which istypicallysquashedintothe (0,1)or (–1,1)range

by a sigmoid or tanh nonlinearity:

gate(x,g)=sigmoid(g)⊙x.(3.24)

Herex,g∈R

arefeaturevectors,and⊙indicateselementwisemultiplica-

tion.Sometimes,asinglescalargateg∈Risusedtoscaleanentirefeature

vector x∈R

orasub-vector ofit.GatingisusedinLSTMandGRUblocks

(HochreiterandSchmidhuber1997;Chungetal.2014),squeeze-and-excite

blocks (Hu et al. 2020), and many other architectures.

Inequivariantnetworks,gatingcanbeusedonequivariantfeatures,provided

thatthegateisinvariantandweuseonescalargatetomultiplyawholegeo-

metricfeaturevector(notonegate percomponentofthevector). Asnotedin

section3.6.2, multiplyingbyaninvariant scalarisequivariant. Invariantscould

be producedbyan invariant linearmap(section 3.6.8)which isa specialcase

ofanequivariantmap.Alternatively,iftherepresentationmatricesρ(g)are

orthogonal(asis oftenthe case)one couldcompute thenorm ofgeometric fea-

turevectors,or moregenerallyaninvariant polynomial(Hilbert1890; Derksen

and Kemper 2002;Villaret al.2021). Thenone canapply anonlinearity such

assigmoid ortanh(clearly,theoutput willstillbeinvariant). Finally,one can

multiplya geometricfeaturevectorx

(transformingaccording toarepresenta-

tion ρ

) bya scalargate. Since scalarand matrixmultiplication commute,this

is always equivariant.

90Chapter 3

Attention isan operationthat likegating involvesmultiplicative interactions

betweendata-dependentquantities,butadditionallyinvolvesanaveragingstep.

LetX∈R

N×ばつD

beamatrixofNfeaturevectorsofdimensionD,andletA∈

M×ばつN

be a matrixof logits orunnormalized attention weights. Thenattention

is computed as:

attention(X,A)

n=1

softmax(A)

,(3.25)

wherethesoftmaxisappliedalongtheNaxis,thusproducinganormalized

distribution over the Nfeature vectors X

n,:

There are two kinds of equivariance we may wishthe attention operation to

have.Ingraphneuralnetworksand(non-autoregressive)transformers(chap-

ter5)onwishesthisattentiontobeequivarianttopermutationsoftheNfeature

vectorsandattentionweights;thisisautomaticallysatisfied.Inapplications

involvinggeometry(e.g.geometricgraphs,??,onemaywishattentiontobe

equivariantto agroupG(e.g. rotations)actingoneachfeaturevectorX

n,:

iden-

tically. Sinceattention computes a weighted average (section 3.6.2) of vectors

withthe samerepresentation,it isequivarianttosuchtransformationsprovided

thatthelogitsA areinvariant.Inthecaseofself-attention,whereA iscom-

puted by inner products, this is guaranteed whenever the keys and values have

the same orthogonal group representation.

3.6.6Tensor product nonlinearities

Gatingnonlinearities,wherewemultiplyascalarandsomeothergeomet-

ricquantity,isaspecialcaseofamoregeneraloperationcalledthetensor

productofrepresentations.Givenvectorsx

∈R

,theirtensorproduct(or

outerproduct)T=

livesinR

×ばつd

.Theelementatindex(j

,...,j

)

istheproductofthecorrespondingelementsofthevectors:T

,...,j

i,j

Forexample,thetensorproductoftwocolumnvectorsisamatrixgivenby

x⊗y=xy

⊤

,i.e.(x⊗y)

.Wecanlinearlycombinetensors,sothe

tensors of a given shape form a vector space

Whenthevectorsx

haveagrouprepresentationassociatedwiththem,the

spaceoftensorsalsonaturallybecomesarepresentation.Applyingg∈Gto

x,yusing their representations ρ

,ρ

, we find:

(ρ

(g)x)⊗(ρ

(g)y)=(ρ

(g)x)(ρ

(g)y)

⊤

=ρ

(g)(xy

⊤

)ρ

(g)

⊤

.(3.26)

Thus,T=x⊗y∈R

×ばつd

ismappedtoρ

(g)Tρ

(g)

⊤

,andindeedthis

definesagrouprepresentationonR

×ばつd

.Wecouldwritethisasρ(g)T=

(g)Tρ

(g)

⊤

,butitshould benotedthatρ(g)isa linearmapfromthespace

of tensors R

×ばつd

to itself, i.e. as a matrix it has shape d

×ばつd

Foundations of Equivariant Deep Learning91

Mathematicallyitisoftenconvenienttoglossovertheparticularshape

(sincereshapingdoesn’treallydoanything),butwhenimplementingalgo-

rithms theshape doesmatter. If we flattent=vec(T)∈R

, wecan takeρ(g)

tobe amatrix ofsized

×ばつd

.The venerable MatrixCookbook(Petersen

and Pedersen 2012, equation 520) tells us that:

vec(ρ

(g)Tρ

(g)

⊤

(

(g)⊗ρ

(g)

)

vec(T),(3.27)

where ρ

(g)⊗ρ

(g) denotes the tensor / Kronecker product of matrices:

A⊗B=







B···a







(3.28)

for matrices A∈R

m×ばつn

and B∈R

p×ばつq

In summary, we canwrite the actionof Gon thespace of matrices eitheras

T7→ρ

(g)Tρ

(g)

⊤

,orinflattenedformast7→ρ

(g)⊗ρ

(g)t.Thisiscalled

thetensorproductofrepresentationsρ

andρ

,andcanbeextendedtoarbitrary

products

Whenpairwisetensorproductsareusedasanonlinearity,the dimensionality

isgreatlyincreasedwitheachapplication.Henceitiscommontocom-

binetensorproductswiththeClebsch-Gordandecomposition(Kondor,Lin,

andTrivedi2018),wherea(fixed)changeofbasisisappliedtothetensor

productinordertoblock-diagonalizeitintoirreduciblerepresentations(see

section 3.6.9), afterwhich only a limited numberof low-order representations

is retained.

Another way in whichthe tensor representation comes up is in graph neural

networks,whereeachgraphonnnodesisdefinedbyanadjacencymatrix

A∈R

n×ばつn

(inpractice,theentriesofthematrixarebinary,butonecanalso

considergeneraledgeornode-pairfeatures).Whenwepermutethenodesof

thegraph,wehavetopermuteboththerowsandcolumnsoftheadjacency

matrix.This correspondstothe tensorrepresentationρ

⊗ρ

,where ρ

(P)=P

isthestandardmatrixrepresentationofthepermutationgroup.Thatis,the

adjacency matrix transforms as A7→PAP

⊤

3.6.7Equivariant Normalization Layers

Normalizationlayers,suchasBatch-,Layer-,Group-andInstance-

Normalization, areanimportant partof modernnetwork architecturesinclud-

ingconvolutionalandattention-basedmodels(IoffeandSzegedy2015;Ba,

Kiros,andHinton2016;WuandHe2018;Ulyanov,Vedaldi,andLempitsky

2016).Normalizingfeaturesthroughoutthe networkcan improve optimization

92Chapter 3

speedandstability,andimprovegeneralization.Generallyspeaking,normal-

izationlayerscomputeameanμandstandard deviationσoffeatures,usethem

tonormalizethefeaturessothattheyhavemeanzeroandstandarddeviation

one, andthen optionally applya learnedshift and scaleoperation. They differ

intheaxesoverwhichthemeanandvariancearecomputed,andhencethe

shape of μand σ.

Regardlessoftheaxesoverwhichnormalizationisperformed,afew

simplerulessufficetomakenormalizationlayersequivariant.Asnotedin

section3.6.2,wearealwayspermittedtolinearlycombinegeometricfea-

turevectors,providedthattheyallhavethesamerepresentationandthatthe

scalar coefficients are invariant. Thus, we cancompute the mean ofgeometric

featurevectorsofthesamerepresentation,andsubtractitfromthosefeature

vectors, andtheresultwill beequivariantwiththesamerepresentation again.

Weshouldhowevernotaverageoverthecomponentsofasinglegeometric

feature vector; the components are not in themselves geometric vectors with a

representationandsotheirmeanisnotageometricquantityeither.Thestan-

darddeviationisafunctionofthesumofsquaresofthe(mean-subtracted)

components,andthisquantity(i.e.thenorm)willbeinvariantaslongasthe

representation is orthogonal.

Letusdescribean equivariant layer-normalizationlayer.Consideranarray

X∈R

N×ばつC×ばつD

,whereNisthebatchsize,Cisthenumberofchannels(multi-

plicitY),andDisthedimensionalityofgeometricfeaturevectors

.Weassume

that eachvector X

transforms accordingtothe sameorthogonalrepresenta-

tion. We will normalize over the Caxis separately for each N. Thus, our mean

has shape N ×ばつD and is computed as:

c=1

ncd

That is,we computeNmeanvectors of dimensionD byaveragingoverthe C

axis.Wesubtractthisfrom X toobtainX

ncd

–μ

.Then,wecompute

thesquarednormforeachgeometricfeaturevector,and computetheir average:

c=1

d=1

ncd

.(3.29)

Finally, we divide: Y

ncd

√

+ε

It is common toapply a gain and bias after normalization.As noted, scaling

byaninvariantgainisalwayspermitted,providedoneusesthesamegainforall

coefficients ofageometricfeaturevector.Applyingabiasisonly equivariant

ifthebiasvectorsatisfiesρ(g)b=bforallg∈G,i.e.ifthebiasisinvariant.

Foundations of Equivariant Deep Learning93

Dependingonρ,thismaymean thatithastobe0,whileinothercasesthere

may be a space of solutions of some dimensionality.

3.6.8Equivariant Linear Maps

One of the most common type of maps in neural networks are learnable linear

maps,suchasfullyconnectedlayers,convolutions,depth-wiseconvolutions,

and1×ばつ1convolutions.Inthelinearcaseitisoftenpossibletounderstand

thespaceofequivariantmapsandparameterizeit.Inlaterchapterswewill

considervariouscaseswithspecialstructure,suchasplanar convolution,group

convolution,andgraphconvolution.Inthissectionweconsiderthecaseofa

general linear map, also known as a fully connected layer.

Consider agroup G, matrix representationsρ

(g)∈R

n×ばつn

and ρ

(g)∈R

m×ばつm

ofG,andamatrixW∈R

m×ばつn

.ForW tobeequivariant,ithastosatisfythe

equationρ

(g)W=Wρ

(g)forallg∈G.Thissystemofequationsislinear

inW,andsowecansolveforWtofindabasisforthespaceofsolutions.

IfW

∈R

m×ばつn

(fori=1,...,k)isabasisforthespaceofsolutions,thenany

equivariant matrixWcan bewritten asa linearcombination ofbasis matrices:

i=1

, and any such linear combination is equivariant.

Itiscommontoaddalearnablebiasvectortotheoutput,computingthe

affine mapWx+binsteadofthelinearmapWx,butthisisnotnecessarily

equivariantevenifWis.Wecanreducetheaffinecasetothelinearoneby

appending a constant 1 dimension to x. Thisquantity is invariant, so we add a

constant1block(trivialrepresentation)ofsize1×ばつ1 toρ

aswell,yieldinga

n+1 dimensional matrix representation. Then, solving for Was before yields

an n+1×ばつm matrix containing the equivariant weight matrix and bias vector.

Dependingonthegroupandtherepresentations,itmightbepossibleto

solvethe equivariance constraintanalytically, and thishas been done formany

groupsofinterests.Indeed,aswewilldiscussinthenextsection,represen-

tationtheorypresentsuswithasystematicwaytounderstandthespaceof

equivariantlinearmapsbetweentworepresentationsbybreakingthemdown

intoirreduciblerepresentationsforwhichequivariant linearmapstakeavery

simple form.

Nevertheless,itcansometimesbemorepracticaltosolvetheequivariance

equationsnumerically.Todoso,we wouldlike tofirstbringthe equivariance

equationin thestandard matrix-vector formAw=0, wherew=vec(W)is the

flattenedweightmatrix andA is amatrixofcoefficients derivedfromρ

and

.We firstwrite theconstraint asρ

(g)Wρ

(g)

–1

–W=0. Next, wevectorize

(flatten) usingthe vec operator. Usingthe generalfact (Petersen andPedersen

(2012),Section10.2.2.,equation520)thatvec(AXB)=(W

⊤

⊗A)vec(X),

94Chapter 3

where ⊗is the Kronecker product eq. (3.28), we obtain:

(ρ

(g)

⊤

⊗ρ

(g)–I

⊗I

)vec(W)=0(3.30)

Thus,foreachgroupelementgwegetaconstraintmatrix(ρ

(g)

⊤

⊗ρ

(g)–

⊗I

)ofsizenm×ばつnm.Assumingthegroupisfinitewith|G|elements,we

can stackthese matricesinto constraintmatrix Aof size |G|nm×ばつnm. Wecan

thenuseastandardmethodsuch asSVDtosolvethelinearsystemAw=0and

obtain our (flattened)basis matrices w

=vec(W

). If thegroup is not finite,it

isoftensufficienttosamplealargenumberofgroupelementsandsolvethe

system for those.

Performingthe linearcombinationW=

i=1

canbe expensivewhen

arelargematrices, orwhenthenumberof basismatriceskislarge. Once

training iscomplete, onecan perform thelinear combinationsonce and simply

storeWas parameters,asina conventionalneuralnetwork, butduring train-

ingthisoperationmustbeperformedduringeachforwardpass.Fortunately,

thematricesW

canoftenbe chosentobesparsewithnon-overlappingspar-

sitypatterns.Asaresult,thematrixW willdisplayaweight-sharingpattern

(Ravanbakhsh, Schneider, and Poczos 2017).

One exampleis convolution, which, as we discussin Chapter 6 corresponds

to a matrixwhere the entries on eachdiagonal are the same. Anotherexample,

discussedinChapter5arepermutation-equivariantlinearmaps.Considera

linear mapWfrom R

to itselfthat is equivariantto permutationsg∈S

acting

bypermutationofcoordinates.Onecanshowthatsuchamaphasonlytwo

parameters (Zaheer et al. 2017):

W=θ

+θ

,(3.31)

whereW

istheidentitymatrix,andW

=11

⊤

–I

isthematrixwith

zero diagonalandone everywhere else.Thus, insteadof computingthelinear

combination, wecould inthis casecopy theweight θ

across thediagonal, and

copy θ

to all off-diagonal entries ofW.

Asnotedinsection3.5andfurtherstudiedinsection3.6.1,itiscommon

in equivariantnetworks touse block-diagonal representations.In thiscase the

constraint for Wsplits into blocks as well:



(g)0

0ρ

(g)



(g)

–1

0ρ

(g)

–1







(3.32)

Theequivariance constraintcanthenbesolvedforeachblock ofWseparately,

sincethe(i,j)-thblockonlyneedstobeequivariantwithrespecttothei-th

block of ρ

and the j-th block of ρ

(g)W

(g)=W

.(3.33)

Foundations of Equivariant Deep Learning95

Furthermore, once we have solved these equations for all blocks, we can form

thefullmatrixW bylinearlycombiningbasisblocksandparameters,which

is more efficient than linearly combining full basis matrices.

3.6.9*Irreducible Representations

We have seenthat wecancombine representationsinto alarger oneby stack-

ingthemblock-diagonally,andwehave seenthatwecanchangethebasisof

arepresentationtoobtainanequivalentone.Usingtheseoperations,wecan

createratheralotofrepresentations.Itisnaturaltoaskwhetherwecango

in the other direction:Given some arbitrary representation,can we change the

basistomakeitblock-diagonal(withthesameblockstructureforallg∈G)?

And could we do this recursively on the blocks, until the representation matri-

ces arediagonal? These questionslead us toone ofthe most centraland useful

concepts in representationtheory, namelythe concept ofirreducible represen-

tations (irreps).Wewill seehow irreps providea systematicunderstanding of

the space representations and the equivariant linear maps between them.

Todefineirreduciblerepresentationswemustfirstdefineinvariantsub-

spaces.Letρbearepresentation inavectorspaceV.Aninvariantsubspace isa

linear subspaceW ⊂Vsuch thatfor allg∈Gand w∈W,we have ρ(g)w∈W.

So, whilevectors w∈Ware notinvariant themselves (ρ(g)w=w in general),

they stayinsidethesubspaceWwhen actedonbyany groupelement,andso

we say that the subspace Wis invariant.

For any representation ρon V, the subspaces W ={0} andW=Vare invari-

ant,sothesearecalledtrivialinvariantsubspaces.Arepresentationwitha

non-trivialinvariantsubspaceiscalledreducible,andonewithoutsucha

subspace is called irreducible.

Ifwehaveareduciblerepresentationρwithnon-trivialinvariantsubspace

W ⊂V,wecanfind achangeofbasissothatρbecomesblockupper-triangular:

′

(g)=Aρ(g)A

–1



(g)ρ

(g)

0ρ

(g)



(3.34)

Avectoroftheform

w=(w,0)willremainofthisform:ρ

′

(g)

w=(ρ

(g)w,0),

i.e. it remains in the subspace W.

Ifitispossibletofindabasiswhereadditionallyρ

(g)=0,i.e.whereρ

′

isblock-diagonal,wesaythatρisdecomposable,andotherwiseitiscalled

indecomposable.Irreduciblerepresentationsarealwaysindecomposable,but

theconverseisnottrueingeneral.However,undersomeadditionalassump-

tions,whichholdformostcasesofpracticalinterestinGDL,theconverse

istrue,whichmeanswecankeepdecomposingarepresentationuntilallthe

blocks are irreducible (and thus indecomposable).In particular, this is true for

96Chapter 3

(complex)representationoffinitegroups,andforfinite-dimensionalunitary

representations.

Since (underbenign conditions)a representationcan bedecomposed intoa

direct sumof irreps, onewould liketo know allirreps for thegroup of interest.

Then,wewouldhavea completeunderstanding ofthe spaceof representations

ofourgroup.Fortunately,theirrepsofmostgroupswecareabouthavebeen

classified, and sousually this is amatter of looking up theirreps in a reference

work suchas VilenkinandKlimyk (1991).Computing theirrepsby handcan

for a group of interest can also be educational

Inchapter6wewillconsidergrids,andgroupsofshiftsactingonthem.

Thesegroups arecommutative,and suchgroupshaveone-dimensional(com-

plex) irreps,sothatany representationofsuchagroupcanbe fullydiagonal-

ized.Takeforinstance thegroupof 1DcyclicshiftsC

(section3.3.3.1), which

hasirrepsof theformρ

(t)=e

2πikt/n

(whereiis theimaginaryunit,t∈C

isa

translation and k∈{0,...,n–1}).

ThosefamiliarwithFourieranalysismayrecognizetheirrepsρ

asthe

basis functionsused inthe Fourier transform,and indeedrepresentation theory

revealsavastgeneralizationofFourieranalysisthatappliestoalargeclass

ofgroups.Eachgrouphasadifferentsetof(inequivalent)irrepsρ

indexed

bysomesetΛcalledthespectrum.Thevaluesλcanbeinterpretedas(gen-

eralized)frequencies,anddependingonthegroup,thesetΛcouldbefinite,

countablyinfinite(e.g.Λ=Z),uncountablyinfinite(e.g.Λ=RorΛ=C).In

chapter 7 we will discuss this topic in more depth.

Thespectrumalsogivesusaconvenientwaytodescribeanarbitrary

representationinabasis-independentway.Ifwehaveafullyreduciblerep-

resentationρofagroupGwithspectrumΛ,wecandecomposeitasadirect

sumofirreps:ρ=

forsomelistoffrequenciesλ

∈Λ.Anevenmore

compactdescriptionistocountthemultiplicitym

,i.e.howofteneachirrep

occursin ρ.Thelistof frequenciesormultiplicitiescan becalledthetype of

ρ.When wewish tobemore precise,wecan alsoinclude thechangeof basis

matrixAthatdecomposesρintoirrepsaspartofthetype,sothatthetype

precisely identifies one representation ρ.

3.6.10*Schur’s Lemma

Knowing how representationsdecompose intoirrepsisveryhelpful inunder-

standingthespaceofequivariantmaps.Wesawinsection3.6.8(eq.(3.32))

thatifwehavetworepresentationsρ

,ρ

thatsplitintoblocksρ

,ρ

,thena

(ρ

,ρ

)-equivariantlinearmapW splitsinto(ρ

,ρ

)-equivariantblocksW

So by decomposing therepresentations into irreps, we canreduce the problem

ofcharacterizing thespaceofequivariantlinearmaps tothecaseofirreps. In

Foundations of Equivariant Deep Learning97

otherwords, ifweknow how toparameterizeequivariantW

,allwe needto

doisstackthemtoformanequivariantW.Schur’slemmatellsushowto

parameterize the space of equivariant linear maps W

between irreps i,j.

Schur’slemmahastwopartswhichdependondifferentconditions,sowe

treat them separately.

Lemma2(SchurI)LetGbeagroup,andletρ

,ρ

beirreduciblerepre-

sentationsofGonvectorspacesV

overafieldK(e.g.realorcomplex

numbers). Then a(ρ

,ρ

)-equivariant linearmap is eitheran isomorphismor

the zero map.

Insimpleterms,assumingweareworkingwithfinite-dimensionalvector

spaces and concrete matrices, to say that representations ρ

,ρ

are isomorphic

or equivalent is to saythat an appropriate change of basis (section 3.6.3) turns

one into the other. That is, W

is invertible. Conversely if the representations

arenotequivalent,thatmeansthereisnoisomorphismbetweenthem,andso

according to Schur any equivariant linear map must bezero. Thus, already we

know thattheblocksW

ofW corresponding toinequivalentirrepsmustbe

zero.

Lemma3(SchurII)letG,ρ

,ρ

beasinSchurI,andassumeadditionally

thatV

andis finitedimensional, andthatKisalgebraically closed(e.g.

complexbutnotrealnumbers).Thena(ρ

,ρ

)-equivariantlinearmapisa

scalar multiple of the identity, i.e. W=cIwhere c∈K.

Thus,atleastinthecaseofcomplexrepresentations,weknowhowto

parameterizetheequivariant linearmaps W

forirrepsi,j bya singlecomplex

parameter!Thesituationforrealrepresentationsisalittlebitmoreinvolved,

but one can show thatthe space of equivariant linear maps between real irreps

iseither 1,2or 4dimension, dependingonwhether thereal representationisof

real,complexorquaternionictype.Thiscanbeascertainedviaamapcalledthe

Frobenius-Schurindicator.MoremathematicaldetailscanbefoundinSerre

1977;Etingofetal.2011.Inpractice,itmaybemostconvenienttocompute

theequivariantlinearmapsforgivenirrepsnumericallyusingthetechnique

presented insection 3.6.8,or usea librarythat hasalready implementedthem

forthegroupofinterest(GeigerandSmidt2022;Geigeretal.2022;Cesa,

Lang, and Weiler 2022).

3.7Summary

To be expanded.

•

Introduced the notion of symmetry in the context of learning problems.

98Chapter 3

•

Symmetriesformagroup,actondataviarepresentations.Algebraicprop-

ertiesofthisgroupcanbeusedtodefineafirst‘blueprint’forequivariant

networks.

•

Nextsteps:‘augment’symmetrieswith‘near-symmetries’leveragingfurther

physical stucture in the problem.

•

Next Chapter augments the algebraic priors with geometric priors.

Exercises

1.Consider a function f:X→Yand assume that it is equivariant, i.e. that:

f(ρ

(g)x)=ρ

(g)f(x),(3.35)

for g any element of some group Gand actions ρ

and ρ

of G.

Provethatiftwoinputshavethesameoutputf(x)=f(x

′

),thenthetransformed

inputs ρ

(g)x and ρ

(g)x

′

also have the same output.

2.Consideralabelfunctionf:X→YbetweenfinitesetsX,Y.Howmanysymme-

tries (invariances) does fhave?

3.Consider an unknownlabel function f:X→Ybetween finite sets, and assume that

we know the full symmetry group Gof f. What is the minimum numberof labelled

examples (x,y) we need to determine f?

4.Consideragroupactionα(g,x)andshowthatthemapf

(x)=α(g,x)isinverseto

(x)=α(g

–1

,x).

5.Consider thecyclic groupC

=e,r,r

,where eis theidentityandr isa90-degree

rotation.Createatablewiththe4groupelementsalongtherowsandcolumns,

andfillentryi,jwiththeproduct(groupcomposition)ofthecorrespondinggroup

elements. You have written the multiplication table for C

6.Givenapermutationσ:[1,...,n]→[1,...,n],how doweproducethe correspond-

ingn×ばつnpermutationmatrixP

?Showthatthecompositionofmapsσ◦σ

′

corresponds to matrix multiplication P

′

7.Showthatgivenmatrixrepresentationsρ

,...,ρ

ofG,themapρ(g)=

block-diag(ρ

,...,ρ

) is also a representation. (Sec. 3.6.1).

8.ConsiderG-invariantfunctionf:X→Y.Wesaythatfiscompleteifforanyx,x

′

forwhichthereexistsnog∈Gsuchthatgx=x

′

,wehavef(x)=f(x

′

).Showthatif

f:X→Yisacompleteinvariantmapping,anyinvariantmappingf

′

:X→Zcan

be written as f

′

(x)=f

′′

(f(x)) for some mapping f

′′

:Y→Z.

9.Show that a group homomorphism maps the identity to the identity.

10.ShowthatthenormalizationoperationY

ncd

√

+ε

definedinSec3.6.7is

equivariant.

Foundations of Equivariant Deep Learning99

Notes

Morespecifically,vectorsareabstractobjectsthancanbeadded(vector

addition)andscaled(scalarmultiplication)accordingtocertainrules.One

oftenhearsvectorsdefinedas"arrows"or"arraysofnumbers";mathematically

speaking, noneof thesedefinitions is correct:the formerrequires an additional

structurecalledinnerproduct(whichdefinesdirection),thelatterrequiresan

additionalassumptionofabasis(inwhichavectorcanberepresentedusing

coordinates).

InthecontextofEuclideangeometry,congruentandisomorphicmeanthe

samething,butcongruentismorecommoninelementarycourseswhereas

isomorphism is applied throughout mathematics.

For example,alllinearmaps preservethestructureof alinearspaceinthat

theypreserve linearrelations: αx+βy=z⇒α(Ax)+β(Ay)=Az. However,

onlytheinvertiblelinearmapsfromavectorspacetoitselfcouldbecalled

symmetries (automorphisms) of that space.

Anaffinespaceislikeavectorspace,exceptthatitdoesnothavea

designated origin 0 that must be preserved by symmetries

Thiskindofnotationiscommonincategorytheoryandisexplainedin

Appendix E.

After the Norwegian mathematician Niels Henrik Abel (1802–1829).

We mostly work finite-dimensional real matrices, but one may also consider

groups of complex matrices, and infinite-dimensional matrices.

Cayley’s theoremis a special caseof a deep result incategory theory called

the Yoneda Lemma.

Theclassificationoffinitesimplegroupsspanstensofthousandofpages

and was completed largely between 1955 and 2004.

Quaternionsarenon-commutativehypercomplexnumbersoftheforma+

bi+cj+dk. They are often used as a description of 3D rotations.

By"preserving vector spacestructure"we meanthatlinear mapsApreserve

linearrelations:αx+βy=z⇒α(Ax)+β(Ay)=Azforscalarsα,βandvectors

x,y,z.

More precisely, ∠(x,y)=cos

–1

⊤

y/∥x∥∥y∥).

Afieldinthiscontextisascalar-,vector-,ortensor-valuedfunctiononsome

geometric domain.

Forexample,physicistswouldoftensaythat"atensorisanobjectthat

transformsasatensor".Inrepresentation-theoreticterms,onecouldsaythat

tensors arequantities that transformaccording to atensor representationof the

general linear group.

100Chapter 3

Recallthatmappingslikeρ(e)andid

areequalifandonlyiftheygive

equaloutputsforallinputs,sotheIdentityaxiomcouldalsobewrittenas

ρ(e)(x)=x for all x∈X.

Moreprecisely, topreservetheserelationsmeans thatgh=k⇒ρ(g)◦ρ(h)=

ρ(k).

Despiteits name,thetrivialrepresentationhas manyimportant uses,since

it is the representation we use for any invariant quantity.

One cangive ageneral definitionusing thecategory-theoreticnotion ofan

"action of a group object internal to a category".

Whenthevectorspaceisinfinitedimensional,asitwillbewhenweconsider

spacesofsignals/functionsonacontinuousdomain,the"matrix"willbe

infinite dimensional as well.

Learningthegrouporitsrepresentationisaninterestingandimportant

active area of research.

In/equivariance inCNNs holdsonly upto bordereffectsand aliasingunless

specifically addressed;see Azulay andWeiss 2019;Zhang 2019; Vasconcelos

et al.2021; Karras etal. 2021; Weiler and Cesa2019a; Cesa, Lang,and Weiler

2022

Onecanalsoconsiderrepresentationsbysignedpermutationmatri-

ces,whicharematriceswithasingle1or–1ineachrowandcolumn.

Forsuchrepresentations,onecanusetheCReLUnonlinearityCReLU(x)=

(ReLU(x),ReLU(–x)).SeeTacoSCohenandWelling(2017a)andShanget

al. (2016)

Inmathematics andphysics, oneusuallyconsiderstensor productsofvec-

torsanddualvectors(whichhavedifferentrepresentationsassociatedwith

them), leadingto tensors ofmixed type,with somenumber of upperand lower

indices. Here we will not consider dual vectors, though.

Hereweassumethatallgeometricfeaturevectorsareofthesamedimen-

sion and have the samerepresentation. The approach can be generalizedto the

casewherewehaveC

copiesofvariousrepresentationsρ

byapplyingthe

calculation for each λseparately.

Oneapproachwouldbetoexplicitlyconstructandthendecomposethe

regular representation (Gacting onthe space of functions on G). According to

thePeter-Weyltheorem,undersuitableconditions,theregularrepresentation

decomposesintoadirectsumofallirreps,withmultiplicityequaltotheir

dimension.