Grids

•

Gridsareacommon,particulartypeofgraphs,thatcomewithadditional

geometric priors.

•

These priorsare capturedby thetranslation group—enablingthe useof

oriented features.

•

We describehowFourieranalysis provides thenatural toolto fullychar-

acterise translation-equivariantmaps, and we present canonical instances

including convolutional and recurrent neural networks.

•

Going beyond pure translations,we demonstrate howstudying equivari-

ancetotimewarpingenforcestheusageofgatinginRNNs,leadingto

the popular long short-term memory model.

Thesecondtypeofobjectsweconsideraregrids.Itisfairtosaythatthe

impactofdeeplearninghasbeenparticularlydramaticincomputervision,

naturallanguageprocessing,andspeechrecognition.Theseapplicationsall

share a geometric common denominator: an underlying grid structure.

Gridsareaparticularcaseofgraphs,buttheycomewithspecificstructure

thatsetsthemapart.Onthe onehand,grids comewithacanonicalarrangement

ofnodesthatmakethepermutationinvariancefromgeneralgraphsunneces-

sary.Ontheotherhand,machinelearningproblemsdefinedovergridsoften

exhibitastronggeometricprior,encodedbythetranslationgroup:anobject

identitydoesnotdependonitsabsolutespatiallocation,oranaminoacid

sequenceencodessimilarstructuresindependentlyofthelocationoftheir

associated genes in the genome.

Thetranslationgroupprovidesaparticularlysimpleinstantiationofthe

GDLblueprint,thankstotheharmonicanalysisperspectivebrought byFourier

analysis. Indeed, since the translation group is commutative, we can explicitly

characterizetheclassoflinearequivariantmaps(thatis,convolutions)using

asharedeigendecomposition—preciselytheFouriertransform.Moreover,

gridsare discretizationsoflow-dimensionalphysical domains,wherelocality

is properly defined.

168Chapter 6

Figure 6.1

An illustration of the one-dimension grid, using boundary conditions.

Inthischapter,wewillfirstderivethefundamentallinksbetweenconvo-

lutionsandFouriertransforms.Thenwewilldiscussthedistinctionbetween

thegraphandthegrid formalism,andfinallymove todescriberepresentative

architectures for image and sequence applications.

6.1Domain

A grid Dof size Nin s dimensionscan be best viewed asthe set Z

⊗s

, contain-

ings-tuples ofintegers (j

,...,j

)modulo N.It willbeconvenientto view the

domainashavingno boundary,andtointerpretthegridisadiscretizationofan

underlyingtorusT

.Thisdomain isitselfagroup,withtheoperation

inherited bythe additivemodular structureof eitherZ

or R/Z:givenu,v∈D,

thepointu+viswell-definedinD.Wedenotethisgroupasthetranslation

group.

LetusnowinvestigatehowtheGDLblueprintinstantiatesinthissetting.

For ease of exposition, we will focus on the one-dimensional s=1 setting.Let

usrecallthat thekeyquestionsweneedtoaddress are:(i)characterizetheclass

of linear equivariants associatedwiththe symmetry group,and(ii) implement

locality.

CirculantmatricesandConvolutionsWecanviewtheone-dimensional

gridasaringgraph(illustratedinFigure6.1),withnodesindexedby

0,1,...,N –1moduloN,and theadjacency matrixwith elementsa

u,u+1modN

1 and zero otherwise.

Asmentionedearlier,whileitistemptingtoreduceourselvestothegen-

eralgraphformulationseeninChapter5,therearetwomaindifferencesthat

setourpresentsettingapart.First,eachnodeuhasidenticalconnectivity,to

itsneighboursu–1andu+1,andthusstructure-wiseindistinguishablefrom

theothers.Inthelanguageofgrouptheory,wesaythatthegridisahomo-

geneousspace:Foranytwopointsuandvinthedomain,thereisagroup

elementg∈Gthatmapsutog.u=v.Next,sincethenodesofthegridhave

afixedordering,wealsohaveafixedorderingoftheneighbours:wecan

callu–1the‘leftneighbour’and u+1the‘rightneighbour’.More generally,

thegridallowstodefineorientation,inawaythatitisgloballyconsistent

acrossallneighborhoods.Ifweuseourpreviousapproachfordesigninga

Grids169

equivariantfunctionFusingalocalaggregationfunctionφ,wenowhave

f(x

)=φ(x

u–1

u+1

)ateverynodeofthegrid:φdoesnotneedtobe

permutation invariant anymore. For a particularchoice of alinear transforma-

tion φ(x

u–1

u+1

)=θ

–1

u–1

+θ

u+1

, we can write F(X) asa matrix

product,

F(X)=







–1













n–2

n–1







Notethisveryspecialmulti-diagonalstructurewithoneelementrepeated

alongeach diagonal,sometimesreferred toas "weightsharing" inthemachine

learning literature.

Moregenerally,given avectorθ=(θ

,...,θ

n–1

),a circulant matrixC(θ)=

(θ

u–vmod n

) isobtained by appendingcircularly shifted versions of thevectorθ.

Circulant matrices are synonymous with discrete convolutions,

(x⋆θ)

n–1

v=0

vmod n

(u–v)mod n

asonehasC(θ)x=x⋆θ.Aparticularchoiceofθ=(0,1,0,...,0)

⊤

yieldsa

specialcirculantmatrixthatshiftsvectorstotherightbyoneposition.This

matrix is called the (right) shift or translation operator and denoted by S.

Circulant matricescan becharacterised bytheir commutativity property:the

productofcirculantmatricesiscommutative,i.e.C(θ)C(η)=C(η)C(θ)for

anyθandη.Sincetheshiftisacirculantmatrix,wegetthefamiliartranslation

or shift equivariance of the convolution operator,

SC(θ)x=C(θ)Sx.

Suchcommutativitypropertyshouldnotbesurprising,sincetheunderlying

symmetrygroup(thetranslationgroup)isAbelian.Moreover,theopposite

directionistrueaswell,i.e.amatrixiscirculantiffitcommuteswithshift.

This,inturn,allowsustodefineconvolutionasatranslationequivariantlin-

earoperation,andisaniceillustrationofthepowerofgeometricpriorsand

theoverallphilosophyofGeometricML:convolutionemergesfromthefirst

principle of translational symmetry.

Notethatunlikethesituationonsetsandgraphs,thenumberoflinearly

independentshift-equivariantfunctions(convolutions)growswiththesizeof

the domain (sincewe have one degree of freedomin each diagonal ofa circu-

lantmatrix).However,thescaleseparationpriorguaranteesfilters canbelocal,

170Chapter 6

resultingin thesameΘ(1)-parameter complexity perlayer,aswe willverifyin

Section6.2.1whendiscussingtheuseof theseprinciplesintheimplementation

of Convolutional Neural Network architectures.

DerivationofthediscreteFouriertransformInChapter4weintroduced

Fourier transforms, and we have already mentioned its connection to convolu-

tion: the fact thatthe Fourier transform diagonalises theconvolution operation

isanimportantpropertyusedinsignalprocessingtoperformconvolutionin

thefrequencydomainasanelement-wiseproductoftheFouriertransforms.

Let us verify this important property in our setting of one-dimensional grids.

Forthispurpose,recallafactfromlinear

algebrathat(diagonalisable)

matrices arejoinly diagonalisable iff they mutually commute.Inother words,

thereexistsa commoneigenbasisfor allthecirculantmatrices,inwhichthey

differonlyby theireigenvalues.Wecan thereforepickonecirculant matrixand

computeitseigenvectors—weareassuredthatthesewillbetheeigenvectors

of allother circulant matricesas well. Itis convenienttopick theshift operator

S, for which the eigenvectors happen to be the discrete Fourier basis

√



1,e

2πik

4πik

,...,e

2πi(N–1)k



⊤

,k=0,1,...,N –1,

whichwecanarrangeintoanN ×ばつNFouriermatrixΦ=(φ

,...,φ

N–1

Indeed, we verify directly that Sφ

2πi/N

Multiplicationby Φ

∗5

gives theDiscreteFourier Transform(DFT), andby

Φthe inverse DFT,

√

N–1

u=0

–

2πiku

√

N–1

k=0

2πiku

Since allcirculantmatricesare jointlydiagonalisable,

they arealso diago-

nalised by the Fourier transform and differ only in their eigenvalues. Since the

eigenvalues of the circulantmatrixC(θ)aretheFourier transform ofthefilter

(see e.g. Bamieh 2018),

θ=Φ

∗

θ, we obtain the Convolution Theorem:

C(θ)x=Φ







n–1







∗

x=Φ(

θ⊙

Because the Fourier matrix Φhas a special algebraicstructure, the products

⋆

xandΦxcan becomputed withO(nlogn)complexityusingaFastFourier

Transform (FFT) algorithm. This is one of the reasons why frequency-domain

filteringissopopularinsignalprocessing;furthermore,thefilteristypically

designed directly in the frequency domain, so the Fourier transform

θis never

explicitly computed.

Grids171

BesidesthedidacticvalueofthederivationoftheFouriertransformandcon-

volution we have done here,it provides a scheme togeneralise these concepts

tographs. Realisingthattheadjacency matrixofthering graphisexactly the

shiftoperator, onecan candevelopthegraph Fouriertransform andananalogy

oftheconvolutionoperatorbycomputingtheeigenvectorsoftheadjacency

matrix (see e.g. Sandryhaila and Moura 2013).

EarlyattemptstodevelopgraphneuralnetworksbyanalogytoCNNs,some-

timestermed‘spectralGNNs’,exploitedthisexactblueprint.

Wewillseein

Sections8thatthisanalogyhassomeimportantlimitations.Thefirstlimita-

tioncomesfromthefactthatagridisfixed,andhenceallsignalsonitcan

berepresentedinthesameFourierbasis.Incontrast,ongeneralgraphs,the

Fourier basis depends on the structure of the graph. Hence, we cannot directly

compareFouriertransformsontwodifferentgraphs—aproblemthattrans-

latedintoalackofgeneralisationinmachinelearningproblems.Secondly,

multi-dimensionalgrids,whichareconstructedastensorproductsofone-

dimensionalgrids,retaintheunderlyingstructure:theFourierbasiselements

andthecorrespondingfrequencies(eigenvalues)canbe organisedinmultiple

dimensions.Inimages,forexample,wecannaturallytalkabouthorizontaland

vertical frequencyand filtershavea notionofdirection. Ongraphs,thestruc-

tureoftheFourierdomainisone-dimensional,aswecanonlyorganisethe

Fourierbasis functions bythe magnitude ofthe correspondingfrequencies. As

a result, graph filters are oblivious of direction orisotropic.

Derivation ofthe continuousFourier transformWecan revisit theprevi-

ousanalysisinthecontinuoussettingwherethegridisreplacedbythereal

line. Like inChapter 4, weconsiderfunctions defined onΩ=Randthe trans-

lation operator (S

f)(u)=f(u–v) shifting fby some position v. Applying S

theFourierbasisfunctionsφ

(u)=e

iξu

yields,byassociativityof theexponent,

iξu

iξ(u–v)

–iξv

iξu

i.e., φu

(u) is the complex eigenvector of S

with the complex eigenvalue e

–iξv

–exactlymirroringthesituationwehadinthediscretesetting.SinceS

isa

unitary operator(i.e., ∥S

x∥

=∥x∥

for any pand x∈L

(R)), any eigenvalueλ

must satisfy |λ|=1, which corresponds preciselyto the eigenvalues e

–iξv

found

above.Moreover,thespectrumofthetranslationoperatorissimple,meaning

thattwofunctionssharingthesameeigenvaluemustnecessarilybecollinear.

Indeed, supposethat S

f=e

–iξ

fforsome ξ

. Takingthe Fourier transformin

both sides, we obtain

∀ξ,e

–iξv

f(ξ)=e

–iξ

f(ξ) ,

which implies that

f(ξ)=0 for ξ=ξ

, thus f=αφ

172Chapter 6

ForagenerallinearoperatorCthatistranslationequivariant(S

C=CS

we have

iξu

=CS

iξu

–iξv

iξu

implying thatCe

iξu

is alsoan eigenfunction

of S

with eigenvalue e

–iξv

, from

whereit follows fromthesimplicity ofspectrumthat Ce

iξu

=βφ

(u);in other

words,theFourier basisistheeigenbasisofalltranslationequivariantopera-

tors.Asaresult,CisdiagonalintheFourierdomainandcanbeexpressed

asCe

iξu

(ξ)e

iξu

,where

(ξ)isatransferfunctionactingondifferent

frequencies ξ. Finally, for an arbitrary function x(u), by linearity,

(Cx)(u)=C

+∞

–∞

x(ξ)e

iξu

dξ=

+∞

–∞

x(ξ)

(ξ)e

iξu

dξ

+∞

–∞

(v)x(u–v)dv=(x⋆p

)(u),

wherep

(u)isthe inverseFouriertransformof

(ξ). Itthus followsthatevery

linear translation equivariant operator is a convolution.

6.2Model

Nowthatwehavecompletelycharacterizedtheclassoflinearequivariant

maps,givenbyconvolutionsx7→x⋆h,andthatwecanfurthermoreimpose

localityofthesemapsbysimplyconstrainingthesupportofthefilterh,we

candirectlydeploytheGDLblueprintfromChapter4,yieldingthehouse-

holdCNNs(ConvolutionalNeuralNetworks)andRNNs(recurrentneural

networks) architectures.

6.2.1Convolutional Neural Networks

Convolutional NeuralNetworks areperhapsthe earliestand mostwellknown

exampleofdeeplearningarchitecturesfollowingtheblueprintofGeometric

DeepLearning.InSection6.1wehavefullycharacterisedtheclassoflinear

and local translation equivariant operators, given by convolutions C(θ)x=x⋆

θwithalocalisedfilterθ

.Letusfirstfocusonscalar-valued(‘single-channel’

or ‘grayscale’)discretised images,where the domainis thegrid Ω=[H]×ばつ[W]

with u=(u

) and x∈X(Ω,R).

AnyconvolutionwithacompactlysupportedfilterofsizeH

×ばつW

canbe

written as alinearcombination of generators θ

1,1

,...,θ

, given for exam-

pleby theunit peaksθ

)=δ(u

–v,u

–w). Anylocal linearequivariant

map is thus expressible as

F(x)=

v=1

w=1

C(θ

)x,(6.75)

Grids173

0111000

0011100

0001110

0001100

0011000

0110000

1100000

101

010

101

C(θ)

z}|{

+θ

14341

12433

12341

13311

33110

x?θ

101

010

101

×ばつ1×ばつ0×ばつ1

×ばつ0×ばつ1×ばつ0

×ばつ1×ばつ0×ばつ1

Figure 6.2

TheprocessofconvolvinganimagexwithafilterC(θ).Thefilterparametersθcan

be expressed as a linear combination of generators θ

which, incoordinates, correspondsto thefamiliar 2Dconvolution(seeFigure

6.2 for an overview):

F(x)

a=1

b=1

u+a,v+b

.(6.76)

Otherchoicesofthebasisθ

arealsopossibleandwillyieldequivalent

operations(forpotentiallydifferentchoicesofα

).Apopularexampleare

directionalderivatives:θ

)=δ(u

)–δ(u

–v,u

–w),(v,w)=(0,0)

takentogetherwiththelocalaverageθ

.Infact,directional

derivativescanbeconsideredagrid-specificanalogueofdiffusionprocesses

ongraphs,whichwerecoverifweassumeeachpixel tobeanodeconnected

to its immediate neighbouring pixels in the grid.

Whenthe scalarinputchannelis replacedbymultiplechannels (e.g.,RGB

colours,ormoregenerallyanarbitrarynumberoffeaturemaps),thecon-

volutionalfilterbecomesaconvolutionaltensorexpressingarbitrarylinear

combinationsofinputfeaturesintooutputfeaturemaps.Incoordinates,this

can be expressed as

F(x)

uvj

a=1

b=1

c=1

jabc

u+a,v+b,c

,j∈[N] ,(6.77)

where M and Nare respectively thenumber ofinput andoutput channels.This

basicoperationencompassesabroadclassofneuralnetworkarchitectures,

which, as we will show in the next section, have had a profound impactacross

manyareasofcomputervision,signalprocessing,andbeyond.Here,rather

174Chapter 6

−3−2−10123

Figure 6.3

ReLU,oftenconsidereda‘modern’architecturalchoice,wasalreadyusedinthe

Neocognitron(Fukushima 1980).Rectificationis equivalenttotheprincipleofdemodu-

lation,which isfundamentalinelectricalengineeringas thebasisformanytransmission

protocols,suchas FMradio; andalso hasaprominentrole inmodels forneuronal activ-

ity.

than dissecting the myriad ofpossiblearchitectural variants of CNNs, we pre-

fer to focus onsome of the essential innovations that enabled their widespread

use.

EfficientmultiscalecomputationAsdiscussedintheGDLtemplatefor

generalsymmetries,extractingtranslationinvariantfeaturesoutofthecon-

volutionaloperatorFrequiresanon-linearstep;e.g.theobiquituousReLU

activationillustratedinFigure6.3.Convolutionalfeaturesareprocessed

through anon-linear activation functionσ, actingelement-wise on theinput—

i.e.,σ:X(Ω)→X(Ω),asσ(x)(u)=σ(x(u)).Perhapsthemostpopularexample

atthetimeofwritingistheRectifiedLinearUnit(ReLU):σ(x)=max(x,0).

Thisnon-linearity effectivelyrectifiesthe signals,pushing theirenergytowards

lowerfrequencies,andenablingthecomputationofhigh-orderinteractions

across scales by iterating the construction.

Alreadyin theearlyworksofFukushima1980andLeCunetal.1998, CNNs

and similar architectures had a multiscale structure, where after each convolu-

tionallayer(6.77)oneperformsagridcoarseningP:X(Ω)→X(Ω

′

),where

thegridΩ

′

hascoarserresolutionthanΩ.Thisenablesmultiscalefilterswith

effectivelyincreasingreceptivefield,yetretainingaconstantnumberofparam-

etersperscale.Several signalcoarsening strategiesP(referredtoaspooling)

maybeused, themostcommonareapplying alow-passanti-aliasingfilter(e.g.

local average) followed by grid downsampling, or non-linear max-pooling.

Insummary,a‘vanilla’CNNlayer canbeexpressedas thecompositionof

thebasicobjectsalreadyintroducedinourGeometricDeepLearningblueprint:

h=P(σ(F(x))) ,(6.78)

Grids175

i.e.an equivariant linearlayer F,a coarseningoperationP, andanon-linearity

σ.Itisalsopossibletoperformtranslationinvariantglobalpoolingopera-

tionswithinCNNs.Intuitively,thisinvolveseachpixel—which,afterseveral

convolutions, summarises apatch centeredaround it—proposing the final rep-

resentationoftheimage

,withtheultimatechoicebeingguidedbyaform

ofaggregation oftheseproposals.Apopularchoicehereistheaveragefunc-

tion, asits outputswill retainsimilar magnitudesirrespectiveof theimage size

(Springenberg et al. 2014).

Prominentexamplesfollowing thisCNNblueprint (someofwhich wewill

discuss next) are displayed in Figure 6.4.

DeepandResidualNetworksACNNarchitecture,initssimplestform,

isthereforespecifiedbyhyperparameters(H

)

k≤K

,withM

k+1

andp

=0,1indicatingwhethergridcoarseningisperformedornot.While

allthesehyperparametersareimportantinpractice,aparticularlyimportant

questionistounderstandtheroleofdepthKinCNNarchitectures,andwhat

arethe fundamentaltradeoffs involvedinchoosing suchakeyhyperparameter,

especially in relation to the filter sizes (H

While a rigorous answer to this question isstill elusive, mounting empirical

evidence collectedthroughouttherecentyearssuggestsafavourabletradeoff

towardsdeeper(largeK)yetthinner(small(H

))models

.Inthiscontext,

acrucialinsightbyHeetal.2016wastoreparametriseeachconvolutional

layertomodelaperturbationofthepreviousfeatures,ratherthanageneric

non-linear transformation:

h=P

(

x+σ(F(x))

)

.(6.79)

Theresultingresidualnetworksprovideseveralkeyadvantagesoverthepre-

viousformulation.Inessence,theresidualparametrisationisconsistentwith

theviewthatthedeepnetworkisadiscretisationofanunderlyingcontinu-

ous dynamical system,modelled as anordinary differential equation (ODE)

Crucially,learningadynamicalsystembymodelingitsvelocityturnsoutto

bemucheasierthanlearningitspositiondirectly.Inourlearningsetup,this

translates intoan optimisationlandscapewithmore favorablegeometry, lead-

ingto theabilitytotrain muchdeeperarchitecturesthanwaspossiblebefore.

Aswillbediscussedinfuturework,learningusingdeepneuralnetworks

definesanon-convexoptimisationproblem,whichcanbeefficientlysolved

usinggradient-descentmethodsundercertainsimplifyingregimes.Thekey

advantage of the ResNet parametrisation has been rigorouslyanalysedinsim-

plescenarios(HardtandMa2016),andremainsanactiveareaoftheoretical

investigation. Finally, NeuralODEs (R. T.Chen et al.2018) area recentpopu-

lararchitecturethatpushestheanalogywithODEsevenfurther,bylearningthe

120

SOFT

224

256

384

256

4096

1000

SOFT

Input

3x3

Conv

1x1

BatchNorm

ReLU

128

Averagepooling

SOFT

6464

128128

I/2

256256

I/4

512512

I/8

10241024

I/16

BottleneckConv

512512512512

I/8

256256256256

I/4

128128128128

I/2

64646464

Softmax

Figure 6.4

Prominentexamplesof CNNarchitectures. Top-to-bottom:LeNet (LeCunetal. 1998),

AlexNet(Krizhevsky,Sutskever, andHinton 2012),ResNet (He etal. 2016) andU-Net

(Ronneberger,Fischer,and Brox2015).Drawn usingthe PlotNeuralNetpackage (Iqbal

2018).

Grids177

parametersof theODE

x=σ(F(x))directly andrelying onstandardnumerical

integration.

NormalisationAnotherimportantalgorithmicinnovationthatboostedthe

empiricalperformanceofCNNssignificantlyisthenotionofnormalisation.In

early modelsof neural activity, itwashypothesised thatneurons perform some

formoflocal‘gaincontrol’,wherethelayercoefficientsx

arereplacedby

̃x

=σ

–1

⊙(x

–μ

).Here, μ

andσ

encodethe firstandsecond-ordermoment

information ofx

, respectively. Further, they canbe eithercomputedglobally

or locally.

In the context ofDeepLearning, this principle was widelyadopted through

the batchnormalisation layer (Ioffe and Szegedy2015)

, followed by several

variants(Ba,Kiros,andHinton2016;SalimansandKingma2016;Ulyanov,

Vedaldi,andLempitsky2016;Cooijmansetal.2016;WuandHe2018).

Despitesomeattemptstorigorouslyexplainthebenefitsofnormalisationin

termsofbetterconditionedoptimisationlandscapes(Santurkaret al.2018),a

general theorythat canprovideguiding principlesis stillmissing at thetime of

writing.

DataaugmentationWhileCNNsencodethegeometricpriorsassociated

with translation invariance and scaleseparation, they do notexplicitly account

for other known transformations that preserve semantic information, e.g light-

ing orcolor changes,or smallrotationsand dilations.A pragmaticapproach to

incorporate these priors with minimal architectural changes is to perform data

augmentation, where one manually performs said transformationstotheinput

images and adds them into the training set.

Data augmentationhas been empiricallysuccessful and iswidely used—not

onlytotrainstate-of-the-artvisionarchitectures,butalsotopropupseveral

developmentsinself-supervisedandcausalrepresentationlearning(T.Chen

etal.2020;Grilletal.2020;Mitrovicetal.2020).However,itisprovably

sub-optimalintermsofsamplecomplexity(Mei,Misiakiewicz,andMonta-

nari 2021;Bietti, Venturi,and Bruna 2021);a moreefficient strategy considers

instead architectures with richer invariance groups, as discussed in Chapter 7.

6.2.2Recurrent Neural Networks

Ourdiscussionhasthusfaralwaysassumedtheinputstobesolelyspatial

acrossagivendomain.However,inmanycommonusecases,theinputscan

alsobeconsideredsequential(e.g.video,textorspeech).Inthiscase,we

assume that the inputconsistsof arbitrarily many steps, wherein at each stept

we are provided with an input signal, which we represent as X

(t)

∈X(Ω

(t)

178Chapter 6

Whileingeneralthedomaincanevolveintimetogetherwiththesignals

onit,itistypicallyassumedthatthedomainiskeptfixedacrossallthet,i.e.

(t)

=Ω.Here,we will exclusively focus onthiscase, but note that exceptions

arecommon. Socialnetworks areanexample whereoneoftenhastoaccount

forthedomainchangingthroughtime,asnewlinksareregularlycreatedas

wellaserased.Thedomaininthissettingisoftenreferredtoasadynamic

graph (D. Xu et al. 2020; Rossi, Chamberlain, et al. 2020).

Often,theindividualX

(t)

inputswillexhibitusefulsymmetriesandhence

maybenontriviallytreatedbyanyofourpreviouslydiscussedarchitectures.

Somecommonexamplesinclude:videos(Ωisafixedgrid,andsignalsarea

sequence of frames);fMRI scans (Ωis a fixedmesh representing the geometry

of the brain cortex, where different regions are activated at different times as a

responsetopresentedstimuli);andtrafficflownetworks(Ωisafixedgraphrep-

resenting the road network, on which e.g. the average traffic speed is recorded

at various nodes).

Letusassumeanencoderfunctionf(X

(t)

)providinglatentrepresentations

atthelevelofgranularityappropriatefortheproblemandrespectfulofthe

symmetriesoftheinputdomain.Asanexample

,considerprocessingvideo

frames:thatis,ateachtimestep,wearegivenagrid-structuredinputrep-

resentedasann×ばつdmatrixX

(t)

,wherenisthenumberofpixels(fixedin

time) anddisthe numberofinputchannels(e.g.d=3forRGBframes). Fur-

ther,weareinterestedinanalysisatthelevelofentireframes,inwhichcase

itisappropriatetoimplementfasatranslationinvariantCNN,outputtinga

k-dimensional representation z

(t)

=f(X

(t)

) of the frame at time-step t.

Wearenowleftwiththetaskofappropriatelysummarisingasequenceof

vectorsz

(t)

acrossallthesteps.Acanonicalwaytodynamicallyaggregate

this informationin away thatrespects thetemporal progressionof inputsand

alsoeasilyallowsforonlinearrivalofnoveldata-points,isusingaRecurrent

Neural Network(RNN).

What wewillshow hereisthatRNNsareaninter-

esting geometric architecture tostudy in their own right,since they implement

a rather unusual type of symmetry over the inputs z

(t)

SimpleRNNsAteachstep,therecurrentneuralnetworkcomputesanm-

dimensionalsummaryvectorh

(t)

ofalltheinputstepsuptoandincludingt.

This(partial)summaryiscomputedconditionalonthe currentstep’sfeatures

andthepreviousstep’ssummary,throughasharedupdatefunction,R:R

×ばつ

→R

, as follows (see Figure 6.5 for a summary):

(t)

=R(z

(t)

(t–1)

)(6.80)

and, asboth z

(t)

and h

(t–1)

are flat vector representations, Rmay bemost easily

expressedasasinglefully-connectedneuralnetworklayer(oftenknownas

Grids179

(0)

RRRR

...

(1)

(2)

(3)

(4)

(3)

(2)

(1)

ffff

(1)

(2)

(3)

(4)

(1)

(2)

(3)

...

Figure 6.5

IllustrationofprocessingvideoinputwithRNNs.EachinputvideoframeX

(t)

processedusingasharedfunctionf—e.g.atranslationinvariantCNN—intoaflat

representationz

(t)

.ThentheRNNupdatefunctionRisiteratedacrossthesevectors,

iterativelyupdatingasummaryvectorh

(t)

whichsummarisesalltheinputsuptoand

includingz

(t)

.Thecomputationisseededwithaninitialsummaryvectorh

(0)

,which

may be either pre-determined or learnable.

SimpleRNN

; see Elman (1990) and Jordan (1997)):

(t)

=σ(Wz

(t)

+Uh

(t–1)

+b)(6.81)

whereW∈R

k×ばつm

,U∈R

m×ばつm

andb∈R

arelearnableparameters,andσis

anactivationfunction.Whilethisintroducesloopsinthenetwork’scompu-

tationalgraph,inpracticethenetworkisunrolledforanappropriatenumber

ofsteps,allowingforbackpropagationthroughtime(RobinsonandFallside

1987; Werbos 1988; Mozer 1989) to be applied.

Thesummaryvectorsmaythenbeappropriatelyleveragedforthedown-

streamtask—ifapredictionisrequiredateverystepofthesequence,thena

180Chapter 6

sharedpredictor maybe appliedtoeachh

(t)

individually. Forclassifying entire

sequences,typicallythefinalsummary,h

(T)

,ispassedtoaclassifier.Here,T

is the length of the sequence.

Specially, theinitial summary vector is usually either settothe zero-vector,

i.e.h

(0)

=0,oritismadelearnable.Analysingthemannerinwhichtheini-

tialsummaryvectorissetalsoallowsustodeduceaninterestingformof

translation equivariance exhibited by RNNs.

TranslationequivarianceinRNNsSinceweinterprettheindividualsteps

tasdiscretetime-steps,theinputvectorsz

(t)

canbeseenaslivingona

one-dimensional

gridoftime-steps.Whileitmightbeattractivetoattempt

extending ourtranslationequivariance analysisfromCNNshere,itcannotbe

done in a trivial manner.

Toseewhy, letus assumethatwehaveproduced anewsequencez

′(t)

(t+1)

byperformingaleft-shiftofoursequencebyonestep.Itmightbetempting

toattemptshowingh

′(t)

(t+1)

,asoneexpectswithtranslationequivariance;

however,thiswillnotgenerallyhold.Considert=1;directlyapplyingand

expanding the update function, we recover the following:

′(1)

=R(z

′(1)

(0)

)=R(z

(2)

(0)

)(6.82)

(2)

=R(z

(2)

(1)

)=R(z

(2)

,R(z

(1)

(0)

))(6.83)

Hence,unlesswecanguaranteethath

(0)

=R(z

(1)

(0)

),wewillnotrecover

translation equivariance. Similar analysis can then be done for steps t>1.

Fortunately,withaslightrefactoringofhowwerepresentz,andfora

suitablechoiceofR,itispossibletosatisfytheequalityabove,andhence

demonstrateasettinginwhichRNNsareequivarianttoshifts.Ourprob-

lemwas largely oneofboundary conditions:theequality above includesz

(1)

whichourleft-shiftoperationdestroyed.Toabstractthisproblemaway,we

willobservehowanRNNprocessesanappropriatelyleft-paddedsequence,

(t)

, defined as follows:

(t)

(

0t≤t

′

(t–t

′

)

t>t

′

Suchasequencenowallowsforleft-shifting

byuptot

′

stepswithoutdestroy-

ing anyof theoriginal inputelements. Further, notewe donot needto handle

right-shiftingseparately; indeed,equivariance torightshifts naturallyfollows

from the RNN equations.

WecannowagainanalysetheoperationoftheRNNoveraleft-shifted

versonof

(t)

,whichwedenoteby

′(t)

(t+1)

,aswedidinEquations

Grids181

6.82–6.83:

′(1)

=R(

′(1)

(0)

)=R(

(2)

(0)

)

(2)

=R(

(2)

(1)

)=R(

(2)

,R(

(1)

(0)

))=R(

(2)

,R(0,h

(0)

))

wherethesubstitution

(1)

=0holdsaslongast

′

≥1,i.e.aslongasany

paddingisapplied

.Now,wecanguaranteeequivariancetoleft-shiftingby

one step (h

′(t)

(t+1)

) as long as h

(0)

=R(0,h

(0)

Said differently, h

(0)

must be chosen to be a fixed pointof a function γ(h)=

R(0,h).Ifthe update function Ris conveniently chosen, then notonly can we

guarantee existence of such fixed points, but we can even directly obtain them

by iterating the application of R until convergence; e.g., as follows:

=0h

k+1

=γ(h

),(6.84)

wheretheindexkreferstotheiterationofRinourcomputation,asopposed

tothetheindex(t)denotingthetimestepoftheRNN.IfwechooseRsuch

thatγisacontractionmapping

,suchaniterationwillindeedconvergeto

auniquefixedpoint.Accordingly,wecantheniterateEquation(6.84)until

k+1

, and we can seth

(0)

. Note that this computationisequivalent to

left-padding the sequence with "sufficiently many" zero-vectors.

Depth inRNNsItisalso easyto stackmultipleRNNs—simply usethe h

(t)

vectorsasaninputsequenceforasecondRNN.Thiskindofconstruction

isoccasionallycalleda"deepRNN",whichispotentiallymisleading.Effec-

tively, dueto the repeated applicationofthe recurrent operation, even a single

RNN "layer" has depth equal to the number of input steps.

Thisoftenintroducesuniquelychallenginglearningdynamicswhenopti-

misingRNNs,aseachtrainingexampleinducesmanygradientupdatesto

thesharedparametersoftheupdatenetwork.Herewewillfocusonperhaps

themostprominentsuchissue—thatofvanishingandexplodinggradients

(Bengio,Simard,andFrasconi1994)—whichisespeciallyproblematicin

RNNs,giventheirdepthandparametersharing.Further,ithassingle-handedly

spurredsomeofthemostinfluentialresearchonRNNs.Foramoredetailed

overview,wereferthereadertoPascanu,Mikolov,andBengio(2013),who

have studied the training dynamics ofRNNs in great detail, and exposed these

challenges from a variety of perspectives: analytical, geometrical, and thelens

of dynamical systems.

Toillustratevanishinggradients,consideraSimpleRNNwithasigmoidal

activationfunctionσ,whosederivativemagnitude|σ

′

|isalwaysbetween0

and1;seeFigure6.6.Multiplyingmanysuchvaluesresultsingradientsthat

182Chapter 6

−6−5−4−3−2−10123456

0.2

0.4

0.6

0.8

Figure 6.6

Illustrationof thelogistic function,σ(x)=

1+exp(–x)

.Another popularchoice isthe hyper-

bolictangent,σ(x)=tanhx.TheyarecalledsigmoidalduetothedistinctS-shapeof

their plots.

quicklytendtozero,implyingthatearlystepsintheinputsequencemaynot

be able to have influence in updating the network parameters at all.

Forexample,considerthenext-wordpredictiontask(commonine.g.predic-

tive keyboards), and theinput text "Petar is Serbian. He was bornon ...[long

paragraph]...Petarcurrentlylivesin".Here,predictingthenext

wordas"Serbia"mayonlybereasonablyconcludedbyconsideringthevery

startoftheparagraph—butgradientshavelikelyvanishedbythetimethey

reach this input step, making learning from such examples very challenging.

Deepfeedforwardneuralnetworkshavealsosufferedfromthevanishing

gradient problem, until theinvention ofthe ReLU activation(which has gradi-

ents equalto exactly zeroor one—thusfixing thevanishinggradient problem).

However,in RNNs,usingReLUsmay easilyleadtoexplodinggradients,as the

outputspaceoftheupdatefunctionisnowunbounded,andgradientdescent

will updatethe cellonce forevery input step,quickly building upthe scaleof

theupdates.Historically,thevanishinggradientphenomenonwas recognised

early onas a significantobstacle in theuse of recurrentnetworks. Copingwith

thisproblemmotivatedthedevelopmentofmoresophisticatedRNNlayers,

which we describe next.

6.3Case Study: Long Short-Term Memory

Akeyinventionthatsignificantlyreducedtheeffectsofvanishinggradients

in RNNs is thatof gating mechanisms, which allow the network to selectively

overwriteinformationinadata-drivenway.Prominentexamplesofthesegated

RNNs includethe Long Short-TermMemory (LSTM; Hochreiterand Schmid-

huber(1997))andtheGatedRecurrentUnit(GRU;Choetal.(2014)).Here

wewillprimarilydiscusstheLSTM—specifically,thevariantpresentedby

Graves (2013)—inorderto illustratetheoperations of suchmodels. Concepts

fromLSTMseasilycarry over toothergated RNNs.Throughoutthissection,

Grids183

(t)

(t−1)

×ばつ+

tanh

×ばつ

(t)

×ばつ

(t)

(t−1)

(t)

LSTM

Figure 6.7

Thedataflowofthelong short-termmemory(LSTM),withitscomponents andmemory

cell(M)clearlyhighlighted.Basedonthecurrentinputz

(t)

,previoussummaryh

(t–1)

and

previouscell state c

(t–1)

, theLSTM predictsthe updatedcell statec

(t)

and summaryh

(t)

it will likely be useful to refer toFigure6.7, which illustrates all of the LSTM

operations that we will discuss in text.

TheLSTMaugmentstherecurrentcomputationbyintroducingamemory

cell,which storescellstate vectors, c

(t)

∈R

, thatare preserved betweencom-

putationalsteps.TheLSTMcomputessummaryvectors,h

(t)

,directlybased

onc

(t)

,andc

(t)

is,inturn,computedusingz

(t)

(t–1)

andc

(t–1)

.Critically,the

cell isnot completelyoverwrittenbasedon z

(t)

and h

(t–1)

, whichwould expose

thenetwork tothesameissuesastheSimpleRNN.Instead,acertainquantity

ofthepreviouscellstatemayberetained—andtheproportionbywhichthis

occurs is explicitly learned from data.

JustlikeinSimpleRNN,wecomputefeaturesbyusingasinglefully-

connectedneuralnetworklayeroverthecurrentinputstepandprevious

summary:

(t)

=tanh(W

(t)

(t–1)

)(6.85)

But,as mentioned,wedonot allowallofthisvectortoenterthecell—hence

whywecallitthevectorofcandidatefeatures,anddenoteitas

(t)

.Instead,

theLSTMdirectly learnsgatingvectors,whicharereal-valuedvectors inthe

range[0,1],anddecidehowmuchofthesignalshouldbeallowedtoenter,

exit, and overwrite the memory cell.

Threesuchgatesarecomputed,allbasedonz

(t)

andh

(t–1)

:theinputgate

(t)

,which computestheproportion ofthe candidatevectorallowedto enterthe

cell;theforgetgatef

(t)

,whichcomputestheproportionofthepreviouscell

184Chapter 6

statetoberetained,andtheoutputgateo

(t)

,whichcomputestheproportion

ofthenewcellstatetobeusedforthefinalsummaryvector.Typicallyallof

these gates arealso derivedusing asinglefully connectedlayer,albeit withthe

logisticsigmoidactivation logistic(x)=

1+exp(–x)

,inordertoguaranteethatthe

outputs are in the [0,1] range

(t)

=logistic(W

(t)

(t–1)

)(6.86)

(t)

=logistic(W

(t)

(t–1)

)(6.87)

(t)

=logistic(W

(t)

(t–1)

)(6.88)

Finally,thesegatesareappropriatelyappliedtodecodethenewcellstate,

(t)

,whichisthenmodulated bythe outputgateto producethe summaryvector

(t)

, as follows:

(t)

⊙

(t)

⊙c

(t–1)

(6.89)

(t)

⊙tanh(c

(t)

)(6.90)

where⊙iselement-wisevectormultiplication.Appliedtogether,Equations

(6.85)–(6.90)completelyspecifytheupdaterulefortheLSTM,whichnow

takes into account the cell vectorc

(t)

as well

(t)

)=R(z

(t)

,(h

(t–1)

))

Notethat,asthevaluesoff

(t)

arederivedfromz

(t)

andh

(t–1)

—andtherefore

directlylearnablefromdata—theLSTMeffectivelylearnshowtoappropri-

atelyforgetpastexperiences.Indeed,thevaluesoff

(t)

directlyappearinthe

backpropagation update forall the LSTMparameters (W

∗

), allowing

the network toexplicitly control, in a data-driven way, the degree of vanishing

for the gradients across the time steps.

Besidestacklingthevanishinggradientissuehead-on,itturnsoutthat

gatedRNNsalsounlockaveryusefulformofinvariancetotime-warping

transformations, which remains out of reach of SimpleRNNs.

Time warping invariance of gated RNNsWe willstart by illustrating, ina

continuous-timesetting

,whatdoesitmeantowarptime,andwhatisrequired

ofarecurrentmodelinordertoachieveinvariancetosuchtransformations.

Our exposition will largely follow the work of TallecandOllivier (2018), that

initially describedthis phenomenon—and indeed,they wereamong thefirst to

actually study RNNs from the lens of invariances.

Letusassumea continuoustime-domainsignalz(t), onwhichwewouldlike

toapply anRNN.Toalignthe RNN’sdiscrete-time computationofsummary

vectorsh

(t)28

with ananalogue in thecontinuous domain,h(t),we willobserve

Grids185

Figure 6.8

Time-warpingoperationscanbesimple,suchastimerescaling;e.g.τ(t)=0.7t(dis-

playedhere),which,inadiscretesetting,wouldamounttonewinputsbeingreceived

every∼1.43steps.However,italsoadmitsawidespectrumofvariably-changing

samplingrates,e.g.samplingmayfreelyaccelerateordeceleratethroughoutthetime

domain.

its linear Taylor expansion:

h(t+δ)≈h(t)+δ

dh(t)

(6.91)

and, setting δ=1, we recover a relationship between h(t) and h(t+1), which is

exactlywhattheRNNupdatefunctionR(Equation6.80)computes.Namely,

the RNN update function satisfies the following differential equation:

dh(t)

=h(t+1)–h(t)=R(z(t+1),h(t))–h(t)(6.92)

We would likethe RNN tobe resilient tothe way inwhich the signalis sam-

pled(e.g.bychangingthetimeunitofmeasurement),inordertoaccountfor

anyimperfections orirregularitiestherein.Formally, wedenotea timewarping

operationτ:R

→R

,asanymonotonicallyincreasingdifferentiablemapping

betweentimes.Thenotationτischosenbecausetimewarpingrepresentsan

automorphism of time. See Figure 6.8 for an illustration of warping.

Further,westatethataclassofmodelsisinvarianttotimewarpingif,for

anymodel ofthe classand any suchτ,there existsanother (possiblythesame)

modelfromtheclassthatprocessesthewarpeddatainthesamewayasthe

original model did in the non-warped case.

This isa potentiallyvery useful property. Ifwehave anRNN classcapable

ofmodellingshort-termdependencieswell,andwecanalsoshowthatthis

classisinvarianttotimewarping,thenweknowitispossibletotrainsucha

modelinawaythatwillusefullycapturelong-termdependenciesaswell(as

theywouldcorrespondtoatimedilationwarpingofasignalwithshort-term

dependencies).Aswewillshortlysee,itisnocoincidencethatgatedRNN

modelssuchasthe LSTMwereproposedtomodellong-rangedependencies.

Achieving timewarpinginvarianceistightlycoupledwithpresenceofgating

mechanisms, such as the input/forget/output gates of LSTMs.

186Chapter 6

Whentimegetswarpedbyτ,thesignalobservedbytheRNNattime

tisz(τ(t))and,toremaininvarianttosuchwarpings,itshouldpredictan

equivalently-warpedsummary functionh(τ(t)).Using Taylorexpansion argu-

ments once more, we derive a form of Equation 6.92 for the warped time, that

the RNN update R should satisfy:

dh(τ(t))

dτ(t)

=R(z(τ(t+1)),h(τ(t)))–h(τ(t))(6.93)

However,theabovederivativeiscomputedwithrespecttothewarpedtime

τ(t),andhencedoesnottakeintoaccounttheoriginalsignal.Tomakeour

model take into account the warping transformation explicitly, we need to dif-

ferentiatethe warpedsummaryfunction withrespecttot.Applying thechain

rule, this yields the following differential equation:

dh(τ(t))

dτ(t)

R(z(τ(t+1)),h(τ(t)))–

dτ(t)

h(τ(t))(6.94)

and,forour(continuous-time)RNNtoremaininvarianttoanytimewarping

τ(t), it needs to be able toexplicitly represent the derivative

dτ(t)

, which is not

assumedknownupfront!WeneedtointroducealearnablefunctionΓ which

approximates this derivative. For example, Γcould be aneural network taking

into account z(t+1) and h(t) and predicting scalar outputs.

Now,remarkthat,fromthepointofviewofadiscreteRNNmodelunder

timewarping,itsinputz

(t)

willcorrespondtoz(τ(t)),anditssummaryh

(t)

willcorrespondtoh(τ(t)).Toobtaintherequiredrelationshipofh

(t)

toh

(t+1)

inordertoremaininvarianttotimewarping,wewilluseaone-stepTaylor

expansion of h(τ(t)):

h(τ(t+δ))≈h(τ(t))+δ

dh(τ(t))

and, onceagain, settingδ=1 andsubstituting Equation 6.94,then discretising:

(t+1)

(t)

dτ(t)

R(z

(t+1)

(t)

)–

dτ(t)

(t)

dτ(t)

R(z

(t+1)

(t)



1–

dτ(t)



(t)

Finally,weswap

dτ(t)

withtheaforementionedlearnablefunction,Γ.This

gives us the required form for our time warping-invariant RNN:

(t+1)

=Γ(z

(t+1)

(t)

)R(z

(t+1)

(t)

)+(1–Γ(z

(t+1)

(t)

))h

(t)

(6.95)

We may quicklydeduce that SimpleRNNs (Equation 6.81)are nottimewarp-

inginvariant,giventhattheydonotfeaturethesecondterminEquation

Grids187

6.95.Instead,theyfullyoverwriteh

(t)

withR(z

(t+1)

(t)

),whichcorresponds

to assuming no time warping at all;

dτ(t)

=1, i.e. τ(t)=t.

Further,ourlinkbetweencontinuous-timeRNNsandthediscreteRNN

basedonRrestedontheaccuracyoftheTaylorapproximation,whichholds

only if the time-warpingderivative is not toolarge, i.e.,

dτ(t)

≲1. The intuitive

explanationofthisis:ifourtimewarpingoperationevercontractstimeina

waythat makes timeincrements (t→t+1) largeenough thatintermediate data

changesarenotsampled,themodelcanneverhopetoprocesstime-warped

inputsinthesamewayasoriginalones—itsimplywouldnothaveaccessto

thesameinformation.Conversely,timedilationsofanyform(which,indis-

creteterms, correspondtointerspersingtheinputtime-serieswithzeroes)are

perfectly allowed within our framework.

Combinedwithourrequirementofmonotonicallyincreasingτ(

dτ(t)

>0),

wecanboundtheoutputspaceofΓas0<Γ(z

(t+1)

(t)

)<1,whichmotivates

the use of the logistic sigmoid activation for Γ, e.g.:

Γ(z

(t+1)

(t)

)=logistic(W

(t+1)

(t)

)

exactlymatchingtheLSTMgatingequations(e.g.Equation6.86).Themain

differenceisthatLSTMscomputegatingvectors,whereasEquation6.95

impliesΓshouldoutputascalar.Vectorisedgates(Hochreiter1991)allow

tofitadifferentwarpingderivativeineverydimensionofh

(t)

,allowingfor

reasoning over multiple time horizons simultaneously.

It is worth taking a pause here to summarise what we have done. By requir-

ing that our RNN classis invariantto (non-destructive) time warping,we have

derived the necessary form thatitmust have (Equation 6.95), and showed that

itexactlycorrespondstotheclassofgatedRNNs.Thegates’primaryrole

underthisperspectiveistoaccuratelyfitthederivative

dτ(t)

ofthewarping

transformation.

The notionof classinvariance issomewhatdistinct fromthe invariances we

studiedpreviously.Namely,oncewetrain agatedRNNona time-warpedinput

with τ

(t), in many cases we cannot zero-shot transfer it to a signal warped by

a different τ

(t).Rather, class invariance only guaranteesthat gated RNNsare

powerfulenough tofit bothof thesesignals inthe samemanner,butpotentially

withvastlydifferentmodelparameters.Thatbeingsaid,therealisationthat

effectivegating mechanismsare tightlyrelatedto fittingthe warpingderivative

canyieldusefulprescriptionsforgatedRNNoptimisation,aswenow briefly

demonstrate.

Forexample,we canoftenassume thatthe rangeof thedependencies weare

interested in tracking within our signal will be in the range [T

] time-steps.

188Chapter 6

ByanalysingtheanalyticsolutionstoEquation6.94,itcanbeshownthat

thecharacteristicforgettingtimeofh

(t)

byourgatedRNNisproportionalto

Γ(z

(t+1)

(t)

)

.Hence,wewouldlikeourgatingvaluestoliebetween

order to effectively remember information within the assumed range.

Further,ifweassumethatz

(t)

andh

(t)

areroughlyzero-centered—which

isacommonby-productofapplyingtransformationssuchaslayernormali-

sation(Ba,Kiros,andHinton2016)—wecanassumethatE[Γ(z

(t+1)

(t)

)]≈

logistic(b

).Controllingthebiasvectorofthegatingmechanismishencea

very powerful way of controlling the effective gate value

Combiningthetwo observations,weconcludethatanappropriaterange of

gatingvaluescanbeobtainedbyinitialisingb

∼–log(U(T

)–1),where

Uis theuniform real distribution. Sucha recommendation was dubbedchrono

initialisation by Tallec and Ollivier (2018), andhas been empirically shownto

improve the long-range dependency modelling of gated RNNs.

Exercises

1.We noted that a matrix is circulantif and only if it commuteswith the shift operator

S. Let us prove this fundamental result for 1D grids.

(i)Write down the explicit elements of the right shift matrix Sof Nelements.

(ii)LetCbeanarbitrary N×ばつN matrix. Writedowntheelements ofthe resultingmatrices

SCand CS.

(iii)By requiring translation equivariance (SC=CS), deduce the relationshipbetween c

and c

i+1,j+1

(iv)Usethisrelationshiptoconclude thattheelementsc

dependonly on(i–j)modN,

proving that Cmust be a circulant.

2.Convolutionalneuralnetworksoftenusestridedconvolutions(orpooling)to

coarsen thegrid.For example, astride ofs=2drops every secondpixel of theout-

putgrid.Show thatanarchitecturewithastrideof2is nolongerequivarianttoall

translations, and characterise which translations it is equivariant to.

3.Giventhatgridsareaspecialcaseofgraphs,wemayattempttoexpressaCNN

usingthelanguageofgraphneuralnetworks(GNNs)fromChapter5.Considera

1Dcyclicalgridwithnnodesasaringgraph,comprisingnodesV={0,1,...,n–

1}andneighbourhoodsN

={(u–1)modn,u,(u+1)modn},withnodefeatures

∈R

being typical RGB values.

(i)Assume we use a GCN (graph convolutional network) model over this graph:

=φ

v∈N

√

ψ(x

)

Grids189

ThisGNNflavourhassomeoftheclosesttiestothespectralpropertiesunderly-

ingCNNs,andthereforeitrepresentsavaluableattempt.However,itsadlycannot

express allpossible convolutionsover images(evenif we expand theneighbourhoods).

Characterise which image CNNs are representable using the GCN.

(ii)WecanattemptfixingthisbymovingtoamoreexpressiveflavourofGNNs;recall

that the most expressive option are message passing GNNs:

=φ

v∈N

ψ(x

)

for apermutation-invariant aggregator

. Provideasimplecounter-example showing

that even this expressive GNN cannot reliably simulate all CNN filters over the grid.

(iii)WhatallowsCNNstobypassthisrestriction?Leverage thisassumption toprovidea

correctiontothemessage-passingequationthatcanrepresentallpossibleCNNsover

the grid.

4.In this question we willstudy time rescalings, a special case oftime warping trans-

formations.Timerescalingshavetheformτ(t)=αt,whereα∈(0,1].Itiseasyto

verifythatrescalingsmonotonicallydilatetime,satisfyingtherequirementsfora

non-destructive timewarping operation. We will startby studying continuous-space

RNNs, taking input signals z(t) to continuous summaries h(t).

(i)StatethelinearTaylorexpansionofh(t)aroundt.Underwhichconditionsisthis

agoodapproximation?UsetheTaylorexpansiontorelateh(t)toh(t+1),usingthe

discrete RNN update, R(z

(t+1)

(t)

) and the summary derivative

dh(t)

(ii)Nowintroduce theproposedtime rescaling,and derivethe requiredcondition forclass

invariance to such operations.

(iii)Derive the update rule for a discrete RNN that is invariant to time rescaling.

(iv)Assume thata time-rescaling-invariant RNNhas beenpre-trained overdata whichhas

beenrescaledusingτ

(t)=α

t.Isitpossibletozero-shottransferthisRNNtomake

predictions over a signal that has been rescaled by τ

(t)=α

t? If so, how?

Notes

Because ofthe periodicboundary conditions,it isa circularor cyclicconvo-

lution. In signalprocessing, θis often referredto as the"filter,"and in CNNs,

its coefficients are learnable.

The left shiftoperator is givenby S

⊤

. Obviously, shifting left and thenright

(orviceversa)doesnotdoanything,whichmeansSisorthogonal:S

⊤

=I.

We must additionally assume distincteigenvalues, otherwise there might be

multiplepossible diagonalisations.Thisassumptionissatisfiedwithourchoice

of S.

190Chapter 6

Sisorthogonalbutnon-symmetric,hence,itseigenvectors areorthogonal

but the eigenvalues are complex (roots of unity).

Notethattheeigenvectorsarecomplex,soweneedtotakecomplex

conjugation when transposing Φ.

SincetheFouriertransformisanorthogonalmatrix(Φ

∗

Φ=I),geomet-

ricallyitactsasachangeofthesystemofcoordinatesthatamountstoan

N-dimensionalrotation. Inthis systemof coordinates("Fourier domain"), the

action of a circulant Cmatrix becomes element-wise product.

Ingraphsignalprocessing,theeigenvectorsofthegraphLaplacianare

oftenusedasanalternativeoftheadjacencymatrixtoconstructthegraph

Fourier transform,see Shumanet al.2013.On grids,both matriceshave joint

eigenvectors,butongraphstheyresultsinsomewhatdifferentthoughrelated

constructions.

Eigenfunction is synonymous with ‘eigenvector’ and is used when referring

to eigenvectors of continuous operators.

The spectralcharacterisation ofthe translationgroup isa particular caseof a

moregeneralresultinFunctionalAnalysis,theStone’sTheorem,whichderives

an equivalent characterisation for any one-parameter unitary group.

Recall, C(θ) is a circulant matrix with parameters θ.

Notethatweusuallyimaginex andθ

as2Dmatrices,butinthisequa-

tion,bothxandθ

havetheirtwocoordinatedimensionsflattenedinto

one—making xa vector, and C(θ

) a matrix.

CNNs whichonly consist ofthe operations mentionedin thisparagraph are

oftendubbed"all-convolutional".Incontrast,manyCNNsflattentheimage

acrossthespatialaxesandpassthemtoanMLPclassifier,oncesufficient

equivariantandcoarseninglayershavebeenapplied.Thislosestranslation

invariance.

Historically,ResNetmodelsarepredatedbyhighwaynetworks(Srivas-

tava,Greff,andSchmidhuber2015),whichallowformoregeneralgating

mechanisms to control the residual information flow.

In thiscase, theResNet isperforming aForward Eulerdiscretisation ofan

ODE:

x=σ(F(x))

We notethat normalisingactivations of neuralnetworks hasseen attention

evenbeforetheadventofbatchnormalisation.See,e.g.,LyuandSimoncelli

2008.

Whetherthe domainisconsideredstatic ordynamicconcernstime scales:

e.g.,aroadnetworkdoeschangeovertime(asnewroadsarebuiltandold

ones aredemolished), but significantlyslower comparedto theflow oftraffic.

Grids191

Similarly,insocialnetworks,changesinengagement(e.g.Twitterusersre-

tweetingatweet)happenatamuchhigherfrequencythanchangesinthefollow

graph.

Wedonotlose generalityin ourexample; equivalentanalysiscanbe done

e.g. for node-level outputs on a spatiotemporal graph; theonlydifference is in

the choice of encoder f(which will then be a permutation equivariant GNN).

Notethatthez

(t)

vectorscanbeseenaspointsonatemporalgrid:hence,

processingthemwithaCNNisalsoviableinsomecases.Transformersare

also increasingly popular models for processing generic sequential inputs.

In spiteof theirname, SimpleRNNsare remarkablyexpressive.Forexam-

ple,itwasshownbySiegelmannandSontag(1995)thatsuchmodelsare

Turing-complete,meaningthattheycanlikelyrepresentanycomputationwe

may ever be able to execute on computers.

Notethatthisconstructionisextendabletogridsinhigherdimensions,

allowingusto,e.g.,processsignalslivingonimagesinascanlinefashion.

Suchaconstruction powereda popularseriesofmodels,such asthe PixelRNN

from Oord, Kalchbrenner, and Kavukcuoglu (2016).

Note that equivalent analyses willarise if we use adifferentpadding vector

than 0.

In a very similar vein, we can derive equivariance to left-shifting by s steps

as long as t

′

≥s.

Contractionsarefunctionsγ:X→Xsuchthat,undersomenorm∥·∥on

X,applyingγcontractsthedistancesbetweenpoints:forallx,y∈X,and

someq∈[0,1),itholdsthat∥γ(x)–γ(y)∥≤q∥x–y∥.Iteratingsuchafunc-

tionthennecessarily convergesto aunique fixedpoint, asa directconsequence

of Banach’s Fixed Point Theorem (Banach 1922).

Notethatwehavesettheactivationfunctiontotanhhere;asLSTMsare

designedto amelioratethevanishinggradient problem,it isnowappropriate to

use a sigmoidal activation.

Notethatthethreegatesarethemselvesvectors,i.e.i

(t)

∈[0,1]

Thisallowsthemtocontrolhowmucheachofthemdimensionsisallowed

through the gate.

ThisisstillcompatiblewiththeRNNupdateblueprintfromEquation

(6.80);simplyconsiderthesummaryvectortobetheconcatenationofh

(t)

and c

(t)

; sometimes denoted by h

(t)

∥c

(t)

Wefocusonthecontinuoussettingasitwillbeeasiertoreasonabout

manipulations of time there.

We will use h(t) to denote acontinuous signal at time t, and h

(t)

to denote a

discrete signal at time-step t.

192Chapter 6

ThisinsightwasalreadyspottedbyGersandJ.Schmidhuber(2000)and

Jozefowicz,Zaremba,andSutskever(2015),whoempiricallyrecommended

initialising theforget-gatebias ofLSTMs toa constantpositive vector, suchas