6
Grids
Gridsareacommon,particulartypeofgraphs,thatcomewithadditional
geometric priors.
These priorsare capturedby thetranslation group—enablingthe useof
oriented features.
We describehowFourieranalysis provides thenatural toolto fullychar-
acterise translation-equivariantmaps, and we present canonical instances
including convolutional and recurrent neural networks.
Going beyond pure translations,we demonstrate howstudying equivari-
ancetotimewarpingenforcestheusageofgatinginRNNs,leadingto
the popular long short-term memory model.
Thesecondtypeofobjectsweconsideraregrids.Itisfairtosaythatthe
impactofdeeplearninghasbeenparticularlydramaticincomputervision,
naturallanguageprocessing,andspeechrecognition.Theseapplicationsall
share a geometric common denominator: an underlying grid structure.
Gridsareaparticularcaseofgraphs,buttheycomewithspecificstructure
thatsetsthemapart.Onthe onehand,grids comewithacanonicalarrangement
ofnodesthatmakethepermutationinvariancefromgeneralgraphsunneces-
sary.Ontheotherhand,machinelearningproblemsdefinedovergridsoften
exhibitastronggeometricprior,encodedbythetranslationgroup:anobject
identitydoesnotdependonitsabsolutespatiallocation,oranaminoacid
sequenceencodessimilarstructuresindependentlyofthelocationoftheir
associated genes in the genome.
Thetranslationgroupprovidesaparticularlysimpleinstantiationofthe
GDLblueprint,thankstotheharmonicanalysisperspectivebrought byFourier
analysis. Indeed, since the translation group is commutative, we can explicitly
characterizetheclassoflinearequivariantmaps(thatis,convolutions)using
asharedeigendecomposition—preciselytheFouriertransform.Moreover,
gridsare discretizationsoflow-dimensionalphysical domains,wherelocality
is properly defined.
168Chapter 6
Figure 6.1
An illustration of the one-dimension grid, using boundary conditions.
Inthischapter,wewillfirstderivethefundamentallinksbetweenconvo-
lutionsandFouriertransforms.Thenwewilldiscussthedistinctionbetween
thegraphandthegrid formalism,andfinallymove todescriberepresentative
architectures for image and sequence applications.
6.1Domain
A grid Dof size Nin s dimensionscan be best viewed asthe set Z
s
N
, contain-
ings-tuples ofintegers (j
1
,...,j
s
)modulo N.It willbeconvenientto view the
domainashavingno boundary,andtointerpretthegridisadiscretizationofan
underlyingtorusT
s
=R
s
/Z
s
.Thisdomain isitselfagroup,withtheoperation
inherited bythe additivemodular structureof eitherZ
N
or R/Z:givenu,v∈D,
thepointu+viswell-definedinD.Wedenotethisgroupasthetranslation
group.
LetusnowinvestigatehowtheGDLblueprintinstantiatesinthissetting.
For ease of exposition, we will focus on the one-dimensional s=1 setting.Let
usrecallthat thekeyquestionsweneedtoaddress are:(i)characterizetheclass
of linear equivariants associatedwiththe symmetry group,and(ii) implement
locality.
CirculantmatricesandConvolutionsWecanviewtheone-dimensional
gridasaringgraph(illustratedinFigure6.1),withnodesindexedby
0,1,...,N –1moduloN,and theadjacency matrixwith elementsa
u,u+1modN
=
1 and zero otherwise.
Asmentionedearlier,whileitistemptingtoreduceourselvestothegen-
eralgraphformulationseeninChapter5,therearetwomaindifferencesthat
setourpresentsettingapart.First,eachnodeuhasidenticalconnectivity,to
itsneighboursu–1andu+1,andthusstructure-wiseindistinguishablefrom
theothers.Inthelanguageofgrouptheory,wesaythatthegridisahomo-
geneousspace:Foranytwopointsuandvinthedomain,thereisagroup
elementgGthatmapsutog.u=v.Next,sincethenodesofthegridhave
afixedordering,wealsohaveafixedorderingoftheneighbours:wecan
callu–1the‘leftneighbour’and u+1the‘rightneighbour’.More generally,
thegridallowstodefineorientation,inawaythatitisgloballyconsistent
acrossallneighborhoods.Ifweuseourpreviousapproachfordesigninga
Grids169
equivariantfunctionFusingalocalaggregationfunctionφ,wenowhave
f(x
u
)=φ(x
u–1
,x
u
,x
u+1
)ateverynodeofthegrid:φdoesnotneedtobe
permutation invariant anymore. For a particularchoice of alinear transforma-
tion φ(x
u–1
,x
u
,x
u+1
)=θ
–1
x
u–1
+θ
0
x
u
+θ
1
x
u+1
, we can write F(X) asa matrix
product,
F(X)=
θ
0
θ
1
θ
–1
θ
–1
θ
0
θ
1
.
.
.
.
.
.
.
.
.
θ
–1
θ
0
θ
1
θ
1
θ
–1
θ
0
x
0
x
1
.
.
.
x
n–2
x
n–1
Notethisveryspecialmulti-diagonalstructurewithoneelementrepeated
alongeach diagonal,sometimesreferred toas "weightsharing" inthemachine
learning literature.
Moregenerally,given avectorθ=(θ
0
,...,θ
n–1
),a circulant matrixC(θ)=
(θ
uvmod n
) isobtained by appendingcircularly shifted versions of thevectorθ.
Circulant matrices are synonymous with discrete convolutions,
1
(xθ)
u
=
n–1
X
v=0
x
vmod n
θ
(uv)mod n
asonehasC(θ)x=xθ.Aparticularchoiceofθ=(0,1,0,...,0)
yieldsa
specialcirculantmatrixthatshiftsvectorstotherightbyoneposition.This
matrix is called the (right) shift or translation operator and denoted by S.
2
Circulant matricescan becharacterised bytheir commutativity property:the
productofcirculantmatricesiscommutative,i.e.C(θ)C(η)=C(η)C(θ)for
anyθandη.Sincetheshiftisacirculantmatrix,wegetthefamiliartranslation
or shift equivariance of the convolution operator,
SC(θ)x=C(θ)Sx.
Suchcommutativitypropertyshouldnotbesurprising,sincetheunderlying
symmetrygroup(thetranslationgroup)isAbelian.Moreover,theopposite
directionistrueaswell,i.e.amatrixiscirculantiffitcommuteswithshift.
This,inturn,allowsustodefineconvolutionasatranslationequivariantlin-
earoperation,andisaniceillustrationofthepowerofgeometricpriorsand
theoverallphilosophyofGeometricML:convolutionemergesfromthefirst
principle of translational symmetry.
Notethatunlikethesituationonsetsandgraphs,thenumberoflinearly
independentshift-equivariantfunctions(convolutions)growswiththesizeof
the domain (sincewe have one degree of freedomin each diagonal ofa circu-
lantmatrix).However,thescaleseparationpriorguaranteesfilters canbelocal,
170Chapter 6
resultingin thesameΘ(1)-parameter complexity perlayer,aswe willverifyin
Section6.2.1whendiscussingtheuseof theseprinciplesintheimplementation
of Convolutional Neural Network architectures.
DerivationofthediscreteFouriertransformInChapter4weintroduced
Fourier transforms, and we have already mentioned its connection to convolu-
tion: the fact thatthe Fourier transform diagonalises theconvolution operation
isanimportantpropertyusedinsignalprocessingtoperformconvolutionin
thefrequencydomainasanelement-wiseproductoftheFouriertransforms.
Let us verify this important property in our setting of one-dimensional grids.
Forthispurpose,recallafactfromlinear
3
algebrathat(diagonalisable)
matrices arejoinly diagonalisable iff they mutually commute.Inother words,
thereexistsa commoneigenbasisfor allthecirculantmatrices,inwhichthey
differonlyby theireigenvalues.Wecan thereforepickonecirculant matrixand
computeitseigenvectors—weareassuredthatthesewillbetheeigenvectors
of allother circulant matricesas well. Itis convenienttopick theshift operator
S, for which the eigenvectors happen to be the discrete Fourier basis
4
φ
k
=
1
N
1,e
2πik
N
,e
4πik
N
,...,e
2πi(N–1)k
N
,k=0,1,...,N –1,
whichwecanarrangeintoanN ×ばつNFouriermatrixΦ=(φ
0
,...,φ
N–1
).
Indeed, we verify directly that Sφ
k
=e
2πi/N
φ
k
.
Multiplicationby Φ
5
gives theDiscreteFourier Transform(DFT), andby
Φthe inverse DFT,
ˆ
x
k
=
1
N
N–1
X
u=0
x
u
e
2πiku
N
x
u
=
1
N
N–1
X
k=0
ˆ
x
k
e
+
2πiku
N
.
Since allcirculantmatricesare jointlydiagonalisable,
6
they arealso diago-
nalised by the Fourier transform and differ only in their eigenvalues. Since the
eigenvalues of the circulantmatrixC(θ)aretheFourier transform ofthefilter
(see e.g. Bamieh 2018),
ˆ
θ=Φ
θ, we obtain the Convolution Theorem:
C(θ)x=Φ
ˆ
θ
0
.
.
.
ˆ
θ
n–1
Φ
x=Φ(
ˆ
θ
ˆ
x)
Because the Fourier matrix Φhas a special algebraicstructure, the products
Φ
xandΦxcan becomputed withO(nlogn)complexityusingaFastFourier
Transform (FFT) algorithm. This is one of the reasons why frequency-domain
filteringissopopularinsignalprocessing;furthermore,thefilteristypically
designed directly in the frequency domain, so the Fourier transform
ˆ
θis never
explicitly computed.
Grids171
BesidesthedidacticvalueofthederivationoftheFouriertransformandcon-
volution we have done here,it provides a scheme togeneralise these concepts
tographs. Realisingthattheadjacency matrixofthering graphisexactly the
shiftoperator, onecan candevelopthegraph Fouriertransform andananalogy
oftheconvolutionoperatorbycomputingtheeigenvectorsoftheadjacency
matrix (see e.g. Sandryhaila and Moura 2013).
EarlyattemptstodevelopgraphneuralnetworksbyanalogytoCNNs,some-
timestermed‘spectralGNNs’,exploitedthisexactblueprint.
7
Wewillseein
Sections8thatthisanalogyhassomeimportantlimitations.Thefirstlimita-
tioncomesfromthefactthatagridisfixed,andhenceallsignalsonitcan
berepresentedinthesameFourierbasis.Incontrast,ongeneralgraphs,the
Fourier basis depends on the structure of the graph. Hence, we cannot directly
compareFouriertransformsontwodifferentgraphs—aproblemthattrans-
latedintoalackofgeneralisationinmachinelearningproblems.Secondly,
multi-dimensionalgrids,whichareconstructedastensorproductsofone-
dimensionalgrids,retaintheunderlyingstructure:theFourierbasiselements
andthecorrespondingfrequencies(eigenvalues)canbe organisedinmultiple
dimensions.Inimages,forexample,wecannaturallytalkabouthorizontaland
vertical frequencyand filtershavea notionofdirection. Ongraphs,thestruc-
tureoftheFourierdomainisone-dimensional,aswecanonlyorganisethe
Fourierbasis functions bythe magnitude ofthe correspondingfrequencies. As
a result, graph filters are oblivious of direction orisotropic.
Derivation ofthe continuousFourier transformWecan revisit theprevi-
ousanalysisinthecontinuoussettingwherethegridisreplacedbythereal
line. Like inChapter 4, weconsiderfunctions defined onΩ=Randthe trans-
lation operator (S
v
f)(u)=f(uv) shifting fby some position v. Applying S
v
to
theFourierbasisfunctionsφ
ξ
(u)=e
iξu
yields,byassociativityof theexponent,
S
v
e
iξu
=e
iξ(uv)
=e
–iξv
e
iξu
,
i.e., φu
ξ
(u) is the complex eigenvector of S
v
with the complex eigenvalue e
–iξv
–exactlymirroringthesituationwehadinthediscretesetting.SinceS
v
isa
unitary operator(i.e., S
v
x
p
=x
p
for any pand xL
p
(R)), any eigenvalueλ
must satisfy |λ|=1, which corresponds preciselyto the eigenvalues e
iξv
found
above.Moreover,thespectrumofthetranslationoperatorissimple,meaning
thattwofunctionssharingthesameeigenvaluemustnecessarilybecollinear.
Indeed, supposethat S
v
f=e
–iξ
0
v
fforsome ξ
0
. Takingthe Fourier transformin
both sides, we obtain
ξ,e
–iξv
ˆ
f(ξ)=e
–iξ
0
v
ˆ
f(ξ) ,
which implies that
ˆ
f(ξ)=0 for ξ=ξ
0
, thus f=αφ
ξ
0
.
172Chapter 6
ForagenerallinearoperatorCthatistranslationequivariant(S
v
C=CS
v
),
we have
S
v
Ce
iξu
=CS
v
e
iξu
=e
–iξv
Ce
iξu
,
implying thatCe
iξu
is alsoan eigenfunction
8
of S
v
with eigenvalue e
–iξv
, from
whereit follows fromthesimplicity ofspectrumthat Ce
iξu
=βφ
ξ
(u);in other
words,theFourier basisistheeigenbasisofalltranslationequivariantopera-
tors.Asaresult,CisdiagonalintheFourierdomainandcanbeexpressed
asCe
iξu
=
ˆ
p
C
(ξ)e
iξu
,where
ˆ
p
C
(ξ)isatransferfunctionactingondifferent
frequencies ξ. Finally, for an arbitrary function x(u), by linearity,
(Cx)(u)=C
Z
+
ˆ
x(ξ)e
iξu
dξ=
Z
+
ˆ
x(ξ)
ˆ
p
C
(ξ)e
iξu
dξ
=
Z
+
p
C
(v)x(uv)dv=(xp
C
)(u),
9
wherep
C
(u)isthe inverseFouriertransformof
ˆ
p
C
(ξ). Itthus followsthatevery
linear translation equivariant operator is a convolution.
6.2Model
Nowthatwehavecompletelycharacterizedtheclassoflinearequivariant
maps,givenbyconvolutionsx7→xh,andthatwecanfurthermoreimpose
localityofthesemapsbysimplyconstrainingthesupportofthefilterh,we
candirectlydeploytheGDLblueprintfromChapter4,yieldingthehouse-
holdCNNs(ConvolutionalNeuralNetworks)andRNNs(recurrentneural
networks) architectures.
6.2.1Convolutional Neural Networks
Convolutional NeuralNetworks areperhapsthe earliestand mostwellknown
exampleofdeeplearningarchitecturesfollowingtheblueprintofGeometric
DeepLearning.InSection6.1wehavefullycharacterisedtheclassoflinear
and local translation equivariant operators, given by convolutions C(θ)x=x
θwithalocalisedfilterθ
10
.Letusfirstfocusonscalar-valued(‘single-channel’
or ‘grayscale’)discretised images,where the domainis thegrid Ω=[H]×ばつ[W]
with u=(u
1
,u
2
) and x∈X(Ω,R).
AnyconvolutionwithacompactlysupportedfilterofsizeH
f
×ばつW
f
canbe
written as alinearcombination of generators θ
1,1
,...,θ
H
f
,W
f
, given for exam-
pleby theunit peaksθ
vw
(u
1
,u
2
)=δ(u
1
v,u
2
w). Anylocal linearequivariant
map is thus expressible as
11
F(x)=
H
f
X
v=1
W
f
X
w=1
α
vw
C(θ
vw
)x,(6.75)
Grids173
0111000
0011100
0001110
0001100
0011000
0110000
1100000
x
?
101
010
101
C(θ)
z}|{
θ
11
+θ
13
+θ
22
+θ
31
+θ
33
=
14341
12433
12341
13311
33110
x?θ
101
010
101
×ばつ1×ばつ0×ばつ1
×ばつ0×ばつ1×ばつ0
×ばつ1×ばつ0×ばつ1
Figure 6.2
TheprocessofconvolvinganimagexwithafilterC(θ).Thefilterparametersθcan
be expressed as a linear combination of generators θ
vw
.
which, incoordinates, correspondsto thefamiliar 2Dconvolution(seeFigure
6.2 for an overview):
F(x)
uv
=
H
f
X
a=1
W
f
X
b=1
α
ab
x
u+a,v+b
.(6.76)
Otherchoicesofthebasisθ
vw
arealsopossibleandwillyieldequivalent
operations(forpotentiallydifferentchoicesofα
vw
).Apopularexampleare
directionalderivatives:θ
vw
(u
1
,u
2
)=δ(u
1
,u
2
)–δ(u
1
v,u
2
w),(v,w)=(0,0)
takentogetherwiththelocalaverageθ
0
(u
1
,u
2
)=
1
H
f
W
f
.Infact,directional
derivativescanbeconsideredagrid-specificanalogueofdiffusionprocesses
ongraphs,whichwerecoverifweassumeeachpixel tobeanodeconnected
to its immediate neighbouring pixels in the grid.
Whenthe scalarinputchannelis replacedbymultiplechannels (e.g.,RGB
colours,ormoregenerallyanarbitrarynumberoffeaturemaps),thecon-
volutionalfilterbecomesaconvolutionaltensorexpressingarbitrarylinear
combinationsofinputfeaturesintooutputfeaturemaps.Incoordinates,this
can be expressed as
F(x)
uvj
=
H
f
X
a=1
W
f
X
b=1
M
X
c=1
α
jabc
x
u+a,v+b,c
,j[N] ,(6.77)
where M and Nare respectively thenumber ofinput andoutput channels.This
basicoperationencompassesabroadclassofneuralnetworkarchitectures,
which, as we will show in the next section, have had a profound impactacross
manyareasofcomputervision,signalprocessing,andbeyond.Here,rather
174Chapter 6
3210123
0
1
2
3
Figure 6.3
ReLU,oftenconsidereda‘modern’architecturalchoice,wasalreadyusedinthe
Neocognitron(Fukushima 1980).Rectificationis equivalenttotheprincipleofdemodu-
lation,which isfundamentalinelectricalengineeringas thebasisformanytransmission
protocols,suchas FMradio; andalso hasaprominentrole inmodels forneuronal activ-
ity.
than dissecting the myriad ofpossiblearchitectural variants of CNNs, we pre-
fer to focus onsome of the essential innovations that enabled their widespread
use.
EfficientmultiscalecomputationAsdiscussedintheGDLtemplatefor
generalsymmetries,extractingtranslationinvariantfeaturesoutofthecon-
volutionaloperatorFrequiresanon-linearstep;e.g.theobiquituousReLU
activationillustratedinFigure6.3.Convolutionalfeaturesareprocessed
through anon-linear activation functionσ, actingelement-wise on theinput—
i.e.,σ:X(Ω)→X(Ω),asσ(x)(u)=σ(x(u)).Perhapsthemostpopularexample
atthetimeofwritingistheRectifiedLinearUnit(ReLU):σ(x)=max(x,0).
Thisnon-linearity effectivelyrectifiesthe signals,pushing theirenergytowards
lowerfrequencies,andenablingthecomputationofhigh-orderinteractions
across scales by iterating the construction.
Alreadyin theearlyworksofFukushima1980andLeCunetal.1998, CNNs
and similar architectures had a multiscale structure, where after each convolu-
tionallayer(6.77)oneperformsagridcoarseningP:X(Ω)→X(Ω
),where
thegridΩ
hascoarserresolutionthanΩ.Thisenablesmultiscalefilterswith
effectivelyincreasingreceptivefield,yetretainingaconstantnumberofparam-
etersperscale.Several signalcoarsening strategiesP(referredtoaspooling)
maybeused, themostcommonareapplying alow-passanti-aliasingfilter(e.g.
local average) followed by grid downsampling, or non-linear max-pooling.
Insummary,a‘vanilla’CNNlayer canbeexpressedas thecompositionof
thebasicobjectsalreadyintroducedinourGeometricDeepLearningblueprint:
h=P(σ(F(x))) ,(6.78)
Grids175
i.e.an equivariant linearlayer F,a coarseningoperationP, andanon-linearity
σ.Itisalsopossibletoperformtranslationinvariantglobalpoolingopera-
tionswithinCNNs.Intuitively,thisinvolveseachpixel—which,afterseveral
convolutions, summarises apatch centeredaround it—proposing the final rep-
resentationoftheimage
12
,withtheultimatechoicebeingguidedbyaform
ofaggregation oftheseproposals.Apopularchoicehereistheaveragefunc-
tion, asits outputswill retainsimilar magnitudesirrespectiveof theimage size
(Springenberg et al. 2014).
Prominentexamplesfollowing thisCNNblueprint (someofwhich wewill
discuss next) are displayed in Figure 6.4.
DeepandResidualNetworksACNNarchitecture,initssimplestform,
isthereforespecifiedbyhyperparameters(H
f
k
,W
f
k
,N
k
,p
k
)
kK
,withM
k+1
=N
k
andp
k
=0,1indicatingwhethergridcoarseningisperformedornot.While
allthesehyperparametersareimportantinpractice,aparticularlyimportant
questionistounderstandtheroleofdepthKinCNNarchitectures,andwhat
arethe fundamentaltradeoffs involvedinchoosing suchakeyhyperparameter,
especially in relation to the filter sizes (H
f
k
,W
f
k
).
While a rigorous answer to this question isstill elusive, mounting empirical
evidence collectedthroughouttherecentyearssuggestsafavourabletradeoff
towardsdeeper(largeK)yetthinner(small(H
f
k
,W
f
k
))models
13
.Inthiscontext,
acrucialinsightbyHeetal.2016wastoreparametriseeachconvolutional
layertomodelaperturbationofthepreviousfeatures,ratherthanageneric
non-linear transformation:
h=P
(
x+σ(F(x))
)
.(6.79)
Theresultingresidualnetworksprovideseveralkeyadvantagesoverthepre-
viousformulation.Inessence,theresidualparametrisationisconsistentwith
theviewthatthedeepnetworkisadiscretisationofanunderlyingcontinu-
ous dynamical system,modelled as anordinary differential equation (ODE)
14
.
Crucially,learningadynamicalsystembymodelingitsvelocityturnsoutto
bemucheasierthanlearningitspositiondirectly.Inourlearningsetup,this
translates intoan optimisationlandscapewithmore favorablegeometry, lead-
ingto theabilitytotrain muchdeeperarchitecturesthanwaspossiblebefore.
Aswillbediscussedinfuturework,learningusingdeepneuralnetworks
definesanon-convexoptimisationproblem,whichcanbeefficientlysolved
usinggradient-descentmethodsundercertainsimplifyingregimes.Thekey
advantage of the ResNet parametrisation has been rigorouslyanalysedinsim-
plescenarios(HardtandMa2016),andremainsanactiveareaoftheoretical
investigation. Finally, NeuralODEs (R. T.Chen et al.2018) area recentpopu-
lararchitecturethatpushestheanalogywithODEsevenfurther,bylearningthe
1
32
6
28
16
10
1
120
1
84
10
SOFT
3
224
96
55
256
27
384
13
384
13
256
13
1
4096
1
4096
1000
SOFT
1
28
28
Input
16
28
3x3
Conv
16
28
16
28
+
16
28
16
28
16
28
+
16
28
16
28
16
28
+
16
28
32
28
32
28
1x1
32
28
BatchNorm
ReLU
+
32
28
128
23
23
Averagepooling
1
10
SOFT
6464
I
128128
I/2
256256
I/4
512512
I/8
10241024
I/16
BottleneckConv
512512512512
I/8
256256256256
I/4
128128128128
I/2
64646464
I
Softmax
Figure 6.4
Prominentexamplesof CNNarchitectures. Top-to-bottom:LeNet (LeCunetal. 1998),
AlexNet(Krizhevsky,Sutskever, andHinton 2012),ResNet (He etal. 2016) andU-Net
(Ronneberger,Fischer,and Brox2015).Drawn usingthe PlotNeuralNetpackage (Iqbal
2018).
Grids177
parametersof theODE
̇
x=σ(F(x))directly andrelying onstandardnumerical
integration.
NormalisationAnotherimportantalgorithmicinnovationthatboostedthe
empiricalperformanceofCNNssignificantlyisthenotionofnormalisation.In
early modelsof neural activity, itwashypothesised thatneurons perform some
formoflocal‘gaincontrol’,wherethelayercoefficientsx
k
arereplacedby
̃x
k
=σ
–1
k
(x
k
μ
k
).Here, μ
k
andσ
k
encodethe firstandsecond-ordermoment
information ofx
k
, respectively. Further, they canbe eithercomputedglobally
or locally.
In the context ofDeepLearning, this principle was widelyadopted through
the batchnormalisation layer (Ioffe and Szegedy2015)
15
, followed by several
variants(Ba,Kiros,andHinton2016;SalimansandKingma2016;Ulyanov,
Vedaldi,andLempitsky2016;Cooijmansetal.2016;WuandHe2018).
Despitesomeattemptstorigorouslyexplainthebenefitsofnormalisationin
termsofbetterconditionedoptimisationlandscapes(Santurkaret al.2018),a
general theorythat canprovideguiding principlesis stillmissing at thetime of
writing.
DataaugmentationWhileCNNsencodethegeometricpriorsassociated
with translation invariance and scaleseparation, they do notexplicitly account
for other known transformations that preserve semantic information, e.g light-
ing orcolor changes,or smallrotationsand dilations.A pragmaticapproach to
incorporate these priors with minimal architectural changes is to perform data
augmentation, where one manually performs said transformationstotheinput
images and adds them into the training set.
Data augmentationhas been empiricallysuccessful and iswidely used—not
onlytotrainstate-of-the-artvisionarchitectures,butalsotopropupseveral
developmentsinself-supervisedandcausalrepresentationlearning(T.Chen
etal.2020;Grilletal.2020;Mitrovicetal.2020).However,itisprovably
sub-optimalintermsofsamplecomplexity(Mei,Misiakiewicz,andMonta-
nari 2021;Bietti, Venturi,and Bruna 2021);a moreefficient strategy considers
instead architectures with richer invariance groups, as discussed in Chapter 7.
6.2.2Recurrent Neural Networks
Ourdiscussionhasthusfaralwaysassumedtheinputstobesolelyspatial
acrossagivendomain.However,inmanycommonusecases,theinputscan
alsobeconsideredsequential(e.g.video,textorspeech).Inthiscase,we
assume that the inputconsistsof arbitrarily many steps, wherein at each stept
we are provided with an input signal, which we represent as X
(t)
∈X(Ω
(t)
).
16
178Chapter 6
Whileingeneralthedomaincanevolveintimetogetherwiththesignals
onit,itistypicallyassumedthatthedomainiskeptfixedacrossallthet,i.e.
Ω
(t)
=Ω.Here,we will exclusively focus onthiscase, but note that exceptions
arecommon. Socialnetworks areanexample whereoneoftenhastoaccount
forthedomainchangingthroughtime,asnewlinksareregularlycreatedas
wellaserased.Thedomaininthissettingisoftenreferredtoasadynamic
graph (D. Xu et al. 2020; Rossi, Chamberlain, et al. 2020).
Often,theindividualX
(t)
inputswillexhibitusefulsymmetriesandhence
maybenontriviallytreatedbyanyofourpreviouslydiscussedarchitectures.
Somecommonexamplesinclude:videos(Ωisafixedgrid,andsignalsarea
sequence of frames);fMRI scans (Ωis a fixedmesh representing the geometry
of the brain cortex, where different regions are activated at different times as a
responsetopresentedstimuli);andtrafficflownetworks(Ωisafixedgraphrep-
resenting the road network, on which e.g. the average traffic speed is recorded
at various nodes).
Letusassumeanencoderfunctionf(X
(t)
)providinglatentrepresentations
atthelevelofgranularityappropriatefortheproblemandrespectfulofthe
symmetriesoftheinputdomain.Asanexample
17
,considerprocessingvideo
frames:thatis,ateachtimestep,wearegivenagrid-structuredinputrep-
resentedasann×ばつdmatrixX
(t)
,wherenisthenumberofpixels(fixedin
time) anddisthe numberofinputchannels(e.g.d=3forRGBframes). Fur-
ther,weareinterestedinanalysisatthelevelofentireframes,inwhichcase
itisappropriatetoimplementfasatranslationinvariantCNN,outputtinga
k-dimensional representation z
(t)
=f(X
(t)
) of the frame at time-step t.
Wearenowleftwiththetaskofappropriatelysummarisingasequenceof
vectorsz
(t)
acrossallthesteps.Acanonicalwaytodynamicallyaggregate
this informationin away thatrespects thetemporal progressionof inputsand
alsoeasilyallowsforonlinearrivalofnoveldata-points,isusingaRecurrent
Neural Network(RNN).
18
What wewillshow hereisthatRNNsareaninter-
esting geometric architecture tostudy in their own right,since they implement
a rather unusual type of symmetry over the inputs z
(t)
.
SimpleRNNsAteachstep,therecurrentneuralnetworkcomputesanm-
dimensionalsummaryvectorh
(t)
ofalltheinputstepsuptoandincludingt.
This(partial)summaryiscomputedconditionalonthe currentstep’sfeatures
andthepreviousstep’ssummary,throughasharedupdatefunction,R:R
k
×ばつ
R
m
R
m
, as follows (see Figure 6.5 for a summary):
h
(t)
=R(z
(t)
,h
(t–1)
)(6.80)
and, asboth z
(t)
and h
(t–1)
are flat vector representations, Rmay bemost easily
expressedasasinglefully-connectedneuralnetworklayer(oftenknownas
Grids179
h
(0)
RRRR
...
z
(1)
z
(2)
z
(3)
z
(4)
h
(4)
h
(3)
h
(2)
h
(1)
ffff
X
(1)
X
(2)
X
(3)
X
(4)
h
(1)
h
(2)
h
(3)
...
Figure 6.5
IllustrationofprocessingvideoinputwithRNNs.EachinputvideoframeX
(t)
is
processedusingasharedfunctionf—e.g.atranslationinvariantCNN—intoaflat
representationz
(t)
.ThentheRNNupdatefunctionRisiteratedacrossthesevectors,
iterativelyupdatingasummaryvectorh
(t)
whichsummarisesalltheinputsuptoand
includingz
(t)
.Thecomputationisseededwithaninitialsummaryvectorh
(0)
,which
may be either pre-determined or learnable.
SimpleRNN
19
; see Elman (1990) and Jordan (1997)):
h
(t)
=σ(Wz
(t)
+Uh
(t–1)
+b)(6.81)
whereWR
k×ばつm
,UR
m×ばつm
andbR
m
arelearnableparameters,andσis
anactivationfunction.Whilethisintroducesloopsinthenetwork’scompu-
tationalgraph,inpracticethenetworkisunrolledforanappropriatenumber
ofsteps,allowingforbackpropagationthroughtime(RobinsonandFallside
1987; Werbos 1988; Mozer 1989) to be applied.
Thesummaryvectorsmaythenbeappropriatelyleveragedforthedown-
streamtask—ifapredictionisrequiredateverystepofthesequence,thena
180Chapter 6
sharedpredictor maybe appliedtoeachh
(t)
individually. Forclassifying entire
sequences,typicallythefinalsummary,h
(T)
,ispassedtoaclassifier.Here,T
is the length of the sequence.
Specially, theinitial summary vector is usually either settothe zero-vector,
i.e.h
(0)
=0,oritismadelearnable.Analysingthemannerinwhichtheini-
tialsummaryvectorissetalsoallowsustodeduceaninterestingformof
translation equivariance exhibited by RNNs.
TranslationequivarianceinRNNsSinceweinterprettheindividualsteps
tasdiscretetime-steps,theinputvectorsz
(t)
canbeseenaslivingona
one-dimensional
20
gridoftime-steps.Whileitmightbeattractivetoattempt
extending ourtranslationequivariance analysisfromCNNshere,itcannotbe
done in a trivial manner.
Toseewhy, letus assumethatwehaveproduced anewsequencez
(t)
=z
(t+1)
byperformingaleft-shiftofoursequencebyonestep.Itmightbetempting
toattemptshowingh
(t)
=h
(t+1)
,asoneexpectswithtranslationequivariance;
however,thiswillnotgenerallyhold.Considert=1;directlyapplyingand
expanding the update function, we recover the following:
h
(1)
=R(z
(1)
,h
(0)
)=R(z
(2)
,h
(0)
)(6.82)
h
(2)
=R(z
(2)
,h
(1)
)=R(z
(2)
,R(z
(1)
,h
(0)
))(6.83)
Hence,unlesswecanguaranteethath
(0)
=R(z
(1)
,h
(0)
),wewillnotrecover
translation equivariance. Similar analysis can then be done for steps t>1.
Fortunately,withaslightrefactoringofhowwerepresentz,andfora
suitablechoiceofR,itispossibletosatisfytheequalityabove,andhence
demonstrateasettinginwhichRNNsareequivarianttoshifts.Ourprob-
lemwas largely oneofboundary conditions:theequality above includesz
(1)
,
whichourleft-shiftoperationdestroyed.Toabstractthisproblemaway,we
willobservehowanRNNprocessesanappropriatelyleft-paddedsequence,
̄
z
(t)
, defined as follows:
̄
z
(t)
=
(
0tt
z
(tt
)
t>t
Suchasequencenowallowsforleft-shifting
21
byuptot
stepswithoutdestroy-
ing anyof theoriginal inputelements. Further, notewe donot needto handle
right-shiftingseparately; indeed,equivariance torightshifts naturallyfollows
from the RNN equations.
WecannowagainanalysetheoperationoftheRNNoveraleft-shifted
versonof
̄
z
(t)
,whichwedenoteby
̄
z
(t)
=
̄
z
(t+1)
,aswedidinEquations
Grids181
6.82–6.83:
h
(1)
=R(
̄
z
(1)
,h
(0)
)=R(
̄
z
(2)
,h
(0)
)
h
(2)
=R(
̄
z
(2)
,h
(1)
)=R(
̄
z
(2)
,R(
̄
z
(1)
,h
(0)
))=R(
̄
z
(2)
,R(0,h
(0)
))
wherethesubstitution
̄
z
(1)
=0holdsaslongast
1,i.e.aslongasany
paddingisapplied
22
.Now,wecanguaranteeequivariancetoleft-shiftingby
one step (h
(t)
=h
(t+1)
) as long as h
(0)
=R(0,h
(0)
).
Said differently, h
(0)
must be chosen to be a fixed pointof a function γ(h)=
R(0,h).Ifthe update function Ris conveniently chosen, then notonly can we
guarantee existence of such fixed points, but we can even directly obtain them
by iterating the application of R until convergence; e.g., as follows:
h
0
=0h
k+1
=γ(h
k
),(6.84)
wheretheindexkreferstotheiterationofRinourcomputation,asopposed
tothetheindex(t)denotingthetimestepoftheRNN.IfwechooseRsuch
thatγisacontractionmapping
23
,suchaniterationwillindeedconvergeto
auniquefixedpoint.Accordingly,wecantheniterateEquation(6.84)until
h
k+1
=h
k
, and we can seth
(0)
=h
k
. Note that this computationisequivalent to
left-padding the sequence with "sufficiently many" zero-vectors.
Depth inRNNsItisalso easyto stackmultipleRNNs—simply usethe h
(t)
vectorsasaninputsequenceforasecondRNN.Thiskindofconstruction
isoccasionallycalleda"deepRNN",whichispotentiallymisleading.Effec-
tively, dueto the repeated applicationofthe recurrent operation, even a single
RNN "layer" has depth equal to the number of input steps.
Thisoftenintroducesuniquelychallenginglearningdynamicswhenopti-
misingRNNs,aseachtrainingexampleinducesmanygradientupdatesto
thesharedparametersoftheupdatenetwork.Herewewillfocusonperhaps
themostprominentsuchissue—thatofvanishingandexplodinggradients
(Bengio,Simard,andFrasconi1994)—whichisespeciallyproblematicin
RNNs,giventheirdepthandparametersharing.Further,ithassingle-handedly
spurredsomeofthemostinfluentialresearchonRNNs.Foramoredetailed
overview,wereferthereadertoPascanu,Mikolov,andBengio(2013),who
have studied the training dynamics ofRNNs in great detail, and exposed these
challenges from a variety of perspectives: analytical, geometrical, and thelens
of dynamical systems.
Toillustratevanishinggradients,consideraSimpleRNNwithasigmoidal
activationfunctionσ,whosederivativemagnitude|σ
|isalwaysbetween0
and1;seeFigure6.6.Multiplyingmanysuchvaluesresultsingradientsthat
182Chapter 6
6543210123456
0
0.2
0.4
0.6
0.8
1
Figure 6.6
Illustrationof thelogistic function,σ(x)=
1
1+exp(–x)
.Another popularchoice isthe hyper-
bolictangent,σ(x)=tanhx.TheyarecalledsigmoidalduetothedistinctS-shapeof
their plots.
quicklytendtozero,implyingthatearlystepsintheinputsequencemaynot
be able to have influence in updating the network parameters at all.
Forexample,considerthenext-wordpredictiontask(commonine.g.predic-
tive keyboards), and theinput text "Petar is Serbian. He was bornon ...[long
paragraph]...Petarcurrentlylivesin".Here,predictingthenext
wordas"Serbia"mayonlybereasonablyconcludedbyconsideringthevery
startoftheparagraph—butgradientshavelikelyvanishedbythetimethey
reach this input step, making learning from such examples very challenging.
Deepfeedforwardneuralnetworkshavealsosufferedfromthevanishing
gradient problem, until theinvention ofthe ReLU activation(which has gradi-
ents equalto exactly zeroor one—thusfixing thevanishinggradient problem).
However,in RNNs,usingReLUsmay easilyleadtoexplodinggradients,as the
outputspaceoftheupdatefunctionisnowunbounded,andgradientdescent
will updatethe cellonce forevery input step,quickly building upthe scaleof
theupdates.Historically,thevanishinggradientphenomenonwas recognised
early onas a significantobstacle in theuse of recurrentnetworks. Copingwith
thisproblemmotivatedthedevelopmentofmoresophisticatedRNNlayers,
which we describe next.
6.3Case Study: Long Short-Term Memory
Akeyinventionthatsignificantlyreducedtheeffectsofvanishinggradients
in RNNs is thatof gating mechanisms, which allow the network to selectively
overwriteinformationinadata-drivenway.Prominentexamplesofthesegated
RNNs includethe Long Short-TermMemory (LSTM; Hochreiterand Schmid-
huber(1997))andtheGatedRecurrentUnit(GRU;Choetal.(2014)).Here
wewillprimarilydiscusstheLSTM—specifically,thevariantpresentedby
Graves (2013)—inorderto illustratetheoperations of suchmodels. Concepts
fromLSTMseasilycarry over toothergated RNNs.Throughoutthissection,
Grids183
W
c
,U
c
,b
c
W
i
,U
i
,b
i
W
f
,U
f
,b
f
W
o
,U
o
,b
o
z
(t)
h
(t1)
×ばつ+
tanh
×ばつ
h
(t)
×ばつ
M
e
c
(t)
c
(t1)
i
(t)
o
(t)
f
(t)
c
(t)
LSTM
Figure 6.7
Thedataflowofthelong short-termmemory(LSTM),withitscomponents andmemory
cell(M)clearlyhighlighted.Basedonthecurrentinputz
(t)
,previoussummaryh
(t–1)
and
previouscell state c
(t–1)
, theLSTM predictsthe updatedcell statec
(t)
and summaryh
(t)
.
it will likely be useful to refer toFigure6.7, which illustrates all of the LSTM
operations that we will discuss in text.
TheLSTMaugmentstherecurrentcomputationbyintroducingamemory
cell,which storescellstate vectors, c
(t)
R
m
, thatare preserved betweencom-
putationalsteps.TheLSTMcomputessummaryvectors,h
(t)
,directlybased
onc
(t)
,andc
(t)
is,inturn,computedusingz
(t)
,h
(t–1)
andc
(t–1)
.Critically,the
cell isnot completelyoverwrittenbasedon z
(t)
and h
(t–1)
, whichwould expose
thenetwork tothesameissuesastheSimpleRNN.Instead,acertainquantity
ofthepreviouscellstatemayberetained—andtheproportionbywhichthis
occurs is explicitly learned from data.
JustlikeinSimpleRNN,wecomputefeaturesbyusingasinglefully-
connectedneuralnetworklayeroverthecurrentinputstepandprevious
summary:
24
e
c
(t)
=tanh(W
c
z
(t)
+U
c
h
(t–1)
+b
c
)(6.85)
But,as mentioned,wedonot allowallofthisvectortoenterthecell—hence
whywecallitthevectorofcandidatefeatures,anddenoteitas
e
c
(t)
.Instead,
theLSTMdirectly learnsgatingvectors,whicharereal-valuedvectors inthe
range[0,1],anddecidehowmuchofthesignalshouldbeallowedtoenter,
exit, and overwrite the memory cell.
Threesuchgatesarecomputed,allbasedonz
(t)
andh
(t–1)
:theinputgate
i
(t)
,which computestheproportion ofthe candidatevectorallowedto enterthe
cell;theforgetgatef
(t)
,whichcomputestheproportionofthepreviouscell
184Chapter 6
statetoberetained,andtheoutputgateo
(t)
,whichcomputestheproportion
ofthenewcellstatetobeusedforthefinalsummaryvector.Typicallyallof
these gates arealso derivedusing asinglefully connectedlayer,albeit withthe
logisticsigmoidactivation logistic(x)=
1
1+exp(–x)
,inordertoguaranteethatthe
outputs are in the [0,1] range
25
:
i
(t)
=logistic(W
i
z
(t)
+U
i
h
(t–1)
+b
i
)(6.86)
f
(t)
=logistic(W
f
z
(t)
+U
f
h
(t–1)
+b
f
)(6.87)
o
(t)
=logistic(W
o
z
(t)
+U
o
h
(t–1)
+b
o
)(6.88)
Finally,thesegatesareappropriatelyappliedtodecodethenewcellstate,
c
(t)
,whichisthenmodulated bythe outputgateto producethe summaryvector
h
(t)
, as follows:
c
(t)
=i
(t)
e
c
(t)
+f
(t)
c
(t–1)
(6.89)
h
(t)
=o
(t)
tanh(c
(t)
)(6.90)
whereiselement-wisevectormultiplication.Appliedtogether,Equations
(6.85)–(6.90)completelyspecifytheupdaterulefortheLSTM,whichnow
takes into account the cell vectorc
(t)
as well
26
:
(h
(t)
,c
(t)
)=R(z
(t)
,(h
(t–1)
,c
(t–1)
))
Notethat,asthevaluesoff
(t)
arederivedfromz
(t)
andh
(t–1)
—andtherefore
directlylearnablefromdata—theLSTMeffectivelylearnshowtoappropri-
atelyforgetpastexperiences.Indeed,thevaluesoff
(t)
directlyappearinthe
backpropagation update forall the LSTMparameters (W
,U
,b
), allowing
the network toexplicitly control, in a data-driven way, the degree of vanishing
for the gradients across the time steps.
Besidestacklingthevanishinggradientissuehead-on,itturnsoutthat
gatedRNNsalsounlockaveryusefulformofinvariancetotime-warping
transformations, which remains out of reach of SimpleRNNs.
Time warping invariance of gated RNNsWe willstart by illustrating, ina
continuous-timesetting
27
,whatdoesitmeantowarptime,andwhatisrequired
ofarecurrentmodelinordertoachieveinvariancetosuchtransformations.
Our exposition will largely follow the work of TallecandOllivier (2018), that
initially describedthis phenomenon—and indeed,they wereamong thefirst to
actually study RNNs from the lens of invariances.
Letusassumea continuoustime-domainsignalz(t), onwhichwewouldlike
toapply anRNN.Toalignthe RNN’sdiscrete-time computationofsummary
vectorsh
(t)28
with ananalogue in thecontinuous domain,h(t),we willobserve
Grids185
Figure 6.8
Time-warpingoperationscanbesimple,suchastimerescaling;e.g.τ(t)=0.7t(dis-
playedhere),which,inadiscretesetting,wouldamounttonewinputsbeingreceived
every1.43steps.However,italsoadmitsawidespectrumofvariably-changing
samplingrates,e.g.samplingmayfreelyaccelerateordeceleratethroughoutthetime
domain.
its linear Taylor expansion:
h(t+δ)h(t)+δ
dh(t)
dt
(6.91)
and, setting δ=1, we recover a relationship between h(t) and h(t+1), which is
exactlywhattheRNNupdatefunctionR(Equation6.80)computes.Namely,
the RNN update function satisfies the following differential equation:
dh(t)
dt
=h(t+1)–h(t)=R(z(t+1),h(t))–h(t)(6.92)
We would likethe RNN tobe resilient tothe way inwhich the signalis sam-
pled(e.g.bychangingthetimeunitofmeasurement),inordertoaccountfor
anyimperfections orirregularitiestherein.Formally, wedenotea timewarping
operationτ:R
+
R
+
,asanymonotonicallyincreasingdifferentiablemapping
betweentimes.Thenotationτischosenbecausetimewarpingrepresentsan
automorphism of time. See Figure 6.8 for an illustration of warping.
Further,westatethataclassofmodelsisinvarianttotimewarpingif,for
anymodel ofthe classand any suchτ,there existsanother (possiblythesame)
modelfromtheclassthatprocessesthewarpeddatainthesamewayasthe
original model did in the non-warped case.
This isa potentiallyvery useful property. Ifwehave anRNN classcapable
ofmodellingshort-termdependencieswell,andwecanalsoshowthatthis
classisinvarianttotimewarping,thenweknowitispossibletotrainsucha
modelinawaythatwillusefullycapturelong-termdependenciesaswell(as
theywouldcorrespondtoatimedilationwarpingofasignalwithshort-term
dependencies).Aswewillshortlysee,itisnocoincidencethatgatedRNN
modelssuchasthe LSTMwereproposedtomodellong-rangedependencies.
Achieving timewarpinginvarianceistightlycoupledwithpresenceofgating
mechanisms, such as the input/forget/output gates of LSTMs.
186Chapter 6
Whentimegetswarpedbyτ,thesignalobservedbytheRNNattime
tisz(τ(t))and,toremaininvarianttosuchwarpings,itshouldpredictan
equivalently-warpedsummary functionh(τ(t)).Using Taylorexpansion argu-
ments once more, we derive a form of Equation 6.92 for the warped time, that
the RNN update R should satisfy:
dh(τ(t))
dτ(t)
=R(z(τ(t+1)),h(τ(t)))–h(τ(t))(6.93)
However,theabovederivativeiscomputedwithrespecttothewarpedtime
τ(t),andhencedoesnottakeintoaccounttheoriginalsignal.Tomakeour
model take into account the warping transformation explicitly, we need to dif-
ferentiatethe warpedsummaryfunction withrespecttot.Applying thechain
rule, this yields the following differential equation:
dh(τ(t))
dt
=
dh(τ(t))
dτ(t)
dτ(t)
dt
=
dτ(t)
dt
R(z(τ(t+1)),h(τ(t)))–
dτ(t)
dt
h(τ(t))(6.94)
and,forour(continuous-time)RNNtoremaininvarianttoanytimewarping
τ(t), it needs to be able toexplicitly represent the derivative
dτ(t)
dt
, which is not
assumedknownupfront!WeneedtointroducealearnablefunctionΓ which
approximates this derivative. For example, Γcould be aneural network taking
into account z(t+1) and h(t) and predicting scalar outputs.
Now,remarkthat,fromthepointofviewofadiscreteRNNmodelunder
timewarping,itsinputz
(t)
willcorrespondtoz(τ(t)),anditssummaryh
(t)
willcorrespondtoh(τ(t)).Toobtaintherequiredrelationshipofh
(t)
toh
(t+1)
inordertoremaininvarianttotimewarping,wewilluseaone-stepTaylor
expansion of h(τ(t)):
h(τ(t+δ))h(τ(t))+δ
dh(τ(t))
dt
and, onceagain, settingδ=1 andsubstituting Equation 6.94,then discretising:
h
(t+1)
=h
(t)
+
dτ(t)
dt
R(z
(t+1)
,h
(t)
)–
dτ(t)
dt
h
(t)
=
dτ(t)
dt
R(z
(t+1)
,h
(t)
)+
1–
dτ(t)
dt
h
(t)
Finally,weswap
dτ(t)
dt
withtheaforementionedlearnablefunction,Γ.This
gives us the required form for our time warping-invariant RNN:
h
(t+1)
=Γ(z
(t+1)
,h
(t)
)R(z
(t+1)
,h
(t)
)+(1–Γ(z
(t+1)
,h
(t)
))h
(t)
(6.95)
We may quicklydeduce that SimpleRNNs (Equation 6.81)are nottimewarp-
inginvariant,giventhattheydonotfeaturethesecondterminEquation
Grids187
6.95.Instead,theyfullyoverwriteh
(t)
withR(z
(t+1)
,h
(t)
),whichcorresponds
to assuming no time warping at all;
dτ(t)
dt
=1, i.e. τ(t)=t.
Further,ourlinkbetweencontinuous-timeRNNsandthediscreteRNN
basedonRrestedontheaccuracyoftheTaylorapproximation,whichholds
only if the time-warpingderivative is not toolarge, i.e.,
dτ(t)
dt
1. The intuitive
explanationofthisis:ifourtimewarpingoperationevercontractstimeina
waythat makes timeincrements (tt+1) largeenough thatintermediate data
changesarenotsampled,themodelcanneverhopetoprocesstime-warped
inputsinthesamewayasoriginalones—itsimplywouldnothaveaccessto
thesameinformation.Conversely,timedilationsofanyform(which,indis-
creteterms, correspondtointerspersingtheinputtime-serieswithzeroes)are
perfectly allowed within our framework.
Combinedwithourrequirementofmonotonicallyincreasingτ(
dτ(t)
dt
>0),
wecanboundtheoutputspaceofΓas0<Γ(z
(t+1)
,h
(t)
)<1,whichmotivates
the use of the logistic sigmoid activation for Γ, e.g.:
Γ(z
(t+1)
,h
(t)
)=logistic(W
Γ
z
(t+1)
+U
Γ
h
(t)
+b
Γ
)
exactlymatchingtheLSTMgatingequations(e.g.Equation6.86).Themain
differenceisthatLSTMscomputegatingvectors,whereasEquation6.95
impliesΓshouldoutputascalar.Vectorisedgates(Hochreiter1991)allow
tofitadifferentwarpingderivativeineverydimensionofh
(t)
,allowingfor
reasoning over multiple time horizons simultaneously.
It is worth taking a pause here to summarise what we have done. By requir-
ing that our RNN classis invariantto (non-destructive) time warping,we have
derived the necessary form thatitmust have (Equation 6.95), and showed that
itexactlycorrespondstotheclassofgatedRNNs.Thegates’primaryrole
underthisperspectiveistoaccuratelyfitthederivative
dτ(t)
dt
ofthewarping
transformation.
The notionof classinvariance issomewhatdistinct fromthe invariances we
studiedpreviously.Namely,oncewetrain agatedRNNona time-warpedinput
with τ
1
(t), in many cases we cannot zero-shot transfer it to a signal warped by
a different τ
2
(t).Rather, class invariance only guaranteesthat gated RNNsare
powerfulenough tofit bothof thesesignals inthe samemanner,butpotentially
withvastlydifferentmodelparameters.Thatbeingsaid,therealisationthat
effectivegating mechanismsare tightlyrelatedto fittingthe warpingderivative
canyieldusefulprescriptionsforgatedRNNoptimisation,aswenow briefly
demonstrate.
Forexample,we canoftenassume thatthe rangeof thedependencies weare
interested in tracking within our signal will be in the range [T
l
,T
h
] time-steps.
188Chapter 6
ByanalysingtheanalyticsolutionstoEquation6.94,itcanbeshownthat
thecharacteristicforgettingtimeofh
(t)
byourgatedRNNisproportionalto
1
Γ(z
(t+1)
,h
(t)
)
.Hence,wewouldlikeourgatingvaluestoliebetween
h
1
T
h
,
1
T
m
i
in
order to effectively remember information within the assumed range.
Further,ifweassumethatz
(t)
andh
(t)
areroughlyzero-centered—which
isacommonby-productofapplyingtransformationssuchaslayernormali-
sation(Ba,Kiros,andHinton2016)—wecanassumethatE[Γ(z
(t+1)
,h
(t)
)]
logistic(b
Γ
).Controllingthebiasvectorofthegatingmechanismishencea
very powerful way of controlling the effective gate value
29
.
Combiningthetwo observations,weconcludethatanappropriaterange of
gatingvaluescanbeobtainedbyinitialisingb
Γ
–log(U(T
l
,T
h
)–1),where
Uis theuniform real distribution. Sucha recommendation was dubbedchrono
initialisation by Tallec and Ollivier (2018), andhas been empirically shownto
improve the long-range dependency modelling of gated RNNs.
Exercises
1.We noted that a matrix is circulantif and only if it commuteswith the shift operator
S. Let us prove this fundamental result for 1D grids.
(i)Write down the explicit elements of the right shift matrix Sof Nelements.
(ii)LetCbeanarbitrary N×ばつN matrix. Writedowntheelements ofthe resultingmatrices
SCand CS.
(iii)By requiring translation equivariance (SC=CS), deduce the relationshipbetween c
ij
and c
i+1,j+1
.
(iv)Usethisrelationshiptoconclude thattheelementsc
ij
dependonly on(ij)modN,
proving that Cmust be a circulant.
2.Convolutionalneuralnetworksoftenusestridedconvolutions(orpooling)to
coarsen thegrid.For example, astride ofs=2drops every secondpixel of theout-
putgrid.Show thatanarchitecturewithastrideof2is nolongerequivarianttoall
translations, and characterise which translations it is equivariant to.
3.Giventhatgridsareaspecialcaseofgraphs,wemayattempttoexpressaCNN
usingthelanguageofgraphneuralnetworks(GNNs)fromChapter5.Considera
1Dcyclicalgridwithnnodesasaringgraph,comprisingnodesV={0,1,...,n
1}andneighbourhoodsN
u
={(u–1)modn,u,(u+1)modn},withnodefeatures
x
u
R
3
being typical RGB values.
(i)Assume we use a GCN (graph convolutional network) model over this graph:
h
u
=φ
x
u
,
X
v∈N
u
1
d
u
d
v
ψ(x
v
)
!
.
Grids189
ThisGNNflavourhassomeoftheclosesttiestothespectralpropertiesunderly-
ingCNNs,andthereforeitrepresentsavaluableattempt.However,itsadlycannot
express allpossible convolutionsover images(evenif we expand theneighbourhoods).
Characterise which image CNNs are representable using the GCN.
(ii)WecanattemptfixingthisbymovingtoamoreexpressiveflavourofGNNs;recall
that the most expressive option are message passing GNNs:
h
u
=φ
x
u
,
M
v∈N
u
ψ(x
u
,x
v
)
!
,
for apermutation-invariant aggregator
L
. Provideasimplecounter-example showing
that even this expressive GNN cannot reliably simulate all CNN filters over the grid.
(iii)WhatallowsCNNstobypassthisrestriction?Leverage thisassumption toprovidea
correctiontothemessage-passingequationthatcanrepresentallpossibleCNNsover
the grid.
4.In this question we willstudy time rescalings, a special case oftime warping trans-
formations.Timerescalingshavetheformτ(t)=αt,whereα(0,1].Itiseasyto
verifythatrescalingsmonotonicallydilatetime,satisfyingtherequirementsfora
non-destructive timewarping operation. We will startby studying continuous-space
RNNs, taking input signals z(t) to continuous summaries h(t).
(i)StatethelinearTaylorexpansionofh(t)aroundt.Underwhichconditionsisthis
agoodapproximation?UsetheTaylorexpansiontorelateh(t)toh(t+1),usingthe
discrete RNN update, R(z
(t+1)
,h
(t)
) and the summary derivative
dh(t)
dt
.
(ii)Nowintroduce theproposedtime rescaling,and derivethe requiredcondition forclass
invariance to such operations.
(iii)Derive the update rule for a discrete RNN that is invariant to time rescaling.
(iv)Assume thata time-rescaling-invariant RNNhas beenpre-trained overdata whichhas
beenrescaledusingτ
1
(t)=α
1
t.Isitpossibletozero-shottransferthisRNNtomake
predictions over a signal that has been rescaled by τ
2
(t)=α
2
t? If so, how?
Notes
1
Because ofthe periodicboundary conditions,it isa circularor cyclicconvo-
lution. In signalprocessing, θis often referredto as the"filter,"and in CNNs,
its coefficients are learnable.
2
The left shiftoperator is givenby S
. Obviously, shifting left and thenright
(orviceversa)doesnotdoanything,whichmeansSisorthogonal:S
S=
SS
=I.
3
We must additionally assume distincteigenvalues, otherwise there might be
multiplepossible diagonalisations.Thisassumptionissatisfiedwithourchoice
of S.
190Chapter 6
4
Sisorthogonalbutnon-symmetric,hence,itseigenvectors areorthogonal
but the eigenvalues are complex (roots of unity).
5
Notethattheeigenvectorsarecomplex,soweneedtotakecomplex
conjugation when transposing Φ.
6
SincetheFouriertransformisanorthogonalmatrix(Φ
Φ=I),geomet-
ricallyitactsasachangeofthesystemofcoordinatesthatamountstoan
N-dimensionalrotation. Inthis systemof coordinates("Fourier domain"), the
action of a circulant Cmatrix becomes element-wise product.
7
Ingraphsignalprocessing,theeigenvectorsofthegraphLaplacianare
oftenusedasanalternativeoftheadjacencymatrixtoconstructthegraph
Fourier transform,see Shumanet al.2013.On grids,both matriceshave joint
eigenvectors,butongraphstheyresultsinsomewhatdifferentthoughrelated
constructions.
8
Eigenfunction is synonymous with ‘eigenvector’ and is used when referring
to eigenvectors of continuous operators.
9
The spectralcharacterisation ofthe translationgroup isa particular caseof a
moregeneralresultinFunctionalAnalysis,theStone’sTheorem,whichderives
an equivalent characterisation for any one-parameter unitary group.
10
Recall, C(θ) is a circulant matrix with parameters θ.
11
Notethatweusuallyimaginex andθ
vw
as2Dmatrices,butinthisequa-
tion,bothxandθ
vw
havetheirtwocoordinatedimensionsflattenedinto
one—making xa vector, and C(θ
vw
) a matrix.
12
CNNs whichonly consist ofthe operations mentionedin thisparagraph are
oftendubbed"all-convolutional".Incontrast,manyCNNsflattentheimage
acrossthespatialaxesandpassthemtoanMLPclassifier,oncesufficient
equivariantandcoarseninglayershavebeenapplied.Thislosestranslation
invariance.
13
Historically,ResNetmodelsarepredatedbyhighwaynetworks(Srivas-
tava,Greff,andSchmidhuber2015),whichallowformoregeneralgating
mechanisms to control the residual information flow.
14
In thiscase, theResNet isperforming aForward Eulerdiscretisation ofan
ODE:
̇
x=σ(F(x))
15
We notethat normalisingactivations of neuralnetworks hasseen attention
evenbeforetheadventofbatchnormalisation.See,e.g.,LyuandSimoncelli
2008.
16
Whetherthe domainisconsideredstatic ordynamicconcernstime scales:
e.g.,aroadnetworkdoeschangeovertime(asnewroadsarebuiltandold
ones aredemolished), but significantlyslower comparedto theflow oftraffic.
Grids191
Similarly,insocialnetworks,changesinengagement(e.g.Twitterusersre-
tweetingatweet)happenatamuchhigherfrequencythanchangesinthefollow
graph.
17
Wedonotlose generalityin ourexample; equivalentanalysiscanbe done
e.g. for node-level outputs on a spatiotemporal graph; theonlydifference is in
the choice of encoder f(which will then be a permutation equivariant GNN).
18
Notethatthez
(t)
vectorscanbeseenaspointsonatemporalgrid:hence,
processingthemwithaCNNisalsoviableinsomecases.Transformersare
also increasingly popular models for processing generic sequential inputs.
19
In spiteof theirname, SimpleRNNsare remarkablyexpressive.Forexam-
ple,itwasshownbySiegelmannandSontag(1995)thatsuchmodelsare
Turing-complete,meaningthattheycanlikelyrepresentanycomputationwe
may ever be able to execute on computers.
20
Notethatthisconstructionisextendabletogridsinhigherdimensions,
allowingusto,e.g.,processsignalslivingonimagesinascanlinefashion.
Suchaconstruction powereda popularseriesofmodels,such asthe PixelRNN
from Oord, Kalchbrenner, and Kavukcuoglu (2016).
21
Note that equivalent analyses willarise if we use adifferentpadding vector
than 0.
22
In a very similar vein, we can derive equivariance to left-shifting by s steps
as long as t
s.
23
Contractionsarefunctionsγ:X→Xsuchthat,undersomenorm∥·∥on
X,applyingγcontractsthedistancesbetweenpoints:forallx,y∈X,and
someq[0,1),itholdsthatγ(x)–γ(y)∥≤qxy.Iteratingsuchafunc-
tionthennecessarily convergesto aunique fixedpoint, asa directconsequence
of Banach’s Fixed Point Theorem (Banach 1922).
24
Notethatwehavesettheactivationfunctiontotanhhere;asLSTMsare
designedto amelioratethevanishinggradient problem,it isnowappropriate to
use a sigmoidal activation.
25
Notethatthethreegatesarethemselvesvectors,i.e.i
(t)
,f
(t)
,o
(t)
[0,1]
m
.
Thisallowsthemtocontrolhowmucheachofthemdimensionsisallowed
through the gate.
26
ThisisstillcompatiblewiththeRNNupdateblueprintfromEquation
(6.80);simplyconsiderthesummaryvectortobetheconcatenationofh
(t)
and c
(t)
; sometimes denoted by h
(t)
c
(t)
.
27
Wefocusonthecontinuoussettingasitwillbeeasiertoreasonabout
manipulations of time there.
28
We will use h(t) to denote acontinuous signal at time t, and h
(t)
to denote a
discrete signal at time-step t.
192Chapter 6
29
ThisinsightwasalreadyspottedbyGersandJ.Schmidhuber(2000)and
Jozefowicz,Zaremba,andSutskever(2015),whoempiricallyrecommended
initialising theforget-gatebias ofLSTMs toa constantpositive vector, suchas
1.

AltStyle によって変換されたページ (->オリジナル) /