Siamese neural network
A Siamese neural network (sometimes called a twin neural network) is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors.[1] [2] [3] [4] Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints but can be described more technically as a distance function for locality-sensitive hashing.[citation needed ]
It is possible to build an architecture that is functionally similar to a twin network but implements a slightly different function. This is typically used for comparing similar instances in different type sets.[citation needed ]
Uses of similarity measures where a twin network might be used are such things as recognizing handwritten checks, automatic detection of faces in camera images, and matching queries with indexed documents. The perhaps most well-known application of twin networks are face recognition, where known images of people are precomputed and compared to an image from a turnstile or similar. It is not obvious at first, but there are two slightly different problems. One is recognizing a person among a large number of other persons, that is the facial recognition problem. DeepFace is an example of such a system.[4] In its most extreme form this is recognizing a single person at a train station or airport. The other is face verification, that is for example, to verify whether a photo in a passport matches the face of the passport's owner. The twin network might be the same, but the implementation can be quite different.
Learning
[edit ]Learning in twin networks can be done with triplet loss or contrastive loss. For learning by triplet loss a baseline vector (anchor image) is compared against a positive vector (truthy image) and a negative vector (falsy image). The negative vector will force learning in the network, while the positive vector will act like a regularizer. For learning by contrastive loss there must be a weight decay to regularize the weights, or some similar operation like a normalization.
A distance metric for a loss function may have the following properties[5]
- Non-negativity: {\displaystyle \delta (x,y)\geq 0}
- Identity of Non-discernibles: {\displaystyle \delta (x,y)=0\iff x=y}
- Commutativity: {\displaystyle \delta (x,y)=\delta (y,x)}
- Triangle inequality: {\displaystyle \delta (x,z)\leq \delta (x,y)+\delta (y,z)}
In particular, the triplet loss algorithm is often defined with squared Euclidean (which unlike Euclidean, does not have triangle inequality) distance at its core.
Predefined metrics, Euclidean distance metric
[edit ]The common learning goal is to minimize a distance metric for similar objects and maximize for distinct ones. This gives a loss function like
- {\displaystyle {\begin{aligned}\delta (x^{(i)},x^{(j)})={\begin{cases}\min \ \|\operatorname {f} \left(x^{(i)}\right)-\operatorname {f} \left(x^{(j)}\right)\|,,円i=j\\\max \ \|\operatorname {f} \left(x^{(i)}\right)-\operatorname {f} \left(x^{(j)}\right)\|,,円i\neq j\end{cases}}\end{aligned}}}
- {\displaystyle i,j} are indexes into a set of vectors
- {\displaystyle \operatorname {f} (\cdot )} function implemented by the twin network
The most common distance metric used is Euclidean distance, in case of which the loss function can be rewritten in matrix form as
- {\displaystyle \operatorname {\delta } (\mathbf {x} ^{(i)},\mathbf {x} ^{(j)})\approx (\mathbf {x} ^{(i)}-\mathbf {x} ^{(j)})^{T}(\mathbf {x} ^{(i)}-\mathbf {x} ^{(j)})}
Learned metrics, nonlinear distance metric
[edit ]A more general case is where the output vector from the twin network is passed through additional network layers implementing non-linear distance metrics.
- {\displaystyle {\begin{aligned}{\text{if}},円i=j,円{\text{then}}&,円\operatorname {\delta } \left[\operatorname {f} \left(x^{(i)}\right),,円\operatorname {f} \left(x^{(j)}\right)\right],円{\text{is small}}\\{\text{otherwise}}&,円\operatorname {\delta } \left[\operatorname {f} \left(x^{(i)}\right),,円\operatorname {f} \left(x^{(j)}\right)\right],円{\text{is large}}\end{aligned}}}
- {\displaystyle i,j} are indexes into a set of vectors
- {\displaystyle \operatorname {f} (\cdot )}function implemented by the twin network
- {\displaystyle \operatorname {\delta } (\cdot )}function implemented by the network joining outputs from the twin network
On a matrix form the previous is often approximated as a Mahalanobis distance for a linear space as[6]
- {\displaystyle \operatorname {\delta } (\mathbf {x} ^{(i)},\mathbf {x} ^{(j)})\approx (\mathbf {x} ^{(i)}-\mathbf {x} ^{(j)})^{T}\mathbf {M} (\mathbf {x} ^{(i)}-\mathbf {x} ^{(j)})}
This can be further subdivided in at least Unsupervised learning and Supervised learning.
Learned metrics, half-twin networks
[edit ]This form also allows the twin network to be more of a half-twin, implementing a slightly different functions
- {\displaystyle {\begin{aligned}{\text{if}},円i=j,円{\text{then}}&,円\operatorname {\delta } \left[\operatorname {f} \left(x^{(i)}\right),,円\operatorname {g} \left(x^{(j)}\right)\right],円{\text{is small}}\\{\text{otherwise}}&,円\operatorname {\delta } \left[\operatorname {f} \left(x^{(i)}\right),,円\operatorname {g} \left(x^{(j)}\right)\right],円{\text{is large}}\end{aligned}}}
- {\displaystyle i,j} are indexes into a set of vectors
- {\displaystyle \operatorname {f} (\cdot ),\operatorname {g} (\cdot )}function implemented by the half-twin network
- {\displaystyle \operatorname {\delta } (\cdot )}function implemented by the network joining outputs from the twin network
Twin networks for object tracking
[edit ]Twin networks have been used in object tracking because of its unique two tandem inputs and similarity measurement. In object tracking, one input of the twin network is user pre-selected exemplar image, the other input is a larger search image. The twin network's job is to locate the exemplar inside of the search image. By measuring the similarity between exemplar and each part of the search image, a map of similarity score can be given by the twin network. Furthermore, using a Fully Convolutional Network, the process of computing each sector's similarity score can be replaced with only one cross correlation layer.[7]
After being first introduced in 2016, Twin fully convolutional network has been used in many High-performance Real-time Object Tracking Neural Networks. Like CFnet,[8] StructSiam,[9] SiamFC-tri,[10] DSiam,[11] SA-Siam,[12] SiamRPN,[13] DaSiamRPN,[14] Cascaded SiamRPN,[15] SiamMask,[16] SiamRPN++,[17] Deeper and Wider SiamRPN.[18]
See also
[edit ]Further reading
[edit ]- Chicco, Davide (2020), "Siamese neural networks: an overview", Artificial Neural Networks, Methods in Molecular Biology, vol. 2190 (3rd ed.), New York City, New York, USA: Springer Protocols, Humana Press, pp. 73–94, doi:10.1007/978-1-0716-0826-5_3, ISBN 978-1-0716-0826-5, PMID 32804361, S2CID 221144012
References
[edit ]- ^ Chicco, Davide (2020), "Siamese neural networks: an overview", Artificial Neural Networks, Methods in Molecular Biology, vol. 2190 (3rd ed.), New York City, New York, USA: Springer Protocols, Humana Press, pp. 73–94, doi:10.1007/978-1-0716-0826-5_3, ISBN 978-1-0716-0826-5, PMID 32804361, S2CID 221144012
- ^ Bromley, Jane; Guyon, Isabelle; LeCun, Yann; Säckinger, Eduard; Shah, Roopak (1994). "Signature verification using a "Siamese" time delay neural network" (PDF). Advances in Neural Information Processing Systems. 6: 737–744.
- ^ Chopra, S.; Hadsell, R.; LeCun, Y. (June 2005). "Learning a Similarity Metric Discriminatively, with Application to Face Verification". 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 1. pp. 539–546 vol. 1. doi:10.1109/CVPR.2005.202. ISBN 0-7695-2372-2. S2CID 5555257.
- ^ a b Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. (June 2014). "DeepFace: Closing the Gap to Human-Level Performance in Face Verification". 2014 IEEE Conference on Computer Vision and Pattern Recognition. pp. 1701–1708. doi:10.1109/CVPR.2014.220. ISBN 978-1-4799-5118-5. S2CID 2814088.
- ^ Chatterjee, Moitreya; Luo, Yunan. "Similarity Learning with (or without) Convolutional Neural Network" (PDF). Retrieved 2018年12月07日.
- ^ Chandra, M.P. (1936). "On the generalized distance in statistics" (PDF). Proceedings of the National Institute of Sciences of India. 1. 2: 49–55.
- ^ Fully-Convolutional Siamese Networks for Object Tracking arXiv:1606.09549
- ^ "End-to-end representation learning for Correlation Filter based tracking".
- ^ "Structured Siamese Network for Real-Time Visual Tracking" (PDF).
- ^ "Triplet Loss in Siamese Network for Object Tracking" (PDF).
- ^ "Learning Dynamic Siamese Network for Visual Object Tracking" (PDF).
- ^ "A Twofold Siamese Network for Real-Time Object Tracking" (PDF).
- ^ "High Performance Visual Tracking with Siamese Region Proposal Network" (PDF).
- ^ Zhu, Zheng; Wang, Qiang; Li, Bo; Wu, Wei; Yan, Junjie; Hu, Weiming (2018). "Distractor-aware Siamese Networks for Visual Object Tracking". arXiv:1808.06048 [cs.CV].
- ^ Fan, Heng; Ling, Haibin (2018). "Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking". arXiv:1812.06148 [cs.CV].
- ^ Wang, Qiang; Zhang, Li; Bertinetto, Luca; Hu, Weiming; Torr, Philip H. S. (2018). "Fast Online Object Tracking and Segmentation: A Unifying Approach". arXiv:1812.05050 [cs.CV].
- ^ Li, Bo; Wu, Wei; Wang, Qiang; Zhang, Fangyi; Xing, Junliang; Yan, Junjie (2018). "SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks". arXiv:1812.11703 [cs.CV].
- ^ Zhang, Zhipeng; Peng, Houwen (2019). "Deeper and Wider Siamese Networks for Real-Time Visual Tracking". arXiv:1901.01660 [cs.CV].