Optimal matching
Optimal matching is a sequence analysis method used in social science, to assess the dissimilarity of ordered arrays of tokens that usually represent a time-ordered sequence of socio-economic states two individuals have experienced. Once such distances have been calculated for a set of observations (e.g. individuals in a cohort) classical tools (such as cluster analysis) can be used. The method was tailored to social sciences[1] from a technique originally introduced to study molecular biology (protein or genetic) sequences (see sequence alignment). Optimal matching uses the Needleman-Wunsch algorithm.
Algorithm
[edit ]Let {\displaystyle S=(s_{1},s_{2},s_{3},\ldots s_{T})} be a sequence of states {\displaystyle s_{i}} belonging to a finite set of possible states. Let us denote {\displaystyle {\mathbf {S} }} the sequence space, i.e. the set of all possible sequences of states.
Optimal matching algorithms work by defining simple operator algebras that manipulate sequences, i.e. a set of operators {\displaystyle a_{i}:{\mathbf {S} }\rightarrow {\mathbf {S} }}. In the most simple approach, a set composed of only three basic operations to transform sequences is used:
- one state {\displaystyle s} is inserted in the sequence {\displaystyle a_{s'}^{\rm {Ins}}(s_{1},s_{2},s_{3},\ldots s_{T})=(s_{1},s_{2},s_{3},\ldots ,s',\ldots s_{T})}
- one state is deleted from the sequence {\displaystyle a_{s_{2}}^{\rm {Del}}(s_{1},s_{2},s_{3},\ldots s_{T})=(s_{1},s_{3},\ldots s_{T})} and
- a state {\displaystyle s_{1}} is replaced (substituted) by state {\displaystyle s'_{1}}, {\displaystyle a_{s_{1},s'_{1}}^{\rm {Sub}}(s_{1},s_{2},s_{3},\ldots s_{T})=(s'_{1},s_{2},s_{3},\ldots s_{T})}.
Imagine now that a cost {\displaystyle c(a_{i})\in {\mathbf {R} }_{0}^{+}} is associated
to each operator. Given two sequences {\displaystyle S_{1}} and {\displaystyle S_{2}},
the idea is to measure the cost of obtaining {\displaystyle S_{2}} from {\displaystyle S_{1}}
using operators from the algebra. Let {\displaystyle A={a_{1},a_{2},\ldots a_{n}}} be a sequence of operators such that the application of all the operators of this sequence {\displaystyle A} to the first sequence {\displaystyle S_{1}} gives the second sequence {\displaystyle S_{2}}:
{\displaystyle S_{2}=a_{1}\circ a_{2}\circ \ldots \circ a_{n}(S_{1})} where {\displaystyle a_{1}\circ a_{2}} denotes the compound operator.
To this set we associate the cost {\displaystyle c(A)=\sum _{i=1}^{n}c(a_{i})}, that
represents the total cost of the transformation. One should consider at this point that there might exist different such sequences {\displaystyle A} that transform {\displaystyle S_{1}} into {\displaystyle S_{2}}; a reasonable choice is to select the cheapest of such sequences. We thus
call distance
{\displaystyle d(S_{1},S_{2})=\min _{A}\left\{c(A)~{\rm {such~that}}~S_{2}=A(S_{1})\right\}}
that is, the cost of the least expensive set of transformations that turn {\displaystyle S_{1}} into {\displaystyle S_{2}}. Notice that {\displaystyle d(S_{1},S_{2})} is by definition nonnegative since it is the sum of positive costs, and trivially {\displaystyle d(S_{1},S_{2})=0} if and only if {\displaystyle S_{1}=S_{2}}, that is there is no cost. The distance function is symmetric if insertion and deletion costs are equal {\displaystyle c(a^{\rm {Ins}})=c(a^{\rm {Del}})}; the term indel cost usually refers to the common cost of insertion and deletion.
Considering a set composed of only the three basic operations described above, this proximity measure satisfies the triangular inequality. Transitivity however, depends on the definition of the set of elementary operations.
Criticism
[edit ]Although optimal matching techniques are widely used in sociology and demography, such techniques also have their flaws. As was pointed out by several authors (for example L. L. Wu[2] ), the main problem in the application of optimal matching is to appropriately define the costs {\displaystyle c(a_{i})}.
Software
[edit ]- TDA is a powerful program, offering access to some of the latest developments in transition data analysis.
- STATA has implemented a package to run optimal matching analysis.
- TraMineR is an open source R-package for analyzing and visualizing states and events sequences, including optimal matching analysis.
References and notes
[edit ]- ^ A. Abbott and A. Tsay, (2000) Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect Sociological Methods & Research], Vol. 29, 3-33. doi:10.1177/0049124100029001001
- ^ L. L. Wu. (2000) Some Comments on "Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect" Archived 2006年10月24日 at the Wayback Machine Sociological Methods & Research, 29 41-64. doi:10.1177/0049124100029001003