Jump to content
Wikipedia The Free Encyclopedia

ProbCons

From Wikipedia, the free encyclopedia
Protein multiple-sequence alignment program

In bioinformatics and proteomics, ProbCons is an open source software for probabilistic consistency-based multiple alignment of amino acid sequences. It is one of the most efficient protein multiple sequence alignment programs, since it has repeatedly demonstrated a statistically significant advantage in accuracy over similar tools, including Clustal and MAFFT.[1] [2]

Algorithm

[edit ]

The following describes the basic outline of the ProbCons algorithm.[3]

Step 1: Reliability of an alignment edge

[edit ]

For every pair of sequences compute the probability that letters x i {\displaystyle x_{i}} {\displaystyle x_{i}} and y i {\displaystyle y_{i}} {\displaystyle y_{i}} are paired in a {\displaystyle a^{*}} {\displaystyle a^{*}} an alignment that is generated by the model.

P ( x i y i | x , y )   = d e f   Pr [ x i y i  in some  a | x , y ] =   alignment  a with  x i y i Pr [ a | x , y ] =   alignment  a 1 { x i y i a } Pr [ a | x , y ] {\displaystyle {\begin{aligned}P(x_{i}\sim y_{i}|x,y)\ {\overset {\underset {\mathrm {def} }{}}{=}}&\ \Pr[x_{i}\sim y_{i}{\text{ in some }}a|x,y]\\[8pt]=&\ \sum _{{\text{alignment }}a \atop {{\text{with }}x_{i}-y_{i}}}\Pr[a|x,y]\\[2pt]=&\ \sum _{{\text{alignment }}a}\mathbf {1} \{x_{i}-y_{i}\in a\}\Pr[a|x,y]\end{aligned}}} {\displaystyle {\begin{aligned}P(x_{i}\sim y_{i}|x,y)\ {\overset {\underset {\mathrm {def} }{}}{=}}&\ \Pr[x_{i}\sim y_{i}{\text{ in some }}a|x,y]\\[8pt]=&\ \sum _{{\text{alignment }}a \atop {{\text{with }}x_{i}-y_{i}}}\Pr[a|x,y]\\[2pt]=&\ \sum _{{\text{alignment }}a}\mathbf {1} \{x_{i}-y_{i}\in a\}\Pr[a|x,y]\end{aligned}}}

(Where 1 { x i y i a } {\displaystyle \mathbf {1} \{x_{i}\sim y_{i}\in a\}} {\displaystyle \mathbf {1} \{x_{i}\sim y_{i}\in a\}} is equal to 1 if x i {\displaystyle x_{i}} {\displaystyle x_{i}} and y i {\displaystyle y_{i}} {\displaystyle y_{i}} are in the alignment and 0 otherwise.)

Step 2: Maximum expected accuracy

[edit ]

The accuracy of an alignment a {\displaystyle a^{*}} {\displaystyle a^{*}} with respect to another alignment a {\displaystyle a} {\displaystyle a} is defined as the number of common aligned pairs divided by the length of the shorter sequence.

Calculate expected accuracy of each sequence:

E Pr [ a | x , y ] ( acc ( a , a ) ) = a Pr [ a | x , y ] acc ( a , a ) = 1 min ( | x | , | y | ) a 1 { x i y i a } Pr [ a | x , y ] = 1 min ( | x | , | y | ) x i y i P ( x i y j | x , y ) {\displaystyle {\begin{aligned}E_{\Pr[a|x,y]}(\operatorname {acc} (a^{*},a))&=\sum _{a}\Pr[a|x,y]\operatorname {acc} (a^{*},a)\\&={\frac {1}{\min(|x|,|y|)}}\cdot \sum _{a}\mathbf {1} \{x_{i}\sim y_{i}\in a\}\Pr[a|x,y]\\&={\frac {1}{\min(|x|,|y|)}}\cdot \sum _{x_{i}-y_{i}}P(x_{i}\sim y_{j}|x,y)\end{aligned}}} {\displaystyle {\begin{aligned}E_{\Pr[a|x,y]}(\operatorname {acc} (a^{*},a))&=\sum _{a}\Pr[a|x,y]\operatorname {acc} (a^{*},a)\\&={\frac {1}{\min(|x|,|y|)}}\cdot \sum _{a}\mathbf {1} \{x_{i}\sim y_{i}\in a\}\Pr[a|x,y]\\&={\frac {1}{\min(|x|,|y|)}}\cdot \sum _{x_{i}-y_{i}}P(x_{i}\sim y_{j}|x,y)\end{aligned}}}

This yields a maximum expected accuracy (MEA) alignment:

E ( x , y ) = arg max a E Pr [ a | x , y ] ( acc ( a , a ) ) {\displaystyle E(x,y)=\arg \max _{a^{*}}\;E_{\Pr[a|x,y]}(\operatorname {acc} (a^{*},a))} {\displaystyle E(x,y)=\arg \max _{a^{*}}\;E_{\Pr[a|x,y]}(\operatorname {acc} (a^{*},a))}

Step 3: Probabilistic Consistency Transformation

[edit ]

All pairs of sequences x,y from the set of all sequences S {\displaystyle {\mathcal {S}}} {\displaystyle {\mathcal {S}}} are now re-estimated using all intermediate sequences z:

P ( x i y i | x , y ) = 1 | S | z 1 k | z | P ( x i z i | x , z ) P ( z i y i | z , y ) {\displaystyle P'(x_{i}-y_{i}|x,y)={\frac {1}{|{\mathcal {S}}|}}\sum _{z}\sum _{1\leq k\leq |z|}P(x_{i}\sim z_{i}|x,z)\cdot P(z_{i}\sim y_{i}|z,y)} {\displaystyle P'(x_{i}-y_{i}|x,y)={\frac {1}{|{\mathcal {S}}|}}\sum _{z}\sum _{1\leq k\leq |z|}P(x_{i}\sim z_{i}|x,z)\cdot P(z_{i}\sim y_{i}|z,y)}

This step can be iterated.

Step 4: Computation of guide tree

[edit ]

Construct a guide tree by hierarchical clustering using MEA score as sequence similarity score. Cluster similarity is defined using weighted average over pairwise sequence similarity.

Step 5: Compute MSA

[edit ]

Finally compute the MSA using progressive alignment or iterative alignment.

See also

[edit ]

References

[edit ]
  1. ^ Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2005). "PROBCONS: Probabilistic Consistency-based Multiple Sequence Alignment". Genome Research. 15 (2): 330–340. doi:10.1101/gr.2821705. PMC 546535 . PMID 15687296.
  2. ^ Roshan, Usman (2014年01月01日). "Multiple Sequence Alignment Using Probcons and Probalign". In Russell, David J (ed.). Multiple Sequence Alignment Methods. Methods in Molecular Biology. Vol. 1079. Humana Press. pp. 147–153. doi:10.1007/978-1-62703-646-7_9. ISBN 9781627036450. PMID 24170400.
  3. ^ Lecture "Bioinformatics II" at University of Freiburg
[edit ]

AltStyle によって変換されたページ (->オリジナル) /