MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform

doi:10.1093/nar/gkf436

Comparative Study

. 2002 Jul 15;30(14):3059-66.

doi: 10.1093/nar/gkf436.

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform

Kazutaka Katoh ¹, Kazuharu Misawa , Kei-ichi Kuma , Takashi Miyata

Affiliations

PMID: 12136088
PMCID: PMC135756
DOI: 10.1093/nar/gkf436

Comparative Study

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform

Kazutaka Katoh et al. Nucleic Acids Res. 2002.

. 2002 Jul 15;30(14):3059-66.

doi: 10.1093/nar/gkf436.

Authors

Kazutaka Katoh ¹, Kazuharu Misawa , Kei-ichi Kuma , Takashi Miyata

Affiliation

¹ Department of Biophysics, Graduate School of Science, Kyoto University, Kyoto 606-8502, Japan.

PMID: 12136088
PMCID: PMC135756
DOI: 10.1093/nar/gkf436

Abstract

A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods. MAFFT includes two novel techniques. (i) Homo logous regions are rapidly identified by the fast Fourier transform (FFT), in which an amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue. (ii) We propose a simplified scoring system that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length. Two different heuristics, the progressive method (FFT-NS-2) and the iterative refinement method (FFT-NS-i), are implemented in MAFFT. The performances of FFT-NS-2 and FFT-NS-i were compared with other methods by computer simulations and benchmark tests; the CPU time of FFT-NS-2 is drastically reduced as compared with CLUSTALW with comparable accuracy. FFT-NS-i is over 100 times faster than T-COFFEE, when the number of input sequences exceeds 60, without sacrificing the accuracy.

PubMed Disclaimer

Figures

Figure 1

Figure 1

(A) A result of the FFT analysis. There are two peaks corresponding to two homologous blocks. (B) Sliding window analysis is carried out and the positions of homologous blocks are determined. Note that window size is 30 (see text) but the window size is set to 4 in (B) for simplicity.

Figure 2

Figure 2

(A) An example of the segment-level DP; (B) Reducing the area for DP on a homology matrix.

Figure 3

Figure 3

The plot of CPU time versus the average lengths of input sequences for three methods described in the text, FFT-NS-2, FFT-NS-i and NW-NS-2, and two existing methods, CLUSTALW and T-COFFEE. The average percent identities among input sequences are ∼35–85% (A) and ∼15–65% (B). The number of sequences is 40. The regression coefficient calculated from the power regression analysis is shown for each method. For all cases, default parameters were used, except for CLUSTALW, in which both cases default setting (CLW18d) and ‘quicktree’ option (CLW18q) were examined. All of the calculations were performed on a Linux operating system (Intel Xeon 1.7 GHz with 1 GB of memory). The gcc version 2.96 compiler was used with the optimization option ‘-O3’.

Figure 4

Figure 4

The plot of CPU time versus the number of input sequences for three methods described in the text, FFT-NS-2 and FFT-NS-i, and two existing methods, CLUSTALW and T-COFFEE. The average percent identities among input sequences are ∼35–85% (A) and ∼15–65% (B). The average length of input sequences is 300. The regression coefficient calculated from the power regression analysis is shown for each method. For all cases, default parameters were used, except for CLUSTALW, in which both cases default setting (CLW18d) and ‘quicktree’ option (CLW18q) were examined. All of the calculations were performed on a Linux operating system (Intel Xeon 1.7 GHz with 1 GB of memory). The gcc version 2.96 compiler was used with the optimization option ‘-O3’.

Figure 5

Figure 5

The plot of sum-of-pairs score (8) versus the average distance of input sequences for five methods, FFT-NS-1, FFT-NS-2, FFT-NS-i, NW-NS-1 and NW-NS-2. The number of input sequences is 40, and sequence lengths are 200 sites on average. Vertical lines indicate the standard deviations of the scores. For all cases, default parameters were used.

See this image and copyright information in PMC

References

1. Needleman S.B., and Wunsch,C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453. - PubMed
1. Sankoff D., and Cedergren,R.J. (1983) Simultaneous comparison of three or more sequences related by a tree. In Sankoff,D. and Kruskal,J.B. (eds), Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, London, UK, pp. 253–264.
1. Feng D.F., and Doolittle,R.F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol., 25, 351–360. - PubMed
1. Barton G.J., and Sternberg,M.J. (1987) A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. J. Mol. Biol., 198, 327–337. - PubMed
1. Berger M.P., and Munson,P.J. (1991) A novel randomized iterative strategy for aligning multiple protein sequences. Comput. Appl. Biosci., 7, 479–484. - PubMed

Publication types

Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

[1] Needleman S.B., and Wunsch,C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453. - PubMed

[2] Needleman S.B., and Wunsch,C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453. - PubMed

[3] Sankoff D., and Cedergren,R.J. (1983) Simultaneous comparison of three or more sequences related by a tree. In Sankoff,D. and Kruskal,J.B. (eds), Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, London, UK, pp. 253–264.

[4] Sankoff D., and Cedergren,R.J. (1983) Simultaneous comparison of three or more sequences related by a tree. In Sankoff,D. and Kruskal,J.B. (eds), Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, London, UK, pp. 253–264.

[5] Feng D.F., and Doolittle,R.F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol., 25, 351–360. - PubMed

[6] Feng D.F., and Doolittle,R.F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol., 25, 351–360. - PubMed

[7] Barton G.J., and Sternberg,M.J. (1987) A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. J. Mol. Biol., 198, 327–337. - PubMed

[8] Barton G.J., and Sternberg,M.J. (1987) A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. J. Mol. Biol., 198, 327–337. - PubMed

[9] Berger M.P., and Munson,P.J. (1991) A novel randomized iterative strategy for aligning multiple protein sequences. Comput. Appl. Biosci., 7, 479–484. - PubMed

[10] Berger M.P., and Munson,P.J. (1991) A novel randomized iterative strategy for aligning multiple protein sequences. Comput. Appl. Biosci., 7, 479–484. - PubMed

Account

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform

Affiliation

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous