Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
An up-to-date list of errata is here.
If you discover an error that isn't already on the list, please
email:
sean@eddylab.org.
Jacket blurb
The face of biology has been changed by the emergence of modern
molecular genetics. Among the most exciting advances are large-scale
genome sequencing efforts, such as the Human Genome Project, which are
producing an immense amount of data. The need to understand the data
is becoming ever more pressing. Demands for sophisticated analyses of
biological sequences are driving forward the newly created and
explosively expanding research area of computational molecular biology,
or bioinformatics.
Many of the most powerful sequence analysis methods are now based on
principles of probabilistic modeling. Examples of such methods include
the use of probabilistically derived score matrices to determine the
significance of sequence alignments, the use of hidden Markov models
as the basis for profile searches to identify distant members of
sequence families, and the inference of phylogenetic trees using
maximum likelihood approaches.
This book provides the first unified, up to date, and tutorial level
overview of sequence analysis methods, with particular emphasis on
probabilistic modelling. Pairwise alignment, hidden Markov models,
multiple alignment, profile searches, RNA secondary structure
analysis, and phylogenetic inference are treated at length.
Written by an interdisciplinary team of authors, the book is
accessible to molecular biologists, computer scientists, and
mathematicians with no formal knowledge of each others' fields. It
presents the state of the art in this important, new and rapidly
developing discipline.
Table of contents
Preface ix
Introduction 1
Sequence similarity, homology, and alignment 2
Overview of the book 2
Probabilities and probabilistic models 4
Further reading 10
Pairwise alignment 12
Introduction 12
The scoring model 13
Alignment algorithms 17
Dynamic programming with more complex models 28
Heuristic alignment algorithms 32
Linear space alignments 34
Significance of scores 36
Deriving score parameters from alignment data 41
Further reading 45
Markov chains and hidden Markov models 46
Markov chains 48
Hidden Markov models 51
Parameter estimation for HMMs 62
HMM model structure 68
More complex Markov chains 72
Numerical stability of HMM algorithms 77
Further reading 79
Pairwise alignment using HMMs 80
Pair HMMs 81
The full probability of x and y, summing over all paths 87
Suboptimal alignment 89
The posterior probability that xi is aligned to yj 91
Pair HMMs versus FSAs for searching 95
Further reading 98
Profile HMMs for sequence families 100
Ungapped score matrices 102
Adding insert and delete states to obtain profile HMMs 102
Deriving profile HMMs from multiple alignments 105
Searching with profile HMMs 108
Profile HMM variants for non-global alignments 113
More on estimation of probabilities 115
Optimal model construction 122
Weighting training sequences 124
Further reading 132
Multiple sequence alignment methods 134
What a multiple alignment means 135
Scoring a multiple alignment 137
Multidimensional dynamic programming 141
Progressive alignment methods 143
Multiple alignment by profile HMM training 149
Further reading 159
Building phylogenetic trees 160
The tree of life 160
Background on trees 161
Making a tree from pairwise distances 165
Parsimony 173
Assessing the trees: the bootstrap 179
Simultaneous alignment and phylogeny 180
Further reading 188
Appendix: proof of neighbour-joining theorem 190
Probabilistic approaches to phylogeny 192
Introduction 192
Probabilistic models of evolution 193
Calculating the likelihood for ungapped alignments 197
Using the likelihood for inference 205
Towards more realistic evolutionary models 215
Comparison of probabilistic and non-probabilistic methods 224
Further reading 231
Transformational grammars 233
Transformational grammars 234
Regular grammars 237
Context-free grammars 242
Context-sensitive grammars 247
Stochastic grammars 250
Stochastic context-free grammars for sequence modelling 252
Further reading 259
RNA structure analysis 260
RNA 261
RNA secondary structure prediction 267
Covariance models: SCFG-based RNA profiles 277
Further reading 297
Background on probability 299
Probability distributions 299
Entropy 305
Inference 311
Sampling 314
Estimation of probabilities from counts 319
The EM algorithm 323
Bibliography 326
Author index 345
Subject index 350