Departments of 1 Computer Science and Engineering and 5 Biochemistry and Molecular Biology and Center for Gene Regulation, The Pennsylvania State University, University Park, Pennsylvania USA 16802; 2 Genome Sciences Department, Lawrence Berkeley National Laboratory, Berkeley, California USA 94720; 3 Axys Pharmaceuticals, La Jolla, California USA 92037; 4 Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas USA 77030
PipMaker (http://bio.cse.psu.edu) is a World-Wide Web site for comparing two long DNA sequences to identify conserved segmentsand for producing informative, high-resolution displays of theresulting alignments. One display is a percent identity plot (pip),which shows both the position in one sequence and the degree ofsimilarity for each aligning segment between the two sequencesin a compact and easily understandable form. Positions along thehorizontal axis can be labeled with features such as exons ofgenes and repetitive elements, and colors can be used to clarifyand enhance the display. The web site also provides a plot ofthe locations of those segments in both species (similar to adot plot). PipMaker is appropriate for comparing genomic sequencesfrom any two related species, although the types of informationthat can be inferred (e.g., protein-coding regions and cis-regulatoryelements) depend on the level of conservation and the time anddivergence rate since the separation of the species. Gene regulatoryelements are often detectable as similar, noncoding sequencesin species that diverged as much as 100-300 million years ago,such as humans and mice, Caenorhabditis elegans and C. briggsae,or Escherichia coli and Salmonella spp. PipMaker supports analysisof unfinished or "working draft" sequences by permitting one ofthe two sequences to be in unoriented and unorderedcontigs.
The availability of complete sequences of many microbial genomes and the proposed sequencing of the human (Collins et al.1998; Marshall 1999) and mouse (Battey et al. 1999) genomes arefueling a major revolution in the biological sciences (Lander1996). Comparative analysis based on alignments of these sequencesis one powerful tool for interpreting such genomic information.Such alignments can help achieve several goals of postsequencingfunctional analysis. These include determining all of the protein-codingsegments in both species, locating regulatory signals, understandingthe mechanisms and history of genome evolution, and deducing thesimilarities and differences in gene organization between thespecies ofinterest.
A variety of methods are available for identifying protein-coding segments in genomic sequences, using approaches such asfinding matches to ESTs and analyzing the inherent propertiesof the DNA sequence (Claverie 1997). The success rate of thesemethods when tested against reference sequences can be high (Burgeand Karlin 1997; Bailey et al. 1998). However, some genes aredifficult to find by these means. Novel genes that are expressedat very low levels, or are transcribed in only a few tissues and/orat a restricted time in development, may show no matches to sequencedatabases (as well as resisting experimental approaches).Despitetheir clear utility in identifying many genes, EST databases generatea relatively high rate of spurious matches (Bailey et al. 1998).In addition, database searches frequently miss some of a gene'sexons. Current gene-finding programs based on inherent sequenceproperties have different deficiencies. They frequently work poorlyfor genes that are alternatively spliced, overlap, or lie withinanother gene. This limitation may be critical because accordingto one recent estimate (Gelfand et al. 1999), at least 30% ofall human genes are spliced alternatively. Moreover, these methodsare adversely affected by sequencing errors and by interruptionsin the sequence, which is a serious deficiency for analysis of"working draft" sequences. Utilizing interspecies alignments canimprove the accuracy of exon assignments (Jang et al. 1999; Lianget al. 1999), and Bouck et al. (1998) show that this techniqueis effective when applied to working draftsequences.
Finding candidates for gene regulatory elements is even more difficult than identifying exons, because of the small size and(sometimes) low sequence specificity of protein-binding siteson DNA. However, as exemplified by work on both the HBB, BTK,and IL-4/IL-13 loci, highly conserved, noncoding regions can bereliable guides to cis-regulatory elements (Gumucio et al. 1996;Hardison et al. 1997b; Oeltjen et al. 1997; Cretu et al. 2000).
To provide tools for efficient identification of coding and regulatory elements in genomic DNA by comparative analysis, webuilt an automated server on the World-Wide Web. To accomplishthis, several obstacles had to be overcome. (1) The alignmentprogram must be able to analyze long sequence files, containingas many as millions of nucleotides. Bacterial genomes can be upto 6 million nucleotides, as can the regions of conserved syntenicloci in human and mouse chromosomes. Alignment programs that utilizememory space proportional to the sequence lengths allow such longfiles to be analyzed (Chao et al. 1994). (2) The alignment programmust be very fast, and the series of Blast programs achieve thisgoal (Altschul et al. 1990, 1997). (3) The enormous volume ofoutput containing all local alignments between two long sequencesmust be presented to the user in a compact, understandable form.We have introduced percent identity plots (pips) for this purpose(Hardison et al. 1997a).
In this paper we describe an automated server for generating alignments and pips. A pip shows the position in one sequenceof each aligning gap-free segment and plots its percent identity.As a complementary display, we also provide a plot of the positionof each aligning segment in both species. We refer to these asdot plots, even though matches shown in conventional dot plotsneed not be contained within a statistically significant alignmentand those in our plots are. Both displays allow rich annotationto be plotted along the appropriate axes to aid in correlatingaligning segments with functional or structural features of thesequence. We provide examples of the application of PipMaker forfinding exons and candidate regulatory elements in mammalian,nematode, and bacterial sequences. The server is able to comparea completed sequence from one species with an incomplete sequencefrom asecond.
Accessing and Using PipMaker
PipMaker is accessed by pointing a web browser to http://bio.cse.psu.edu, which is the menu page providing links to instructions,examples, a basic server, and an advanced server. At either PipMakerserver page, the user submits two sequence files, using the Browsefunction to select a file from the user's machine or by cut andpaste into the available windows. An additional file containingthe coordinates of interspersed repeats in the first sequenceshould also be submitted to avoid uninformative and time-consumingalignments among repeats. This Repeats file is generated by RepeatMasker(Smit and Green 1999) at http://ftp.genome.washington.edu/cgi-bin/RepeatMasker.An optional Exons file contains the positions of known or predictedexons, plus the name and transcriptional orientation of each gene.The Advanced PipMaker page allows colors to be added to the plotby supplying an Underlay file. Various options for generatingthe alignments (see below) are alsoavailable.
PipMaker returns the alignments generated by BlastZ in any or all of four different formats: a pip, a dot plot, a conventionaltextual alignment, and a compact listing of the coordinates ofthe aligning segments. One can choose among these outputs fromthe Advanced PipMaker page. For the pip, the program plots theposition (in the first sequence) and percent identity of eachgap-free segment of the alignments (Figs. 1C and 2). The top horizontalaxis is automatically decorated with the positions of repeats(from the Repeats file), and exons (from the Exons file). Thepositions of CpG islands are also computed and displayed alongthe top axis. The coordinates (lower horizontal axis) are thenucleotide positions in the first sequence. The image of the pipis rendered as a pdf file by default, although a PostScript filecan be requested. The results are returned to the user via e-mail.The image files can be viewed and printed with GhostScript (pdfor PostScript files) or other widely distributed programs suchas Adobe Acrobat Reader for pdf files.
Although pips are compact and highly informative about sequence features and aligning segments in the first sequence, theydo not show the positions of the alignments in the second sequence.Thus, an interesting alignment displayed in a pip may involvea sequence that is in a different position or has been invertedin the second sequence. Such information is best conveyed in atraditional dot-plot display, where the positions of alignmentsin both sequences are shown as diagonal lines (Figs. 1D and 3).
An Example from Mammals
A 100-kb sequence from human chromosome 5q31 (Frazer et al. 1997) was extracted from GenBank entries AC004500 and AC004775.It was analyzed by identifying repeats using the RepeatMaskerprogram (Smit and Green 1999), submitting the masked sequenceto GenScan (Burge and Karlin 1997), and performing Blast searches(Altschul et al. 1997) with the masked sequence or portions thereofagainst the nucleotide, EST, and protein sequence databases atNCBI. Information obtained from these sources was manually mergedin an attempt to identifygenes.
GenScan predicted six genes, which fell into three categories, depending on the nature of the database matches. Three of theGenScan predictions showed near identity with mRNA sequences inGenBank, two of which were for characterized genes (Ubiquitone-BPand GDF-9); the other was a full-insert mRNA sequence (GenBankaccession no. AF143867). Conceptual translations of two otherGenScan-predicted genes showed strong but not identical proteinmatches: one to a human protein named MLLT2 (GenBank accessionno. NP_005926), and the other to the apical protein (APX) of Xenopuslaevis (GenBank accession no. Q01613). GenScan also predicteda single-exon gene, portions of which match ESTs. A 4-kb regionaround positions 24-28K had numerous matches to the EST databasebut did not contain any GenScan-predicted exons or hits to otherdatabases.
A pip showing the alignments between the 100-kb sequence from human chromosome 5q31 and its homolog in mouse is given in Figure2. Features in the human sequence derived from analysis by RepeatMasker, GenScan predictions, and matches to ESTs are plotted alongthe top horizontal axis. The portions of the pip correspondingto exons with exact matches to GenBank mRNAs are in blue. Orangecorresponds to exons of GenScan-predicted genes (MLLT2 and APXhomologs), whose products have significant but not exact matchesto other proteins. Purple indicates the single-exon gene predictedby GenScan. One can easily discern clustered repeats, presumablyformed by recursive integration into closely linked sites, andmore dispersed repetitive elements. Although the significanceof these nonrandom distributions of repeats is currently not known,the inclusion of information about the positions and identityof repeats in the pips may aid in the discovery of informativecorrelations. The display also shows prominent CpG islands centeredat ~37K, 74K, 78K, and90K.
PipMaker provides useful sequence comparisons even when one of the sequences is unfinished. For example, the sequence in mousethat is homologous to this portion of human 5q31 has not beencompleted, but substantial portions are present as 14 separatecontigs. PipMaker compares the completed human sequence againstall 14 mouse sequences. The dot plot (Fig. 3), which is a complementarydisplay of exactly the same alignments shown in the pip in Figure2, shows the discontinuities in the mouse sequence more clearly.It also illustrates the fact that a given contig can be in eitherorientation relative to the completed sequence; this is an optionon the Advanced PipMaker page. In the pip, these numerous matchesare assembled in the order consistent with the completed humansequence. Inspection of the dot plot allows us to identify portionsof the human sequence that fall between regions that align withseparate mouse contigs. This absence of orthologous sequencesin the available mouse data is indicated by the dark gray shadingin Figures 2 and 3. The positions of the gray blocks were obtainedfrom the user-supplied Underlayfile.
Pips provide information complementary to GenScan and database hits for finding exons. Protein-coding exons show an averagepercent identity of ~85% for many comparisons between human andmouse genes (Makalowski and Boguski 1998). This is illustratedin Figure 2, as the high percent identity lines in the pip foralmost all of the exons (see, e.g., the MLLT2 homolog). Thus,when a high percent identity is found in the same region as anexon predicted by GenScan, one has even more confidence in thatassignment. Putative exons 5-8 of the APX homolog appear to becorrect, whereas the remaining predicted exons are suspect. Thehomolog to the GenScan-predicted exon ~90K has not been sequencedcompletely in mouse; thus, the matches do not extend through theentire predicted exon. In this case, the exon predicted by GenScan(purple portion of pip, Fig. 2) corresponds to EST hits (greenportions) and high percent identity alignments to two of the mousecontigs. This convergence of assignments is easy to see in thepip. It indicates that one should be confident of this exon assignment,even without BlastN or BlastP hits to known genes or theirproducts.
Another potential use for pips in exon assignments is illustrated by the region from 24K to 28.5K (Fig. 2). This segment ofthe human DNA shows many database matches to ESTs, including membersof the three different UniGene clusters listed. This appears tobe a complex region, and the fact that discrete segments are highlyconserved in mouse can be incorporated into the analysis. Forinstance, one may want to focus attention on the EST matches thatcoincide with the higher scoring alignments withmouse.
Pips are useful not only for finding exons but also for distinguishing protein-coding regions from untranslated regions ofthe exons. For example, the 3'-untranslated region of GDF-9, indicatedby a light gray box, shows a considerable decrease in percentidentity (see the region from ~41.3Kto 42.7K in Fig. 2). Thisis characteristic of many untranslated regions. In addition, protein-codingportions of exons tend not to be broken by gaps (see the MLLT2homolog portion of the pip), whereas the untranslated portionsof exons have more gaps in the alignment, which produce a seriesof shorter horizontallines.
Pips also reveal candidates for regulatory elements in regions that do not encode mRNA. For instance, the aligning segmentshighlighted in red around positions 2.8K, 63.7K, 64.8K, and 82.7Kin Figure 2 stand out as having high percent identities but donot coincide with any known or predicted exons. Analysis of noncodingregions with high percent identity has determined that frequentlythey are also conserved in other mammals and unique in the humangenome, which are two common features of long-range regulatoryelements (Cretu et al. 2000). Functional characteristics of conservednoncoding sequences in the BTK, HBB, and IL-4/IL-3 loci (Hardisonet al. 1997b; Oeltjen et al. 1997, Cretu et al. 2000) have demonstratedthat sequence conservation is a good predictor of regulatoryelements.
As illustrated in the dot plot (Fig. 3), the exons of a gene can be on different contigs. For instance, exon 1 of the mousehomolog to GDF-9 is on contig 38 and exon 2 is on contig 51. TheAdvanced PipMaker page includes a utility for joining the exonstogether into a putative coding sequence for each gene. In thisprocess, the joining of the second incomplete sequence is guidedby the matches to the first complete sequence. The putative codingsequences, returned by e-mail, can then be used in more thoroughstudies, such as an analysis of synonymous and nonsynonymoussubstitutions.
An Example from Nematodes
PipMaker need not be limited to comparisons between human and mouse sequences. As an example of a comparison between two relatednematode species, Caenorhabditis elegans and Caenorhabditis briggsae,Figure 4 shows the pip for the region containing the bli-4 locus,using exon annotations from the GenBank file. The bli-4 gene productproteolytically activates a number of physiologically importantpolypeptides. Many, but not all, of the putative exons show highpercent identity matches. In addition, three other strongly conservedsegments are seen between exons 12 and 14. The experimental analysisby Thacker et al. (1999) shows that these three segments are alternativelyspliced exons. In our experience, it is not uncommon to identifypreviously unrecognized exons by inspecting a pip.
An Example from Eubacteria: Escherichia coli vs. Salmonella typhimurium
The sequence of Escherichia coli K12 was completed in 1997 (Blattner et al. 1997). Two bacterial genera closely related toE. coli are Salmonella and Yersinia. The genomes of Yersinia pestisand three species of Salmonella have been "sample sequenced" attwofold coverage (for background, see McClelland and Wilson 1998;Wong et al. 1999). This results in ~90% of the genome being sequenced,but contained in many contigs, currently unlinked. However, byaligning a Salmonella or Yersinia sequence with the completedE. coli sequence, much useful information can be gathered, ina fashion similar to the comparison between human 5q31 and itsmulticontiguous mousehomolog.
A portion of the pip resulting from aligning the complete E. coli sequence with the Salmonella typhimurium sequence samplesis shown in Figure 5. This shows matches in the protein-codingregions and intergenic regions for several genes linked in E.coli. Colors have a special meaning for this display (see legendto Fig. 5), illustrating the flexibility of this feature.
The matches in intergenic regions include those between the oppositely transcribed glyA and hmpA genes. From the alignmentshown in Figure 5C, one can see that the promoters for each geneare conserved and long identical segments are seen between thepromoters. These conserved regions correspond to previously characterizedbinding sites for PurR and MetR (Lorenz and Stauffer 1996), thelatter of which is involved in the induction of hmpA by nitricoxide (Membrillo-Hernandez et al. 1998). This gene encodes a flavohemoglobinwith a nitric oxide dioxygenase activity (Gardner et al. 1998;Hausladen et al. 1998). In this intergenic region, all of theknown functional sites are conserved and the remaining portionsare not, illustrating the power of this approach for finding candidateregulatoryelements.
Advanced Features of PipMaker
Choice of Scoring Matrix
Users can choose different scoring matrices for BlastZ, appropriate for human versus mouse (i.e., where the first sequenceis from human and the second is from mouse), mouse versus human,or other. The default scoring matrix seems to work acceptablyfor other combinations of two species, for instance, human-Fuguor human-human.Chaining
With the default setting of Show all matches, it is possible for one region of the first sequence to align with several regionsof the second sequence because of duplications of a gene or anexon or because of incomplete masking of interspersed repeatsor low-complexity regions. Such duplications cause lines to appearsuperimposed in the pip. The converse situation, where one regionof the second sequence aligns with several regions in the first,also occurs but does not disrupt the pip. Advanced PipMaker providestwo options for eliminating such duplicate matches, each withits own strengths andweaknesses. If the Chaining option is chosen, then PipMaker will identify and plot only matches that appear in the same relative orderin the first and second sequences. This option should be usedonly if the genomic structures of the two sequences are knownto be conserved; otherwise a duplication might not be detected.(An inversion of a segment in one of the two sequences might alsogo undetected unless the user chooses the Search both strandsoption.) For an example of chaining, consider the pip shown in Figure 6A. Exon 7 of the first sequence has a number of matches in thesecond sequence, due to duplications of that exon. The dot-plotview of the entire alignment (Fig. 6B) also shows the duplication(at around position 7000 on the horizontal axis), as well as duplicationsin later exons. Figure 6, C and D, shows the results of specifyingthe Chaining option.Single Coverage
An alternative method for avoiding duplicate matches is provided by the Single coverage option, which selects a highest-scoringset of alignments such that any position in the first sequencecan appear in one alignment, at most. (There is no guarantee thatthe order of matching regions is identical in the two sequences.)The three dot plots in Figure 7 show a case where this optionworks better than chaining. Figure 7A shows all matches in a genecluster in which the first sequence has six copies of the geneand the second sequence has four. With chaining, only four ofthe genes can be matched, as shown in Figure 7B. The Single coverageoption selects one match for each region of the first sequence(Fig. 7C).The power and utility of interspecies comparisons of genomic sequences are now widely accepted, but accurate, easy-to-usetools for making such comparisons with very long DNA sequenceshave not been available. The PipMaker server now meets thatneed.
PipMaker uses the program BlastZ to compute local alignments. Comparing two sequences of ~150 kb each takes ~1 min. AlthoughBlastZ can align sequences of virtually unlimited size, as a practicalmatter we have placed a limit of 2 million base pairs (each) onthe files submitted to our webserver.
The results displayed as pips are useful for finding exons and candidate regulatory elements, as well as following patternsof sequence changes during evolution. Pips have at least threeadvantages compared to conventional dot plots. First, they includeinformation about the percent identity, so that higher-scoringalignments can be readily detected. Second, the occurrence ofgaps within an alignment is more apparent when the percent identitiesof the two gap-free segments differ. Third, the alignment informationis presented in a compact, but high-resolution,form.
Different kinds of information can be gleaned by comparisons of genomic DNA sequences from species that diverged at progressivelymore distant times. For instance, gene regulatory elements, frequentlybeing less well-conserved than protein-coding regions, are oftenidentified most easily using more closely related species. Interestingly,the estimated times of divergence of several informative speciespairs, such as E. coli/Salmonella and human-mouse, fall at ~100million years (Goodman et al. 1987; Ochman and Groisman 1994;Hedges et al. 1996; Ochman and Bergthorsson 1998). Identificationof exons and protein-coding regions can utilize species that separatedconsiderably long ago, such as human-Fugu. However, differentloci within a given species pair can show strikingly differentpatterns of conservation (Koop 1995; Hardison et al. 1997a). Ina few mouse and human loci, such as those encoding the alpha J segmentsof the T-cell receptor, some noncoding regions are more similarthan the protein-coding regions (Koop and Hood 1994). Analysisvia PipMaker makes these contrasting patterns quite obvious. Asthe human and mouse sequences are completed, some general rulesmay emerge with the potential to predict the patterns at otherloci. However, currently one has to approach this analysis ona locus-by-locus basis and investigate the patternsempirically.
The increasing availability of completed or very long genomic sequences from bacteria, fungi, plants, invertebrates, and vertebratescan be used to generate hypotheses about functions from the alignedsequences. This requires appropriate software for the analysis,and PipMaker should be a versatile tool to aid in this endeavor.This leads to the final phases of functional genomics, which areexperimental tests of thesehypotheses.
Description of Pips
Pips are perhaps best described with an example. Consider the human-mouse alignment in Figure 1A. Gaps divide the alignmentinto the five segments shown in Figure 1B. A pip depicts eachone of these segments as a horizontal line, the left-to-rightposition of which is determined by the human coordinates and thevertical placement of which indicates the percent identity, asshown in Figure 1C. This alignment can be seen in Figure 2 aroundexon 1 of the gene labeled MLLT2 homolog. The dot plot of thesame alignment, shown in Figure 1D, represents each of the gap-freesegments as a diagonal line indicating position in both the humanand mouse sequences but not the percentidentity.
Computing Alignments
The alignment engine currently used by PipMaker, called BlastZ, is an experimental variant of the Gapped Blast program (Altschulet al. 1997; Zhang et al. 1998). It is an entirely new implementationthat has been designed for aligning two very long sequences. Themajor algorithmic improvement essentially guarantees that computermemory will never be a constraining resource. (W. Miller, unpubl.)
Scoring Matrices
The default scoring parameters are match = 1, mismatch =-1, gap open = -6, and gap extension = -0.2. These empirically determinedvalues have been used in our alignment programs for 10 years,with reasonable success. We now provide additional scoring matricesfor aligning human and mouse genomic sequences, determined inthe following manner. We estimated neutral nucleotide substitutionpatterns between human and mouse from the 80,000 substitutionsin alignments of 600 kb of human DNA transposon fossils to theirwell-defined consensus sequences (A.F.A. Smit, A. Kas, and P.Green, pers. comm.). For this estimate, we assumed that neutralsubstitution biases were similar in rodent and primate evolutionand that the average substitution level since the rodent-primatesplit is 18% in human DNA and 35% in mouse DNA (Li et al. 1996).Because the mutation biases depend on the GC richness of the genomicregion, log-odd scoring matrices were derived for queries of differentGC content. Furthermore, given the higher substitution level inrodents, the scoring matrix is asymmetrical and different matricesare in use for human versus mouse than for mouse versus humancomparisons. Details of our approach can be found at the PipMakerwebsite.
Removing Duplicate Alignments
We have implemented two methods for removing redundant alignments between DNA sequences that are repeated in either or bothsequences, such as duplicated genes. For selecting a set of alignmentsthat forms a chain, that is, where the start point for each alignmentfollows (in both sequences) the end point of the preceding alignment,we used the method of Zhang et al. (1994). We prefer this algorithmto theoretically faster methods because it easily accommodatessmall overlaps between adjacent alignments and permits a varietyof "gap penalties" to be charged for offsets between two adjacentalignments in an optimalchain.
An alternative option in PipMaker is to find a highest scoring set of alignments subject to the constraint that no two coverthe same position in the first sequence (see the Single coverageoption described above). A straightforward algorithm solves thisproblem in time 0(N log N), where N denotes the number of initialalignments.
This work was supported by the National Library of Medicine (grants RO1LM05110 and RO1LM05773) and National Institutes ofHealth (grantRO1DK27635).
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be herebymarked "advertisement" in accordance with 18 USC Section 1734solely to indicate thisfact.
6 Correspondingauthor.
E-MAIL webb@cse.psu.edu; FAX (814) 865-3176.
Received September 29, 1999; accepted in revised form February 1, 2000.