This site needs JavaScript to work properly. Please enable it to take advantage of the complete set of features!
Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

NIH NLM Logo
Log in
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
doi: 10.7717/peerj.332. eCollection 2014.

The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes

Affiliations

The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes

Jason W Sahl et al. PeerJ. .

Abstract

Background. As whole genome sequence data from bacterial isolates becomes cheaper to generate, computational methods are needed to correlate sequence data with biological observations. Here we present the large-scale BLAST score ratio (LS-BSR) pipeline, which rapidly compares the genetic content of hundreds to thousands of bacterial genomes, and returns a matrix that describes the relatedness of all coding sequences (CDSs) in all genomes surveyed. This matrix can be easily parsed in order to identify genetic relationships between bacterial genomes. Although pipelines have been published that group peptides by sequence similarity, no other software performs the rapid, large-scale, full-genome comparative analyses carried out by LS-BSR. Results. To demonstrate the utility of the method, the LS-BSR pipeline was tested on 96 Escherichia coli and Shigella genomes; the pipeline ran in 163 min using 16 processors, which is a greater than 7-fold speedup compared to using a single processor. The BSR values for each CDS, which indicate a relative level of relatedness, were then mapped to each genome on an independent core genome single nucleotide polymorphism (SNP) based phylogeny. Comparisons were then used to identify clade specific CDS markers and validate the LS-BSR pipeline based on molecular markers that delineate between classical E. coli pathogenic variant (pathovar) designations. Scalability tests demonstrated that the LS-BSR pipeline can process 1,000 E. coli genomes in 27-57 h, depending upon the alignment method, using 16 processors. Conclusions. LS-BSR is an open-source, parallel implementation of the BSR algorithm, enabling rapid comparison of the genetic content of large numbers of genomes. The results of the pipeline can be used to identify specific markers between user-defined phylogenetic groups, and to identify the loss and/or acquisition of genetic information between bacterial isolates. Taxa-specific genetic markers can then be translated into clinical diagnostics, or can be used to identify broadly conserved putative therapeutic candidates.

Keywords: Bioinformatics; Comparative genomics; Genomics; Microbiology; Pathogens.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Time performance of the LS-BSR pipeline.
(A) 1000 Escherichia coli and Shigella genomes were randomly sub-sampled and analyzed using default LS-BSR parameters and 16 processors. Wall time was plotted against the number of genomes analyzed. The results demonstrate that the LS-BSR pipeline scales well with increasing numbers of genomes. (B) The same set of 100 E. coli genomes was processed with different numbers of processors and the wall time was plotted. The results demonstrate that using additional processors decreases the overall run time of LS-BSR.
Figure 2
Figure 2. The distribution of virulence factors and phylogenomic markers associated with a core single nucleotide polymorphism (SNP) phylogeny.
The core SNP phylogeny was inferred from a whole genome alignment produced by Mugsy (Angiuoli & Salzberg, 2011). Known virulence genes (Table S2) were screened against 96 Escherichia coli and Shigella genomes using BLASTN within LS-BSR. Clade specific markers were identified at defined nodes in the phylogeny (A through Q). Gene annotations for these markers are detailed in Table S2.
Figure 3
Figure 3. Comparison of LS-BSR cluster with core genome SNP phylogeny.
A comparison of 96 Escherichia coli/Shigella genomes between (A) a core single nucleotide polymorphism (SNP) phylogeny or (B) a cluster generated with the Multiple Experiment Viewer (Saeed et al., 2006) from BLAST Score Ratio (BSR) values that include the entire pan-genome. Colors applied to each classical E. coli phylogroup were applied to the SNP phylogeny and transferred to the BSR cladogram. Shigella genomes are marked with a red circle.
Figure 4
Figure 4. Pan-genome plots generated from LS-BSR output.
Analyses were conducted on a set of 100 Escherichia coli genomes. The distribution of coding region sequences (CDSs) across the set of genomes surveyed is shown in A. A supplemental script can be used to better understand the convergence of the core genome (B), the accumulation of CDSs (C), and the number of unique CDSs for each genome analyzed (D); each analysis was conducted with 100 random sum-samplings and means are depicted with red diamonds.

References

    1. Altenhoff AM, Gil M, Gonnet GH, Dessimoz C. Inferring hierarchical orthologous groups from orthologous gene pairs. PLoS ONE. 2013;8:e332. doi: 10.1371/journal.pone.0053786. - DOI - PMC - PubMed
    1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
    1. Angiuoli SV, Salzberg SL. Mugsy: fast multiple alignment of closely related whole genomes. Bioinformatics. 2011;27(3):334–342. doi: 10.1093/bioinformatics/btq665. - DOI - PMC - PubMed
    1. Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM, Kubal M, Meyer F, Olsen GJ, Olson R, Osterman AL, Overbeek RA, McNeil LK, Paarmann D, Paczian T, Parrello B, Pusch GD, Reich C, Stevens R, Vassieva O, Vonstein V, Wilke A, Zagnitko O. The RAST Server: rapid annotations using subsystems technology. BMC Genomics. 2008;9:75. doi: 10.1186/1471-2164-9-75. - DOI - PMC - PubMed
    1. Benedict MN, Henriksen JR, Metcalf WW, Whitaker RJ, Price ND. ITEP: an integrated toolkit for exploration of microbial pan-genomes. BMC Genomics. 2014;15:8. doi: 10.1186/1471-2164-15-8. - DOI - PMC - PubMed
Cite

AltStyle によって変換されたページ (->オリジナル) /