GenomeScope: fast reference-free genome profiling from short reads

doi:10.1093/bioinformatics/btx153

. 2017 Jul 15;33(14):2202-2204.

doi: 10.1093/bioinformatics/btx153.

GenomeScope: fast reference-free genome profiling from short reads

Gregory W Vurture ¹, Fritz J Sedlazeck ², Maria Nattestad ¹, Charles J Underwood ¹, Han Fang ^{1

3}, James Gurtowski ¹, Michael C Schatz ^{1

2}

Affiliations

PMID: 28369201
PMCID: PMC5870704
DOI: 10.1093/bioinformatics/btx153

GenomeScope: fast reference-free genome profiling from short reads

Gregory W Vurture et al. Bioinformatics. 2017.

. 2017 Jul 15;33(14):2202-2204.

doi: 10.1093/bioinformatics/btx153.

Authors

Gregory W Vurture ¹, Fritz J Sedlazeck ², Maria Nattestad ¹, Charles J Underwood ¹, Han Fang ^{1

3}, James Gurtowski ¹, Michael C Schatz ^{1

2}

Affiliations

¹ Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
² Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA.
³ Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA.

PMID: 28369201
PMCID: PMC5870704
DOI: 10.1093/bioinformatics/btx153

Abstract

Summary: GenomeScope is an open-source web tool to rapidly estimate the overall characteristics of a genome, including genome size, heterozygosity rate and repeat content from unprocessed short reads. These features are essential for studying genome evolution, and help to choose parameters for downstream analysis. We demonstrate its accuracy on 324 simulated and 16 real datasets with a wide range in genome sizes, heterozygosity levels and error rates.

Availability and implementation: http://genomescope.org , https://github.com/schatzlab/genomescope.git .

Contact: mschatz@jhu.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1

Fig. 1

(A) GenomeScope heterozygosity, total genome size, and unique genome size estimates: (left) twenty seven simulated A.thaliana datasets with vary amounts of heterozygosity, sequencing error or read duplications; (middle) ten synthetic mixtures of real E.coli sequencing data; and (right) six genuine plant and animal sequencing datasets: L.calcarifer (Asian seabass), D.melanogaster (fruit fly), M.undulates (budgerigar), A.thaliana Col-Cvi F1 (thale cress), P.bretschneideri (pear), C.gigas (Pacific oyster). Also displayed are the true simulated values (Simulated), the results from a mapping and variant calling pipeline (Mapping), and a whole genome alignment (DnaDiff) where available. (B) GenomeScope k-mer profile plot of the A.thaliana dataset showing the fit of the GenomeScope model (black) to the observed k-mer frequencies (blue). The unusual peak of very high frequency k-mers (∼10 ×ばつ coverage) were determined to be highly enriched for organelle sequences

See this image and copyright information in PMC

References

1. Bankevich A. et al. (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol., 19, 455–477. - PMC - PubMed
1. Bates D.M., Watts D.G. (1988) Nonlinear Regression Analysis and Its Applications. John Wiley & Sons, Inc., New York, NY.
1. Chikhi R., Medvedev P. (2014) Informed and automated k-mer size selection for genome assembly. Bioinformatics, 30, 31–37. - PubMed
1. Gnerre S. et al. (2011) High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. U. S. A., 108, 1513–1518. - PMC - PubMed
1. Goodwin S. et al. (2016) Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet., 17, 333–351. - PMC - PubMed

Grants and funding

R01 HG006677/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

[1] Bankevich A. et al. (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol., 19, 455–477. - PMC - PubMed

[2] Bankevich A. et al. (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol., 19, 455–477. - PMC - PubMed

[3] Bates D.M., Watts D.G. (1988) Nonlinear Regression Analysis and Its Applications. John Wiley & Sons, Inc., New York, NY.

[4] Bates D.M., Watts D.G. (1988) Nonlinear Regression Analysis and Its Applications. John Wiley & Sons, Inc., New York, NY.

[5] Chikhi R., Medvedev P. (2014) Informed and automated k-mer size selection for genome assembly. Bioinformatics, 30, 31–37. - PubMed

[6] Chikhi R., Medvedev P. (2014) Informed and automated k-mer size selection for genome assembly. Bioinformatics, 30, 31–37. - PubMed

[7] Gnerre S. et al. (2011) High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. U. S. A., 108, 1513–1518. - PMC - PubMed

[8] Gnerre S. et al. (2011) High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. U. S. A., 108, 1513–1518. - PMC - PubMed

[9] Goodwin S. et al. (2016) Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet., 17, 333–351. - PMC - PubMed

[10] Goodwin S. et al. (2016) Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet., 17, 333–351. - PMC - PubMed

Account

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

GenomeScope: fast reference-free genome profiling from short reads

Affiliations

GenomeScope: fast reference-free genome profiling from short reads

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources