Stacks: building and genotyping Loci de novo from short-read sequences

doi:10.1534/g3.111.000240

. 2011 Aug;1(3):171-82.

doi: 10.1534/g3.111.000240. Epub 2011 Aug 1.

Stacks: building and genotyping Loci de novo from short-read sequences

Julian M Catchen , Angel Amores , Paul Hohenlohe , William Cresko , John H Postlethwait

PMID: 22384329
PMCID: PMC3276136
DOI: 10.1534/g3.111.000240

Stacks: building and genotyping Loci de novo from short-read sequences

Julian M Catchen et al. G3 (Bethesda). 2011 Aug.

. 2011 Aug;1(3):171-82.

doi: 10.1534/g3.111.000240. Epub 2011 Aug 1.

Authors

Julian M Catchen , Angel Amores , Paul Hohenlohe , William Cresko , John H Postlethwait

PMID: 22384329
PMCID: PMC3276136
DOI: 10.1534/g3.111.000240

Abstract

Advances in sequencing technology provide special opportunities for genotyping individuals with speed and thrift, but the lack of software to automate the calling of tens of thousands of genotypes over hundreds of individuals has hindered progress. Stacks is a software system that uses short-read sequence data to identify and genotype loci in a set of individuals either de novo or by comparison to a reference genome. From reduced representation Illumina sequence data, such as RAD-tags, Stacks can recover thousands of single nucleotide polymorphism (SNP) markers useful for the genetic analysis of crosses or populations. Stacks can generate markers for ultra-dense genetic linkage maps, facilitate the examination of population phylogeography, and help in reference genome assembly. We report here the algorithms implemented in Stacks and demonstrate their efficacy by constructing loci from simulated RAD-tags taken from the stickleback reference genome and by recapitulating and improving a genetic map of the zebrafish, Danio rerio.

Keywords: Illumina; RAD-seq; RAD-tag; meiotic linkage map; zebrafish.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Stacks schematic. (A) The ustacks program forms stacks in an individual from short sequencing reads (cleaned by process_radtags.pl) that match exactly. (B) The ustacks program breaks down the sequence of each stack into k-mers and loads them into a dictionary. The ustacks program breaks down each stack again into k-mers and queries the k-mer Dictionary to create a list of potentially matching stacks, which can be visualized as nodes in a graph connected by the nucleotide distance between them. (C) ustacks merges matched stacks to form putative loci. (D) ustacks matches secondary reads that were not initially placed in a stack against putative loci to increase stack depth. An SNP model in ustacks checks each locus at each nucleotide position for polymorphisms. (E) ustacks calls a consensus sequence and records SNP and haplotype data. (F) The cstacks program loads stacks from the parents of a genetic cross into a Catalog to create a set of all possible loci in a mapping cross. (G) sstacks matches map cross progeny against the Catalog to determine the haplotypes at each locus in every individual in the cross.

Figure 2

Figure 2

Stacks web interface. (A) The interface allows a researcher to view observed haplotypes at each locus in all individuals. (B) Researchers can click each haplotype to view the stack itself. The interface provides extensive filtering facilities as well as the ability to annotate and export results in a number of formats, including Excel, JoinMap, and R/qtl.

Figure 3

Figure 3

Stacks simulation results. The stickleback reference genome was digested in silico by SbfI, and 60 bp reads were made from each direction from the 22,774 cut sites at several different sequencing depths with several different error rates. The left panel shows the number of (A) loci, (B) stacks, and (C) SNPs observed in the Stacks output. Loci that Stacks assembled incorrectly are displayed in a dark color, whereas loci containing repetitive sequences are shown in a crosshatch pattern. A comparison of the number of loci present in the dataset (A) vs. the number of stacks reconstructed (B) showed that ustacks collapsed repetitive loci but correctly reconstructed nearly all other loci at low and moderate error rates or at high coverage. The right panel shows the number of reads with a certain number of sequencing errors that were incorporated into correct stacks, incorrect stacks, and unused reads for ×ばつ coverage and error rates of (D) 0.5%, (E) 1%, and (F) 3%. As errors accumulated, Stacks excluded more reads, lowering the overall depth, whereas some reads accumulated enough errors to be incorporated into stacks that appeared to be correctly assembled but, in fact, joined stacks representing loci from which they did not originate (indicated by reads with more errors than allowed by the k-mer matching algorithm, four errors in the simulation).

Figure 4

Figure 4

Stacks depth of coverage distribution. (A) Correctly reconstructed stacks have a depth of coverage equal to twice the mean sequencing coverage because the simulation assumes diploid individuals. With no polymorphism or error (gray line), the depth of coverage distribution nearly matched the known simulation distribution (dotted red line), with the exception of repetitive loci, which created the long tail of the distribution to the right, which was truncated at ×ばつ but extends to ×ばつ. After adding SNPs, ustacks failed to reconstruct a small number of loci (green arrow) as shown by the increase in stacks with a depth of coverage equal to the sequencing mean depth. (B–C) With the addition of sequencing error and increasing mean sequencing depth, most stacks were still properly reconstructed. Results showed a repeating pattern of improperly reconstructed stacks occurring at multiples of the mean sequencing depth corresponding to the number of loci improperly merged together. The increasing error rate caused a general loss of depth in the stacks (green vs. violet lines).

Figure 5

Figure 5

Danio rerio RAD-tag map compared to the doubled haploid map. We constructed a RAD-seq genetic map of zebrafish (RADmap) using DNA from 42 individuals of the doubled haploid mapping panel (HSmap) that had been previously genotyped by microsatellites or single strand conformation polymorphism (Kelly et al. 2000; Woods et al. 2000; Woods et al. 2005). Stacks recovered the 25 zebrafish linkage groups (Figure S2) with lengths nearly identical to published values (3186 cM in the HSmap vs. 3160 cM in the RADmap). With 7861 markers, our RADmap had nearly twice as many markers as appeared in the HSmap (4073 markers). The insert shows the scale for marker density.

Figure 6

Figure 6

RADmap marker order is consistent with the sequenced zebrafish genome. A specific region on LG20 with no recombination in the RADmap spanned almost 10 Mb in the physical genome (inset). This recombination suppression could be due to a heterozygous inversion present in the genome of the mother of the gynogenetic HS mapping panel.

See this image and copyright information in PMC

References

1. Allendorf F. W., Danzmann R. G., 1997. Secondary tetrasomic segregation of MDH-B and preferential pairing of homeologues in rainbow trout. Genetics 145: 1083–1092 - PMC - PubMed
1. Altschul S. F., Madden T. L., Schaffer A. A., Zhang J., Zhang Z., et al. , 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25: 3389–3402 - PMC - PubMed
1. Amores A., Force A., Yan Y. L., Joly L., Amemiya C., et al. , 1998. Zebrafish hox clusters and vertebrate genome evolution. Science 282: 1711–1714 - PubMed
1. Amores A., Catchen J. M., Ferrara A., Fontenot Q., Postlethwait J. H., 2011. Genome evolution and meiotic maps by massively parallel DNA sequencing: spotted gar, an outgroup for the teleost genome duplication. Genetics 188: 799–808 - PMC - PubMed
1. Arias J., Keehan M., Fisher P., Coppieters W., Spelman R., 2009. A high density linkage map of the bovine genome. BMC Genet. 10(1): 18. - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- The Lens - Patent Citations Database

[1] Allendorf F. W., Danzmann R. G., 1997. Secondary tetrasomic segregation of MDH-B and preferential pairing of homeologues in rainbow trout. Genetics 145: 1083–1092 - PMC - PubMed

[2] Allendorf F. W., Danzmann R. G., 1997. Secondary tetrasomic segregation of MDH-B and preferential pairing of homeologues in rainbow trout. Genetics 145: 1083–1092 - PMC - PubMed

[3] Altschul S. F., Madden T. L., Schaffer A. A., Zhang J., Zhang Z., et al. , 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25: 3389–3402 - PMC - PubMed

[4] Altschul S. F., Madden T. L., Schaffer A. A., Zhang J., Zhang Z., et al. , 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25: 3389–3402 - PMC - PubMed

[5] Amores A., Force A., Yan Y. L., Joly L., Amemiya C., et al. , 1998. Zebrafish hox clusters and vertebrate genome evolution. Science 282: 1711–1714 - PubMed

[6] Amores A., Force A., Yan Y. L., Joly L., Amemiya C., et al. , 1998. Zebrafish hox clusters and vertebrate genome evolution. Science 282: 1711–1714 - PubMed

[7] Amores A., Catchen J. M., Ferrara A., Fontenot Q., Postlethwait J. H., 2011. Genome evolution and meiotic maps by massively parallel DNA sequencing: spotted gar, an outgroup for the teleost genome duplication. Genetics 188: 799–808 - PMC - PubMed

[8] Amores A., Catchen J. M., Ferrara A., Fontenot Q., Postlethwait J. H., 2011. Genome evolution and meiotic maps by massively parallel DNA sequencing: spotted gar, an outgroup for the teleost genome duplication. Genetics 188: 799–808 - PMC - PubMed

[9] Arias J., Keehan M., Fisher P., Coppieters W., Spelman R., 2009. A high density linkage map of the bovine genome. BMC Genet. 10(1): 18. - PMC - PubMed

[10] Arias J., Keehan M., Fisher P., Coppieters W., Spelman R., 2009. A high density linkage map of the bovine genome. BMC Genet. 10(1): 18. - PMC - PubMed

Account

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Stacks: building and genotyping Loci de novo from short-read sequences

Stacks: building and genotyping Loci de novo from short-read sequences

Authors

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources