pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data

doi:10.1111/1755-0998.13326

. 2021 May;21(4):1359-1368.

doi: 10.1111/1755-0998.13326. Epub 2021 Feb 5.

pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data

Katharine L Korunes ¹, Kieran Samuk ²

Affiliations

PMID: 33453139
PMCID: PMC8044049
DOI: 10.1111/1755-0998.13326

pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data

Katharine L Korunes et al. Mol Ecol Resour. 2021 May.

. 2021 May;21(4):1359-1368.

doi: 10.1111/1755-0998.13326. Epub 2021 Feb 5.

Authors

Katharine L Korunes ¹, Kieran Samuk ²

Affiliations

¹ Department of Evolutionary Anthropology, Duke University, Durham, NC, USA.
² Department of Biology, Duke University, Durham, NC, USA.

PMID: 33453139
PMCID: PMC8044049
DOI: 10.1111/1755-0998.13326

Erratum in

[No title available]
[No authors listed] [No authors listed] Mol Ecol Resour. 2022 Apr;22(3):1228-1229. doi: 10.1111/1755-0998.13571. Epub 2021 Dec 23. Mol Ecol Resour. 2022. PMID: 34939740 No abstract available.

Abstract

Population genetic analyses often use summary statistics to describe patterns of genetic variation and provide insight into evolutionary processes. Among the most fundamental of these summary statistics are π and d_XY , which are used to describe genetic diversity within and between populations, respectively. Here, we address a widespread issue in π and d_XY calculation: systematic bias generated by missing data of various types. Many popular methods for calculating π and d_XY operate on data encoded in the variant call format (VCF), which condenses genetic data by omitting invariant sites. When calculating π and d_XY using a VCF, it is often implicitly assumed that missing genotypes (including those at sites not represented in the VCF) are homozygous for the reference allele. Here, we show how this assumption can result in substantial downward bias in estimates of π and d_XY that is directly proportional to the amount of missing data. We discuss the pervasive nature and importance of this problem in population genetics, and introduce a user-friendly UNIX command line utility, pixy, that solves this problem via an algorithm that generates unbiased estimates of π and d_XY in the face of missing data. We compare pixy to existing methods using both simulated and empirical data, and show that pixy alone produces unbiased estimates of π and d_XY regardless of the form or amount of missing data. In summary, our software solves a long-standing problem in applied population genetics and highlights the importance of properly accounting for missing data in population genetic analyses.

Keywords: bioinfomatics/phyloinfomatics; genomics/proteomics; molecular evolution; population genetics - empirical; software.

PubMed Disclaimer

Figures

FIGURE 1

FIGURE 1

The logic and input/ouput of pixy demonstrated with a simple haploid example. (a) Comparison of two methods for computing π (or d_XY) in the face of missing data. These methods follow the first expression of Equation 1 but differ in how they calculate the numerator and denominator. In Case 1, all missing data is assumed to be present but invariant. This results in a deflated estimate of π. In Case 2, missing data are simply omitted from the calculation, both in terms of the number of sites (the final denominator) and the component denominators for each site (the n choose two terms). This results in an unbiased estimate of π. (b) The adjusted π method (Case 2) as implemented for VCFs in pixy. The example VCF (input) contains the same four haplotypes as (a). Invariant sites are represented as sites with no ALT allele, and greyed-out sites are those that failed to pass a genotype filter requiring a minimum number of reads covering the genotype (Depth ≥ 10 in this case)

FIGURE 2

FIGURE 2

Comparison between pixy and existing methods in complete data sets. (a, b) The sampling distribution of π (a) and d_XY (b), as estimated from 10,000 simulated data sets using pixy and a variety of existing methods (see text for details). The red dotted line denotes the theoretical expectation for the mean of the sampling distribution, 4N_eμ = 0.04 (which is the same for π and d_XY in this particular case). The observed means of the sampling distributions are marked with inverted triangles. For clarity, estimates of π and d_XY above 0.100 are aggregated in the last bin ("0.100+"). (c, d) direct comparisons between pixy’s estimates of π (c) and d_XY (d) and those from existing methods

FIGURE 3

FIGURE 3

Comparison between pixy and existing methods in the presence of missing data. (a) π and (b) d_XY are shown as scaled estimates (each estimate is scaled by dividing by the estimate obtained from the parent data set with no missing data). Perfect congruence between estimates in the presence and absence of missing data is shown with the dotted line at y = 1. Estimates were obtained from data sets with varying proportions of missing genotypes (top row, a and b) and sites (bottom row, a and b)

FIGURE 4

FIGURE 4

Comparisons of estimates of π from whole genome data derived from 18 Anopheles gambiae individuals from the Ag1000G Burkina Faso (BFS) population. Each panel (a–d) depicts the estimates of π for the X chromosome performed using pixy (y-axis) and four other methods (x-axis, a–d). Points are coloured according to the proportion of missing data (of any type) calculated by pixy. The 1:1 line is shown in red

See this image and copyright information in PMC

References

1. Broad Institute (2019). Picard toolkit. GitHub repository [Internet]. http://broadinstitute.github.io/picard/
1. Burri R (2017). Interpreting differentiation landscapes in the light of long-term linked selection. Evolution Letters, 1, 118–131.
1. Carmena M, & González C (1995). Transposable elements map in a conserved pattern of distribution extending from beta-heterochromatin to centromeres in Drosophila melanogaster. Chromosoma, 103, 676–684. - PubMed
1. Cruickshank TE, & Hahn MW (2014). Reanalysis suggests that genomic islands of speciation are due to reduced diversity, not reduced gene flow. Molecular Ecology, 23, 3133–3157. - PubMed
1. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R & 1000 Genomes Project Analysis Group (2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156–2158. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

[1] Broad Institute (2019). Picard toolkit. GitHub repository [Internet]. http://broadinstitute.github.io/picard/

[2] Broad Institute (2019). Picard toolkit. GitHub repository [Internet]. http://broadinstitute.github.io/picard/

[3] Burri R (2017). Interpreting differentiation landscapes in the light of long-term linked selection. Evolution Letters, 1, 118–131.

[4] Burri R (2017). Interpreting differentiation landscapes in the light of long-term linked selection. Evolution Letters, 1, 118–131.

[5] Carmena M, & González C (1995). Transposable elements map in a conserved pattern of distribution extending from beta-heterochromatin to centromeres in Drosophila melanogaster. Chromosoma, 103, 676–684. - PubMed

[6] Carmena M, & González C (1995). Transposable elements map in a conserved pattern of distribution extending from beta-heterochromatin to centromeres in Drosophila melanogaster. Chromosoma, 103, 676–684. - PubMed

[7] Cruickshank TE, & Hahn MW (2014). Reanalysis suggests that genomic islands of speciation are due to reduced diversity, not reduced gene flow. Molecular Ecology, 23, 3133–3157. - PubMed

[8] Cruickshank TE, & Hahn MW (2014). Reanalysis suggests that genomic islands of speciation are due to reduced diversity, not reduced gene flow. Molecular Ecology, 23, 3133–3157. - PubMed

[9] Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R & 1000 Genomes Project Analysis Group (2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156–2158. - PMC - PubMed

[10] Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R & 1000 Genomes Project Analysis Group (2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156–2158. - PMC - PubMed

Account

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data

Affiliations

pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data

Authors

Affiliations

Erratum in

Abstract

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous