pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data
- PMID: 33453139
- PMCID: PMC8044049
- DOI: 10.1111/1755-0998.13326
pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data
Erratum in
-
[No title available][No authors listed] [No authors listed] Mol Ecol Resour. 2022 Apr;22(3):1228-1229. doi: 10.1111/1755-0998.13571. Epub 2021 Dec 23. Mol Ecol Resour. 2022. PMID: 34939740 No abstract available.
Abstract
Population genetic analyses often use summary statistics to describe patterns of genetic variation and provide insight into evolutionary processes. Among the most fundamental of these summary statistics are π and dXY , which are used to describe genetic diversity within and between populations, respectively. Here, we address a widespread issue in π and dXY calculation: systematic bias generated by missing data of various types. Many popular methods for calculating π and dXY operate on data encoded in the variant call format (VCF), which condenses genetic data by omitting invariant sites. When calculating π and dXY using a VCF, it is often implicitly assumed that missing genotypes (including those at sites not represented in the VCF) are homozygous for the reference allele. Here, we show how this assumption can result in substantial downward bias in estimates of π and dXY that is directly proportional to the amount of missing data. We discuss the pervasive nature and importance of this problem in population genetics, and introduce a user-friendly UNIX command line utility, pixy, that solves this problem via an algorithm that generates unbiased estimates of π and dXY in the face of missing data. We compare pixy to existing methods using both simulated and empirical data, and show that pixy alone produces unbiased estimates of π and dXY regardless of the form or amount of missing data. In summary, our software solves a long-standing problem in applied population genetics and highlights the importance of properly accounting for missing data in population genetic analyses.
Keywords: bioinfomatics/phyloinfomatics; genomics/proteomics; molecular evolution; population genetics - empirical; software.
© 2021 John Wiley & Sons Ltd.
Figures
References
-
- Broad Institute (2019). Picard toolkit. GitHub repository [Internet]. http://broadinstitute.github.io/picard/
-
- Burri R (2017). Interpreting differentiation landscapes in the light of long-term linked selection. Evolution Letters, 1, 118–131.
-
- Carmena M, & González C (1995). Transposable elements map in a conserved pattern of distribution extending from beta-heterochromatin to centromeres in Drosophila melanogaster. Chromosoma, 103, 676–684. - PubMed
-
- Cruickshank TE, & Hahn MW (2014). Reanalysis suggests that genomic islands of speciation are due to reduced diversity, not reduced gene flow. Molecular Ecology, 23, 3133–3157. - PubMed
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials
Miscellaneous