[フレーム] [フレーム]
  • Loading metrics

Open Access

Peer-reviewed

Methods

Analytical and computational solution for the estimation of SNP-heritability in biobank-scale and distributed datasets

  • Guo-An Qi ,

    Contributed equally to this work with: Guo-An Qi, Qi-Xin Zhang

    Roles Data curation, Formal analysis, Investigation, Validation, Visualization

    Affiliations Institute of Bioinformatics, Zhejiang University, Hangzhou, Zhejiang, China, Center for Laboratory Medicine, Department of Genetic and Genomic Medicine, and Clinical Research Institute, Zhejiang Provincial People’s Hospital, People’s Hospital of Hangzhou Medical College, Hangzhou, Zhejiang, China

  • Qi-Xin Zhang ,

    Contributed equally to this work with: Guo-An Qi, Qi-Xin Zhang

    Roles Data curation, Formal analysis, Investigation, Validation, Writing – review & editing

    Affiliation Department of Epidemiology, School of Public Health, Zhejiang Chinese Medical University, Hangzhou, Zhejiang, China

  • Jingyu Kang,

    Roles Methodology, Validation

    Affiliation School of Mathematics and Statistics and Research Institute of Mathematical Sciences (RIMS), Jiangsu Provincial Key Laboratory of Educational Big Data Science and Engineering, Jiangsu Normal University, Xuzhou, Jiangsu, China

  • Tianyuan Li,

    Roles Methodology

    Affiliation School of Mathematics and Statistics and Research Institute of Mathematical Sciences (RIMS), Jiangsu Provincial Key Laboratory of Educational Big Data Science and Engineering, Jiangsu Normal University, Xuzhou, Jiangsu, China

  • Xiyun Xu,

    Roles Investigation, Validation

    Affiliation School of Mathematics and Statistics and Research Institute of Mathematical Sciences (RIMS), Jiangsu Provincial Key Laboratory of Educational Big Data Science and Engineering, Jiangsu Normal University, Xuzhou, Jiangsu, China

  • Zhe Zhang,

    Roles Funding acquisition, Investigation, Validation

    Affiliation Department of Animal Science, College of Animal Sciences, Zhejiang University, Hangzhou, Zhejiang, China

  • Zhe Fan,

    Roles Formal analysis

    Affiliation School of Public Health (Shenzhen), Shenzhen Campus of Sun Yat-sen University, Shenzhen, Guangdong, China

  • Siyang Liu,

    Roles Funding acquisition, Investigation, Supervision

    Affiliation School of Public Health (Shenzhen), Shenzhen Campus of Sun Yat-sen University, Shenzhen, Guangdong, China

  • Guo-Bo Chen

    Roles Funding acquisition, Investigation, Methodology, Resources, Software, Supervision, Visualization, Writing – original draft

    * E-mail: chenguobo@gmail.com

    Affiliations Center for Laboratory Medicine, Department of Genetic and Genomic Medicine, and Clinical Research Institute, Zhejiang Provincial People’s Hospital, People’s Hospital of Hangzhou Medical College, Hangzhou, Zhejiang, China, Key Laboratory of Endocrine Gland Diseases of Zhejiang Province, Hangzhou, Zhejiang, China

Author summary

For a complex trait, heritability () gives the genetic determination of its variation. Given the emergence of biobank-scale data, a more powerful method is needed to estimate . Based on the framework of Haseman-Elston regression (RHE-reg), we integrate a fast randomization algorithm to estimate , and RHE-reg can tackle biobank-scale data, such as UK Biobank (UKB), very efficiently. Furthermore, we present an analytical solution that balances computational cost and precision of the estimation, a property that is important in dealing with biobank-scale data. We investigated the performance of the RHE-reg in simulated data and also applied it for 81 UKB quantitative traits; as tested in UKB data of nearly 300,000 unrelated individuals, it took on average about 4.5 hours to complete an estimation when used 10 CPUs. We extended the application of RHE-reg into distributed datasets when privacy is not compromised. As shown in UKB and simulated data the performance of RHE-reg was accurate in estimating . The software for estimating SNP-heritability for biobank-scale data is released.

Citation: Qi G-A, Zhang Q-X, Kang J, Li T, Xu X, Zhang Z, et al. (2025) Analytical and computational solution for the estimation of SNP-heritability in biobank-scale and distributed datasets. PLoS Comput Biol 21(10): e1013568. https://doi.org/10.1371/journal.pcbi.1013568

Editor: Androniki Psifidi, Royal Veterinary College, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND

Received: January 12, 2025; Accepted: September 29, 2025; Published: October 21, 2025

Copyright: © 2025 Qi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All codes for simulation study and practical protocol are available on GitHub (https://github.com/gc5k/gear2). The genotype-phenotype data used in our analyses are available from UK Biobank (https://www.ukbiobank.ac.uk). The UKB data can be accessed following successful application at https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access.

Funding: This work was supported by the National Natural Science Foundation of China (32272832 to ZZ, and 31771392 to GBC), Shenzhen Basic Research Foundation (20220818100717002 to SL), Guangdong Basic and Applied Basic Research Foundation (2022B1515120080 to SL). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Abstract

Estimation of heritability has been a routine in statistical genetics, in particular with the increasing sample size such as biobank-scale data and distributed datasets, the latter of which has increasing concerns of privacy. Recently a randomized Haseman-Elston regression (RHE-reg) has been proposed to estimate SNP-heritability, and given sufficient iteration () RHE-reg can tackle biobank-scale data, such as UK Biobank (UKB), very efficiently. In this study, we present an analytical solution that balances iteration and RHE-reg estimation, which resolves the convergence of the proposed RHE-reg in high precision. We applied the method for 81 UKB quantitative traits and estimated their SNP-heritability and test statistics precisely. Furthermore, we extended RHE-reg into distributed datasets and demonstrated their utility in real data application and simulated data.

Introduction

Estimating heritability has been one of the central tasks in statistical genetics [1]. Given the increasing sequencing capability, high-throughput genetic data have been emerging in the form of biobank-scale [2], which challenges statistical computation, in particular, such as the estimation of heritability for complex traits. Conventional methods, such as REML, for estimating heritability, like the linear mixed model, often takes computational cost of , where is the sample size and is the number of markers. These costs can become infeasible in the context of biobank-scale data. Haseman-Elston regression (HE-reg) was originally proposed for linkage analysis [3]. After the nuclear correlation between sib pairs is replaced by the linkage disequilibrium (LD) for unrelated samples, the modified HE-reg can be used to estimate heritability and is much faster than REML [4]. Given that is often greater than given the current data, any calculation that is upon genetic relationship matrix (GRM) will be unfavorable even for HE-reg. To reduce the computational cost of estimating heritability, a randomized estimation of heritability has been introduced [5], called randomized Haseman-Elston regression (RHE-reg), which is a promising method that can be used for both single-trait and bi-trait analyses [6,7].

RHE-reg is built on a hybrid framework, which has favored analytical properties of the Haseman-Elston regression and the feasible computational cost of for biobank-scale data; is the round of iteration for RHE-reg. As pointed out by a recent systematic review, iteration control poses one of the challenges for RHE-reg [8]. However, the original report by Wu and Sankararaman did not give a clear solution for the round of iteration [5]. In this study, we investigated RHE-reg and found an analytical procedure to control , which can provide customized iteration for a given data.

Now, one of the trends is that genomic cohorts are mushroomed such as emerging non-invasive prenatal testing cohorts [9,10], but the bottleneck is how to share genomic data without compromising personal privacy [11]. As recently practiced, when genotypes have been masked in randomization, the randomized method has proven to be reliable in addressing genetic problems for distributed data, such as searching relatives [12,13]. Following this idea, it is found that after the randomization step, RHE-reg can be modified to estimate heritability for distributed datasets, reminiscent of vertical or horizontal federated learning [14].

Method description

Wu and Sankararaman proposed a randomized implementation for the Haseman-Elston regression (RHE-reg), which dramatically reduced the computational time from to in dealing with [5]; is the genetic relationship matrix for individuals on markers, and see its detailed definition in the section below. It is clear that a large , indicating more iteration of the presented algorithm, is helpful in improving precision, but it is unsolved how to get an estimate for and its role in determining the boundaries of key statistics, upon the standard errors of the randomized estimator [8]. This work is in general consistent with Wu and Sankararaman’s work, but we present the analytical sampling variance of the estimated and its corresponding test statistics after correction of some technical errors in their original work. We can consequently evaluate how influences the estimation of heritability and its corresponding score, and, as data can be very large, the control of is of theoretical as well as practical importance. An analytical resolution crystallizes a computational procedure, and we further extend the method to another two new scenarios, called vertical-RHE-reg, which is a global implementation for LD score regression [15], and horizontal-RHE-reg, which enables Federated Learning but we estimate heritability in distributed data without compromise of privacy [14].

Materials and methods

A framework for Randomized Haseman-Elston regression (RHE-reg)

In essence, Haseman-Elston regression is a kind of method of moments (MoM) estimation for heritability, and can provide equivalent estimates of heritability for complex traits after IBD is replaced with IBS [4,11]. As we extend the work of Wu and Sankararaman [5], we similarly assume that

in which is the standardized phenotype of the traits of interest, is the standardized genotype matrix of individuals, is the number of double allelic markers, is the cumulative effect related to each of the markers, is the residual effect, is an identity matrix, is the SNP heritability, and is an identity matrix, is the residual variance. Under the general assumption for a polygenic trait, it is easy to see that

is the genetic relationship matrix (GRM); the moment estimator, or randomized Haseman-Elston regression, is to minimize . Of , by differentiating and , respectively, we have the following normal equations:

(1)

The preliminary estimators for and are given as

(2)

For ease of discussion, we now only focus on the expression without adjustment of covariates. The denominator involves , a high-order function for GRM. Alternatively, according to the trace property of a matrix, it can be calculated that , a summation of the square of each element in . We proved that the expectation of , where is the effective number of markers that depicts the average squared Pearson’s correlations among all genomic markers as often used for measuring linkage disequilibrium [12,16,17]; a brief sketch of how can be transferred into is also presented in the section "Estimation for the effective number of markers" below. Therefore, the expectation for the preliminary estimator of is for a typical polygenic trait as established [4,18]; is the averaged LD between a marker and a causal variant, and is the averaged LD between any pair of markers – including the LD of a marker with itself. At first glance at Eq 2, it seems inevitable to compute , the computational cost of which is , a substantial cost given a large sample size, such as for UKB of about 500,000 samples. We obtain the estimate of according to the properties of matrix algebra, and is the exponential index and takes the value of 1, 2, 3, or 4 upon the application in this study.

(3)

Where is a linear estimator for , is a vector of length and each element of is sampled from the standard normal distribution, and is the round of iterations. Of note, the sampling variance of . As will be shown below, will be a plugin parameter in the analysis below, and we suggest a robust estimation of from rather than . Eq 3 is the most innovative part in the work of Wu and Sankararaman, and it is known as Girard-Hutchinson estimation for stochastic trace estimation [19, 20]. Of note, , which was incorrectly derived as in Wu and Sankararaman’s work [5], and to fix their problem directly led to the present work.

Randomized estimation for via RHE-Reg

When there is random mating, , and substituting the expressions given as Eq 3 into Eq 2, a randomized estimator of heritability is

(4)

The component in the denominator is no other than a shuffling nature of the estimation with rounds of resampling.

Sampling variance of RHE-reg

Of Eq 4, we have and , and their respective mean and variance are

The randomized estimator of can be seen as a ratio of , in which both and are variables, and, according to Delta method, its sampling variance can be expressed as , in which the covariance term can be zeroed out in this scenario [21]. So, we can obtain

(5)

For the definition of please refer to the section "Estimation for key parameters". As is a random variable, using Taylor approximation can be obtained by . .

(6)

in which the second term is the bias of the RHE-reg estimator. At the same time, we can also find the mean squared error (MSE), the summation of the sampling variance and squared bias, for as below

(7)

In this polynomial expression, as will be shown in the simulation and real data analysis, is largely upon the sampling variance, which can be further reduced with sufficient iterations (). As will be shown for UK Biobank examples, dynamically ranges from 10 to 200, even greater upon many factors.

Constructing test statistics

Given the estimation of heritability, we can construct the z-score statistic below:

(8)

in which , a quantity that will be zeroed out after sufficient iterations, and can be estimated from Eq 5. Obviously, when is large enough, the optimal z score is the following:

(9)

There is an obvious relationship between two z scores in Eq 8 and Eq 9 (practically ). Given we can predict optimal test statistic as below:

(10)

It means that after iteration the expectation of the test statistic is predictable in certain degree.

In summary, given iterations the test statistic observed is , which is subject to the realized values of , , and . is the expected optimized test statistic when is very large and zeroed out all uncertainty due to iteration. is a reconstruction of , and , indicating how a larger seem to bring out advantage in such as a more significant p-value.

Estimation for key parameters

There are several key quantities/parameters involved in the above equations for RHE-reg, and we present how to estimate them. These parameters are – effective number of markers, the trace of fourth-order GRM, and .

Estimation for the effective number of markers ()

can be decomposed into four terms upon (or ) and (or ), and according to Isserlis’s Theory [22], having integrated these four terms, we have

in which the effective number of markers and the squared Pearson’s correlation of LD between a pair of SNPs [23]. Often , and if all markers are in linkage equilibrium (see the note of Table 1 ). Here, is a population parameter, a summary statistic that encompasses allelic frequencies and linkage disequilibrium of makers. According to Eq 3, , we consequently propose a randomization algorithm, which estimates as below

Table 1. Table for high-order moments for different coding scheme for genotypes.

https://doi.org/10.1371/journal.pcbi.1013568.t001

(11)

A more detailed estimation procedure for can be found in our recent work [16]. See Note I in S1 Text for more details.

Estimation for

The benchmark estimation for is , a fourth-order summation of the eigenvalues of . However, it is computationally expensive when is large. There are two alternative choices to estimate . Method I: , and would affect its precision. Method II: , which uses the fourth-order randomized estimation in Eq 3. Both Method I and Method II can be realized via Eq 3. As will be shown below, Method II provides more stable estimates than Method I.

About —high-dimension structure of genetic architecture

For , , in which if is replaced by because is too expensive to constructed as aforementioned, we consequently have

(12)

can be estimated as in Eq 3 if is replaced by the phenotype itself; it is similarly for and . They reflect high-dimensional structure between and . So the sampling variance of is not only related to itself, but is eventually upon the high-order structure between and . See Note II in S1 Text for more discussion about .

About — the term determines the iteration B

We define the ratio as below

(13)

in which can be estimated as as above. However, it should be noticed that is a heavy penalty for higher heritability; for example, comparing with , leads to a 100-fold penalty for the latter in the numerator of Eq 13. Easily, we can estimate if we want to know how many iterations are needed to reach the preset ratio of

(14)

In practice, can take the value of 0.1 or 0.05 as in our simulation and real data analysis below.

Extended utilities for distributed GWAS datasets

Because datasets are often distributed across institutes, we consequently consider two scenarios for the application of RHE-reg in distributed datasets. As the estimation for (Eq 4) can be split into the numerator and the denominator, the numerator and the denominator are estimated from two different sources. In the other scenario, the whole dataset has been distributed into small slices at different institutes. We call the first scenario the vertical RHE-reg and the latter horizontal RHE-reg.

Vertical RHE-reg

Estimation for can be implemented in summary statistics that the numerator and the denominator can be from different components [18]. We denote the correspondingly heritability for this subtle difference, as well as all tilded symbols from a reference panel that is related to genotypes. Alternatively, Eq 4 can be rewritten as

(15)

the denominator . , in which has the dimension of ; is the genotype matrix of the reference sample that is employed to estimate . So, is (assuming )

The bias is , which will be zeroed out when increases (see Note III in S1 Text for more general situations). The corresponding test statistic is

in which . For the population of similar ancestry, the ratio is cancelled out after sufficient iteration, and leads to

Horizontal RHE-reg

For this application, it is assumed that the entire dataset is divided into institutes ( is subscript for and ). Consequently, , the whole data and are distributed in institutes, and the length of upon how the proportion of data has in the institute; , similarly the dimension of is in which is the number of individuals in the institute. One only needs to receive the mean and summation of square for each , and similarly for receiving the allele frequencies of the reference alleles of . So after scaling for and ,

(16)

, a matrix, can be generated from , by each institute, and consequently independently generate and without compromise of privacy; the subscript indicates Frobenius norm of a matrix that . Upon the precision requirement, after rounds of iterations, can be calculated so as to evaluate whether further iterations are needed. Unlike the vertical RHE-reg, the horizontal RHE-reg is identical to the RHE-reg under this simple scenario. An R script is attached for its detailed implementation (S1 Data).

Summary for RHE-reg

Now we discuss some computational issues about RHE-reg. So, eventually will creep into the RHE-reg. The focus here is to investigate how would affect the RHE-reg, in particular the stability of and z scores. All the above analyses are based on three computational units, , , and – if covariates are taken into account, and the operation between them lead to the whole computational procedure, of which their elementary operations can be implemented hierarchically (Table 2). We give an atlas for the computational route. Furthermore, we have covariates, and the covariate matrix is of dimensions. After inclusion of the covariates, the equations for stopping rules can be updated accordingly (see Note IV in S1 Text). We finish the description of the statistical approaches and go to their applications now.

Software

We have developed computer software that can handle biobank-scale algorithm presented in this study. The software reads genotype in binary format as defined in such as PLINK. For fast vector-matrix multiplication, Mailman algorithm is employed here [24]. We adapt the implementation of the Mailman algorithm from Agrawal’s fast PCA project [25]. It is known that using the Mailman algorithm the vector-matrix multiplication in is reduced from to . There is no conceptual obstacle to applying the method for genotype data in dosage format, but the Mailman algorithm cannot proceed in such a scenario. There are many matrix multiplication included, and in programming some suggested tips to take them out is as shown (S1 Fig).

Results

Simulation results

We conducted simulations to evaluate the aforementioned theoretical results under various parameters. The reference allele frequency was evenly sampled from 0.1 ~ 0.5, and was set three values of 0, 0.1, and 0.25, and all SNPs were considered causal after a typical polygenic model, which follows Normal distribution. 1) The linkage disequilibrium (Lewontin’s ) for each pair of consecutive SNPs were 0, 0.2, 0.4, 0.6, and 0.8 for consecutive SNPs. 2) We set three levels of unrelated samples 1,000, 5,000, and 10,000, respectively. 3) Three levels of SNP numbers 10,000, 50,000, and 100,000. These five parameters could totally carry out 45 simulation scenarios for each by our in-house simulation code, and its detailed implementation can be found in Zhang et al [16]. For each simulation scenario, we set the value of 10, 20, and 50 in order to find proper . , (as well as their allele frequencies), , and were considered to investigate how to determine . Although neither nor reaches real biobank-scale data, we investigate and summarize certain properties of RHE-reg under these 135 scenarios in the results below. The biobank-scale test is to be investigated in UK Biobank examples.

Result 1: Randomized estimation for

As shown in the method section, was appeared as one of the key parameters in determining the performance of the sampling of RHE-reg. The direct estimation of from the eigenvalues of was the golden standard, and we consequently compared Method I, , and Method II, , with its direct estimation. As shown in Fig 1, the above 135 simulation scenarios were compared with the direct estimation for . For Method I, increasing from 10 to 50 could increase the precision of the estimation. In contrast, Method II showed very consistent and high precision for the estimation of regardless of the sample size, an increasing of from 10 to 50 did not help improve precision. The advantage of Method II was probably because estimated as its mean, whereas as its sampling variance. So, hereafter we used Method II to estimate . Of note, as is computational expensive when is large so that only limited sample size and SNP numbers were tested in these 135 simulations; however, the principal results should be retained for an even larger sample size, as well as , but with more expensive computational cost in solving eigenvalues. Biobank-scale performance of the proposed method will be illustrated in UK Biobank 81 traits in Result 4.

Fig 1. Comparison for the estimation of .

The x-axis represents benchmark estimation for directly, and y-axis represents the estimation of using Method I or Method II respectively. The diagonal line (solid black) is for comparison. Each fitted line shows the correlation between all 135 estimations with their benchmark estimation .

https://doi.org/10.1371/journal.pcbi.1013568.g001

Result 2: MSE of RHE-reg

In Eq 7, , and we defined according to Eq 14. In Fig 2, we showed how MSE and could be reduced by for these 90 simulated scenarios (excluded 45 scenarios under ). We only illustrated the results for and 10,000, respectively, because was too small a sample size here for efficient convergence. The top row of Fig 2 illustrated how MSE were reduced by , and obviously a much larger reduced MSE because was turned down. Actually the bias term played little weight in MSE, which was dominantly determined by . was at least one or two order of magnitude compared with . In the second row of Fig 2 reflected how quickly vanished after iterations. Neither LD nor played an important role in determining MSE for the simulated scenarios, but the ratio between and mattered much as under the same sample size, more SNPs always inflated MSE.

Fig 2. Evaluation for the MSE of RHE-reg under the different simulation scenarios.

The top row (A-C) represents the comparison for MSE under different for 90 simulated scenarios, and the bottom row (D-F) represents the comparison for the ratio between and . In each panel, 90 simulated scenarios are split into 6 groups given different combination for sample sizes ( = 5,000, and 10,000) and SNP numbers ( = 10,000, 50,000, and 100,000). In each group 15 points can be split into 5 groups from left to right for different LD levels ( = 0, 0.2, 0.4, 0.6, and 0.8) and each LD group has three simulated (0, 0.1, and 0.25), respectively. In each panel, the grey line indicates the mean of the investigated values for the corresponding 15 scenarios.

https://doi.org/10.1371/journal.pcbi.1013568.g002

Result 3: Randomized estimation for and z-score

In result 3, we studied how could influence and its z-score. As the sampling variance of was reciprocal to the sample size under the null hypothesis and , it was obviously to see in simulation that: greater , and greater would help to bring out a more stable estimation for (Fig 3A-C). If we employed from as the benchmark, when sample size , there was very high consistent estimation for even (Fig 3C). is the sampling variance of REML when [26].

Fig 3. Estimation of and z-score after different .

Each plot illustrates the comparison of the estimated heritability (A-C) and z-score (D-F) given (x-axis) vs 20 and 50 (y-axis) under different sample size , 5,000, and 10,000, respectively. The black solid line is the reference line of y = x, and the coloured solid line is the fitted regression, which is printed in each plot. In each plot, there are 45 points in each colour as simulated.

https://doi.org/10.1371/journal.pcbi.1013568.g003

The availability of the score of the estimated heritability was important for statistical inference. We evaluated the influence of in determining the performance of the randomized algorithm (Fig 3D-F). It was known from the above analysis , so when the estimation of became stable the test statistic was stable too. So, was relative stable when (Fig 3E) or (Fig 3F). When the sample size was sufficiently large, a few iteration could guarantee high accuracy of the estimation. In addition, we also tested the estimation by setting , 0.75, and 0.9, respectively, and the results, as promised by our theory, were consistent to what observed.

Result 4: Application of horizontal RHE-reg

This study was to estimate heritability for distributed data as exact as a single piece of data. Two cohorts with 4,000 and 6,000 individuals, respectively, were generated to verify h-RHE-reg. was set the value of 0, 0.1, and 0.25, respectively. The effects of all 10,000 SNPs were sampled from the distribution . Heritability and z scores were estimated using individual-level RHE-reg as well as h-RHE-reg. was set of 10, 20, and 50. The genotypes of the two simulated cohorts were standardized by and for the -th locus, where was the average allele frequency. The phenotypes of the two cohorts were standardized by and , where and . Each simulation scenario had 10 repeats (see Source1.R and Source2.R in S1 Data for its implementation).

The estimates of heritability and its z score were consistent using individual-level RHE-reg and h-RHE-reg in all scenarios when the random vectors were the same (Fig 4). We also split the data into 2,000 and 8,000 individuals, and, as expected, the results were nearly identical and unbiased.

Fig 4. Application of horizontal RHE-reg in simulation studies.

Estimated heritability (A-C) and z-scores (D-F) obtained from RHE-reg ( axis) and h-RHE-reg ( axis) under different settings of 10, 20, and 50, respectively. Point colors represent the simulated heritability. Each scenario was repeated 10 times. The dashed line represents the identity line ().

https://doi.org/10.1371/journal.pcbi.1013568.g004

Real data analysis for UK Biobank

We chose the unrelated 292,223 British white who have no kinship found, as indicated by the genetic kinship provided in the UK Biobank (field 22021) for real data test [2]. After quality control, the inclusion criteria were: MAF > 0.01, missing call rate < 0.05 and Hardy-Weinberg proportion test p-value > 1e-6, whose genotype call rate > 0.95, and 525,460 autosome SNPs were included for analysis. We estimated heritability of the 81 quantitative traits, and included the top two principal components and sex as covariates.

We used two strategies to estimate heritability. In strategy I, denoted as B+ strategy hereafter, we set as a warm-up step to evaluate and was set of 0.05. After the warm-up of iteration, we then increased iteration by a step of 10, We then estimate final realized , , , and three kinds of scores until the convergence ratio of ; however, we set a hard stop for even if was still greater than 0.05. In strategy II, we directly set , 20, or 50 without further considering additional iteration anymore, and consequently denoted as , , and strategies hereafter.

Comparison of the UKB results between strategy I and II

Even a couple of traits were set to take a hard stop because their were greater than 200, the estimated for the 81 traits had a mean of 0.0518, which was very close to the preset (Fig 5A1). It indicated that our theory worked to control the precision of the sampling variance of RHE-reg. The trait "age of diabetes diagnosed" had , extremely large standard error compared to other traits, because of its smallest sample size of (Fig 5B1). In Fig 5C1, we got three z scores, which are score directly calculated given iterations (green colored, via Eq 8), the optimal z score when was infinitive (blue colored, via Eq 9), and the predicted score (pink colored, via Eq 10).

Fig 5. Randomized estimation for heritability for UK Biobank 81 quantitative traits.

A1-D1) The performance of RHE-reg given the respective number for each trait, , effective number of markers using randomized estimation (), (the vertical line covers 95% confidence interval), and scores estimated in three methods. Three z scores are plotted, the green colored z scores are directly estimated given iterations for each trait (Eq 8), the pink colored z scores are optimal z score (Eq 9), and the blue colored z scores are directly estimated given (Eq 10). A2-A4) Comparison for between that of and , 20, and 50, respectively. B2-B4) Comparison for between that of and , 20, and 50, respectively; the vertical and horizontal lines are the means of from x-axis and y-axis, respectively. C2-C4) Comparison for between and , 20, and 50, respectively; the fitted lines is printed on the top left corner of each plot. D2-D4) Comparison for the three pairwise scores. The green colored z scores are estimated in Eq 8 given B+ and the number of B as shown on the x-axis label, the pink colored z score are estimated in Eq 9, and blue colored ones in Eq 10, respectively.

https://doi.org/10.1371/journal.pcbi.1013568.g005

For comparison, we examined the corresponding statistics that were estimated under , , and , respectively. In strategy II, A larger led a smaller as expected (Fig 5A2-4). Interestingly, regardless the change of in strategy II, were very consistent to those estimated from strategy I, as shown that the fitted regression lines were very close to 1 (Fig 5B2-B4). Three types of z scores were compared (Fig 5C2-C4), and the optimal z scores from both strategies were nearly perfect (blue points and blue dashed lines). Then, as shown in Fig 5C4, the three kinds of z scores were nearly completely matched.

In addition, the estimates were also consistent with our previous results using a less efficient method [27], and see S1 Table for more details. The heritability estimated by the randomization algorithm exhibited a relative high degree of correlation (Pearson’s correlation coefficient of 0.77) with the previous estimates for 81 traits. Compared to the previous results, the was nearly consistent with the GRM-based estimates, and is with averaged 1.38% deviation after 10 iterations and further decreased to 1.23% deviation after 50 iterations (S1 Table).

We also compared the computational efficiency of RHE-reg with GCTA [28] and BOLT-REML [29] in estimating the heritability on BMI. The comparison was conducted on a sub-dataset in UKB with randomly selected 10,000 individuals and 523,945 SNP markers after filtration. The results indicate significant efficiency improvement in estimating the heritability of complex traits in biobank-scale datasets for RHE-reg, with computation times reduced by 96.6% and 83.8% compared to GCTA and BOLT-REML, respectively (S2 Table). More benchmark comparison of the computational performance could be found in earlier studies [18,5]. Even using a complete dataset, RHE-reg could also complete heritability estimation within an acceptable time (S3 Table). In our tested 81 UKB traits, with 10 threads, it on average took 453 mins to finish the analysis of a trait and the average iteration of .

Application of vertical RHE-reg

Of Eq 15, indicates that and can be from two independent sources. Consequently, we split each UKB trait evenly into halves to test the v-RHE-reg, and Eq 15 had four possible combinations: 1) split 1/1: both and were estimated from split 1; 2) split 2/1: was estimated from split 2 and split 1; 3) split 1/2: was estimated from split 1 and split 2; 4) both and were estimated from split 2. So, we had four estimators as below

Of each trait, its heritability and z score tests could be constructed within each split and between each split by exchanging the estimation, and consequently brought out v-RHE-reg. As shown in Eq 15, we compared the result for =10, 20, and 50, respectively, and observed consistent results between split 1 and split 2, and between split 1/2 and split 2/1.

Fig 6 showed the results of these four estimators under different . It illustrated that pairwise estimates against , and against , and as observed the pairwise estimates were quite consistent with each other both within and between splits.

Fig 6. Application of v-RHE-reg for 81 quantitative traits in UKB.

The top and the bottom row represent heritability and z score respectively. Each column illustrates results after iterations. In each plot, the coordinate for a grey point is heritability/z-score estimated from split 1 (x-axis) and split 2 (y-axis).

https://doi.org/10.1371/journal.pcbi.1013568.g006

Discussion

The presented study is developed on the randomized Haseman-Elston regression for the estimation of SNP-heritability proposed recently by Wu and Sankararaman (2018) [5]. They very smartly used a randomization approach – Girard-Hutchinson estimation, which significantly reduces the computational cost in estimating from to [20,19]. However, the drawbacks of their method may be its unclear property for , which further leads to obscure sampling variance of the estimated heritability. As discussed in a recent review, it has been obscure in the original RHE-reg since no closed-form solutions were provided to quantify the connection between and the estimation procedure [8]. After integrating analytical results for Haseman-Elston regression into this randomized framework [4], we present here a close-form solution for RHE-reg. Having provided the sampling variance, we are able to evaluate how influences the estimation procedure of RHE-reg precisely. In particular, a key element that is related to the sampling variance of , which is proportional to . It should be noticed under the null hypothesis that as established previously [4,26,18]. The quantity of is identical to the sampling variance of REML under the null hypothesis or that of modified Haseman-Elston regression [26,4,18]. Of note, the present study is focused on the presence of typical polygenic architecture because counterexample, albeit pathological, can be found when causal variants are distributed not random as discussed [4,27].

A nature extension of the method is to include multi-component, such as for the estimation for each chromosome. It is obvious that the method for deriving sampling variance should be extended for multi-components estimation if their corresponding and are in global linkage, or nearly, equilibrium, which is often the case for human populations [17]. Much advanced numerical tools, such as condition numbers, are needed to evaluate the approximation of the randomized algorithm [30]. Some inconsistency between GRM-based estimation and randomization estimation, such as the overall correlation of 0.77 for estimated heritability between Xu et al.’s results and the current result, may arise from the different covariates chosen [27]. In Xu et al.’s work, the heritability was estimated under the first two PCs corrected, while the current randomization method further took gender as extra covariate, this may cause the observed discrepancy especially in the gender-related traits.

In summary, the purpose of the present study is two-fold. First, we provide a method to balance iteration and precision of estimation, and an improved implementation of RHE-reg is realized. Secondly, we extend RHE-reg into the estimation of SNP-heritability for distributed data, which uses the controlled to synchronize the estimation across datasets. With increasing genomic cohorts but distributed in different institutes, it is now a trend to propose computational solutions without compromising privacy [31]. The enhanced RHE-reg framework can consequently have computational and analytical merits, and, as demonstrated, we further extend its utilities such as vertical- and horizontal RHE-reg, as demonstrated in this study. Given the increasing cry for genomic privacy, both vertical and horizontal RHE-reg will be meaningful in securing genomic information. However, given its traditionally very quantitative origin of statistical genetics, statistical routines may have competing, if not superior, solutions than those derived from available information technology [32,12].

It is straightforward to apply the estimation procedure for the estimation of dominance variance components both for individual-level data and summary statistics. The only update of the equation is to replace with . For each SNP, is coded 0, 2p, and 4p-2 for the genotype that counts 0, 1 and 2 reference alleles; and furthermore, is further scale by [33,34]. So for a pair of individual and After replacing with , all the above estimation procedure can be applied for . Furthermore, The effective number of markers in terms of is , a tetradic form of LD for a pair of SNPs.

Supporting information

S1 Text. Technical details on some mathematical derivations.

"Effective number of markers" (Note I); "Discussion of " (Note II); "Sampling variance for vertical RHE-reg" (Note III); "Adjustment for covariates" (Note IV); "Coding scheme and LD" (Note V).

https://doi.org/10.1371/journal.pcbi.1013568.s001

(DOCX)

S1 Fig. This figure gives the computational tips on how to program the software.

Some sequential operation of the matrix is suggested to make the program easy to write.

https://doi.org/10.1371/journal.pcbi.1013568.s002

(TIFF)

S1 Data. Implementation for Horizontal RHE-reg (R code: Source1. R and Source2.R).

https://doi.org/10.1371/journal.pcbi.1013568.s003

(ZIP)

S1 Table. SNP-heritability estimation for 81 UKB traits (XLSX).

https://doi.org/10.1371/journal.pcbi.1013568.s004

(XLSX)

S2 Table. The time cost for heritability estimation on BMI for RHE-reg, BOLT-REML and GCTA.

https://doi.org/10.1371/journal.pcbi.1013568.s005

(XLSX)

S3 Table. Computational time for 81 continuous traits using RHE-reg.

https://doi.org/10.1371/journal.pcbi.1013568.s006

(XLSX)

Acknowledgments

We thank the participants of the included cohorts and of UK Biobank for making this work possible (UKB application 41376).

References

  1. 1. Visscher PM, Hill WG, Wray NR. Heritability in the genomics era--concepts and misconceptions. Nat Rev Genet. 2008;9(4):255–66. pmid:18319743
  2. 2. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–9. pmid:30305743
  3. 3. Haseman JK, Elston RC. The investigation of linkage between a quantitative trait and a marker locus. Behav Genet. 1972;2(1):3–19. pmid:4157472
  4. 4. Chen G-B. Estimating heritability of complex traits from genome-wide association studies using IBS-based Haseman-Elston regression. Front Genet. 2014;5:107. pmid:24817879
  5. 5. Wu Y, Sankararaman S. A scalable estimator of SNP heritability for biobank-scale data. Bioinformatics. 2018;34(13):i187–94. pmid:29950019
  6. 6. Wu Y, et al. Fast estimation of genetic correlation for biobank-scale data. Am J Hum Genet. 2022;24:24–32.
  7. 7. Patel V, et al. Scientific applications leveraging randomized linear algebra. arXiv. 2025.
  8. 8. Tang M, Wang T, Zhang X. A review of SNP heritability estimation methods. Brief Bioinform. 2022;23(3):bbac067. pmid:35289357
  9. 9. Yang Z, Hu L, Zhen J, Gu Y, Liu Y, Huang S, et al. Genetic basis of pregnancy-associated decreased platelet counts and gestational thrombocytopenia. Blood. 2024;143(15):1528–38. pmid:38064665
  10. 10. Xiao H, et al. Genetic analysis of 104 pregnancy phenotypes in 39, 194 Chinese women. medRxiv. 2023.
  11. 11. Chen G-B, Liu S, Zhang L, Huang T, Tang X, Li Y, et al. Building and sharing medical cohorts for research. Innovation (Camb). 2024;5(3):100623. pmid:38665391
  12. 12. Zhang Q-X, Liu T, Guo X, Zhen J, Yang M-Y, Khederzadeh S, et al. Searching across-cohort relatives in 54,092 GWAS samples via encrypted genotype regression. PLoS Genet. 2024;20(1):e1011037. pmid:38206971
  13. 13. Chen G-B, Lee SH, Robinson MR, Trzaskowski M, Zhu Z-X, Winkler TW, et al. Across-cohort QC analyses of GWAS summary statistics from complex traits. Eur J Hum Genet. 2016;25(1):137–46. pmid:27552965
  14. 14. McMahan HB, et al. Communication-efficient learning of deep networks from decentralized data. arXiv. 2017.
  15. 15. Bulik-Sullivan BK, Loh P-R, Finucane HK, Ripke S, Yang J, Schizophrenia Working Group of the Psychiatric Genomics Consortium, et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet. 2015;47(3):291–5. pmid:25642630
  16. 16. Zhang Q-X, Jayasinghe D, Zhang Z, Lee SH, Xu H-M, Chen G-B. Precise estimation of in-depth relatedness in biobank-scale datasets using deepKin. Cell Rep Methods. 2025;5(6):101053. pmid:40436020
  17. 17. Huang X, Zhu T-N, Liu Y-C, Qi G-A, Zhang J-N, Chen G-B. Efficient estimation for large-scale linkage disequilibrium patterns of the human genome. Elife. 2023;12:RP90636. pmid:38149842
  18. 18. Zhou X. A unified framework for variance component estimation with summary statistics in genome-wide association studies. Ann Appl Stat. 2017;11(4):2027–51. pmid:29515717
  19. 19. Hutchinson MF. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Commun Stat Comput. 1989;18:1059–76.
  20. 20. Girard A. A fast ?Monte-Carlo cross-validation? procedure for large least squares problems with noisy data. Numer Math. 1989;56(1):1–23.
  21. 21. Lynch M, Walsh B. Genetics and analysis of quantitative traits. Sunderland, MA, USA: Sinauer Associates, Inc; 1998.
  22. 22. Isserlis L. On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables. Biometrika. 1918;12(1/2):134.
  23. 23. Goddard M. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica. 2009;136(2):245–57. pmid:18704696
  24. 24. Liberty E, Zucker SW. The mailman algorithm: A note on matrix-vector multiplication. Inf Process Lett. 2009;109:179–82.
  25. 25. Agrawal A, Chiu AM, Le M, Halperin E, Sankararaman S. Scalable probabilistic PCA for large-scale genetic variation data. PLoS Genet. 2020;16(5):e1008773. pmid:32469896
  26. 26. Visscher PM, Hemani G, Vinkhuyzen AAE, Chen G-B, Lee SH, Wray NR, et al. Statistical power to detect genetic (co)variance of complex traits using SNP data in unrelated samples. PLoS Genet. 2014;10(4):e1004269. pmid:24721987
  27. 27. Xu T, Qi G-A, Zhu J, Xu H-M, Chen G-B. Subsampling Technique to Estimate Variance Component for UK-Biobank Traits. Front Genet. 2021;12:612045. pmid:33747041
  28. 28. Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011;88(1):76–82. pmid:21167468
  29. 29. Loh P-R, Bhatia G, Gusev A, Finucane HK, Bulik-Sullivan BK, Pollack SJ, et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nat Genet. 2015;47(12):1385–92. pmid:26523775
  30. 30. Horn RA, Johnson CR. Matrix Analysis. 2 ed. New York: Cambridge University Press. 1994.
  31. 31. Elhussein A, Baymuradov U, NYGC ALS Consortium, Elhadad N, Natarajan K, Gürsoy G. A framework for sharing of clinical and genetic data for precision medicine applications. Nat Med. 2024;30(12):3578–89. pmid:39227443
  32. 32. Wang S, Kim M, Li W, Jiang X, Chen H, Harmanci A. Privacy-aware estimation of relatedness in admixed populations. Brief Bioinform. 2022;23(6):bbac473. pmid:36384083
  33. 33. Zhu Z, Bakshi A, Vinkhuyzen AAE, Hemani G, Lee SH, Nolte IM, et al. Dominance genetic variation contributes little to the missing heritability for human complex traits. Am J Hum Genet. 2015;96(3):377–85. pmid:25683123
  34. 34. Vitezica ZG, Legarra A, Toro MA, Varona L. Orthogonal estimates of variances for additive, dominance, and epistatic effects in populations. Genetics. 2017;206(3):1297–307. pmid:28522540

AltStyle によって変換されたページ (->オリジナル) /