The Brazil Bolsa Família program is a conditional cash transfer program aimed to reduce short-term poverty by direct cash transfers and to fight long-term poverty by increasing human capital among Brazilian households with low income. Eligibility for Bolsa Família benefits depends on a cutoff rule, which classifies it as a regression discontinuity (RD) design. Following Li, Mattei and Mealli (2015) and Branson and Mealli (2019), we formally describe the Bolsa Família RD design as a local regular design (Imbens and Rubin (2015)) within the potential outcome approach. Under this framework, causal effects can be identified and estimated on an unknown but well-defined subpopulation where the following RD assumptions hold: a local overlap assumption, a local SUTVA, and a local ignorability (unconfoundedness) assumption. The potential advantages of this probabilistic framework over local regression methods based on continuity assumptions concern the causal estimands that can be targeted, the design and the analysis as well as the interpretation and generalizability of the results. For the identification of the subpopulation for which we can draw valid causal inference, we propose to use a Bayesian model-based finite mixture approach to clustering to probabilistically classify observations into subpopulations where the RD assumptions hold and do not hold on the basis of the observed data. This approach: (a) allows to account for the uncertainty in the subpopulation membership, which is typically neglected, (b) does not impose any constraint on the shape of the subpopulation, (c) allows to target alternative causal estimands than the average treatment effects (ATEs), and (d) is robust to a certain degree of manipulation/selection of the forcing variable. We apply our proposed approach to assess causal effects of the Bolsa Família program on leprosy incidence in 2009 for Brazilian households who registered in the Brazilian National Registry for Social Programs in 2007–2008 for the first time. We find evidence that being eligible for the program reduces the risk of leprosy.
To investigate causal impacts, many researchers use controlled pre-post designs that compare over-time differences between a population exposed to a policy change and an unexposed comparison group. However, researchers using these designs often disagree about the "correct" specification of the causal model, perhaps most notably in analyses to identify the effects of gun policies on crime. To help settle these model specification debates, we propose a general identification framework that unifies a variety of models researchers use in practice. In this framework, which nests "brand name" designs like difference-in-differences as special cases, we use models to predict untreated outcomes and then correct the treated group’s predictions using the comparison group’s observable prediction errors. Our point identifying assumption is that treated and comparison groups would have equal prediction errors (in expectation) under no treatment. To choose among candidate models, we propose a data-driven procedure based on models’ robustness to violations of this point identifying assumption. Our selection procedure averages over candidate models, weighting by each model’s posterior probability of being the most robust, given its differential average prediction errors in the pre-period. This approach offers a way out of debates over the "correct" model by choosing on robustness instead and has the desirable property of being feasible in the "locked box" of preintervention data only. We apply our methodology to the gun policy debate, focusing specifically on Missouri’s 2007 repeal of its permit-to-purchase law, and provide an package () for implementation.
Estimating the joint effect of a multivariate, continuous exposure is crucial, particularly in environmental health where interest lies in simultaneously evaluating the impact of multiple environmental pollutants on health. We develop novel methodology that addresses two key issues for estimation of treatment effects of multivariate, continuous exposures. We use nonparametric Bayesian methodology that is flexible to ensure our approach can capture a wide range of data generating processes. Additionally, we allow the effect of the exposures to be heterogeneous with respect to covariates. Treatment effect heterogeneity has not been well explored in the causal inference literature for multivariate, continuous exposures, and, therefore, we introduce novel estimands that summarize the nature and extent of the heterogeneity and propose estimation procedures for new estimands related to treatment effect heterogeneity. We provide theoretical support for the proposed models in the form of posterior contraction rates and show that it works well in simulated examples both with and without heterogeneity. Our approach is motivated by a study of the health effects of simultaneous exposure to the components of , where we find that the negative health effects of exposure to environmental pollutants are exacerbated by low socioeconomic status, race and age.
Contrastive dimension reduction methods have been developed for case-control study data to identify variation that is enriched in the foreground (case) data X relative to the background (control) data Y. Here we develop contrastive regression for the setting where there is a response variable r associated with each foreground observation. This situation occurs frequently when, for example, the unaffected controls do not have a disease grade or intervention dosage, but the affected cases have a disease grade or intervention dosage, as in autism severity, solid tumors stages, polyp sizes, or warfarin dosages. Our contrastive regression model captures shared low-dimensional variation between the predictors in the case and control groups and then explains the case-specific response variables through the variance that remains in the predictors after shared variation is removed. We show that, in one single-cell RNA sequencing dataset on cellular differentiation in chronic rhinosinusitis with and without nasal polyps and in another single-nucleus RNA sequencing dataset on autism severity in postmortem brain samples from donors with and without autism, our contrastive linear regression performs feature ranking and identifies biologically-informative predictors associated with response that cannot be identified using other approaches.
Surrogate selection is an experimental design that without sequencing any DNA can restrict a sample of cells to those carrying certain genomic mutations. In immunological disease studies, this design may provide a relatively easy approach to enrich a lymphocyte sample with cells relevant to the disease response because the emergence of neutral mutations associates with the proliferation history of clonal subpopulations. A statistical analysis of clonotype sizes provides a structured, quantitative perspective on this useful property of surrogate selection. Our model specification couples within-clonotype birth-death processes with an exchangeable model across clonotypes. Beyond enrichment questions about the surrogate selection design, our framework enables a study of sampling properties of elementary sample diversity statistics; it also points to new statistics that may usefully measure the burden of somatic genomic alterations associated with clonal expansion. We examine statistical properties of immunological samples governed by the coupled model specification, and we illustrate calculations in surrogate selection studies of melanoma and in single-cell genomic studies of T cell repertoires.
Observational zero-inflated count data arise in a wide range of areas such as genomics. One of the common research questions is to identify causal relationships by learning the structure of a sparse directed acyclic graph (DAG). While structure learning of DAGs has been an active research area, existing methods do not adequately account for excessive zeros and, therefore, are not suitable for modeling zero-inflated count data. Moreover, it is often interesting to study differences in the causal networks for data collected from two experimental groups (control vs. treatment). To explicitly account for zero-inflation and identify differential causal networks, we propose a novel Bayesian differential zero-inflated negative binomial DAG (DAG0) model. We prove that the causal relationships under the proposed DAG0 are fully identifiable from purely observational, cross-sectional data, using a general proof technique that is applicable beyond the proposed model. Bayesian inference based on parallel-tempered Markov chain Monte Carlo is developed to efficiently explore the multimodal posterior landscape. We demonstrate the utility of the proposed DAG0 by comparing it with state-of-the-art alternative methods through extensive simulations. An application in a single-cell RNA-sequencing dataset, generated under two experimental groups, finds some interesting results that appear to be consistent with existing knowledge. A user-friendly R package that implements DAG0 is available at https://github.com/junsoukchoi/BayesDAG0.git.
Crowdfunding is a powerful tool for individuals or organizations seeking financial support from a vast audience. Despite widespread adoption, managers often lack information about dynamics of their platforms. Hawkes processes have been used to represent self-exciting phenomenon in a wide variety of empirical fields but have not been applied to crowdfunding platforms in a way that could help managers understand user behaviors. In this paper we extend the Hawkes process to capture important features of crowdfunding contributions and apply the model to analyze data from two donation-based platforms. For each user-item pair, the continuous-time conditional intensity is modeled as the superposition of a self-exciting baseline rate and a mutual excitation by preferential attachment, both depending on prior user engagement and attenuated by a power law decay of user interest. The model is thus structured around two time-varying features, contribution count and item popularity. We estimate parameters that govern the dynamics of contributions from 2000 items and 164,000 users over several years. We identify a bottleneck in the user contribution pipeline, measure the force of item popularity, and characterize the decline in user interest over time. A contagion effect is introduced to assess the impact of item popularity on contribution rates. This mechanistic model lays the groundwork for enhanced crowdfunding platform monitoring based on evaluation of counterfactual scenarios and formulation of dynamics-aware recommendations.
Insurance losses due to flooding can be estimated by simulating and then summing losses over a large number of locations and a large set of hypothetical years of flood events. Replicated realisations lead to Monte Carlo return-level estimates and associated uncertainty. The procedure, however, is highly computationally intensive. We develop and use a new, Bennett-like concentration inequality, to provide conservative but relatively accurate estimates of return levels. Bennett’s inequality accounts for the different variances of each of the variables in a sum but uses a uniform upper bound on their support. Motivated by the variability in the total insured value of risks within a portfolio, we incorporate both individual upper bounds and variances and obtain tractable concentration bounds. Simulation studies and application to a representative portfolio demonstrate a substantial tightening compared with Bennett’s bound. We then develop an importance-sampling procedure that repeatedly samples annual losses from the distributions implied by each year’s concentration inequality, leading to conservative estimates of the return levels and their uncertainty using orders of magnitude less computation. This enables a simulation study of the sensitivity of the predictions to perturbations in quantities that are usually assumed fixed and known but, in truth, are not.
Forecasting and controlling PM2.5 emissions is crucial for environmental protection and public health. To analyze the Beijing multisite air quality dataset on regional and seasonal effects in PM2.5 emissions, which has large-scale distributed cluster/longitudinal data and high-dimensional covariates, we develop a unified cluster subsampling method for generalized linear models (GLMs) to downsize the data volume and reduce computational burden. To incorporate the within-subject correlation, a weighted generalized estimation equation under an informative working correlation structure is considered, and a novel optimal subsampling criterion including both the A- and L-optimality is proposed. For low-dimensional GLMs, the resulting optimal subsample estimators are consistent and asymptotically normal with explicitly derived asymptotic covariance matrices. For the preconceived low-dimensional parameter in high-dimensional GLMs, a quasi decorrelated score function is developed to mitigate the effect from nuisance parameter estimation. Our proposed method is evaluated by simulation. By applying our method to the Beijing multisite air quality dataset, we reveal that the PM2.5 emissions in the south part of Beijing have a U-shaped seasonal effect in the order of winter, spring, summer, and autumn, and a regional aggregation effect in winter of southeastern Beijing.
Sets of trajectories that begin or end at the same point, in the form of a bouquet, appear in several real-world problems, such as the dispersion of volcanic ash or forecasts of hurricane paths, among others. Our interest in this type of trajectory focuses on studying the biogeography of airborne microorganisms and their ability to colonise soils recently exposed due to climate change. For these functional data, we introduce a new integrated depth measure () that allows finding central and outlier curves in a dataset. First, circular local depths () are calculated in concentric circles around the common point, and in a second step, these values are integrated along the curves, yielding the trajectory depth. Under mild conditions both and have good properties and are strongly consistent. In addition, we propose an efficient algorithm for working with large datasets. Finally, we apply this new technique to find the main routes followed by air masses carrying microorganisms to Byers Peninsula (Livingston Island, Antarctica).
A statistical framework we call CQUESST (Carbon Quantification and Uncertainty from Evolutionary Soil STochastics), which models carbon sequestration and cycling in soils, is applied to a long-running agricultural experiment that controls for crop type, tillage, and season. The experiment, known as the Millenium Tillage Trial (MTT), ran on 42 field-plots for 10 years from 2000–2010; here CQUESST is used to model soil carbon dynamically in six pools, in each of the 42 agricultural plots, and on a monthly time step for a decade. We show how CQUESST can be used to estimate soil-carbon cycling rates under different treatments. Our methods provide much-needed statistical tools for quantitatively inferring the effectiveness of different experimental treatments on soil-carbon sequestration. The decade-long data are of multiple observation types, and these interacting time series are ingested into a fully Bayesian model that has a dynamic stochastic model of multiple pools of soil carbon at its core. CQUESST’s stochastic model is motivated by the deterministic RothC soil-carbon model based on nonlinear difference equations. We demonstrate how CQUESST can estimate soil-carbon fluxes for different experimental treatments while acknowledging uncertainties in soil-carbon dynamics, in physical parameters, and in observations. An important outcome of our modeling is the inference of cropping-specific decay rates, with evidence suggesting that soil carbon decay-rates vary as a function of land management practices. CQUESST is implemented efficiently in the probabilistic programming language using its parallelization, and it scales well for large numbers of field-plots, using software libraries that allow for computation to be shared over multiple nodes of high-performance computing clusters.
In the information age, it has become increasingly common for data containing records about overlapping individuals to be distributed across multiple sources, making it necessary to identify which records refer to the same individual. The goal of record linkage is to estimate this unknown structure in the absence of a unique identifiable attribute. We introduce a Bayesian hierarchical record linkage model for spatial location data motivated by the estimation of individual-specific growth-size curves for conifer species using data derived from overlapping LiDAR scans. Annual tree growth estimates depend on correctly identifying unique individuals across scans in the presence of noise. We formalize a two-stage modeling framework connecting the record linkage model and a flexible downstream individual tree growth model that provides robust uncertainty quantification and propagation through both stages of the modeling pipeline via an extension of the linkage-averaging approach of (Ann. Appl. Stat. 12 1013–1038). In this paper, we discuss the two-stage model formulation, outline the computational strategies required to achieve scalability, assess the model performance on simulated data, and fit the model to a bi-temporal dataset derived from LiDAR scans of the Upper Gunnison Watershed provided by the Rocky Mountain Biological Laboratory to assess the impact of key topographic covariates on the growth behavior of conifer species in the Southern Rocky Mountains (USA).
We propose a novel nonparametric Bayesian approach for meta-analysis with event time outcomes. The model is an extension of linear dependent tail-free processes. The extension includes a modification to facilitate (conditionally) conjugate posterior updating and a hierarchical extension with a random partition of studies. The partition is formalized as a Dirichlet process mixture. The model development is motivated by a meta-analysis of cancer immunotherapy studies. The aim is to validate the use of relevant biomarkers in the design of immunotherapy studies. The hypothesis is about immunotherapy in general, rather than about a specific tumor type, therapy and marker. This broad hypothesis leads to a very diverse set of studies being included in the analysis and gives rise to substantial heterogeneity across studies.
Approaches for estimating genetic effects at the individual level often focus on analyzing phenotypes at a single time point, with less attention given to longitudinal phenotypes. This paper introduces a mixed modeling approach that includes both genetic and individual-specific random effects and is designed to estimate genetic effects on both the baseline and slope for a longitudinal trajectory. The inclusion of genetic effects on both baseline and slope, combined with the crossed structure of genetic and individual-specific random effects, creates complex dependencies across repeated measurements for all subjects. These complexities necessitate the development of novel estimation procedures for parameter estimation and individual-specific predictions of genetic effects on both baseline and slope.
We employ an Average Information Restricted Maximum Likelihood (AI-ReML) algorithm to estimate the variance components corresponding to genetic and individual-specific effects for the baseline levels and rates of change for a longitudinal phenotype. The algorithm is used to characterizes the prostate-specific antigen (PSA) trajectories for participants who remained prostate cancer-free in the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial. Understanding genetic and individual-specific variation in this population will provide insights for determining the role of genetics in cancer screening. Our results reveal significant genetic contributions to both the initial PSA levels and their progression over time, highlighting the role of these genetic factors on the variability of PSA across unaffected individuals. We show how genetic factors can be used to identify individuals prone to large baseline and increasing trajectories PSA values among individuals who are prostate cancer-free. In turn, we can identify groups of individuals who have a high probability of falsely screening positive for prostate cancer using well-established cutoffs for early detection based on the level and rate of change in this biomarker. The results demonstrate the importance of incorporating genetic factors for monitoring PSA for more accurate prostate cancer detection.
When a promising subgroup is identified from an unsuccessful trial with a broad target population, we often need to evaluate and possibly confirm the selected subgroup with a follow-up study, typically a validation trial, on the subgroup. In this paper we focus on the panitumumab study and ask the question of how to utilize data from both trials to improve the efficiency of subgroup evaluation without selection bias there. We propose a new resampling-based approach to quantify and remove selection bias and then to perform data combination from both trials for valid and efficient inference on the subgroup effect. The proposed method is model-free and asymptotically sharp. We apply the proposed method to analyze the panitumumab trial and show how much data combination could help improve the analysis of clinical trials when a promising subgroup is identified from part of the data and accelerate the delivery of new treatment to the patients in need.
With the growing prevalence of diabetes and the associated public health burden, it is crucial to identify modifiable factors that could improve patients’ glycemic control. In this work we seek to examine associations between medication usage, concurrent comorbidities, and glycemic control, utilizing data from continuous glucose monitors (CGMs). CGMs provide high-frequency interstitial glucose measurements, but reducing data to simple statistical summaries is common in clinical studies, resulting in substantial information loss. Recent advancements in the Fréchet regression framework allow to utilize more information by treating the full distributional representation of CGM data as the response, while sparsity regularization enables variable selection. However, the methodology does not scale to large datasets. Crucially, rigorous inference is not possible because the asymptotic behavior of the underlying estimates is unknown, while the application of resampling-based inference methods is computationally infeasible. We develop a new algorithm for sparse distributional regression by deriving a new explicit characterization of the gradient and Hessian of the underlying objective function, while also utilizing rotations on the sphere to perform feasible updates. The updated method is up to 10,000+ fold faster than the original approach, opening the door for applying sparse distributional regression to large-scale datasets and enabling previously unattainable resampling-based inference. We combine our algorithm with stability selection to perform variable selection inference on CGM data from patients with type 2 diabetes and obstructive sleep apnea. We find a significant association between sulfonylurea medication and glucose variability without evidence of association with glucose mean. We also find that overnight oxygen desaturation variability has a stronger association with glucose regulation than overall oxygen desaturation levels.
Dementia is globally one of the leading causes of death and the primary cause of dependency and disability in senior citizens. Life expectancy with dementia, defined as the average remaining lifespan for cases with dementia, is a key epidemiological concept in geriatrics. In spite of its significance for medical research and policy-making, this measure has not been studied for people with dementia in the Canadian population. We employ data from the Canadian Study of Health and Aging, a nationwide cross-sectional study on geriatrics with follow-up for survival, to study life expectancy among elderly Canadians with dementia. Even though practically more feasible, the collected survival data using such sampling mechanism suffer from two forms of bias: selection bias due to left truncation, also known as survivor bias, and bias owing to loss to follow-up. While the latter is often inevitable in longitudinal studies when the subjects under study may drop out before the terminating event occurs, the former is a structural cross-sectional sampling bias occurring because long-term survivors are favoured by such a sampling mechanism. To the best of our knowledge, life expectancy and margins of error under these two types of bias have not hitherto been studied in the literature. Taking these complexities into account, we study the nonparametric maximum likelihood estimator of age-specific life expectancy and its uniform margins of error. Based on this estimator, we devise the first two-sample method for constructing uniform margins of error for the difference in life expectancy between two groups of patients, which is then applied to scrutinise the effects of various covariates. Our methodology enjoys robustness and high efficiency while avoiding restrictive constraints. Simulation studies are conducted to validate the performance of the proposed procedures. Our analysis provides novel information on the progression of the disease in Canada, revealing the pronounced effects of sex and type of dementia on life expectancy. A comprehensive body of theoretical results, essential for paving the way for methodological development and beyond, is documented in the Supplementary Material.
A time-to-event analysis is advocated for examining associations between time-varying environmental exposures and preterm birth in cohort studies. While the identification of preterm birth entirely depends on gestational age, the true gestational age is rarely known in practice. Obstetric estimate (OE) and gestational age based on the date of last menstrual period (LMP) are two commonly used measurements, but both suffer from various sources of error. Uncertainties in gestational age result in both outcome misclassification and measurement error of time-varying exposures, which can potentially introduce serious bias in health effect estimates. Motivated by the lack of validation data in large population-based studies, we develop a hierarchical Bayesian model that utilizes the two error-prone gestational age estimates to examine time-varying exposures on the risk of preterm birth while accounting for uncertainties in the estimates. The proposed approach introduces two discrete-time hazard models for the latent true gestational ages that are preterm (<37 weeks) or term (≥37 weeks). Then two multinomial models are adopted for characterizing misclassifications resulting from using OE-based and LMP-based gestational age. The proposed modeling framework permits the joint estimation of preterm birth risk factors and parameters characterizing gestational age misclassifications without validation data. We apply the proposed method to a birth cohort based on birth records from Kansas in 2010. Our analysis finds robust positive associations between exposure to ozone during the third trimester of pregnancy and preterm birth, even after accounting for gestational age uncertainty.
Atherosclerosis is a chronic, multifaceted disease that affects multiple arterial systems. Its progression is primarily driven by low-density lipoprotein (LDL) cholesterol accumulation, which promotes localized arterial lesion formation. These lesions can lead to severe complications, including ischemic heart disease (IHD) and stroke. Both genetic factors, particularly single nucleotide polymorphisms (SNPs), and age-related changes in body composition significantly influence LDL levels, generating extensive ultrahigh-dimensional covariates from functional and scalar mixtures (UDFSM), which may be stored at different sites due to the massive amount of data and the different data representations. To analyze the impact of genetic and physiological variables on LDL levels, we first separately extract features from ultrahigh-dimensional functional and scalar covariates in an unsupervised manner. Then we propose a novel regression model that incorporates these features, which may be correlated due to the underlying correlations in the ultrahigh-dimensional covariates comprising both functional and scalar mixtures. Our methodology employs a factor regression model with an additive multiple-index component to sufficiently and effectively capture latent feature-response variable relationships. We enhance model interpretability and account for covariate correlations by imposing column sparsity and low-rank structures on the regression coefficients matrix, thereby incorporating structural information to improve efficiency and robustness. This distribution-agnostic approach to the response variable ensures greater flexibility and versatility. For model fitting we develop a sieve likelihood-based framework that leverages the problem’s inherent structure to provide efficient and robust estimates. We apply our method to the Avon Longitudinal Study of Parents and Children (ALSPAC) dataset, achieving high prediction accuracy for LDL levels and identifying significant SNPs and anthropometric measures affecting LDL. We specifically examine how various anthropometric measures influence LDL levels over ages. We further extend our analysis to identify key parental and individual characteristics that influence adult body mass index (BMI).
Sepsis is a life-threatening condition caused by a dysregulated host response to infection. Recently, researchers have hypothesized that sepsis consists of a heterogeneous spectrum of distinct subtypes, motivating several studies to identify clusters of sepsis patients that correspond to subtypes, with the long-term goal of using these clusters to design subtype-specific treatments. Therefore, clinicians rely on clusters having a concrete medical interpretation, usually corresponding to clinically meaningful regions of the sample space that have a concrete implication to practitioners. In this article we propose Clustering Around Meaningful Regions (CLAMR), a Bayesian clustering approach that explicitly models the medical interpretation of each cluster center. CLAMR favors clusterings that can be summarized via meaningful feature values, leading to medically significant sepsis patient clusters. We also provide details on measuring the effect of each feature on the clustering using Bayesian hypothesis tests, so one can assess what features are relevant for cluster interpretation. Our focus is on clustering sepsis patients from Moshi, Tanzania, where patients are younger and the prevalence of HIV infection is higher than in previous sepsis subtyping cohorts.
The diagnosis and treatment of cancer can evoke a variety of adverse emotions. Online health communities (OHCs) provide a safe platform for cancer patients and those closely related to express emotions without fear of judgement or stigma. In the literature, linguistic analysis of OHCs is usually limited to a single disease and based on methods with various technical limitations. In this article we analyze posts from September 2003 to September 2022 on eight cancers that are publicly available at the American Cancer Society’s Cancer Survivors Network (CSN). We propose a novel network analysis technique based on low-rank matrices. The proposed approach decomposes the emotional expression semantic networks into an across-cancer time-independent component (which describes the "baseline" that is shared by multiple cancers), a cancer-specific time-independent component (which describes cancer-specific properties), and an across-cancer time-dependent component (which accommodates temporal effects on multiple cancer communities). For the second and third components, respectively, we consider a novel clustering structure and a change point structure. A penalization approach is developed, and its theoretical and computational properties are carefully established. The analysis of the CSN data leads to sensible networks and deeper insights into emotions for cancer overall and specific cancer types.
Verbal autopsies (VAs) are extensively used to investigate the population-level distributions of deaths by cause in low-resource settings without well-organized vital statistics systems. Computer-based methods are often adopted to assign causes of death to deceased individuals based on the interview responses of their family members or caregivers. In this article we develop a new Bayesian approach that extracts information about cause-of-death distributions from VA data considering the age- and sex-related variation in the associations between symptoms. Its performance is compared with that of existing approaches using gold-standard data from the Population Health Metrics Research Consortium. In addition, we compute the relevance of predictors to causes of death based on information-theoretic measures.
We consider the problem of clustering grouped data for which the observations may include group-specific variables in addition to the variables that are shared across groups. This type of data is common in cancer genomics where the molecular information is usually accompanied by cancer-specific clinical information. Existing grouped clustering methods only consider the shared variables, thereby ignoring valuable information from the cancer-specific variables. To allow for these cancer-specific variables to aid in the clustering, we propose a novel Bayesian nonparametric approach, termed global-local (GLocal) Dirichlet process that models the "global-local" structure of the observations across groups. We characterize the GLocal Dirichlet process using the stick-breaking representation and the representation as a limit of a finite mixture model, which leads to an efficient posterior inference algorithm. We illustrate our model with extensive simulations and a real pan-gastrointestinal cancer dataset. The cancer-specific clinical variables included carcinoembryonic antigen level, patients’ body mass index, and the number of cigarettes smoked per day. These important clinical variables refine the clusters of gene expression data and allow us to identify finer subclusters, which is not possible in their absence. This refinement aids in the better understanding of tumor progression and heterogeneity. Moreover, our proposed method is applicable beyond the field of cancer genomics to a general grouped clustering framework in the presence of group-specific idiosyncratic variables.
Stochastic epidemic models provide an interpretable probabilistic description of the spread of a disease through a population. Yet fitting these models to partially observed data can be a difficult task due to intractability of the marginal likelihood, even for classic Markovian models. To remedy this issue, this article introduces a novel data-augmented Markov chain Monte Carlo sampler for exact Bayesian inference under the stochastic susceptible-infectious-removed model, given only discretely observed counts of infections. In a Metropolis–Hastings step, the latent data are jointly proposed from a surrogate process carefully designed to closely resemble the target process and from which we can efficiently generate epidemics consistent with the observed data. This yields a method that explores the high-dimensional latent space efficiently and easily scales to outbreaks with thousands of infections. We prove that our sampler is uniformly ergodic and find empirically that it mixes much faster than existing single-site samplers. We apply the algorithm to fit a semi-Markov susceptible-infectious-removed model to the 2013–2015 outbreak of Ebola Haemorrhagic Fever in Guéckédou, Guinea.
Quantile regression is a powerful tool in epidemiological studies where interest lies in inferring how different exposures affect specific percentiles of the distribution of a health or life outcome. Existing methods either estimate conditional quantiles separately for each quantile of interest or estimate the entire conditional distribution using semi- or nonparametric models. The former often produce inadequate models for real data and do not share information across quantiles, while the latter are characterized by complex and constrained models that can be difficult to interpret and computationally inefficient. Further, neither approach is well suited for quantile-specific subset selection. Instead, we pose the fundamental problems of linear quantile estimation, uncertainty quantification, and subset selection from a Bayesian decision analysis perspective. For any Bayesian regression model, we derive optimal and interpretable linear estimates and uncertainty quantification for each model-based conditional quantile. Our approach introduces a quantile-focused squared error loss, which enables efficient, closed-form computing and maintains a close relationship with Wasserstein-based density estimation. In an extensive simulation study, our methods demonstrate substantial gains in quantile estimation accuracy, variable selection, and inference over frequentist and Bayesian competitors. We use these tools to identify and quantify the heterogeneous impacts of multiple social stressors and environmental exposures on educational outcomes across the full spectrum of low- , medium- , and high-achieving students in North Carolina.
It is of substantial interest to study health disparity associations with COVID-19 death rates. Although high-quality individual-level COVID-19 epidemiological data have been difficult to collect on a national scale, all United States (U.S.) counties have reported total COVID-19 death counts. A standard ecological analysis would then regress county total death counts by county-level covariates such as age, sex, and race percentages. However, such an analysis is limited by ecological bias and fallacy in which estimated county-level associations are different from individual-level associations. Fortunately, state-level age, sex, and race specific COVID-19 death counts are also available for all U.S. states, so this information can be integrated with county-level data for more informative ecological analyses. We propose an approximate log-linear random effects model to jointly model county-level total death counts and state-level age, sex, and race specific death counts. We then develop a penalized composite log-likelihood method for parameter estimation and perform simulation studies to evaluate our proposed approach. Lastly, we analyze COVID-19 death data from the entire U.S., show how incorporating state-level counts can prevent ecological bias and fallacy, and illustrate the heterogeneity in health disparity associations across different U.S. states.
To optimize mobile health interventions and advance domain knowledge on intervention design, it is critical to understand how the intervention effect varies over time and with contextual information. This study aims to assess how a push notification suggesting physical activity influences individuals’ step counts using data from the HeartSteps micro-randomized trial (MRT). The statistical challenges include the time-varying treatments and longitudinal functional step count measurements. We propose the first semiparametric causal excursion effect model with varying coefficients to model the time-varying effects within a decision point and across decision points in an MRT. The proposed model incorporates double time indices to accommodate the longitudinal functional outcome, enabling the assessment of time-varying effect moderation by contextual variables. We propose a two-stage causal effect estimator that is robust against a misspecified high-dimensional outcome regression nuisance model. We establish asymptotic theory and conduct simulation studies to validate the proposed estimator. Our analysis provides new insights into individuals’ change in response profiles (such as how soon a response occurs) due to the activity suggestions, how such changes differ by the type of suggestions received, and how such changes depend on other contextual information such as being recently sedentary and the day being a weekday.
To study the neurophysiological basis of attention deficit hyperactivity disorder (ADHD), clinicians use electroencephalography (EEG) which records neuronal electrical activity on the cortex. Instead of focusing on single-channel spectral power, a novel framework for investigating interactions (dependence) between channels in the entire network is proposed. Although dependence measures, such as coherence and partial directed coherence (PDC), are well explored in studying brain connectivity, these measures only capture linear dependence. Moreover, in designed clinical experiments, these dependence measures are observed to vary across subjects, even within a homogeneous group. To address these limitations, we propose the mixed-effects functional-coefficient autoregressive (MXFAR) model which captures between-subject variation by incorporating subject-specific random effects. The advantages of the MXFAR model are the following: (i) it captures potential nonlinear dependence between channels; (ii) it is nonparametric and hence flexible and robust to model misspecification; (iii) it can capture differences between groups when they exist; (iv) it accounts for variation across subjects; (v) the framework easily incorporates well-known inference methods from mixed-effects models; (vi) it can be generalized to accommodate various covariates and factors. Then we formulate a novel nonlinear spectral measure, the functional partial directed coherence (fPDC), to extract dynamic cross-dependence patterns at different frequency oscillations. Finally, we apply the proposed MXFAR-fPDC framework to analyze multichannel EEG signals and report novel findings on altered brain functional networks in ADHD patients.
Brain connectivity characterizes interactions between different regions of a brain network during resting-state or performance of a cognitive task. In studying brain signals, such as electroencephalograms (EEG), one formal approach to investigating connectivity is through an information-theoretic causal measure called transfer entropy (TE). To enhance the functionality of TE in brain signal analysis, we propose a novel methodology that captures cross-channel information transfer in the frequency domain. Specifically, we introduce a new measure, the spectral transfer entropy (STE), to quantify the magnitude and direction of information flow from a band-specific oscillation of one channel to another band-specific oscillation of another channel. The main advantage of our proposed approach is that it formulates TE in a novel way to perform inference on band-specific oscillations while maintaining robustness to the inherent problems associated with filtering. In addition, an advantage of STE is that it allows adjustments for multiple comparisons to control false positive rates. Another novel contribution is a simple yet efficient method for estimating STE using vine copula theory. This method can produce an exact zero estimate of STE (which is the boundary point of the parameter space) without the need for bias adjustments. With the vine copula representation, a null copula model, which exhibits zero STE, is defined, thus enabling straightforward significance testing through standard resampling. Lastly, we demonstrate the advantage of the proposed STE measure through numerical experiments and provide interesting and novel findings on the analysis of EEG data in a visual-memory experiment.
There is keen interest in the field of biomechanics to identify unique strategies of negotiating specific movement tasks post lower-limb joint injury. Finite mixture models are flexible methods that are commonly used for carrying out this type of task. A recent focus in the model-based clustering literature is to highlight the difference between the number of components in a mixture model and the number of clusters. The number of clusters is more relevant from a practical stand point, but to date, the focus of prior distribution formulation has been on the number of components. This can make prior elicitation on the number of clusters challenging when prior information exists, which is the case in the biomechanic study considered here. In light of this, we develop a finite mixture methodology that permits eliciting prior information directly on the number of clusters in a flexible and intuitive way. This is done by employing an asymmetric Dirichlet distribution as a prior on the weights of a finite mixture. Further, a penalized complexity motivated prior is employed for the Dirichlet shape parameter. We illustrate the ease to which prior information can be elicited via our construction and the flexibility of the resulting induced prior on the number of clusters. In addition to applying the method to the biomechanic data, we also demonstrate utility using numerical experiments and the galaxies dataset.
Unexpected failures in engineering systems lead to expensive maintenance actions and should be avoided if at all possible. This is particularly true for wind turbine systems for which unexpected failures not only demand costly repairs but also cause long downtime. Motivated by this need, we present an accumulation method for fault early warning and failure anticipation. Our research shows that one critical element allowing the ability of early warning is to accumulate the small-magnitude symptoms resulting from gradual changes in an engineering system like wind turbines. Our idea is inspired by the classical cumulative sum method, or CUSUM, but we have to redesign the accumulation mechanism for tackling unique challenges in wind turbine data. The new accumulation method is applied to two real wind turbine datasets, one with gearbox failures and the other with generator failures, and demonstrates superior performance as compared with CUSUM.
International comparisons of hierarchical time series data sets based on survey data, such as annual country-level estimates of school enrollment rates, can suffer from large amounts of missing data due to differing coverage of surveys across countries and across times. A popular approach to handling missing data in these settings is through multiple imputation, which can be especially effective when there is an auxiliary variable that is strongly predictive of and has a smaller amount of missing data than the variable of interest. However, standard methods for multiple imputation of hierarchical time series data can perform poorly when the auxiliary variable and the variable of interest have a nonlinear relationship. Performance can also suffer if the multiple imputations are used to estimate an analysis model that makes different assumptions about the data compared to the imputation model, leading to uncongeniality between analysis and imputation models. We propose a Bayesian method for multiple imputation of hierarchical nonlinear time series data that uses a sequential decomposition of the joint distribution and incorporates smoothing splines to account for nonlinear relationships between variables. We compare the proposed method with existing multiple imputation methods through a simulation study and an application to secondary school enrollment data. We find that the proposed method can lead to substantial performance increases for estimation of parameters in uncongenial analysis models and for prediction of individual missing values.
Educational assessments are valuable tools for measuring student knowledge and skills, but their validity can be compromised when test takers exhibit changes in response behavior due to factors such as time pressure. To address this issue, we introduce a novel latent factor model with change-points for item response data, designed to detect and account for individual-level shifts in response patterns during testing. This model extends traditional item response theory (IRT) by incorporating person-specific change-points, which enables simultaneous estimation of item parameters, person latent traits, and the location of behavioral changes. We evaluate the proposed model through extensive simulation studies, which demonstrate its ability to accurately recover item parameters, change-point locations, and individual ability estimates under various conditions. Our findings show that accounting for change-points significantly reduces bias in ability estimates, particularly for respondents affected by time pressure. Application of the model to two real-world educational testing datasets reveals distinct patterns of change-point occurrence between high-stakes and lower-stakes tests, providing insights into how test-taking behavior evolves during the tests. This approach offers a more nuanced understanding of test-taking dynamics, with important implications for test design, scoring, and interpretation.
Social network platforms today generate vast amounts of data, including network structures and a large number of user-defined tags, which reflect users’ interests. The dimensionality of these personalized tags can be ultrahigh, posing challenges for model analysis in targeted preference analysis. Traditional categorical feature screening methods overlook the network structure, which can lead to incorrect feature set and suboptimal prediction accuracy. This study focuses on feature screening for network-involved preference analysis based on ultrahigh-dimensional categorical tags. We introduce the concepts of self-related features and network-related features, defined as those directly related to the response and those related to the network structure, respectively. We then propose a pseudo-likelihood ratio feature screening procedure that identifies both types of features. Theoretical properties of this procedure under different scenarios are thoroughly investigated. Extensive simulations and real data analysis on Sina Weibo validate our findings.
Addressing misreporting of participation in social programs, which is common and has increased in all major surveys, is important to study intergenerational effects of policies. In this paper we propose a practical estimator for a quantile regression model with endogenous one-sided misreporting. The identification of the model uses a parametric first stage and information related to participation and misreporting. We show that the estimator is consistent and asymptotically normal. We also establish that a bootstrap procedure is asymptotically valid for approximating the distribution of the estimator. Simulation studies show the small sample behavior of the estimator in comparison with other methods. Finally, we illustrate the approach using U.S. survey data to estimate the intergenerational effect of a mother’s participation on welfare on her daughter’s adult income.