Statistics
See recent articles
Showing new listings for Thursday, 2 July 2026
- [1] arXiv:2607.00051 [pdf, html, other]
-
Title: Spatio-Temporal Gaussian Process for Building Terrain-Incorporating Wind Power CurvesSubjects: Applications (stat.AP); Machine Learning (cs.LG)
Accurate modeling of wind turbine power curves is crucial for optimal wind farm operation. Nearly all existing power curve models focus on temporal variables such as wind speed and temperature while overlooking the influence of terrain covariates, which governs inflow wind conditions and thus also affects wind power production. This paper proposes a nonparametric spatio-temporal Gaussian process model that integrates temporal environmental covariates with spatial terrain features. The model falls in the category of spatial-temporal Gaussian process models with data on a grid. The challenge to be addressed is that the spatio-temporal modeling require certain temporal alignment among the data, a property that the wind farm data does not have. Our solution strategy is to construct a shared representative temporal covariate set which not only aligns the temporal inputs but also has a size an order of magnitude smaller than the original data size. With this transformation, our resulting model is able to employ a separable kernel structure that captures both spatial and temporal dependencies. Empirical analysis on a real wind farm dataset shows that our method improves predictive accuracy over existing baselines and can be used to quantify the various impact of the terrain characteristics on turbine performance.
- [2] arXiv:2607.00128 [pdf, html, other]
-
Title: Similarity-Based Prediction for Digital Twins: Panel Data, Theory, and ApplicationsComments: 32 pages, 1 figureSubjects: Methodology (stat.ME)
Prediction from sequential panel data is central to digital-twin modeling, where new panels arrive over time and the predictive system is updated sequentially. Existing methods often rely on temporal proximity, which can fail when similar input-output patterns recur at nonadjacent times or when recent panels differ from the target panel. We propose State-Local Prediction (StaLoP), a nonparametric dynamic panel prediction framework that utilizes information through target-local predictive compatibility. StaLoP represents panels using target-local state vectors, compares historical and target panels via empirical discrepancy scores to determine relevance weights for the target point, and combines these weights with covariate localization. Theoretical results, including bias-variance characterization, asymptotic normality, simultaneous prediction bands, and a target-local-GDF-corrected MSPE criterion for panel and model selection, are developed. Extensive simulations validate the performance of StaLoP and support its theoretical properties. Applications to sequence prediction, simulator calibration, variable selection, and county-to-county migration-flow forecasting demonstrate improved out-of-sample prediction and provide scientific insights into the underlying applications.
- [3] arXiv:2607.00188 [pdf, html, other]
-
Title: Quantile regression with measurement errorsSubjects: Methodology (stat.ME)
We devise a novel estimator for a general quantile regression model with normal measurement errors in the covariates. The method is applicable to both linear and nonlinear quantile regressions and does not impose the quantile requirement on multiple quantile levels simultaneously. We circumvent the difficulties caused by discontinuity in quantile regression through kernel smoothing, and overcome the nonlinearity inherent in quantile regression via considering extension to the complex domain and moment generating functions. We show that the resulting estimator achieves the standard root-$n$ consistency and asymptotic normality under mild conditions. The performance of the proposed method is illustrated via numerical simulations and a real data example related to Cherry Blossom times in Japan in 2024. This is the first consistent estimator in a general quantile regression problem with normal measurement errors.
- [4] arXiv:2607.00214 [pdf, html, other]
-
Title: A Short Review of Estimators for the GLM predictive of Laplace Bayesian Neural NetworksSubjects: Statistics Theory (math.ST)
This short review examines the primary approaches for estimating the predictive distribution of Laplace-approximated Bayesian neural networks, with particular focus on the Generalized Linear Model (GLM) formulation. We survey the landscape of estimation strategies, from exact GLM computations requiring full Jacobian evaluations to Monte Carlo approximations that trade computational cost for statistical efficiency. The review covers the theoretical foundations of the Laplace approximation, the Kronecker-factored approximate curvature (KFAC) method for scalable posterior inference, and the various predictive estimation techniques developed in the literature. We provide a unified presentation that clarifies the relationships between methods and highlights their respective computational and statistical trade-offs.
- [5] arXiv:2607.00222 [pdf, html, other]
-
Title: Causal Inference for All: Marginal Estimands for Outcomes Truncated by DeathSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
In longitudinal studies, outcomes of interest are often truncated by death, meaning that they are only observed or well-defined conditional on intercurrent events such as survival. Existing strategies face a trade-off: causally interpretable estimands, such as survivor average causal effects, target a latent subgroup, whereas while-alive and composite summaries apply to the full population but are difficult to interpret as causal effects on the non-mortality outcome. We address these challenges by introducing methodology for a new set of estimands that (i) concern the entire population, (ii) remain causally interpretable, and (iii) leverage the longitudinal data commonly available in studies with outcomes truncated by death. The set of estimands includes single-world marginal separable effects that generalize conditional separable effects to full-population summaries. We develop identification and estimation results for these estimands and apply the methodology in a reanalysis of a prostate cancer trial, highlighting how different estimands can yield different treatment conclusions.
- [6] arXiv:2607.00224 [pdf, html, other]
-
Title: Sample Complexities of Estimating Gumbel--Max Watermark Proportions with and without Reduction to Pivotal StatisticsSubjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
Watermarking promises a statistical trace of large language model (LLM) use, but real documents, after editing or paraphrasing, rarely arrive as purely human-written or purely machine-generated. This motivates a quantitative question beyond detection: what proportion of a document is generated from a pre-specified watermarked LLM? We study this watermark proportion estimation problem under the Gumbel--max watermarking mechanism, treating the next-token prediction (NTP) distributions as unknown and arbitrary nuisance parameters subject to a non-degeneracy condition. We compare two observation regimes: in the full observation regime, the estimator observes the pseudorandom vector and the selected token at each position; under the more popular setting of pivotal reduction, it observes only a scalar pivot, which follows a one-dimensional Uniform--Beta mixture distribution. Under pivotal reduction, we develop a Laguerre-polynomial estimator and establish a matching information-theoretic lower bound for the sample complexity. For full observation, we introduce an event-counting estimator and show a matching lower bound, yielding a substantially smaller sample complexity. As our results imply, although reducing to pivotal statistics is an elegant and widely used procedure, it is not always sample-efficient for estimating the proportion of watermarks.
- [7] arXiv:2607.00230 [pdf, html, other]
-
Title: Waiting time analysis in a finite-capacity multi-server systems with dynamic priorities, dynamically evolving customer types, and abandonmentSubjects: Applications (stat.AP); Probability (math.PR)
In many service systems, an estimation of customers' waiting times for the service can assist in decision making focused on enhancing the operational efficiency, improving the customers' experience, and ensuring efficient resource allocation. In this paper, we study the customers' waiting times in a finite-capacity service system with a finite number of parallel servers and a shared waiting area. We consider two customer types, Type 1 and Type 2, with dynamic admission priorities, dynamically evolving customer type, and abandonment. We model the system under such assumptions using a continuous-time Markov chain (CTMC) and develop a methodology based on Krylov subspace approximation methods to evaluate the conditional waiting time distributions of Type 1 and Type 2 customers in the system. This methodology (CTMC-Krylov) offers a scalable computational approach that is well suited for analysing large complex systems. Next, we model this system using a quasi-birth-and-death (QBD) process and derive analytical expressions building on matrix-analytic methods to evaluate the conditional and long-run waiting time distributions using recursion. We illustrate the practical applicability of our models in a hospital system through a suite of numerical examples based on a large dataset obtained from a tertiary referral hospital in Australia, considering two types of patients, complex (Type 1) and other (Type 2).
- [8] arXiv:2607.00261 [pdf, html, other]
-
Title: Worst-Case Maximal Inequalities for Heavy-tailed Random VectorsSubjects: Statistics Theory (math.ST)
This paper establishes finite-sample worst-case maximal inequalities for averages of independent centered heavy-tailed random vectors. The object of interest is the expected top-$k$ Euclidean norm of the sample average, which includes the expected coordinate-wise maximum as the special case $k=1$. Under coordinatewise variance constraints and tail-envelope constraints, the worst-case value is characterized up to universal constants over the class of distributions satisfying a finite $q$:th envelope moment condition. Analogous bounds are obtained for the sub-Weibull envelope class and the marginal sub-Weibull class.
- [9] arXiv:2607.00317 [pdf, html, other]
-
Title: Economic Disparities and Their Relationship to Destructive Health Behaviors in Five Western U.S. StatesComments: 24 pages, 6 figures, 3 tablesSubjects: Applications (stat.AP)
In this paper, we look at the relationships that economic variables have with adverse health outcomes in the western counties of Washington, Idaho, Oregon, California, and Nevada, with specific emphasis on how suicide rate relates to such economic variables. Data was first gathered from Census and County Health Rankings for the entire United States (for website use and usefulness for future research), cleaned and regression-imputed, and then various exploratory data analysis methods were used, such as PCA, clustering, correlation gathering, linear fittings, and LASSO. PCA and clustering suggested that counties may group according to broader state-level economic patterns, although political interpretations would require additional electoral data. Correlation Analysis along with LASSO and linear fittings showed us the destructive variables that connected the most with economic variables (in terms of $R^2$ and correlation values seen), the economic variables that are most and least important in predicting suicide rate, and the possible relationships that suicide rate has with these economic variables.
- [10] arXiv:2607.00320 [pdf, other]
-
Title: From Spectral Methods to Sample Complexity Bounds for Fourier Neural OperatorsComments: 66 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
We establish approximation and learning guarantees for Fourier neural operators (FNOs) applied to time-$T$ solution operators of dissipative evolution equations. The analysis builds on the premise that FNOs can efficiently approximate and learn solution operators whenever these operators admit stable and accurate spectral discretizations. To formalize this idea, we introduce classes of evolution operators defined through spectral methods and derive FNO approximation bounds and polynomial sample complexity guarantees for these classes. For equations with polynomial nonlinearities, the learning rates depend primarily on the smoothness of the input space and the dimension of the physical domain. Our results hold uniformly over broad families of dissipative equations, rather than for a single fixed PDE, and apply in particular to the Navier--Stokes, Allen--Cahn, and Cahn--Hilliard equations. For equations with non-polynomial smooth nonlinearities, we prove that polynomial sample complexity still holds with rates that now additionally depend on the smoothness of the nonlinear terms and the dissipation strength. Overall, we connect classical spectral approximation theory with modern operator learning and explain when FNOs can learn nonlinear evolution operators efficiently.
- [11] arXiv:2607.00330 [pdf, html, other]
-
Title: Ergodicity and High-Frequency Inference for Hybrid Switching Lévy-Driven Stochastic Differential EquationsSubjects: Statistics Theory (math.ST); Probability (math.PR)
Hybrid switching Lévy-driven stochastic differential equations with pure-jump noise and state-dependent switching rates are studied under high-frequency observation. A three-stage inference procedure is proposed for the drift, scale, and switching-rate parameters, combining a staged Gaussian quasi-likelihood with an intensity-type contrast. Checkable sufficient conditions for weighted exponential ergodicity are established for the hybrid process; the proof does not rely on Brownian smoothing, but uses a fixed skeleton-chain argument combining small-jump accessibility and regime connectivity. Under ergodicity and the high-frequency sampling scheme, consistency, joint asymptotic normality, and a polynomial-type large deviation inequality are proved for the full estimator. The joint limit exhibits a transparent covariance structure: the drift and scale blocks are coupled through the third moment of the driving Lévy noise, whereas the switching-rate block is asymptotically uncorrelated with the continuous-coefficient blocks. Numerical experiments for models driven by normal inverse Gaussian noise illustrate the finite-sample behavior of the proposed estimators.
- [12] arXiv:2607.00331 [pdf, html, other]
-
Title: Coupling Precipitation Forecasting and Early Warning with Reverse-Martingale Recurrent Neural NetworksComments: 34 pages, 5 figuresSubjects: Applications (stat.AP)
Precipitation forecasts are judged by accuracy, but the decisions they support -- when to restrict water, when to warn of drought -- turn on noticing when a local regime is becoming abnormal, which forecast scores alone do not reveal. We ask whether one recurrent model can do both with little or no loss in forecast skill. We add a backward-coherence (reverse-martingale) penalty that keeps the network's hidden state smooth when read backward in time; the size of the resulting reconstruction defect becomes an online warning signal, monitored by a sequential change-point detector. The design is deliberately conservative. On real daily station data from four contrasting climates -- monsoonal Taiwan, semi-arid Texas, temperate Germany, and Mediterranean Anatolia (Turkey) -- the model matches a standard network's forecast skill everywhere, and makes the hidden state markedly steadier in every region. The novelty is the added information: on these real droughts the signal can alarm well ahead of the operational SPI-3 index, giving lead that neither the forecast nor the index provides. This benefit is not uniform across the four regions -- large in one, partial in two others, and near-absent in the fourth. We offer the hydroclimatic character of drought onset, whether it precedes or merely coincides with the rainfall deficit, as a plausible explanation to be tested in future work, supported by a controlled synthetic study with known onset times. The contribution is thus a new and conservative way to read precipitation records: no loss in forecast skill, a steadier model, and an early-warning signal beyond the standard index.
- [13] arXiv:2607.00350 [pdf, html, other]
-
Title: Robust Estimation and Inference with Selective Borrowing in Hybrid Controlled Trials: A Tutorial with SelectiveIntegrative and intFRTSubjects: Methodology (stat.ME)
Hybrid controlled trials (HCTs) augment randomized controlled trials (RCTs) with external controls (ECs) to improve statistical efficiency when RCTs face limited sample sizes, slow accrual, or ethical constraints. However, valid use of ECs requires careful adjustment for covariate shift and outcome drift, as inappropriate borrowing may introduce bias and compromise inference. This tutorial provides a practical workflow for estimation and inference in HCTs. We first present a statistical analysis roadmap covering estimands, identification assumptions, eligibility alignment, matching, full and selective borrowing strategies, and both asymptotic inference and randomization tests. We then demonstrate step-by-step implementation using the SelectiveIntegrative and intFRT packages. The workflow is illustrated using a synthetic lung cancer dataset included in the intFRT package that mimics the CALGB 9633 trial and ECs from the National Cancer Database. The tutorial aims to help applied statisticians conduct transparent, interpretable, and reproducible HCT analyses that improve efficiency while maintaining valid inference.
- [14] arXiv:2607.00373 [pdf, html, other]
-
Title: Confidence Intervals for the Risk Difference in Combined Unilateral and Bilateral Data Incorporating a Distribution-Based ApproachComments: 23 pages, 3 figures, 8 tablesSubjects: Methodology (stat.ME)
Combined unilateral and bilateral binary outcomes frequently arise in studies involving paired organs. The risk difference is a clinically interpretable measure for comparing treatment effects between groups. Existing confidence interval methods are primarily based on asymptotic normality and may fail to adequately reflect finite-sample distributional features, particularly skewness. To address this issue, we propose a distribution-based confidence interval derived from the probability distribution of the risk difference estimator and a modified MOVER procedure that accounts for intra-subject correlation. Their performances are compared with those of commonly used asymptotic methods through extensive simulation studies. Across a broad range of parameter settings, all methods exhibited satisfactory performance as sample size increased. The proposed distribution-based interval achieved coverage probabilities close to the nominal level with interval widths comparable to those of existing procedures. In small sample settings, it was able to capture skewness in the sampling distribution that was not reflected by methods relying on asymptotic normality. Analyses of two real-world datasets demonstrated the practical applicability of the competing methods and yielded consistent inferential conclusions. The proposed approach provides an alternative framework for interval estimation of the risk difference in studies involving combined unilateral and bilateral binary outcomes.
- [15] arXiv:2607.00376 [pdf, html, other]
-
Title: Distributed Prediction under Heterogeneity with Unidentifiable ParameterSubjects: Methodology (stat.ME); Optimization and Control (math.OC); Statistics Theory (math.ST)
Predicting a response based on covariates is a fundamental problem in statistics and machine learning. However, profound difficulties arise when the underlying low-dimensional structural parameters are unidentifiable, as typified in dimension reduction contexts. Specifically,estimating these non-identifiable parameters inherently introduces severe nonconvexity. In distributed settings, this difficulty is further compounded by the challenges of data heterogeneity and communication cost. To overcome these intertwined barriers, we propose a novel distributed semiparametric framework. We formulate an adaptive homogeneity pursuit utilizing a trace-similarity penalty to effectively address data heterogeneity. To resolve the ensuing severe nonconvexity and communication bottlenecks, we introduce an invex relaxation technique coupled with a multi-step local update algorithm, ensuring stable convergence to global optimality with significantly reduced communication overhead. Theoretically, we establish a non-asymptotic model-free prediction error bound and prove that our estimator achieves a two-phase minimax optimal convergence rate and an sharper model-free prediction error bound. Furthermore, we provide theoretical guarantees for algorithmic convergence and communication efficiency. Extensive simulations and a real-world multi-center medical application validate the superiority of our method.
- [16] arXiv:2607.00470 [pdf, html, other]
-
Title: Neural Network-Based Estimation of Time-Dependent Parameters in AR(p) ProcessesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We investigate a forecasting framework based on a simple discrete-time dynamic model with coefficients varying in time. The parameters of the model are recovered within a deep learning framework, which makes it possible to retain a transparent parametric structure while simultaneously accounting for complex and nonstationary patterns in the observed phenomenon. Our analysis covers two specifications of the noise process. Besides the standard Gaussian setting, we also consider Laplace-distributed noise, which can offer a more adequate description in the presence of heavier tails and sharper local fluctuations. For both cases, we formulate the predictive scheme of the model and analyze the associated uncertainty quantification, including the construction of prediction intervals. The results illustrate that a relatively simple model, when combined with time-dependent parameter estimation, can serve as a mathematically tractable and practically flexible tool for forecasting complex dynamics under different noise assumptions. The general model is stated for TVAR($p$), while the prediction-interval formulas and the numerical experiments are developed for the TVAR(1) case.
- [17] arXiv:2607.00586 [pdf, html, other]
-
Title: Optimal scaling of MCMC algorithms: exploiting the symmetry of the Metropolis-Hastings formulaComments: 23 pages, 3 figuresSubjects: Computation (stat.CO); Machine Learning (cs.LG); Probability (math.PR)
We present a simple, yet general approach to study the scaling properties as the dimensionality of Metropolised MCMC sampling algorithms increases. The study relies ultimately on the symmetry of the Metropolis-Hastings formula. Our findings contain, as particular cases, many known results for the Random Walk Metropolis, MALA and other algorithms. In addition, they provide, in an easy way, new optimal scaling results for a variety of proposal mechanisms, including implicit proposals and proposals generated with the help of differential equation integrators. The analysis applies to targets that are products of a given, not necessarily univariate distribution, and also to cases where the different terms in the product are scaled differently. We show how to construct gradient-based MALA-like proposals where the variance of the proposal as the dimension $d$ increases may be taken as $O(1/d^\mu),ドル with $\mu>0$ arbitrarily small, to be compared with the values $\mu = 1$ for Random Walk Metropolis and $\mu=1/3$ for MALA.
- [18] arXiv:2607.00645 [pdf, other]
-
Title: Approximate full-conformal multi-task regression with reproducing kernelsSubjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
Multi-task regression aims at jointly solving multiple regression problems, called tasks. Compared to solving each task separately, better performances can be achieved as long as the tasks are sufficiently related. Full-conformal prediction is a framework that formulates a data-dependent prediction-region containing the unknown output-vector at any prescribed confidence level. However, explicit computation of this prediction-region is intractable in general since it requires training infinitely many predictors. The present work focuses on multi-task regression in a Reproducing Kernel Hilbert Space (RKHS) of vector-valued functions. This computational issue is addressed by designing an approximating predictionregion containing the full-conformal one. This construction is carried out in two scenarios: piq when the inter-task covariance-matrix is known, and piiq when this matrix is estimated. In terms of volume, the tightness of this approximation is assessed theoretically by means of an upper-bound in the first scenario. It is also empirically proved to improve upon the split-conformal prediction on synthetic data in both scenarios.
- [19] arXiv:2607.00722 [pdf, html, other]
-
Title: How does academic performance affect self-efficacy? Interpretable modelling through latent academic achievementComments: Main manuscript: 25 pages (including references). Supplementary material: 19 pagesSubjects: Methodology (stat.ME); Applications (stat.AP)
There is increasing evidence of a directional relationship from academic performance to self-efficacy. We develop a Bayesian model for investigating this relationship when academic performance is measured on an ordinal scale and self-efficacy on a continuous scale. The model allows latent academic achievement to enter the self-efficacy regression as a predictor, while Bayesian variable selection identifies factors associated with either response. The resulting conditional formulation yields an interpretable regression characterisation of how latent academic achievement relates to self-efficacy. Furthermore, it enables a tailored partially collapsed Gibbs sampler that analytically integrates out the regression coefficients when updating the variable inclusion indicators. Simulation studies demonstrate that the proposed conditional formulation and tailored sampler improve sampling efficiency and variable-selection performance relative to a recent, more general joint Gaussian copula regression formulation. We apply the methodology to data from the longitudinal study of Australian children, a landmark national cohort study covering children's education, social and emotional wellbeing, health and family circumstances. The model and analysis shed light on how latent academic achievement relates to self-efficacy in Australian children, and reveal that the two outcomes differ markedly in the range of covariates associated with each outcome.
- [20] arXiv:2607.00847 [pdf, html, other]
-
Title: Transfert learning and adaptive LASSO quantileSubjects: Methodology (stat.ME); Computation (stat.CO)
We propose for a quantile regression an estimation method for transferring knowledge using two $L_1$ penalties based on an estimator obtained from a source database. The proposed transfer learning estimator satisfies the properties of consistency and sparsity. Its convergence rate and asymptotic behavior are studied in several scenarios. This knowledge transfer results in a shorter computation time than that of the standard adaptive LASSO estimator. Another advantage of our method is that it can be applied to models with non-Gaussian errors. In addition, in order to implement the computing of the adaptive transfer LASSO quantile estimator, we propose an algorithm. The simulations confirm the theoretical results and demonstrate that the adaptive learning estimator, calculated using the proposed algorithm, is more competitive than the LASSO estimators. Finally, we illustrate the practical utility of the proposed transfer learning estimator and algorithm using a real-data application involving the physicochemical properties of protein tertiary structures.
- [21] arXiv:2607.00877 [pdf, html, other]
-
Title: Hierarchical Variational Kalman FilteringSubjects: Machine Learning (stat.ML); Information Theory (cs.IT)
Traditional variational Kalman filtering with unknown noise statistics suffers from inconsistent process covariance estimation and slow convergence speed, limiting its practical utility. To address these issues, we introduce a surrogate variable representing the process-noise-free state, which enables explicit modeling and inference of process noise statistics. In addition, we reformulate the conventional coordinate ascent variation inference (CAVI) as a marginalized maximum a posteriori problem, followed by a single-step hyperparameter fitting. This reformulation obviates the need for multiple inner iterations inherent to CAVI and decouples the design of the covariance tracking filters. Consequently, this architecture permits the deployment of higher-order filters for covariance tracking and enables sliding-window hyperparameter estimation. Notably, when this window encompasses all historical data, the covariance tracking estimator intrinsically operates as a zero-phase filter. Numerical simulations validate the theoretical framework, demonstrating the enhanced convergence speed and superior estimation accuracy compared with existing methods.
- [22] arXiv:2607.00907 [pdf, html, other]
-
Title: Beyond the Flow: A Bayesian Latent Clustering Framework for Shared Micro-mobility Users in VeniceComments: 24 pages, 10 figuresSubjects: Applications (stat.AP)
The study on shared micro-mobility is based on trip modeling and user data. User segmentation in shared micromobility systems is traditionally studied by aggregating trip-level observations into user-specific summary measures before applying clustering techniques. Such aggregation can obscure trip-level variability and lead to ecological fallacies if results are interpreted as applying to individual records. We propose a Bayesian finite mixture model for multivariate categorical count data that clusters users directly from repeated trip-level observations while preserving the full categorical structure of individual travel behavior. This approach focuses on identifying heterogeneous mobility users from high-dimensional categorical trip behavior while accounting for uncertainty in cluster assignments. Users are the fundamental unit of analysis for exploring latent cluster patterns. The model represents each user with a product-multinomial likelihood with latent cluster membership. The methodology is illustrated using a one-year trip record of shared bikes and e-bikes from the Municipality of Venice, Italy, comprising over 220,000 trips made by more than 11,000 recurrent users. The analysis identifies eight distinct latent mobility profiles corresponding to localized, commuter-oriented, tourist-oriented, central, and inter-zonal travel behaviors. The proposed framework provides a flexible and computationally scalable approach for clustering repeated categorical observations and is readily applicable to other large-scale behavioral and transportation datasets.
- [23] arXiv:2607.00915 [pdf, html, other]
-
Title: Simulating Node Manipulations in Gaussian Graphical Models: The GGMNIRA Framework for Continuous and Ordinal Psychological Network DataSubjects: Methodology (stat.ME)
Scientific Abstract: In psychological network analysis, centrality indices are commonly used to evaluate the importance of nodes within a network. However, centrality only captures the static topological position of a node, and there is no sufficient theoretical justification for assuming that it reflects a node's influence on network dynamics. The NodeIdentifyR Algorithm (NIRA) offers an alternative by systematically applying simulated manipulations to node intercepts within the Ising model to evaluate nodes' projected importance, but this algorithm is restricted to binary data, and the manipulated parameter lacks a clear theoretical meaning outside the context of psychopathology. To address these limitations, we propose the Gaussian Graphical Model NodeIdentifyR Algorithm (GGMNIRA), which manipulates a node's conditional mean and uses Kullback-Leibler (KL) divergence to quantify the change in network distribution before and after manipulation, thereby extending this simulated manipulation logic to the Gaussian graphical model framework, which is applicable to continuous and ordinal data. Around this algorithm, we further developed a correlation stability coefficient and a nonparametric bootstrap difference test for KL divergence, with corresponding interpretive thresholds established through simulation studies. The framework was also extended to bridge Gaussian graphical models and moderated Gaussian graphical models, enabling its application to multi-construct comorbidity networks and to contexts involving moderation effects. All methods are implemented in the R package "GGMNIRA".
- [24] arXiv:2607.00980 [pdf, html, other]
-
Title: An Instrumental Variable Approach to Account for Informative Treatment Switching in Real-world EvidenceSubjects: Methodology (stat.ME)
Reproducible and generalizable assessment of treatment decisions requires principled handling of subsequent treatment switching that may inform expected outcomes and shift across cohorts and over time. To effectively account for informative treatment switching, we propose an instrumental variable approach that characterizes the poorly documented expected outcomes at switching as unmeasured confounding. After establishing the baseline treatment as a viable instrumental variable, we constructed an estimating equation based on the association between the centered instrumental variable and a martingale style residual process that identifies the treatment effect under structural cumulative survival model. Our proposed method is doubly robust, i.e., valid whenever either of baseline propensity model or no-switching outcome model is consistently estimated. A co-training of treatment effect parameter and survival outcome regression model eliminated the requirement of observing a no-switching subset under semi-parametric additive hazards models. We further developed an baseline-survival-corrected cross-fitting approach to incorporate general machine learning models for estimating nuisance models. Numerical results demonstrated the validity of our method in various settings when a basket of benchmark solutions produced biased or contradictory results. We applied our method to comparison of high-efficacy vs standard efficacy disease modifying treatments as the second line therapy of multiple sclerosis.
- [25] arXiv:2607.00995 [pdf, html, other]
-
Title: Deep Multitask Learning for Mixed-Type Outcomes with Shared SparsitySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Most existing multitask learning approaches are limited by their reliance on task-specific loss functions tailored to the scale and type of each outcome. When outcomes differ across tasks, these losses are generally not directly comparable, which makes it difficult to formulate a unified objective and may limit information sharing across tasks. We propose a multitask transformation framework in which task-specific responses may differ through unknown monotone transformations. Motivated by high-dimensional biological applications in which the predictor dimension may diverge with the sample size while only a common subset of predictors is informative, we consider shared sparsity across tasks. Under this framework, we estimate the target functions and identify important predictors by optimizing a smoothed rank-based criterion with a group-Lasso penalty, implemented through a multitask deep neural network with a shared first layer. We establish the nonasymptotic excess-risk bounds, and variable-selection consistency for the proposed estimator. Simulation studies show that the proposed method achieves competitive prediction and variable-selection performance compared with competing approaches. Analyses of gene-expression studies with continuous, binary, and mixed outcomes further illustrate that the proposed method improves prediction and identifies biologically meaningful shared predictors.
- [26] arXiv:2607.01010 [pdf, other]
-
Title: Function-Counting Theory for Low-Dimensional Data StructuresComments: 49 pages, 7 figuresSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Classical Analysis and ODEs (math.CA); Combinatorics (math.CO)
The success of deep learning models in classification and regression is widely attributed to the low-dimensional structure that real-world data tend to exhibit, despite their high-dimensional representation. This work attempts to provide a mathematical framework for binary classification on low-dimensional data, building on Cover's (1965) function-counting theory. With our framework, we aim to address the question of how the low-dimensional structure of the data affects the classification capabilities of learning models. Cover's theory relies on a general position assumption that blinds it to the underlying data structure. We refine this assumption to account for the low-dimensionality of the data and derive dichotomy counts that reflect the data structure. We further extend Cover's separation capacity and problem of generalization to the low-dimensional setting, enabling the impact of the underlying data structure on both to be analyzed.
- [27] arXiv:2607.01057 [pdf, html, other]
-
Title: Characterizing and Identifying Separable Graphical ModelsComments: 69 pages, 7 figures, complete paper currently under submissionSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
We study a broad class of graphical models whose independencies correspond to vertex separation in mixed graphs with directed, undirected, and bidirected edges, that are capable of encoding independence structures arising from feedback, latent and selection mechanisms. In particular, we introduce separable graphs, in which each missing edge implies the existence of a separating set for its endpoints, and essentially separable graphs, those graphs separation equivalent to a separable graph. We show that these models include many existing graph families used to define graphical models an provide several characterizations of separable graphs and essentially separable graphs. We also provide multiple characterizations of separation equivalence for separable graphs. One is a graphical characterization in terms of ordinary graph properties, extending earlier results for specific subfamilies Another is a separational characterization depending only on graph separation properties. Finally, we provide a canonical representation for the equivalence classes of essentially separable graphs and develop an algorithm that, under suitable assumptions, identifies the equivalence class of any essentially separable graph.
New submissions (showing 27 of 27 entries)
- [28] arXiv:2606.27525 (cross-list from econ.GN) [pdf, html, other]
-
Title: Measuring Racial Disparities in Rent Growth Under Algorithmic Landlord Concentration in U.S. MetrosComments: Code available at: this https URLSubjects: General Economics (econ.GN); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (stat.ML)
The 2024 Department of Justice antitrust complaint against RealPage, Inc. named five major residential REITs for coordinating algorithmic rent pricing across hundreds of thousands of apartment units in major US metropolitan areas. This paper studies whether census-tract-level corporate landlord concentration (CLC), measured from SEC EDGAR 10-K property filings geocoded to census tracts, the first such application in the literature, is associated with rent growth 2019-2023, and whether that association is larger in majority-minority neighborhoods. Rent outcomes are measured using the Zillow Observed Rent Index (ZORI). To account for the possibility that corporate landlords preferentially locate in neighborhoods already seeing rent appreciation, all regressions control for a fully novel Algorithmic Housing Burden Index (AHBI), a composite of pre-existing rent burden and market tightness from ACS data. Across 665 census tracts in ten US metropolitan areas, doubling REIT concentration is associated with 2.8 percentage points higher rent growth (p = 0.086, p = 0.030, HC1 robust). This association is significantly stronger in majority-minority tracts. Within the same metro, high-CLC majority-minority tracts are associated with 5.9 percentage points higher rent growth than comparable white tracts (p = 0.039). An XGBoost model predicts 44 percent of out-of-sample rent growth variance, with SHAP analysis independently confirming that CLC's contribution is positive in minority tracts and negative in white tracts. Taken all together, these findings provide the first tract-level evidence consistent with corporate landlord concentration being associated with disproportionately higher rent growth in communities of color.
- [29] arXiv:2607.00149 (cross-list from math.PR) [pdf, html, other]
-
Title: Uniform-in-time Propagation-of-Chaos for Stein Variational Gradient DescentComments: 56 pagesSubjects: Probability (math.PR); Machine Learning (stat.ML)
We study uniform-in-time propagation-of-chaos for continuous-time Stein Variational Gradient Descent (SVGD). Classical finite-time propagation-of-chaos estimates for mean-field systems typically deteriorate rapidly with time and therefore do not directly explain the long-time relation between the finite-particle system and its mean-field limit. We obtain two complementary classes of uniform-in-time propagation-of-chaos results.
For broad distributional metrics, we introduce a cutoff strategy which combines finite-time propagation-of-chaos estimates up to an $N$-dependent horizon with independent quantitative long-time convergence estimates for the finite-particle and mean-field SVGD flows. This yields uniform-in-averaging-time propagation-of-chaos bounds in Langevin kernel Stein discrepancy, Wasserstein-1 distance, and Wasserstein-2 distance, with logarithmic or iterated-logarithmic rates depending on the metric, target and kernel class.
We also develop a finite-dimensional theory for matrix-valued finite-rank kernels. For Gaussian targets with bilinear kernels, the SVGD dynamics close exactly on first and second moments, yielding genuine uniform-in-physical-time parametric propagation-of-chaos rates in finite-dimensional Stein-feature metrics. We then prove a conjugacy principle showing that these feature-level estimates transfer to conjugate target-kernel pairs under orientation-preserving diffeomorphisms, thereby extending the theory to broad classes of nonlinear, including multimodal, targets.
Together, these results highlight the contrast between generic distributional metrics, for which our general approach yields logarithmic rates, and closed finite-dimensional Stein observables, for which parametric $N^{-1/2}$ propagation-of-chaos rates persist uniformly in time. - [30] arXiv:2607.00152 (cross-list from cs.LG) [pdf, html, other]
-
Title: GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation IdentityComments: 18 pages, 10 figures, 4 tables. Code and data: this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Three of the most popular methods for training language models to reason look like three different tricks. They are not. All three adjust a single number: standard deviation, reflecting how much a prompt's sampled answers disagree. When such a model is trained, it answers each problem many times, and an automatic checker marks every answer right or wrong. The standard deviation of those marks measures the disagreement: largest when the answers split evenly between right and wrong, and zero when they all agree. Group Relative Policy Optimization (GRPO) divides by this number, GRPO Done Right (Dr. GRPO) drops the division, and Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) discards the groups where it is zero. Each is presented as its own fix, yet this paper proves they are three settings of one dial. That dial is not cosmetic: for right-or-wrong rewards, the disagreement is exactly the size of the training update, the group-standard-deviation identity. A split group teaches the most, while a unanimous group teaches nothing and falls silent. The same result says which problems deserve the most weight and how many tries each one needs. This paper confirms the intuition on a large real difficulty dataset (Big-Math) and in a controlled training run. What looks like a harmless normalization step is the dial that decides where learning happens and how strongly.
- [31] arXiv:2607.00207 (cross-list from math.OC) [pdf, other]
-
Title: Homogenization of $\ell_2$-Adversarial Training in High-Dimensions: Exact Dynamics under Stochastic Gradient DescentSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
We develop a framework for analyzing the learning dynamics of $\ell_2$-adversarial training of single-index models on Gaussian mixtures in the high-dimensional limit under streaming stochastic gradient descent (SGD). We derive deterministic equivalents for a broad class of statistics of the SGD iterates, including the adversarial risk and distance to adversarial optimality, in terms of the solution to a system of ODEs. We use them to study two idealized learning rate schedules: the Polyak stepsize and exact line search. In the case of $\ell_2$-adversarial least squares with a single class, we show that, unlike noiseless standard least squares, no constant learning rate guarantees monotone descent of SGD towards a minimizer of the adversarial risk. We identify anisotropic covariance and a mismatch in ridge parameters as the main sources of suboptimality of exact line search relative to the Polyak stepsize. We also introduce a stochastic differential equation (SDE), called adversarial homogenized SGD, that captures the evolution of statistics of the iterates of SGD. For $\ell_2$-adversarial least squares, using this SDE, we show the evolution of the risk is equivalent, up to dimension-free constants, to that of SGD on standard least squares with an adaptive learning rate and adaptive $\ell_2$-regularization. When the dynamics converge, the limiting adversarial risk and SGD iterate are determined by a fixed-point equation, with the limiting iterate being equivalent to the solution of a ridge regression problem whose regularization parameter is the limiting effective regularization of SGD.
- [32] arXiv:2607.00252 (cross-list from cs.LG) [pdf, html, other]
-
Title: Distributionally Robust Linear Regression With Block Lewis WeightsComments: ICLR 2026. Comments welcome!Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC); Machine Learning (stat.ML)
We present an algorithm for the group distributionally robust (GDR) least squares problem. Given $m$ groups, a parameter vector in $\mathbb{R}^d,ドル and stacked design matrices and responses $\mathbf{A}$ and $\mathbf{b},ドル our algorithm obtains a $(1+\varepsilon)$-multiplicative optimal solution using $\widetilde{O}(\min\{\mathsf{rank}(\mathbf{A}),m\}^{1/3}\varepsilon^{-2/3})$ linear-system-solves of matrices of the form $\mathbf{A}^{\top}\mathbf{B}\mathbf{A}$ for block-diagonal $\mathbf{B}$. Our technical methods follow from a recent geometric construction, block Lewis weights, that relates the empirical GDR problem to a carefully chosen least squares problem and an application of accelerated proximal methods. Our algorithm improves over known interior point methods for moderate accuracy regimes and matches the state-of-the-art guarantees for the special case of $\ell_{\infty}$ regression. We also give algorithms that smoothly interpolate between minimizing the average least squares loss and the distributionally robust loss.
- [33] arXiv:2607.00275 (cross-list from cs.LG) [pdf, html, other]
-
Title: Entropy-Regularized Probabilistic Gates for Sparse Model Discovery in Scarce-Data Federated LearningSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
Federated Learning (FL) is a distributed machine learning (ML) paradigm with collaboration among multiple clients without sharing data. FL is challenging under data heterogeneity and partial client participation. Learning sparse models is useful for communication and computational efficiency in FL, but it is especially difficult in the small-sample high-dimensional regime (d >> N) where optimization can yield parameter configurations that fail to generalize to unseen test data. While magnitude-based pruning doesn't account for uncertainty exploration in the parameter space, a formulation with probabilistic gates and an L0 constraint allows sampling from competing sparse configurations during training. In this work, we study entropy regularization of gate distributions as a mechanism to maintain uncertainty in sparse federated optimization by preventing early commitment to sparse support. We examine its impact under data heterogeneity, client participation heterogeneity, and sparsity. Experiments on synthetic and real-world benchmarks show consistent improvements over federated iterative hard thresholding (Fed-IHT) and pruning after dense federated averaging (FedAvg) training, both in statistical performance on test data and in sparsity recovery accuracy.
- [34] arXiv:2607.00280 (cross-list from cs.LG) [pdf, html, other]
-
Title: Understanding Guest Preferences and Optimizing Two-sided Marketplaces: Airbnb as an ExampleComments: 5 pages, 3 figures. Presented at the KDD 2024 Workshop on Two-Sided Marketplace Optimization, Barcelona, SpainSubjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Econometrics (econ.EM); Applications (stat.AP)
Airbnb is a community based on connection and belonging -- many hosts on Airbnb are everyday people who share their worlds to provide guests with the feeling of connection and being at home; Airbnb strives to connect people and places. Among our efforts to connect guests and hosts, we provide tools to enable hosts to set competitive prices, which helps improve affordability for guests while helping hosts get more bookings. We also personalize the guest experience to show them the listings that match their needs.
To help inform these efforts, we combine economic modeling and causal inference techniques to understand how guests book stays based on the prices hosts set, among other factors, and how that preference varies across different guests and listings. Such understanding helps us identify opportunities for Airbnb to support the marketplace and better connect guests and hosts. For example, understanding how much guests respond to different prices helps optimize the tools that we provide to hosts, in order to enable hosts to choose and set competitive prices that further balance demand and supply. As another example, understanding heterogeneity in guest preferences helps us personalize the guest experience and better match them with the listings that meet their needs, based on how much they respond to different prices and other factors. - [35] arXiv:2607.00312 (cross-list from econ.EM) [pdf, html, other]
-
Title: Post-selection inference for network structureSubjects: Econometrics (econ.EM); Methodology (stat.ME)
Researchers often use the density of connections between groups of agents, such as communities, blocs, or markets, to characterize the structure of a social or economic network. In many cases, these groups are selected using the network data, making conventional fixed-group inference procedures potentially invalid. To address this issue, we develop two new confidence intervals that are universally valid post-selection in the sense that they guarantee simultaneous coverage asymptotically over all pairs of groups whose relative sizes do not vanish. Our first interval builds on a strategy of \cite{berk2013valid}. Our second interval is based on a Talagrand-type concentration inequality for empirical processes. Both intervals are simple to compute and scalable to large networks, but a key technical contribution of our paper is show that only the second interval achieves the best-possible width asymptotically up to a constant factor. Three empirical illustrations show that accounting for selection can matter in practice. Some evidence for homophily in a social network and a hub-and-spoke structure in a trade network survives our correction, while evidence for disjoint market segments in a worker transition network does not.
- [36] arXiv:2607.00479 (cross-list from cs.LG) [pdf, other]
-
Title: Ghost in the Kernel: In-Context Learning with Efficient Transformers via Domain GeneralizationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Transformer-based large models have demonstrated remarkable generalization abilities across different tasks by leveraging a context-aware attention module for in-context learning. With richer context, transformers adapt more effectively to the current use case without any parameter updates. However, the quadratic computational and memory complexity with respect to context length significantly slows data processing in softmax transformers. Linear transformers were proposed to address this issue by reducing the complexity to linear dependence on context length, but the design and understanding of the feature mapping in linear attention, from a theoretical viewpoint, remain unclear. In this paper, we investigate the approximation and generalization abilities of linear transformers under a two-staged sampling process from domain generalization. We show that linear transformers perform in-context learning as learning a mapping from context distributions to response functions. A dimension-independent convergence rate is obtained for our generalization analysis, which also exhibits the tradeoff between the regularities of data distributions and latent features. Guided by our theoretical framework, we propose a new perspective on activation and loss design for linearizing pretrained softmax large language models.
- [37] arXiv:2607.00510 (cross-list from cs.LG) [pdf, html, other]
-
Title: Prototype Language ModelsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Knowing which training examples drive outputs is fundamental to auditing, correcting, and understanding language models, yet for modern LLMs this remains expensive, approximate, and largely post-hoc. Standard language models generate tokens through a dense network pathway, causing training data's influence to be distributed across parameters rather than organized along explicit, traceable components. We introduce a prototype language model architecture, Prototypes for Interpretable Sequence Modeling (PRISM), that forms each prediction via a sparse, non-negative mixture of learned prototypes, trained with clustering objectives that anchor each prototype to coherent neighborhoods of training examples. Across architectures from 130M to 1.6B parameters trained on up to 50B tokens, prototype language models either surpass or remain within 2.5 percentage points on average downstream accuracy of matched dense baselines. We show that sparse prototype structure localizes curvature in the loss landscape, yielding a more tractable Hessian and enabling training data attribution that is ~500x faster than post hoc baselines when consuming equivalent memory. Calibrating linear prototype controllers can improve downstream accuracy by roughly 3 points while tracing those corrections back to training neighborhoods, and targeted prototype suppression can remove model behaviors without finetuning or measurable loss in generation quality.
- [38] arXiv:2607.00512 (cross-list from cs.LG) [pdf, html, other]
-
Title: From Structural Equation Modelling to Double Machine Learning: Robustness Analysis for Survey-Based ResearchComments: 21 pages, 1 figure, 13 tablesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Structural equation modelling (SEM) is widely used in survey-based business and information systems research to assess latent constructs and theory-driven structural relationships. However, SEM path significance is obtained within a particular model specification and may not show whether findings remain stable under alternative estimation frameworks. This study develops and demonstrates a staged robustness analysis framework that connects SEM, ordinary least squares (OLS) regression, and Double Machine Learning (DML). SEM is first used to refine the measurement structure and estimate the robustness-baseline SEM model, in which the full theory-specified structural path system is retained for downstream robustness analysis before final structural path evaluation. OLS regression is then applied to SEM-derived construct scores as a transparent regression benchmark. Finally, DML-style residualisation is used to examine whether each tested focal relationship remains stable after flexible machine-learning-based adjustment for observed controls. Learner-sensitivity checks compare Random Forest, Gradient Boosting, and Support Vector Machine learners, and selected reverse-direction diagnostics are used to examine directional sensitivity. The framework is demonstrated using a FinTech Digital Customer Intimacy survey model. The findings identify which relationships are stable across SEM, OLS, and DML-style checks, and which require more cautious interpretation. A reproducible Google Colab workbook and generated result files are publicly available, providing a reusable template that researchers and students can adapt to other survey-based latent-construct studies. The paper contributes a practical robustness workflow and interpretation guide for survey-based researchers seeking to complement SEM with conventional and machine-learning-based robustness checks.
- [39] arXiv:2607.00531 (cross-list from cs.LG) [pdf, html, other]
-
Title: Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular OptimizationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Machine Learning (stat.ML)
Scientific reasoning is an increasingly important capability of large language models, yet improving the robustness and efficiency of training such reasoning remains a key open challenge. We study this problem in instruction-based molecular optimization, where answer-only supervised fine-tuning (SFT) collapses multi-step reasoning and reinforcement learning with verifiable rewards (RLVR) suffers from sparse feedback. Reference-guided Policy Optimization mitigates both by anchoring policy updates to dataset-provided references, but its effectiveness is tightly coupled to reference quality: weak or misaligned references impose a performance ceiling. To overcome this ceiling, we propose active reasoning, a paradigm in which the policy actively decides, on a per-instance basis, when to imitate a reference and when to reinforce its own discoveries, while continuously upgrading what it imitates. We instantiate this paradigm as Active Group Relative Policy Optimization (Active-GRPO), realized through two coupled mechanisms: active imitate-reinforce and active referencing. The former performs imitation learning when the reference still outperforms the policy's own candidates, and shifts to self-improvement via reinforcement learning once the policy has generated molecules that surpass the reference. The latter continuously upgrades the reference itself by replacing it with the best policy-generated candidate discovered so far, progressively raising the imitation target and ensuring that reference guidance remains informative-rather than restrictive-throughout training. Across TOMG-Bench MOLOPT, Active-GRPO improves average SRxSim from 0.0959 for GRPO and 0.1665 for RePO to 0.1773 under matched three-seed evaluation, with statistically significant gains on LogP, MR, and QED.
- [40] arXiv:2607.00669 (cross-list from math.NA) [pdf, html, other]
-
Title: Convolutional Symmetric AutoEncoders: enhancing latent stability via differential geometryComments: 28 pages, 17 figuresSubjects: Numerical Analysis (math.NA); Machine Learning (stat.ML)
Autoencoders (AEs) have emerged as powerful tools for non-linear dimensionality reduction, often surpassing traditional linear methods such as Proper Orthogonal Decomposition (POD) in scenarios characterized by slowly decaying Kolmogorov $n$-widths. In the realm of Reduced-Order Modelling (ROM), these models are increasingly utilized to learn low-dimensional representations of solution manifolds associated with parametric Partial Differential Equations (PDEs). However, the high expressivity of AEs presents a challenge: although trained networks typically minimize reconstruction error, they often struggle to capture the essential properties necessary for building accurate and robust ROMs. Recent works by arXiv:2307.15288v2 and arXiv:2506.11641v1 have tackled this challenge in fully connected AEs by proposing representation-consistent architectures, which preserve some of the properties belonging to POD. This study builds upon that concept by extending representation consistency for convolutional layers. We introduce a novel class of symmetric Convolutional AutoEncoders (CAEs) designed to embody the primary properties of manifold parametrization mappings. When integrated into a ROM framework, this architecture demonstrates significantly improved predictive capabilities. Specifically, we compared the performance of the ROMs based on classical and symmetric CAEs on three one dimensional academic test cases, namely the Linear Advection, the Viscous Burger and the Kuramoto Sivashinsky equation. Numerical results demonstrate that our proposed symmetric approach consistently yields more accurate latent trajectories, lower reconstruction errors, and enhanced model robustness.
- [41] arXiv:2607.00897 (cross-list from cs.IT) [pdf, other]
-
Title: Recovery of Planted SubgraphsComments: COLT 2026; 101 pagesSubjects: Information Theory (cs.IT); Probability (math.PR); Statistics Theory (math.ST)
Understanding the fundamental limits of recovering planted subgraphs in random graphs is a central challenge in high-dimensional statistics and theoretical computer science. While existing work has largely focused on special subgraph families such as cliques, bicliques, or dense blocks, the exact recovery of a general planted subgraph in Erdős--Rényi random graphs remains poorly understood. In this paper, we study the exact recovery of an arbitrary planted subgraph $\Gamma = \Gamma_n$ embedded in a dense Erdős--Rényi random graph $\mathcal{G}(n,q_n),ドル where edges within $\Gamma$ are present independently with probability $p_n > q_n$.
Our main results identify sharp conditions under which exact recovery is possible with high probability, and we establish matching lower bounds showing the necessity of these conditions. The resulting statistical threshold is characterized by a new graph-theoretic quantity, which we term the \emph{minimal maximum subgraph density}. This quantity is defined as the maximum subgraph density of the smallest induced balanced subgraph of $\Gamma$.
We then turn to the problem of recovery under polynomial-time constraints. We propose a computationally efficient recovery algorithm that applies to arbitrary planted subgraphs and analyze its performance in terms of certain spectral properties of the adjacency matrix. In addition, we derive computational lower bounds for recovery using the low-degree polynomial framework, establishing regimes where recovery is statistically possible but computationally hard. Finally, we consider several extensions of our setting, including recovery in semi-random models and weaker notions of recovery. - [42] arXiv:2607.01171 (cross-list from cs.LG) [pdf, html, other]
-
Title: Decision-Aware Training for Sample-Based Generative ModelsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Sample-based generative models are increasingly used for probabilistic forecasting in high-stakes decision settings, yet their training objectives are blind to the decision maker's cost structure. These models are commonly trained with strictly proper scoring rules, such as the energy score, which allocate their training signal in proportion to data density, with no awareness of where forecast errors are most costly for downstream decisions. We therefore propose decision-aware training for sample-based generative models, augmenting the energy score objective with a differentiable decision loss that directly penalises the cost incurred by acting on the model's forecast. This combined loss is theoretically grounded, as the decision loss is itself a proper scoring rule. We validate our method on one synthetic and two real-world tasks, showing targeted improvements in cost-sensitive regions while retaining full probabilistic forecasts.
Cross submissions (showing 15 of 15 entries)
- [43] arXiv:2406.00730 (replaced) [pdf, html, other]
-
Title: Assessing survival models by interval testing with Poisson-binomial distributionsComments: Main: 13 pages. Total: 15 pagesSubjects: Methodology (stat.ME)
Selecting appropriate parametric survival models is often a pivotal part of a regulatory submission for new pharmaceutical products. With recent developments in complex survival approaches, the number of suitable models is increasing, making model selection more challenging. Common approaches to model selection include AIC, BIC, and expert opinion on survival extrapolation. However, these approaches primarily assess relative goodness-of-fit, providing limited insight into where, and to what extent, a fitted model is incompatible with the observed data. We propose evaluating survival models using Poisson-binomial distributions across specified time intervals. Two interval selection approaches, censor-defined intervals and 10 evenly-spaced intervals, are presented with worked examples. A simulation exercise, targeting two proposed test statistics across 12 standard scenarios (with different data maturity and patient numbers), demonstrated that for every scenario the empirical Type I error did not exceed the nominal 5% level. Our proposed model selection technique goes beyond classical approaches by highlighting time intervals where models perform poorly.
- [44] arXiv:2406.11584 (replaced) [pdf, html, other]
-
Title: Modeling cyclicality and intransitivity in paired comparisons dataComments: 49 pages, 5 tables, 3 FiguresSubjects: Methodology (stat.ME)
Paired comparison data arise in ranking problems, decision analysis, sports analytics, recommendation systems, and many other applications in which alternatives are evaluated by comparing two items at a time. Standard models typically impose a transitive preference profile induced by a vector of merits. In many empirical settings, however, preference relations exhibit cyclic and intransitive patterns that cannot be adequately represented by a global ranking. This paper develops a framework for modeling cyclicality and departures from transitivity. The proposed approach decomposes a preference profile into orthogonal transitive and cyclic components and provides a geometric characterization of the associated parameter space. The cyclic component is represented using an overcomplete dictionary of elementary cycles, so that identifying cyclic structure and the intransitivities it may induce becomes a sparse model selection problem. We propose a method for recovering sparse cyclic structure and establish large--sample guarantees for estimation and model recovery. The analysis clarifies the relationship between cyclicality, intransitivity, and several notions of transitivity used in paired comparison theory. By explicitly modeling cyclic structure, the proposed framework can improve estimation, ranking, interpretation, and prediction. The methodology is evaluated through simulations and illustrated with an empirical application.
- [45] arXiv:2410.02050 (replaced) [pdf, html, other]
-
Title: A fast, flexible simulation framework for Bayesian adaptive designs -- the R package BATSSSubjects: Computation (stat.CO); Methodology (stat.ME)
The use of Bayesian adaptive designs for randomised controlled trials has been hindered by the lack of software readily available to statisticians. We have developed a new software package (Bayesian Adaptive Trials Simulator Software - BATSS for the statistical software R, which provides a flexible structure for the fast simulation of Bayesian adaptive designs for clinical trials. We illustrate how the BATSS package can be used to define and evaluate the operating characteristics of Bayesian adaptive designs for various different types of primary outcomes (e.g., those that follow a normal, binary, Poisson or negative binomial distribution) and can incorporate the most common types of adaptations: stopping treatments (or the entire trial) for efficacy or futility, and Bayesian response adaptive randomisation - based on user-defined adaptation rules. Other important features of this highly modular package include: the use of (Integrated Nested) Laplace approximations to compute posterior distributions, parallel processing on a computer or a cluster, customisability, adjustment for covariates and a wide range of available conditional distributions for the response.
- [46] arXiv:2504.15388 (replaced) [pdf, other]
-
Title: Deep learning with missing dataComments: 57 pages, 13 figuresSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
In the context of multivariate nonparametric regression with missing covariates, we propose Pattern Embedded Neural Networks (PENNs), which can be applied in conjunction with any existing imputation technique. In addition to a neural network trained on the imputed data, PENNs pass the vectors of observation indicators through a second neural network to provide a compact representation. The outputs are then combined in a third neural network to produce final predictions. Our main theoretical result exploits an assumption that the observation patterns can be partitioned into cells on which the Bayes regression function behaves similarly, and belongs to a compositional Hölder class. It provides a finite-sample excess risk bound that holds for an arbitrary missingness mechanism, and in combination with a complementary minimax lower bound, demonstrates that our PENN estimator attains in typical cases the minimax rate of convergence as if the cells of the partition were known in advance, up to a poly-logarithmic factor in the sample size. Numerical experiments on simulated, semi-synthetic and real data confirm that the PENN estimator consistently improves, often dramatically, on standard neural networks without pattern embedding. Code to reproduce our experiments, as well as a tutorial on how to apply our method, is publicly available.
- [47] arXiv:2506.23213 (replaced) [pdf, html, other]
-
Title: Nuisance parameters and elliptically symmetric distributions: a geometric approach to parametric and semiparametric efficiencySubjects: Statistics Theory (math.ST); Signal Processing (eess.SP)
Elliptically symmetric distributions are a classic example of a semiparametric model where the location vector and the scatter matrix (or a parameterization of them) are the two finite-dimensional parameters of interest, while the density generator represents an \textit{infinite-dimensional nuisance} term. This basic representation of the elliptic model can be made more accurate, rich, and flexible by considering additional \textit{finite-dimensional nuisance} parameters. Our aim is therefore to investigate the deep and counter-intuitive links between statistical efficiency in estimating the parameters of interest in the presence of both finite and infinite-dimensional nuisance parameters. Previous seminal works have addressed this problem by leveraging a general result: if the statistical model has a specific group invariance, then the projection operator onto the semiparametric nuisance tangent space can be asymptotically expressed as a conditional expectation with respect to the maximal invariant sub-$\sigma$ algebra. In this article, we show that, for the statistical model of elliptical distributions, the projection operator can be explicitly computed without relying on the above-mentioned asymptotic approximation. This allows us to obtain original results also for the case in which the location vector and the scatter matrix are parameterized by a finite-dimensional vector that can be partitioned in two sub-vectors: one containing the parameters of interest and the other containing the nuisance parameters. As an example, we illustrate how the obtained results can be applied to the well-known \virg{low-rank} parameterization. Furthermore, while the theoretical analysis will be developed for Real Elliptically Symmetric (RES) distributions, we show how to extend our results to the case of Circular and Non-Circular Complex Elliptically Symmetric (C-CES and NC-CES) distributions.
- [48] arXiv:2508.00937 (replaced) [pdf, html, other]
-
Title: A General Approach to Visualizing Uncertainty in Statistical GraphicsSubjects: Methodology (stat.ME); Graphics (cs.GR); Machine Learning (cs.LG)
We present a general approach to visualizing uncertainty in static 2-D statistical graphics. If we treat a visualization as a function of its underlying quantities, uncertainty in those quantities induces a distribution over images. We show how to aggregate these images into a single visualization that represents the uncertainty. The approach can be viewed as a generalization of sample-based approaches that use overlay. Notably, standard representations, such as confidence intervals and bands, emerge with their usual coverage guarantees without being explicitly quantified or visualized. As a proof of concept, we implement our approach in the IID setting using resampling, provided as an open-source Python library. Because the approach operates directly on images, the user needs only to supply the data and the code for visualizing the quantities of interest without uncertainty. Through several examples, we show how both familiar and novel forms of uncertainty visualization can be created. The implementation is not only a practical validation of the underlying theory but also an immediately usable tool that can complement existing uncertainty-visualization libraries.
- [49] arXiv:2510.06995 (replaced) [pdf, html, other]
-
Title: Root Cause Analysis of Outliers in Unknown Cyclic GraphsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
We study the propagation of outliers in cyclic causal graphs with linear structural equations, tracing them back to one or several "root cause" nodes. We show that it is possible to identify a short list of potential root causes provided that the perturbation is sufficiently strong and propagates according to the same structural equations as in the normal mode. This shortlist consists of the true root causes together with those of its parents lying on a cycle with the root cause. Notably, our method does not require prior knowledge of the causal graph and yields encouraging results on simulated data and real data from biology and cloud computing.
- [50] arXiv:2511.01960 (replaced) [pdf, html, other]
-
Title: Unifying Statistical and Mathematical Modeling Through a Causal Inference LensSubjects: Other Statistics (stat.OT); Other Quantitative Biology (q-bio.OT)
Within the biological, physical, and social sciences, there are two broad quantitative traditions: statistical and mathematical modeling. Both traditions have the common pursuit of advancing our scientific knowledge, but these traditions have developed largely with distinct languages and inferential frameworks. This paper uses the notion of identification from causal inference, a field originating from the statistical modeling tradition, to develop a shared language. I first review foundational identification results for statistical models and then extend these ideas to mathematical models. Central to this framework is the use of bounds, ranges of plausible numerical values, to analyze both statistical and mathematical models. I discuss the implications of this perspective for the interpretation, comparison, and integration of different modeling approaches, and illustrate the framework with a simple pharmacodynamic model for hypertension. To conclude, I describe areas where the approach taken here should be extended in the future. By formalizing connections between statistical and mathematical modeling, this work contributes to a shared framework for quantitative science. My hope is that this work will advance interactions between these two traditions.
- [51] arXiv:2511.21534 (replaced) [pdf, html, other]
-
Title: A Sensitivity Analysis Framework for Causal Inference Under InterferenceSubjects: Methodology (stat.ME)
In many applications of causal inference, the treatment received by one unit may influence the outcome of another, a phenomenon referred to as interference. Although there are several frameworks for conducting causal inference in the presence of interference, practitioners often lack the data necessary to adjust for its effects. In this paper, we propose a weighting-based sensitivity analysis framework that can be used to assess the systematic bias arising from ignoring interference. Unlike most of the existing literature, we allow for the presence of unmeasured confounding, and show that the combination of interference and unmeasured confounding is a notable challenge to causal inference. We also study a third factor contributing to systematic bias: lack of transportability. Our framework enables practitioners to assess the impact of these three issues simultaneously through several easily interpretable sensitivity parameters that can reflect a wide range of intuitions about the data.
- [52] arXiv:2512.24152 (replaced) [pdf, other]
-
Title: Fast Score-Based Sampling via Log-Concave ReductionsComments: Accepted to the COLT 2026 Conference, San Diego, CASubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
Sampling based on score diffusions has led to striking empirical results, and has attracted considerable attention from various research communities. It depends on availability of (approximate) Stein score functions for various levels of additive noise. We show how in some generality, the availability of scores allows the general problem to be ``reduced'' to sampling from an adaptively constructed sequence of $K$ strongly log-concave (SLC) sub-problems. The reduction is simple, constructive and algorithm-independent, so that any SLC sampler can be used as a subroutine. Various bounds on score-based sampling complexity follow directly: for instance, high-accuracy SLC samplers yield $\tilde{\mathcal{O}}(K \sqrt{d} \operatorname{polylog}(1/\varepsilon))$ guarantees for accuracy $\varepsilon$ in dimension $d,ドル where randomized midpoint SLC schemes yield $\tilde{\mathcal{O}}(K d^{1/3} \operatorname{poly}(1/\varepsilon))$ guarantees. When the original distribution itself is SLC, we prove that $K \leq 1 + \log_2(\kappa),ドル thereby obtaining the first efficient procedure with logarithmic dependence on condition number $\kappa$; for general distributions, the quantity $K$ depends on the geometry of score Hessian across the trajectory. Our analysis is direct and simple, involving techniques and insights complementary to those in standard analyses of discretized diffusions.
- [53] arXiv:2601.07668 (replaced) [pdf, other]
-
Title: The Role of Confounders and Linearity in Ecological Inference: A ReassessmentComments: 41 pages, revised. July 2026Subjects: Applications (stat.AP)
Estimating conditional means using only the marginal means available from aggregate data is known as the ecological inference problem. We reassess this literature, arguing that it has understudied two issues: how practitioners should control for confounding, and how methodologists can leverage the linearity inherent in the structure of the problem. On the former, we formalize ignorability conditions like those in causal inference and outline consistent plug-in estimators: These are credible when covariates make the ignorability condition plausible. On the latter, we show that aggregation restricts the target function to be partially linear. Such linearity clarifies the connections between King's (1997) methodology, its predecessors, and subsequent developments. That motivates a recent doubly-robust technique that enters covariates flexibly while leveraging linearity. Finally, we test these methods in datasets where the ground truth is fortuitously observed. In these common applications, all methods tested were prone to overestimating racial polarization and underestimating split-ticket voting.
- [54] arXiv:2603.00827 (replaced) [pdf, html, other]
-
Title: Minimax convergence rates of a binary plug-in type classification procedure for time-homogeneous SDE paths under low-noise conditionsComments: 41 pagesSubjects: Statistics Theory (math.ST)
The study of minimax convergence rates for classification procedures adapted to SDE paths is rarely addressed in the literature. Only one paper established optimal convergence rates for a binary classifier for SDE paths constructed from the white noise model. In this paper, we consider a more complex diffusion model with space-dependent drift and diffusion coefficients where the drift depends on the class and the diffusion coefficient is common to all classes. We establish, under the low-noise condition, a faster convergence rate over a Holder space. This result will require the establishment of an exponential inequality, which is essential to obtain the expected rate. We then study the lower bound on the excess risk of the empirical classifier.
- [55] arXiv:2603.20467 (replaced) [pdf, html, other]
-
Title: Goal-oriented learning of stochastic differential equations using error bounds on path-space observablesSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Dynamical Systems (math.DS)
Stochastic differential equations (SDEs), which serve as the governing equations for dynamical systems in a broad range of applications, can become cost-prohibitive for numerical simulation at scales necessary for quantifying key properties. Surrogate models of the drift function of an SDE, learned from data of the high-fidelity system, are routinely used to increase the efficiency of simulation and prediction of properties. However, standard choices of loss function for learning the surrogate model fail to provide error guarantees in certain path-dependent observables, such as transition times. This paper introduces an error bound for path-space observables and employs it as a novel variational loss for the goal-oriented learning of the drift function of a SDE. We show the error bound holds for a broad class of observables, including mean first hitting times on unbounded time domains. We derive an analytical gradient of the goal-oriented loss by leveraging the formula for Fréchet derivatives of expected path functionals, which remains tractable for implementation in stochastic gradient descent schemes. We demonstrate that surrogate models of overdamped Langevin systems developed via goal-oriented learning achieve improved accuracy in predicting the statistics of a first hitting time observable and robustness to distributional shift in the data.
- [56] arXiv:2604.18742 (replaced) [pdf, html, other]
-
Title: JASPER: Joint Bayesian Analysis of Spatial Expression via RegressionComments: 43 pages; 5 figuresSubjects: Applications (stat.AP); Methodology (stat.ME)
Spatially resolved transcriptomics is a fast-developing set of technologies that enables the measurement of localized gene expression across spatial locations in a sample. Detecting spatially varying genes is critical for analyzing such data, yet existing methods often fail to account for inter-gene correlations, leading to inflated false positive and false negative rates. Additionally, most prominent methods rely on predefined spatial covariance kernels, making them sensitive to the complexity of spatial expression patterns. Motivated by a human breast cancer dataset, we address these limitations in existing literature through JASPER (Joint Bayesian Analysis of SPatial Expression via Regression), a Bayesian framework that jointly models spatial expression patterns across multiple genes using a spatial basis function regression approach. We demonstrate the superior performance of JASPER compared to existing methods in several real-world spatial transcriptomic datasets and supporting simulation experiments. JASPER identifies genes with stronger spatial correlation and greater biological relevance, as validated by overlap comparison, enrichment analysis, and pathway analysis using independent biological databases. Our results highlight the ability of JASPER to improve the statistical and biological interpretability of spatial transcriptomics data, making it a powerful tool for uncovering spatial gene expression patterns in complex biological systems.
- [57] arXiv:2605.03264 (replaced) [pdf, html, other]
-
Title: Efficient Propose-Test-Release for Optimal Differentially Private EstimationComments: 20 pages, 3 figuresSubjects: Methodology (stat.ME)
Differential privacy (DP) is a rigorous framework that protects the participation of individuals in a dataset by controlling information leakage through released estimators. It brings a challenge for statisticians: DP uniformly considers all possible datasets, whereas statistical practice often downweights atypical or rare outcomes. The conceptual challenge is especially pronounced in sensitivity analysis, where atypical datasets introduces markedly high sensitivity, even for a basic estimator such as ordinary least square. Standard DP recipe adds a noise governed by this large overall sensitivity, which causes excessive loss in accuracy. We introduce an efficient Propose-Test Release (ePTR) pipeline, which tests the dataset via a user-designed Safety Lower Bound, and then probabilistically releases the estimator based on local sensitivity level. This flexible pipeline enables substantially simple DP mechanisms for many problems. To illustrate, we study basic estimators for Bayes classification, linear regression, and kernel regression. Each estimator can be highly sensitive to atypical datasets, yet admits simple ePTR-based algorithms that achieve minimax optimality. In numerical studies, these ePTR estimators demonstrate improved accuracy against popular DP baselines under privacy guarantees.
- [58] arXiv:2605.26608 (replaced) [pdf, html, other]
-
Title: Maximum-Likelihood Estimation of Hyperedge-Triggered Hawkes Processes via a Closed-Form EM AlgorithmComments: 13 pages, 6 figures, 2 tables; revised version with updated figures and layoutSubjects: Methodology (stat.ME)
Hypergraph effects in event streams are difficult to estimate because a group-level burst can often be explained either by direct higher-order excitation or by a collection of ordinary pairwise Hawkes interactions. This paper studies maximum-likelihood estimation for a hyperedge-triggered Hawkes process, in which the conditional intensity is excited both by individual past events and by the completion of a multi-node firing pattern within a short temporal window. We derive a closed-form EM algorithm based on latent branching responsibilities and a piecewise compensator for the most-recent-anchor hyperedge mechanism. The compensator corrects the naive integral that overcounts superseded pattern completions. For independently parameterised candidate hyperedges, the EM updates are closed form; when a low-rank CP parameterisation is imposed, the hyperedge factors are updated by block-coordinate ascent on the same expected complete-data objective, yielding a generalised EM implementation. Synthetic experiments show near-unbiased recovery under a time-rescaling-validated simulator, stable EM convergence, identifiable trigger-window structure, and the expected O(n^2) event-count scaling of the prototype implementation. The main statistical limitation is not numerical optimisation but identifiability: when pairwise and hyperedge components are supported on the same co-firing events, likelihood gains can be hard to attribute. Held-out analyses on retina and primary visual-cortex spike-train datasets show stable positive candidate-count BIC differences for the two cortical datasets and more fragile evidence for the retina dataset as the candidate set expands. Code and reproducibility scripts are available at this https URL.
- [59] arXiv:2605.29200 (replaced) [pdf, html, other]
-
Title: Approximating full conformal prediction: distribution free guarantees via the tournament correctionComments: 23 pages, 2 figuresSubjects: Methodology (stat.ME)
Conformal prediction is a framework for providing prediction intervals with distribution-free validity, guaranteeing predictive coverage for data drawn from any distribution. Its two main variants are full conformal prediction and split conformal prediction (also called transductive and inductive). Full conformal prediction is widely considered to be statistically more efficient (since split conformal prediction requires data splitting, and therefore can lead to wider prediction intervals due to the resulting loss in sample size), but its implementation is computationally prohibitive, as it requires the underlying model to be refit for every candidate value in the response space. Existing computational shortcuts, such as using a discrete grid of values to approximate the full conformal prediction construction, frequently lack theoretical guarantees on marginal coverage and can fail in practice.
To address this limitation, we introduce a novel class of approximations to the full conformal prediction method, based on the idea of \emph{tournaments}, which enables the construction of prediction sets with a rigorous marginal coverage guarantee of 1ドル-2\alpha$. Under stability conditions, the theoretical coverage guarantee tightens to approximately 1ドル-\alpha$. This new framework generalizes the existing method of leave-one-out cross-conformal prediction, while allowing for flexible use of various existing approximation strategies. - [60] arXiv:2605.30253 (replaced) [pdf, html, other]
-
Title: Wasserstein Contraction of Coordinate Ascent Variational InferenceComments: 30 pages + 4 pages appendix, 3 figures. V3 includes new results on multi block algorithms, analysis on discrete spaces, and new applicationsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Functional Analysis (math.FA); Optimization and Control (math.OC); Probability (math.PR); Computation (stat.CO)
We study the non-asymptotic contraction in Wasserstein distance of the sequential, parallel, and random-scan coordinate ascent variational inference algorithms. This is shown to hold under a functional smoothness condition of the optimality maps and a transportation-information inequality at their fixed points. Our results are sharp and general, and as opposed to those based on global strong log-concavity assumptions, they allow for local convergence on smooth, non-smooth, and discrete manifolds, including within the context of data augmentation. We consider many applications in statistical physics and Bayesian statistics. These include pairwise Markov Random field models such as Ising and Curie-Weiss, unbalanced Bayesian Gaussian Mixture Models, high-dimensional Bayesian Probit Regression, and high-dimensional Logistic Regression with Pólya--Gamma random variables (i.e. Jaakkola-Jordan's algorithm). In many of these models, these represent the first available convergence results of their kind.
- [61] arXiv:2606.20299 (replaced) [pdf, html, other]
-
Title: Statistical Properties of Training & GeneralizationComments: 32 pages, 3 figures. Part of the VERaiPHY initiativeSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Data Analysis, Statistics and Probability (physics.data-an)
Deep learning has managed to evade numerous intuitions from classical statistics to achieve unprecedented performance on a number of real-world tasks. In this article, we investigate the key features and surprises of deep learning from a physics-informed perspective, taking care to point out and justify where possible the many choices inherent in constructing a deep learning model. In particular, we review the phenomenon of neural scaling laws and discuss their interplay with the constraints and inductive biases which may be present when applying machine learning to problems in physics.
- [62] arXiv:2606.25169 (replaced) [pdf, other]
-
Title: Laplace-Fisher Gate Identities for Optimal Matrix-Gated Blended Score EstimationComments: Provisional reportSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG)
Sampling from an unnormalized target by reversing an Ornstein-Uhlenbeck diffusion requires the score of each noise-perturbed marginal. Tweedie's identity and a target-score identity give unbiased finite-reference estimators for this score. Scalar blends can reduce variance, but are too rigid for singular or strongly anisotropic targets. We cast blended score estimation as conditional risk minimization over matrix-valued blending coefficients, or gates, and derive the variance-optimal gate
G*(y,t) = alpha_t^2 (alpha_t^2 I_d + gamma_t E[H_0(X_0) | Y_t = y])^{-1}, H_0 = -nabla^2 log p_0.
Here alpha_t = e^{-t} and gamma_t = 1 - e^{-2t}. We call this formula the Laplace-Fisher Gate Identity (LFGI). Since the Tweedie-TSI disagreement has conditional mean zero, the gate changes estimator variance without changing its expected value. We give the Gaussian special case and prove finite-reference consistency and stability bounds for estimating the gate from weighted reference samples.
We apply the finite-reference LFGI estimator to normalized density evaluation for Bayesian inverse problems. When MCMC pilot samples and derivative information are available, LFGI uses these byproducts to construct a normalized posterior-density surrogate. The surrogate enables posterior-energy evaluation, model-evidence estimation, and downstream density-based diagnostics. On a PDE-constrained inverse-problem benchmark, the LFGI surrogate improves posterior-density calibration and sampling diagnostics relative to the other tested score-estimator classes. Experiments using LFGI with known model evidence check absolute evidence calibration in both Gaussian and non-Gaussian settings. - [63] arXiv:2606.31190 (replaced) [pdf, html, other]
-
Title: Semiparametric Efficiency in Sequential Experiments: Characterization and Design via Average PropensitySubjects: Methodology (stat.ME)
Modern experiments, including evaluations of AI-enabled services and platform interventions, often depart from independent and identically distributed (i.i.d.) sampling because assignments may be adaptive, balanced across covariates, or subject to rollout constraints such as exposure, fairness, and budget limits. This paper studies the efficiency benchmark for estimating causal targets in such sequential experiments. We show that every non-anticipating design induces an average propensity score, and we establish a semiparametric lower bound: for regular locally unbiased estimators, attainable precision is bounded by the i.i.d. efficiency benchmark evaluated at this induced score. The average propensity score thereby serves as a common benchmark and design target, allowing sequential experimental design to be viewed as choosing or learning an efficient allocation rule, with operational constraints entering through the admissible set when present. We then develop implementable batched adaptive designs that approach this benchmark through two complementary mechanisms. The first uses regression adjustment based on efficient influence functions; for general smooth estimands it attains the benchmark under standard nuisance-rate conditions, while for linear functionals of outcome means it achieves a sharp second-order rate. The second uses adaptive covariate balancing to attain the same benchmark through the assignment mechanism, enabling simple moment-based estimation. Both routes require only a small number of policy updates, making them compatible with delayed feedback and easier to monitor in operational deployments. Numerical experiments and an empirical study of AI medical-assistant evaluation demonstrate the practical efficiency gains, including in multi-treatment settings. Overall, the paper provides a unified framework for characterizing and designing efficient sequential experiments.
- [64] arXiv:2212.04814 (replaced) [pdf, html, other]
-
Title: The Generalized Falsification Adaptive Set for Violations of the Exclusion Restriction and ExogeneitySubjects: Econometrics (econ.EM); Methodology (stat.ME)
The falsification adaptive set (FAS) as proposed by Masten and Poirier (2021) provides an identified set for a treatment effect when the baseline model is falsified, assuming invalid instruments violate exclusion only. We show that whether an invalid instrument is a confounder or collider has important consequences: incorrect treatment can cause the FAS to exclude the true parameter. We derive pattern-specific falsification adaptive sets for each combination of violations and propose a generalized FAS as their union, containing the true parameter value if any instrument is valid. We illustrate our results with the roads and trade application of Duranton et al. (2014).
- [65] arXiv:2504.06299 (replaced) [pdf, html, other]
-
Title: Explainability in mulimodal deep transformation models for stroke outcome predictionLisa Herzog, Jonas Brändli, Maurice Schneeberger, Loran Avci, Nordin Dari, Martin Hänsel, Hakim Baazaoui, Pascal Bühler, Susanne Wegener, Beate SickComments: Accepted at MICCAI 2026Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP)
Multimodal prediction models based on imaging and clinical data are increasingly used for clinical decision support, yet their interpretability remains limited. We present multimodal Deep Transformation Models (DTMs) combining statistical approaches and neural networks to achieve strong predictive performance while preserving interpretability for tabular data. A key contribution of this work is the adaption of the xAI methods Grad-CAM and Occlusion to DTMs relying on 3D CNNs, enabling interpretation of the image branch through the generation of explanation maps. We developed DTMs to predict functional independence three months after stroke using diffusion-weighted imaging and clinical data from 407 patients. In a ten-fold cross-validation, the models achieved state-of-the-art predictive performance (AUC 0.81 [0.75, 0.87]) while maintaining interpretability for tabular features, with functional independence before stroke and stroke severity on admission emerging as the strongest predictors. Explanation maps from both xAI methods highlighted consistent regions, including frontal lobe areas which are known to be associated with age, a strong predictor of functional outcome. Notably, these regions disappeared once age was included as an explicit tabular predictor. Similarity analyses of explanation maps revealed distinct spatial patterns, providing meaningful insights into stroke pathophysiology, systematic error analysis and hypothesis generation.
- [66] arXiv:2504.09951 (replaced) [pdf, html, other]
-
Title: Towards Weaker Variance Assumptions for Stochastic OptimizationSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
We revisit a classical assumption for analyzing stochastic gradient algorithms where the squared norm of the stochastic subgradient (or the variance for smooth problems) is allowed to grow as fast as the squared norm of the optimization variable. We contextualize this assumption in view of its inception in the 1960s, its seemingly independent appearance in the recent literature, its relationship to weakest-known variance assumptions for analyzing stochastic gradient algorithms, and its relevance in deterministic problems for non-Lipschitz nonsmooth convex optimization. We build on and extend a connection recently made between this assumption and the Halpern iteration. For convex nonsmooth, and potentially stochastic, optimization, we analyze horizon-free, anytime algorithms with last-iterate rates. For problems beyond simple constrained optimization, such as convex problems with functional constraints or regularized convex-concave min-max problems, we obtain rates for optimality measures that do not require boundedness of the feasible set.
- [67] arXiv:2509.19235 (replaced) [pdf, html, other]
-
Title: On the Performance of THz Wireless Systems over $α$-$\mathcal{F}$ Channels with Beam Misalignment, Mobility and Hardware ImpairmentsSubjects: Signal Processing (eess.SP); Statistics Theory (math.ST)
This paper investigates the performance of terahertz (THz) wireless systems over the $\alpha$-$\mathcal{F}$ fading channels with beam misalignment, mobility and hardware impairments. New expressions are derived for the probability density, cumulative distribution, and higher-order moments of the instantaneous signal-to-noise ratio (SNR). Building upon the aforementioned expressions, we extract novel formulas for the outage probability (OP), average symbol error probability, and average channel capacity. Asymptotic expressions are also derived, providing useful insights into system performance in the high-SNR regime. Furthermore, an upper bound on the capacity metric is obtained. Monte Carlo simulation results are presented to validate the developed analytical framework.
- [68] arXiv:2603.08001 (replaced) [pdf, html, other]
-
Title: Amortized Maximum Inner Product Search with Learned Support FunctionsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Maximum inner product search (MIPS) is a crucial subroutine in machine learning, requiring the identification of a vector taken within a database (the keys) that best aligns with a given query. We propose amortized MIPS: a regression-based approach that trains neural networks to directly predict MIPS solutions, amortizing the cost of repeatedly solving MIPS for queries drawn from a known distribution over a fixed key database. Our key insight is that the MIPS value function is the \emph{support} function of the set of keys, a well-studied convex function whose gradient yields the optimal key. This motivates two complementary amortized models: SupportNet, an input-convex neural network trained to regress the support function, and KeyNet, a vector-valued network that directly regresses the optimal key. SupportNet can serve as a cluster router, steering queries toward relevant database partitions, while KeyNet can be used as a drop-in replacement for the original query, fed directly to off-the-shelf indexing pipelines. Our experiments on the BEIR benchmark show that, for document embeddings, learned \SupportNet{}s and \KeyNet{}s significantly improve IVF match rates when accounting for compute effort, whether measured in FLOPs, number of probes, or wall-clock time. Our code is available at: this https URL.
- [69] arXiv:2604.14669 (replaced) [pdf, html, other]
-
Title: Zeroth-Order Optimization at the Edge of StabilityComments: ICML 2026Subjects: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC); Machine Learning (stat.ML)
Zeroth-order (ZO) methods are widely used when gradients are unavailable or prohibitively expensive, including black-box learning and memory-efficient fine-tuning of large models, yet their optimization dynamics in deep learning remain underexplored. In this work, we provide an explicit step size condition that exactly captures the (mean-square) linear stability of a family of ZO methods based on the standard two-point estimator. Our characterization reveals a sharp contrast with first-order (FO) methods: whereas FO stability is governed solely by the largest Hessian eigenvalue, mean-square stability of ZO methods depends on the entire Hessian spectrum. Since computing the full Hessian spectrum is infeasible in practical neural network training, we further derive tractable stability bounds that depend only on the largest eigenvalue and the Hessian trace. Empirically, we find that full-batch ZO methods operate at the edge of stability: ZO-GD, ZO-GDM, and ZO-Adam consistently stabilize near the predicted stability boundary across a range of deep learning training problems. Our results highlight an implicit regularization effect specific to ZO methods, where large step sizes primarily regularize the Hessian trace, whereas in FO methods they regularize the top eigenvalue.
- [70] arXiv:2606.03665 (replaced) [pdf, html, other]
-
Title: Sparse Tree-Based Aggregation for Time Series RegressionsSubjects: Econometrics (econ.EM); Methodology (stat.ME)
High-dimensional time series regressions are often regularized to produce sparse coefficients. We show that temporal aggregation provides a powerful alternative to reduce dimensionality in high-order autoregressions and mixed-frequency regressions. To this end, we propose StarTime (Sparse Tree-based Aggregation for Time Series), a convex penalization method that uses a temporal tree to arrange lags hierarchically from high to low frequency. StarTime then flexibly selects coefficients to be aggregated at possibly varying frequencies, sparse or a combination thereof. We provide new error bounds for StarTime, demonstrate improved estimation accuracy and recovery of aggregation and sparsity in simulations relative to benchmarks, and illustrate StarTime's relevance for financial and macroeconomic applications.
- [71] arXiv:2606.10111 (replaced) [pdf, html, other]
-
Title: Nonlinear Bayesian Estimator for Parameter Learning: A Fixed-Point CharacterizationComments: 32 pages, 9 figuresSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
This paper presents a nonlinear parameter estimator for Wiener-type state-space models obtained as a fixed-point architecture that couples two affine minimum mean-squared error (MMSE) estimators: one for the unknown parameters and one for latent variables. The architecture retains the functional structure of the optimal affine MMSE parameter estimator while incorporating Dynamic Basis Statistics (DBS) estimates that summarize nonlinear basis-function evaluations. Two DBS construction strategies are developed, leading to two nonlinear estimator frameworks. The dual basis-parameter estimator combines an affine basis estimator with the affine parameter estimator, whereas the dual state-parameter estimator first computes affine state estimates and their covariances, then maps these state-estimate statistics through a Gaussian DBS operator to obtain DBS estimates. Both dual estimators admit fixed-point characterizations that alternate between estimating each component using the updated prior of the other, obtained from that component's plug-in estimate statistics from the previous iteration. The efficacy of the proposed methods is examined via extensive Monte Carlo experiments, showing that the dual basis-parameter estimator attains parameter mean-squared errors comparable to those of the purely affine parameter estimator, while the dual state-parameter estimator achieves the lowest parameter mean-squared error, outperforming both the dual basis-parameter and purely affine parameter estimators, as well as sequential Monte Carlo variants of classical Particle Gibbs and Expectation-Maximization schemes.
- [72] arXiv:2606.18218 (replaced) [pdf, other]
-
Title: Finite-Time Queue Peak Laws in Stochastic Networks: Logarithmic Scaling After Geometric ThresholdsSubjects: Probability (math.PR); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
We study finite-horizon queue peaks in generalized switches, a standard stochastic-network model in which many queues share constrained service resources. Arrivals may be dependent, nonstationary, and responsive to the system history; the only load condition is uniform interior slack, meaning the conditional mean arrival vector stays in a fixed contraction of the capacity region. We show that this slack reshapes the finite-time peak law for drift-minimizing scheduling policies such as MaxWeight. The square-root envelope that is sharp without slack persists only up to a geometry-dependent threshold; beyond that threshold, the running maximum grows only logarithmically with the horizon, both with high probability and in expectation.
The mechanism is self-normalization: in the current queue direction, the projected fluctuation scale is normalized by the stabilizing drift scale. This removes capacity geometry from the logarithmic coefficient, while geometry remains in the threshold. Matching lower bounds show that both the logarithmic term and a geometric threshold are unavoidable. When finite-time state-space collapse is available, the threshold can be sharpened using local bottleneck geometry. For generalized input-queued switches, we obtain finite-time peak bounds with tight logarithmic coefficients. Simulations illustrate the two-phase envelope, local geometric refinements, and variance-sensitive improvements predicted by the theory. - [73] arXiv:2606.21639 (replaced) [pdf, html, other]
-
Title: A new classification method based on Minimum Spanning TreesSubjects: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
Minimum Spanning Trees have been used in unsupervised learning, particularly in clustering tasks, due to their ability to recognize clusters by removing edges that are considered inconsistent in defining those clusters. This paper aims to study the use of Minimum Spanning Trees in supervised learning. Specifically, we propose a classification algorithm based on Minimum Spanning Trees. To improve its performance, we introduce a robust version of the method that is also computationally more efficient. We evaluate the effectiveness of our proposed method through an extensive simulation study. We also apply the proposed methodology to a real-world case study involving aircraft trajectories.
- [74] arXiv:2606.30789 (replaced) [pdf, html, other]
-
Title: Predictable GRPO: A Closed-Form Model of Training DynamicsRajat Ghosh, Datta Nimmaturi, Aryan Singhal, Vaishnavi Bhargava, Henry Wong, Johnu George, Debojyoti DuttaSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We develop a first-principles reduced-order model of these dynamics. Under a single mean-field assumption that summarizes the policy by its expected reward, we reduce the GRPO update to a stochastically-forced damped oscillator whose mass, damping, and stiffness are fixed in closed form by the optimizer hyperparameters together with a single measured curvature scale -- momentum supplies the inertia, off-policy lag erodes the damping, and the group size enters, to leading order, as a noise temperature. The reduction has three consequences. First, it subsumes the empirical single-exponential saturation law as its overdamped limit, recasting the fitted plateau, timescale, and size exponent as the fixed point, inverse stiffness, and curvature-scaling exponent of the underlying potential, and adding, through the retained inertial term, the slow-start phase the single exponential cannot represent. Second, it yields predictions tied to independently measurable quantities rather than fitted ones: group-size invariance of the deterministic trajectory with a 1ドル/G$ stationary fluctuation, a sharp stability threshold in the refresh interval, and an overdamped-to-oscillatory transition. Third, it furnishes diagnostics that separate failure modes a reward curve alone conflates -- reward hacking, advantage degeneracy, policy concentration, and dynamical instability. Across three models and two group sizes, the closed-form trajectory fits training reward to $R^2 \geq 0.91$ and the mean trajectory is group-size invariant to leading order -- on both the reward curve and out-of-distribution transfer to eight math benchmarks -- while the within-group reward spread retains a residual $G$-dependence that the leading-order temperature picture does not capture.