Boosting K-nearest neighbor regression performance for longitudinal data through a novel learning approach

Loeloe, Mohammad Sadegh; Tabatabaei, Seyyed Mohammad; Sefidkar, Reyhane; Mehrparvar, Amir Houshang; Jambarsang, Sara

doi:10.1186/s12859-025-06205-1

Research
Open access
Published: 30 September 2025

Boosting K-nearest neighbor regression performance for longitudinal data through a novel learning approach

Mohammad Sadegh Loeloe ¹,
Seyyed Mohammad Tabatabaei ^2,3,
Reyhane Sefidkar ¹,
Amir Houshang Mehrparvar ^4,5 &
...
Sara Jambarsang ¹

BMC Bioinformatics volume 26, Article number: 232 (2025) Cite this article

330 Accesses
1 Altmetric
Metrics details

Abstract

Background

Longitudinal studies often require flexible methodologies for predicting response trajectories based on time-dependent and time-independent covariates. To address the complexities of longitudinal data, this study proposes a novel extension of K-Nearest Neighbor (KNN) regression, referred to as Clustering-based KNN Regression for Longitudinal Data (CKNNRLD).

Methods

In CKNNRLD, data are first clustered using the KML algorithm (K-means for longitudinal data), and the nearest neighbors are then searched within the relevant cluster rather than across the entire dataset. The theoretical framework of CKNNRLD was developed and evaluated through extensive simulation studies. Ultimately, the method was applied to a real longitudinal spirometry dataset.

Result

Compared to the standard KNN, CKNNRLD demonstrated improved prediction accuracy, shorter execution time, and reduced computational burden. According to the simulation findings, using the CKNNRLD method for this purpose took less time compared to using the KNN implementation (for N > 100). It predicted the longitudinal responses more accurately and precisely than the equivalent algorithm. For instance, CKNNRLD execution time was approximately 3.7 times faster than the typical KNN execution time in the scenario with N = 2000, T = 5, D = 2, C = 4, E = 1, and R = 1. Since the KNN method needs all of the training data to identify the nearest neighbors, it tends to operate slowly as the number of individuals in longitudinal research increases (for N > 500).

Conclusion

The CKNNRLD algorithm significantly improves accuracy and computational efficiency for predicting longitudinal responses compared to traditional KNN methods. These findings highlight its potential as a valuable tool for researchers with large longitudinal datasets.

Peer Review reports

Background

Machine learning has revolutionized data analysis across various fields, including medical research, by enabling flexible, data-driven predictions and pattern recognition. Among machine learning approaches, non-parametric methods have gained significant attention due to their flexibility, as they require fewer assumptions compared to traditional parametric models. These methods are particularly useful when the underlying data distribution is complex or unknown [1].

One widely used non-parametric method for regression and classification is the K-Nearest Neighbors (KNN) algorithm. KNN is a simple yet effective method relying on distance-based similarity to make predictions. It has been extensively applied in cross-sectional data analysis, with numerous studies focusing on improving its performance. However, a significant drawback of KNN is that it is a "lazy learner," meaning it does not build a model during training but instead stores all training data and makes predictions at runtime. This results in high computational costs, especially for large datasets, because each new prediction requires scanning the entire dataset to identify the nearest neighbors. These inefficiencies have motivated various enhancements to the KNN algorithm [2,3,4].

Several studies have proposed modifications to improve KNN regression in cross-sectional settings. For example, Saket et al. (2012) introduced BINER, a regression algorithm that integrates binary search with KNN to enhance efficiency [5]. Dubey et al. [2] proposed CLUEKR, a clustering-based KNN regression method that first groups data and then searches for neighbors within relevant clusters [2]. Other notable improvements include Ougiaroglou et al.’s [6] homogeneous clustering-based KNN [6], Al-Helali et al.’s (2015) KNNVWC algorithm with variable-width clusters [7], and Song et al.’s [8] approach to reducing training data size to improve computational efficiency [8]. More recent advancements include feature-extraction techniques [9], modifications for adaptive clustering [10], and refined nearest-neighbor selection [11]. While these methods improve KNN performance for cross-sectional data, they do not explicitly address challenges in longitudinal data, where repeated measurements introduce intra-subject correlation.

Longitudinal studies play a crucial role in medical research, capturing multiple observations for each subject over time. Unlike cross-sectional data, longitudinal data exhibit intra-subject correlation, which must be accounted for in the analysis [12,13,14]. Traditional statistical models, such as linear mixed-effects models (LMMs) [15], generalized linear mixed-effects models (GLMMs) [8], and generalized estimating equations (GEEs) [16], have been developed to address this correlation structure. Over the past few decades, these methods have been refined, bridging the gap between theoretical advancements and practical applications [17]. However, with the increasing availability of computational power, machine-learning techniques have emerged as powerful alternatives for analyzing complex longitudinal data.

Despite the growing interest in machine learning for longitudinal analysis, direct applications of KNN regression to predict longitudinal responses have received limited attention. In many longitudinal studies, the objective is not only to interpret the relationship between independent and dependent variables but also to predict trends or trajectories of variables flexibly. To address this gap, we propose CKNNRLD (Clustering-based KNN Regression for Longitudinal Data). This algorithm improves KNN regression for longitudinal data by first clustering variable trajectories using the longitudinal k-means method (KML) before performing nearest-neighbor searches within relevant clusters. By structuring the search space in this manner, our approach enhances computational efficiency while preserving the flexibility of KNN regression and addressing intra-subject correlation.

Trajectory clustering in longitudinal studies is a well-established approach for understanding patterns of change over time. Several methods, including group-based trajectory modeling (GBTM), growth mixture modeling (GMM), and longitudinal k-means (KML), have been used for clustering trajectories [18]. Among these, KML has demonstrated consistent performance in recovering underlying clustering structures [18]. Genolini et al. (2010–2016) developed R packages to facilitate KML-based clustering for longitudinal data [19,20,21]. Leveraging KML for trajectory clustering allows CKNNRLD to enhance KNN regression by reducing intra-subject correlation while maintaining the advantages of a non-parametric approach.

The core contribution of this study is the generalization of the KNN regression framework to longitudinal data, achieved by integrating a time-aware clustering mechanism with trajectory-level distance metrics. To our knowledge, this specific formulation has not been addressed in the existing literature. In contrast to earlier methods, such as CLUEKR or traditional clustering-based KNN, our approach explicitly models temporal dynamics and intra-subject correlation, which are critical in longitudinal studies [2].

In the current study, we introduced and evaluated CKNNRLD through theoretical formulation, simulation studies, and real-world application to spirometry data from Iranian Bafq iron ore workers. The methods section details the theoretical framework, data simulation process, and implementation strategy. The results section presents the performance evaluation of CKNNRLD using simulated and real-world datasets, demonstrating its potential as an effective tool for analyzing longitudinal data.

Methods

Variable-trajectory

Let $y_{it} \in {\mathbb{R}}$ denote the observed response value for the subject $i \in \left\{ {1, \ldots ,n} \right\}$ at the time point $t \in \left\{ {1,...,T_{i} } \right\}$, where $T_{i}$ is the number of observations for subject i. The response trajectory for subject i can be represented as the vector.$Y_{i} = (y_{i1} ,y_{i2} , \ldots ,y_{{iT_{i} }} )^{{\text{T}}} .$

The collection of these trajectories for all subjects can be organized in a response matrix.

$Y = [Y_{1} ,Y_{2} ,...,Y_{n} ]^{{\text{T}}}$, with each row corresponding to a subject’s longitudinal response profile.

$$ Y = \left( \begin{gathered} Y_{1} \hfill \\ Y_{2} \hfill \\ \vdots \hfill \\ Y_{n} \hfill \\ \end{gathered} \right) = \left[ {\begin{array}{*{20}c} \begin{gathered} y_{11} \hfill \\ y_{21} \hfill \\ \end{gathered} & \begin{gathered} \ldots \hfill \\ \ldots \hfill \\ \end{gathered} & \begin{gathered} y_{{1T_{1} }} \hfill \\ y_{{2T_{2} }} \hfill \\ \end{gathered} \\ \vdots & \ddots & \vdots \\ {y_{n1} } & \cdots & {y_{{nT_{n} }} } \\ \end{array} } \right] $$

(1)

Distances in the D quantitative covariates and the T measurement times

Let $X_{it}$ denote the covariate vector for subject i at time t, and $X_{it}^{\left( k \right)}$ denote the k-th as the covariate component. We use $X_{i.}$ to represent the longitudinal covariate matrix for subject i and $X_{..}$ to denote the full covariate matrix for all subjects across all time points.

Therefore $X_{it} = (X_{it}^{\left( 1 \right)} ,{\text{ X}}_{it}^{\left( 2 \right)} , ,円 \ldots ,{\text{ X}}_{it}^{\left( D \right)} ) \in {\mathbb{R}}^{D}$, denote the D-dimensional covariate vector for subject i at time point t. To quantify the similarity between two such observations, a distance metric can be used. A common choice is the Euclidean distance, defined as follows:

$$ d_{t} (X_{it} ,X_{jt} ) = \sqrt {\sum\limits_{k = 1}^{D} {\left( {X_{it}^{(k)} - X_{jt}^{(k)} } \right)^{2} } } $$

(2)

Or Mahalanobis distance accounting for correlation in predictors:

$$ d\left( {X_{it} ,X_{jt} } \right) = \sqrt {(X_{it} - X_{jt} )^{T} \Sigma^{ - 1} (X_{it} - X_{jt} )} $$

(3)

where Σ is the covariance matrix of covariance matrix of the D-dimensional covariates across subjects.

Let $ X_{i..} ,X_{j..} \in {\mathbb{R}}^{T \times D}$ denote the longitudinal covariate matrices for subjects i and j, respectively. Then, it $d(X_{i..} ,X_{j..} )$ represents the distance between these two matrices, computed based on a suitable aggregation of temporal or feature-wise distances. Two methods were used to determine this distance between two matrices. The two matrices’ T columns are considered first, followed by the calculation of distances between the T couples of columns, and, lastly, the combination of these T distances using a function combining the column distances. The second method involves considering the D lines of the two matrices, calculating the distances between the D line pairs, and then using a function that combines the line distances, to sum up these D distances.

More formally, to compute a distance between $X_{i..}$ and $X_{j..}$ according to the first method, for each fixed time point t, let $X_{it} ,X_{jt} \in {\mathbb{R}}^{D}$ denote the covariate vectors for subjects i and j, respectively, at time t. The distance between these vectors is defined as:

$$ d_{t} (X_{it} ,X_{jt} ) = \sqrt {\sum\limits_{k = 1}^{D} {\left( {X_{it}^{(k)} - X_{jt}^{(k)} } \right)^{2} } } $$

(4)

This corresponds to the Euclidean distance between the covariate vectors at time t for subjects i and j, extracted from their respective longitudinal matrices $X_{i.}$ and $X_{j.}$. The result is a vector of T distances $\left( {d_{1} \left( {X_{i1} ,X_{j1} } \right),d_{2} \left( {X_{i2} ,X_{j2} } \right), \ldots ,d_{T} \left( {X_{iT} ,X_{jT} } \right)} \right)$. Then, these T distances can be combined using a function that algebraically corresponds to a norm $\left\| \cdot \right\|$ of the distance vector. Finally, the distance between $X_{i.}$ and $X_{j..}$

$$ d\left( {X_{i..} ,X_{j..} } \right) = \left\| {d_{1} \left( {X_{i1} ,X_{j1} } \right),d_{2} \left( {X_{i2} ,X_{j2} } \right), \ldots ,d_{T} \left( {X_{iT} ,X_{2T} } \right)} \right\| $$

(5)

To compute the distance $d^{\prime}$ between the longitudinal data matrices $X_{i..}$ $X_{j..}$ according to the second approach, we define the distance $d_{k} \left( {X_{i.}^{\left( k \right)} ,{\text{ X}}_{j.}^{\left( k \right)} } \right)$ between the time series of the k-th variable for both subjects. Here, they $X_{i.}^{\left( k \right)} ,X_{j.}^{\left( k \right)} \in {\mathbb{R}}^{T}$ represent the temporal trajectories of covariate k for subjects i and j, respectively. The overall distance is then computed by aggregating over all variables:

$$ d^{\prime}(X_{i..} ,X_{j..} ) = \sum\limits_{k = 1}^{D} d (X_{i.}^{\left( k \right)} ,X_{j.}^{\left( k \right)} ) $$

(6)

This results in a vector of D distances. $d_{1} \left( {X_{i.}^{\left( 1 \right)} ,X_{j.}^{\left( 1 \right)} } \right),d_{2} \left( {X_{i.}^{\left( 2 \right)} ,X_{j.}^{\left( 2 \right)} } \right), \ldots ,d_{D} \left( {X_{i.}^{\left( D \right)} ,X_{j.}^{\left( D \right)} } \right)$.

Then, we combine these D distances using a function that algebraically corresponds to a norm $\left\| \cdot \right\|$ of the distance vector. Finally,

$$ d^{\prime}(X_{i..} ,X_{j..} ) = \left\| {d_{1} \left( {X_{i.}^{\left( 1 \right)} ,X_{j.}^{\left( 1 \right)} } \right),d_{2} \left( {X_{i.}^{\left( 2 \right)} ,X_{j.}^{\left( 2 \right)} } \right), \ldots ,d_{D} \left( {X_{i.}^{\left( D \right)} ,X_{j.}^{\left( D \right)} } \right)} \right\| $$

(7)

The choice of the norm $\left\| \cdot \right\|$ can lead to the definition of many distances. On the contrary, in the case where $\left\| \cdot \right\|$ the standard p-norm is the Minkowski distance with parameter p, choosing either method $d$ $d^{\prime}$ leads to the same result:

$$ d(X_{i..} ,X_{j..} ) = d^{\prime}(X_{i..} ,X_{j..} ) $$

The proof is based on [20]:

$$ d(X_{i..} ,X_{j..} ) = \sqrt[p]{{\sum\limits_{t} {\left( {d\left( {X_{it} ,X_{jt} } \right)} \right)^{p} } }} = \sqrt[p]{{\sum\limits_{t} {\left( {\sqrt[p]{{\sum\limits_{k} {\left( {\left| {X_{it}^{\left( K \right)} - X_{jt}^{\left( k \right)} } \right|^{p} } \right)^{p} } }}} \right)^{p} } }} $$

$$ \sqrt[p]{{\sum\limits_{t} {\sum\limits_{k} {\left| {X_{it}^{\left( k \right)} - X_{jt}^{\left( k \right)} } \right|^{P} = \sqrt[p]{{\sum\limits_{t} {\left( {\sqrt[P]{{\sum\limits_{K} {\left( {\left| {X_{it}^{\left( k \right)} - X_{jt}^{\left( k \right)} } \right|^{P} } \right)} }}} \right)} }}} } }} $$

$$ = \sqrt[p]{{\sum\limits_{k} {\left( {d\left( {X_{i.}^{\left( k \right)} - X_{j.}^{\left( k \right)} } \right)} \right)^{p} } }} = d^{\prime}\left( {X_{i..} ,X_{j..} } \right) $$

(8)

Therefore, the Minkowski distance, a generalization of the Euclidean distance, is used in this study as the basis for computing similarity between multivariate profiles [20].

$$ d\left( {X_{i..} ,X_{j..} } \right) = \sqrt[p]{{\sum\limits_{t,k} {\left| {X_{it}^{\left( k \right)} - X_{jt}^{\left( k \right)} } \right|^{p} } }} $$

(9)

The Euclidean distance is obtained by setting p = 2, the Manhattan distance by setting p = 1, and the maximum distance by passing to the limit $p \to + \infty$[20].

K-Nearest Neighbors (KNN) regression

KNN is a non-parametric method for regression and classification, which is part of the non-parametric approach of local averaging. In this non-parametric approach, the estimation $m(x_{i} ) = E(Y_{i} \left| {x_{i} )} \right.$ is obtained by averaging the values of data points $y_{j}$ whose corresponding $x_{j}$ predictors are close to $x_{i}$ the target point. It means that

$$ \hat{m}\left( {x_{i} } \right) = \sum\limits_{j = 1}^{n} {w_{j} } \left( {x_{i} ;x_{1} ,...,x_{n} } \right)y_{j} = \left( {w_{1} (X),w_{2} (X),...,w_{n} (X)} \right)\left( \begin{gathered} y_{1} \hfill \\ y_{2} \hfill \\ \vdots \hfill \\ y_{n} \hfill \\ \end{gathered} \right) $$

(10)

where the weights $w_{j} \left( {x_{i} ;x_{1} ,...,x_{n} } \right)$ are non-negative and decrease as the distance between $x_{j} ;j = 1,...,n$ and $x_{i}$ increases. This approach helps estimate the value $y_{i}$ corresponding to $x_{i}$ a target point by borrowing information from nearby observations. It assigns greater weight to observations closer to the target point $x_{i}$ and less weight to those farther away [22].

KNN regression for longitudinal data

In longitudinal studies, Y values for each individual consist of multiple measurements taken over time. However, the main problem of examining the effect of independent variables on the response remains. As stated in Eq. (1), $Y$ is a matrix containing a quantitative variable trajectory whose i-th row represents the trajectory response variable for the i-th individual to estimate the variable-trajectory vector of a new individual using KNN regression; we denote this by $\hat{y}_{it}$. In this case, the nearest neighbors of covariate trajectory $X_{i.}$ are using the distance metric in Eq. (9). The predicted response trajectory is computed as the weighted average of the neighbors’ outcomes [23]. To predict the value of the response variable for subject i at time t, we use the average response of its k-nearest neighbors, identified based on longitudinal distances between their covariate trajectories $X_{i.}$:

$$ \hat{y}_{it} = \sum\limits_{{j \in N_{k} \left( i \right)}} {\frac{1}{k}y_{jt} } $$

(11)

where $\hat{y}_{it}$ denotes the predicted response for subject i at time t, obtained via local averaging over the responses of the k-nearest neighbors. The prediction is computed as the average of the outcomes of the nearest neighbors at time t.

Where N(i) denotes the set of indices corresponding to the k-nearest neighbors of subject i, based on the longitudinal distance between their covariate trajectories.

Alternatively, it can be used as weighted averaging:

$$ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{m} (X_{it} ) = \sum\limits_{{j \in N_{k} \left( i \right)}} {w_{j} \underline {y}_{jt} } $$

$$ with ,円 weights\;\;w_{j} = \frac{{1/d\left( {X_{i \cdot } ,X_{j \cdot } } \right)}}{{\sum\limits_{l\smallint N\left( i \right)} 1 /d\left( {X_{i \cdot } ,X_{l \cdot } } \right)}} $$

(12)

where $d(X_{it} ,X_{jt} )$ denotes the longitudinal distance between subjects i and j at time t, and weights are normalized inverse distances to emphasize closer neighbors. The above method requires that the covariates $X_{i.}$ be time-dependent and can sufficiently explain Y. We will refer to this method as KNNRLD (KNN Regression for Longitudinal Data).

Some drawbacks related to KNN regression

Although KNN has proved to be a ubiquitous classification or regression tool with good scalability, it has some drawbacks. One of its most significant drawbacks is that it is a lazy learner, i.e., it uses all the training data at runtime [2]. Another issue is that KNN regression is prone to the curse of dimensionality. KNN estimators are universally consistent, which means.

$$ {\rm E}(\hat{m}_{k} - m)^{2} \to 0\;as\;n \to \infty $$

(13)

However, KNN regression estimators may perform poorly if the input space dimension is high (i.e., the number of covariates, p). This is simply because the nearest neighbors must be close to the target point, which can lead to significant errors in high-dimensional spaces. So, it can be proved that

$$ {\rm E}(\hat{m}_{k} \left( x \right) - m\left( x \right))^{2} \le n^{{ - \frac{2}{2 + P}}} $$

(14)

This is a relatively weak error rate because it depends weakly on the dimension p. Therefore if we want the maximum estimation error to be less than or equal to $\varepsilon \ge 0$ that is

$$ {\rm E}(\hat{m}_{k} \left( x \right) - m\left( x \right))^{2} \le \varepsilon $$

(15)

Therefore, the number of samples n should be chosen so that

$$ n \ge \varepsilon^{{ - \frac{2}{2 + P}}} $$

(16)

Which increases exponentially with the increase of P [24].

To generalize KNN regression in longitudinal data, first consider the assumption that none of the independent variables change over time, so the sample size can be established based on Eq. (11) for KNN regression in longitudinal data because, in this case, the original k-nearest neighbor is the same as a P-dimensional KNN regression for cross-sectional data. However, we will assume that all the independent variables used in KNN regression with longitudinal data are time-dependent. In this case, we first return to Eq. (8), where

$$ \sqrt[p]{{\sum\limits_{t} {\left( {d\left( {X_{it} ,X_{jt} } \right)} \right)^{p} } }} = \sqrt[p]{{\sum\limits_{t} {\left( {\sqrt[p]{{\sum\limits_{k} {\left( {\left| {X_{it}^{\left( k \right)} - X_{jt}^{\left( k \right)} } \right|^{p} } \right)^{p} } }}} \right)^{p} } }} $$

(17)

That is, the problem of the distance coefficient between subjects i and j can be separated into the problem of finding the distance coefficient in the shortest possible time. Therefore, $t = 1,2, \ldots ,T$ the required sample volume will be obtained at any measurement time, similar to the previous state. The required sample size for the case where some independent variables depend on time is between these two cases.

As mentioned, the curse of dimensionality in a KNN estimator is that the nearest neighbors must be close to the target point. High dimensionality can lead to significant errors. This problem can be partially mitigated by clustering the data before running the K-Nearest Neighbors (KNN) algorithm. Clustering helps by grouping similar subjects, thereby confining the nearest-neighbor search to localized regions in the feature space. As a result, the influence of high dimensionality is reduced because distance computations are performed within smaller, denser subsets rather than across the entire space. This strategy is particularly effective for longitudinal data, where local structures often carry more meaningful temporal relationships. By clustering subjects based on proximity in the P-dimensional space, the KNN algorithm operates within more homogeneous groups. Additionally, distance calculations are performed on smaller subsets of data (as opposed to full n-dimensional matrices), which improves computational efficiency—especially for large-scale longitudinal datasets [20, 25].

Enhancing KNN regression accuracy through clustering in longitudinal data

By clustering individuals based on similar temporal patterns and covariate structures, the overall variability within each group is reduced. This approach ensures that subjects with comparable characteristics are analyzed together, leading to a more consistent intra-cluster correlation. As a result, applying KNN regression within each cluster becomes more stable and reliable. Rather than implementing KNN across the entire dataset, restricting it to pre-defined clusters allows the algorithm to find neighbors with more similar longitudinal structures. This localized approach improves predictive accuracy by ensuring comparisons among more relevant observations. When individuals within a cluster exhibit shared underlying characteristics, such as similar baseline values or progression trends—clustering effectively accounts for these latent factors. It reduces the necessity for explicitly modeling random effects, simplifying the regression process while capturing subject-specific influences [12, 21, 25].

The objective of CKNNRLD is to estimate the value of a longitudinal outcome, y_it, using observed covariates, X_it, through a cluster-based nearest-neighbor regression framework. This framework aggregates information from similar time-aligned individuals to produce continuous-valued predictions.

Clustering-based KNN regression for longitudinal data

The algorithm comprises both preprocessing and processing steps, as outlined below.

Preprocessing steps

1.
Clustering variable trajectories with the help of K-means for longitudinal data (KML) method [19, 20]
2.
Finding the optimal number of clusters using the Calinski criterion [26] or the other criteria in the kml package [19, 20] (in this step, users can determine the number of clusters according to their taste based on prior information about the data).
3.
Determining the optimal K for KNN regression by k-fold cross-validation method (If the clusters are homogeneous, one same K can be used for all the clusters.)
4.
Finding the representative of each cluster (the mean vector (matrix for time-dependent data) of the corresponding covariate variables in each cluster).
5.
Saving the information from the previous four steps.

Processing steps

6.
Finding the cluster where the query point (query vector for non-time-dependent data and query matrix for time-dependent data) has the highest probability of occurrence (the smallest distance to the representative of each cluster).
7.
KNNRLD is applied based on the weighted average of K nearest neighbors of query variables in the desired cluster.
8.
The value calculated in the previous step is a new variable trajectory provided as an output.

According to the above steps, the KNN regression algorithm, based on clustering for longitudinal data, is illustrated in Fig. 1. The kml package includes functions for imputing missing values, a detailed description of which is given in the related article [20]. If there are trajectories with missing values among the data, a decision can be made before step 1 with the help of the functions introduced in the kml R package. In our implementation, the optimal number of clusters was selected using internal validation indices such as the Calinski-Harabasz and silhouette scores available in the kml R package. This data-driven strategy helped ensure that the chosen C aligned with the actual data structure.

Fig. 1

The algorithm flow plot of the CKNNLR. KNNRLD: KNN regression for Longitudinal data; CKNNRLD: clustering-Based KNN regression for longitudinal data; CV: cross-validation

Full size image

Relation to previous work

CKNNRLD represents a significant advancement over existing clustering-based KNN methods by specifically addressing the unique challenges of longitudinal data analysis. While prior approaches such as CLUEKR (Dubey et al. [2]) have demonstrated the utility of combining clustering and KNN in cross-sectional settings, they lack the mechanisms necessary to handle temporal dependencies, intra-subject correlations, and irregular time points [2]. A concise comparison between CLUEKR and CKNNRLD is presented in Supplementary Table S1, summarizing key methodological differences in data type, clustering logic, and regression strategy [2].

Our method builds upon longitudinal clustering techniques, such as KML (k-means for longitudinal data), by introducing several key innovations.

Time-sensitive distance metrics: CKNNRLD employs trajectory-level distance calculations that capture both the shape and timing of individual trajectories, ensuring meaningful comparisons across subjects with sparse or irregular measurements.

Cluster-specific local regression models: By fitting regression models within clusters of similar temporal patterns, CKNNRLD adapts to heterogeneity across subgroups and improves prediction accuracy.

Scalability and flexibility: The method remains computationally efficient and non-parametric, making it suitable for large-scale longitudinal datasets without relying on strict model assumptions.

To our knowledge, no previous method has combined these components into a unified, interpretable framework for KNN-based regression on longitudinal data. Our theoretical and empirical results demonstrate that CKNNRLD offers superior performance to standard KNN approaches in terms of both accuracy and robustness in the presence of temporal complexity.

Simulation methods

The data-generating procedure first defined the number of clusters (C) that characterized the longitudinal change to be modeled into the data. Three or four clusters (C = 3, 4) with distinct longitudinal patterns (trajectories) were specified. A $y_{ij}$ model in the general linear form below denotes a trajectory belonging to group C:

$$ y_{ij} = \beta_{0c} + \beta_{1c} t_{j} + \beta_{2jcd} X_{ijcd} + Z_{0i} + e_{ij} \;\;i = 1,...,N;j = 0,...,T $$

(18)

Here $\beta_{oc}$, $\beta_{1c}$ are the intercept and slope for cluster C, respectively. Furthermore, $Z_{0i}$ it denotes the trajectory-specific random intercept; its deviation from the cluster trajectory $Z_{0i} \sim N(0,R)$ $e_{ij}$ represents independent random noise $e_{ij} \sim N\;(0,E)$ In this context, t_j denotes the j-th time point at which the longitudinal measurements (either for response or covariates) are observed for a given subject. In balanced data, t_j can be interpreted as a fixed and common time index across subjects; in unbalanced settings, it may vary across individuals.

The simulation subsequently varied the data sets across a fixed number of repeated measurements or time points (T = 3, 5, 10) and the number of simulated subjects (N = 100, 500, 2000), which $X_{ijcd}$ is a fictitious covariate that depends on time j and cluster c for the i-th subject. The simulation scenarios included different values for the true number of clusters (C = 3 and C = 4) to examine the model’s sensitivity to the clustering structure. The index d refers to the d-th covariate. The dimension of covariates is also considered equal to D = 2 or 3. So, the N refers to the number of subjects (or trajectories) at each time point in each cluster, yielding a total of data points in each T ×ばつ C ×ばつ N ×ばつ D data set. For all time points and subjects error ($e_{ij}$) and random noise ($Z_{0i}$) were added to the data, taken from a normal distribution with either mean = 0 and SD = 1 (E = 1, R = 1) for the low measurement error and random noise condition, or (E = 5, R = 1), for the high measurement error and low random noise condition, or (E = 1, R = 5), for the low measurement error and high random noise condition to simulate both differences in cluster homogeneity and deviation from linearity (See Table 1).

Table 1 Summarizing the simulation scenarios

Full size table

The total number of scenarios in the simulated design was 2 (clusters) ×ばつ 3 (time points) ×ばつ 3 (number of subjects) ×ばつ 2 (dimension of covariate) ×ばつ 3 (levels of error and random noise) = 108. A new series of simulated datasets was constructed for each of the three predicting methods. The number of replicated data sets per scenario was r = 10,000. The R-package latrend (Version 1.6.1) was used for this study, which can generate data according to this model using a utility function named generateLongData() included in the package [27]. This function generates datasets based on a mixture of linear mixed models (see Table 1).

Figure 2 illustrates three scenarios with common parameters, N = 500, T = 10, D = 3, and C = 4, to compare the effects of random noise and measurement error under three conditions (Fig. 2)

Fig. 2

An examples of Simulated data with T = 10, D = 3, C = 4, N = 500 and E = 1, R = 1 (A panel) , E = 1, R = 5 (B panel) and E = 5, R = 1 (C panel)

Full size image

The simulation study was designed to reflect the structure and conditions of the real-world longitudinal dataset analyzed in the application section. In particular, the choice of the number of covariates (D = 2 and D = 3) was motivated by the typical dimensionality observed in clinical data, where a limited number of predictors are measured repeatedly over time. Accordingly, the scenarios were constructed to mimic realistic noise levels, time structures, and clustering patterns encountered in longitudinal respiratory datasets.

The criteria for evaluating different methods for predicting new observations

Assessment of BIAS

Bias is a deviation in an estimate from the true quantity, indicating the performance of the assessed methods. One assessment of bias is the difference between the average estimate and the true value, which in this study was equal to $BIAS = \hat{y}_{ij} - y_{ij}$ where $\hat{y}_{ij}$ the predicted variable trajectory of the i-th person at time j is estimated using covariates under each method $y_{ij}$ to its actual value [28].

Assessment of accuracy

The classical Mean Squared Error (MSE) is typically computed as:

$$ MSE = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left( {\hat{y}_{i} - y_{i} } \right)}^{2} $$

(17)

However, in this study, a theoretical version of MSE was used to decompose the error into squared bias and variance components:

$$ Theoretical\;MSE = BIAS^{2} + SE^{2} $$

(18)

SE is the standard error, representing the empirical standard deviation of the estimator across simulations. This formulation, adapted from Burton [26], captures both systematic and random error components in the estimator’s performance [28].

Assessment of coverage probability (CP)

The coverage of a confidence interval is the proportion of times that the obtained confidence interval contains the true specified parameter value. The coverage should be approximately equal to the nominal coverage rate, e.g., 95 percent of samples for 95 percent confidence intervals, to properly control the type I error rate for testing a null hypothesis of no effect. In this study, CP refers to the proportion of times the $100(1 - \alpha )\%$ $\alpha = 0.05$ confidence interval $\hat{y}_{ij} \pm Z_{{1 - \frac{\alpha }{2}}} SE\left( {\hat{y}_{ij} } \right)$ includes $y_{ij}$, for i = 1,..., N, j = 0,..., T. If the parameter estimates are relatively unbiased, then narrower confidence intervals imply more precise estimates, suggesting gains in efficiency and power [28].

The criteria for the evaluation of clustering

We used four criteria to check the relationship between the accuracy of the estimates and the correct detection of clusters in the CKNNRLD method. Three of these are well-known criteria for evaluating the accuracy of clusters, including the Rand index [29], the Adjusted Rand index [30], and the Nowak index [31]. To calculate these three indices, the clusterSim R package (version 0.51–5) and the comparing partitions function available in this package were used [32]. In each simulation scenario, the number of clusters (C) was selected using standard internal validation indices such as the Calinski-Harabasz scores to ensure meaningful and stable clustering [26]. Moreover, one of the criteria is related to the correct detection of the algorithm in determining the number of clusters as a proportion of the matched number of clusters, which is the proportion of the number of simulation iterations in which the CKNNRLD algorithm correctly recognized the number of clusters to the total number of iterations in each simulation scenario.

Finally, Pearson’s correlation was used to assess the impact of the clustering step in the CKNNRLD algorithm on the accuracy and correctness of its estimates across all 108 scenarios.

Execution time

In the two prediction scenarios—1 predicting just one new observation and 2 predicting a random sample of observations equal to 50% of N—the execution time ratios of the KNNRLD and LMM to CKNNRLD were taken into account. These ratios were considered while comparing the execution times of the various methodologies.

Table 2 BIAS, MSE, and CP for 108 scenarios and three methods

Full size table

Results on simulated data

Table 2 presents the values of BIAS, Mean Squared Error (MSE), and coverage probability (CP) related to all three CKNNRLD, KNNRLD, and LMM methods for 108 simulation scenarios.

Figure 3 compares three methods—CKNNRLD, KNNRLD, and LMM—using boxplots for bias, MSE, and CP across various simulation scenarios. LMM consistently shows higher bias and MSE, particularly as the number of clusters, time points, and error levels increase. In contrast, CKNNRLD and KNNRLD exhibit lower bias and MSE, demonstrating more stable and accurate predictions. Coverage probability is also better maintained by CKNNRLD and KNNRLD. Increasing the number of subjects reduces the mean squared error (MSE) and bias for all methods, but CKNNRLD and KNNRLD remain more robust across different conditions. Higher measurement error and random noise have a more negative impact on LMM than on the KNN approaches.

Fig. 3

Boxplots for BIAS (top panel), MSE (middle panel), and CP (bottom panel) by simulation parameters (from left to right N, C, T, D, (R, E) and overall mean

Full size image

CKNNRLD yielded lower MSE values than KNNRLD in 89.9% of scenarios, a lower bias than the LMM approach in all scenarios, and a lower MSE than the LMM approach in 65% of the scenarios (Table 2). In the CKNNRLD algorithm, an increase in the number of trajectories (N) resulted in a decrease in bias and MSE, while the CP became more consistent, approaching CP = 0.95. Conversely, the accuracy of estimates using the LMM method declined with increasing sample size (N). Similarly, the accuracy and precision of the KNNRLD approach improved as N increased, except for bias at N = 500 compared to N = 100 (Fig. 3).

The average execution time of the CKNNRLD methodology was double that of the KNNRLD method for predicting a single new longitudinal observation. In comparison, the average pure calculation time for CKNNRLD (excluding clustering timings and proximity determination for fresh data) was less than half (0.35) of the time required for the KNNRLD technique. For a random sample with fifty percent N, the average execution time ratio of the KNNRLD algorithm was less than one (0.32) in comparison to the CKNNRLD technique for N = 100; however, it was around 1.5 times (1.52) for N = 500 and roughly three times (3.04) for N = 2000 (see Fig. 4).

Fig. 4

Box plots of the ratio of the run time of KNNRLD and LMM algorithms compared to CKNNRLD for fifty percent of observations as a new set of observations, by simulation parameters (from left to right N, C, T, D, (R, E) and overall mean

Full size image

Supplementary Fig. S1 illustrates the sensitivity of CKNNRLD to misspecification in the number of clusters (ΔC). Specifically, it compares the mean squared error (MSE) when the true number of clusters is correctly selected (ΔC = 0) versus when C is under or overestimated by one (ΔC = ± 1). Results show that in 73% of simulation scenarios, the optimal C was selected and yielded lower MSE. In the remaining 27%, a modest increase in MSE was observed.

The Pearson correlation coefficients between clustering accuracy criteria and estimate accuracy criteria in the CKNNRLD method are presented in Table 3. The results indicate significant negative correlations between clustering accuracy indices (Rand, Adjusted Rand, and Nowak) and bias and Mean Squared Error (MSE), suggesting that improved clustering accuracy is associated with lower estimation errors. The strongest correlations are observed for the Rand index, with bias (− 0.376) and MSE (− 0.320), highlighting its relevance in ensuring precise estimation. The proportion of matched clusters also shows a weaker but significant negative correlation with bias (− 0.199) and MSE (− 0.180). However, coverage probability (CP) does not exhibit strong correlations with clustering accuracy measures, indicating that while clustering quality impacts bias and mean squared error (MSE), it has a minimal influence on the coverage of confidence intervals. These findings underscore the significance of effective clustering in enhancing the predictive performance of CKNNRLD and demonstrate that the quality of clustering has a direct impact on the regression performance of CKNNRLD. However, the method remains robust when standard validation indices are used to determine cluster structure, even in the presence of moderate noise.

Table 3 Pearson correlation of clustering accuracy criteria by estimate accuracy criteria in CKNNRLD method

Full size table

Applications to real data

The severity and prognosis of respiratory diseases are primarily determined by the results of pulmonary function tests, particularly spirometry [33]. Accurate interpretation of spirometry results relies on standardized reference values that account for population-specific factors such as race, age, and height. However, respiratory scales derived from spirometry do not adhere to a linear model, as lung volumes exhibit a skewed distribution influenced by age-related and height-dependent variations. Understanding these complex patterns is crucial for precise diagnosis, monitoring disease progression, and evaluating the impact of environmental and occupational exposures on respiratory health [34,35,36,37]. Forced vital capacity (FVC) is a key indicator that reflects both restrictive and obstructive pulmonary impairments, making it essential for monitoring long-term respiratory health. Tracking changes in lung function over time is essential, particularly in occupational environments where workers may be exposed to airborne pollutants, dust, and other respiratory hazards that contribute to long-term pulmonary decline. Continuous monitoring enables the identification of early signs of impairment, facilitating timely interventions, workplace safety improvements, and the development of preventive measures to reduce the risk of chronic respiratory diseases [37, 38].

In this section, we investigate the longitudinal pattern of spirometry variables among Iran Bafq iron ore workers. The Forced Vital Capacity (FVC) variable, measured in liters, was assessed for 274 employees of this mine over three consecutive years: 2022, 2023, and 2024. Age, height, and exposure to tobacco (in pack-years; a pack-year is a clinical quantification of cigarette smoking used to measure a person’s exposure to tobacco) were considered key features. These features can be used to predict the process of changing the indicators of a new person, as they play a crucial role in determining lung function trajectories over time. Given the non-linear nature of pulmonary changes, incorporating these variables into predictive models allows for a more personalized assessment of respiratory health. By leveraging machine learning techniques, such as CKNNRLD and KNNRLD, it becomes possible to estimate future spirometry values, identify individuals at risk of lung function decline, and implement early interventions to mitigate potential respiratory impairments. For this purpose, the data were divided into two sets: train and test. The number of variable trajectories for 30 subjects (11%) was considered the test set, and the remaining data was used as the training set to train the CKNNRLD and KNNRLD methods. In Fig. 5, the variable trajectories are specified for 244 training data subjects (Panel A), and the optimal number of clusters was determined using the kml package and various criteria (see Fig. 5, Panel B). Finally, for every 30 subjects in the test dataset, the predictions from both methods and the actual values of the variable trajectories are illustrated in Panel C of Fig. 5. Internal cluster validation on the real spirometry dataset was performed using the Calinski-Harabasz criterion within the kml package. The maximum value was observed at C = 4 (CH ≈ 351.7), with C = 3 (CH ≈ 344.3) also yielding competitive scores, supporting the selection of a small number of well-separated longitudinal clusters.

Fig. 5

A Trajectories for training data; B KML clustering trajectory for training data; C Prediction trajectory for test data. KML, K-means for longitudinal data; FVC, forced vital capacity

Full size image

Discussion

In some situations, a longitudinal study’s objective is not to evaluate the link between independent and dependent variables; instead, we are searching for a flexible way to predict trends or varied trajectories using various features. In this work, we enhanced the KNN regression to predict responses from longitudinal data by introducing the Clustering-Based KNN Regression for Longitudinal Data (CKNNRLD) algorithm. The benefit of CKNNRLD is that it requires less computational effort and exhibits lower complexity. Instead of directly searching for nearest neighbors in the entire dataset, the algorithm first clusters the data using KML, a method for grouping longitudinal variable trajectories, and then finds the cluster where the query point should be located. According to the simulation findings, using the CKNNRLD method for this purpose took less time than using the KNNRLD implementation for N > 100. However, it also predicted the longitudinal responses more accurately and precisely than a comparable approach. Since the KNNRLD method requires all the training data to locate the nearest neighbors, it operates extremely slowly as the number of individuals in longitudinal research increases (N > 500). Moreover, the simulation study was designed to explore a wide range of data configurations by varying the number of clusters (C), time points (T), dimensionality (D), and noise levels across 108 scenarios. The consistent performance of CKNNRLD across these settings, as presented in Table 2 and Figs. 3 and 4, demonstrates the model’s robustness and serves as an implicit sensitivity analysis concerning key modeling parameters.

A longitudinal clustering step before performing KNN regression can mitigate intra-subject correlation, although its effectiveness depends on the underlying correlation structure. Clustering enhances stability by grouping individuals with similar temporal patterns and covariate structures, reducing cluster variability and facilitating more accurate KNN predictions. By ensuring that neighbors share comparable longitudinal structures, clustering enables improved estimation while implicitly accounting for subject-specific effects, thereby reducing the need for explicit random effects modeling [12, 20, 25].

The core contribution of CKNNRLD lies in generalizing the KNN regression framework to longitudinal settings by integrating a time-aware clustering mechanism with trajectory-level distance metrics, which has not been explored in prior work. This addresses a critical gap in the literature by providing a non-parametric, scalable tool tailored for sparse, irregular, and temporally dependent data. However, several limitations must be taken into account. The approach assumes that intra-subject correlation is primarily driven by clusterable factors, which may not hold if individual trajectories exhibit significant heterogeneity, such as random slopes or nonstationary processes. Additionally, even after clustering, residual correlation within subjects may persist, requiring supplementary adjustments such as mixed-effects modeling, autoregressive corrections, or weighting strategies. While longitudinal clustering enhances the robustness of KNN regression in handling intra-subject correlation, additional refinement techniques may be needed to fully account for subject-specific variability and unobserved individual effects [39].

While deep learning methods, such as recurrent neural networks (RNNs) and Gaussian processes (GPs), have demonstrated exceptional capabilities in modeling complex temporal dynamics, their application in clinical settings is often limited. These models typically require large, high-quality datasets and involve extensive hyperparameter tuning, which can be infeasible in practice. Moreover, the interpretability of deep learning approaches remains a significant barrier to their adoption in healthcare. Recent studies have highlighted these concerns, emphasizing the need for transparent and scalable alternatives in clinical predictive modeling [40, 41]. In contrast, CKNNRLD offers a lightweight, interpretable framework suitable for moderate-sized longitudinal datasets. By enhancing traditional KNN regression with clustering-based trajectory alignment, our method maintains high prediction accuracy while remaining computationally tractable and more aligned with the practical requirements of clinical environments.

A clustering-based KNN regression approach can be used to predict the trajectory of a new individual’s response based on baseline independent variables, with accuracy depending on the nature of these variables. Suppose the independent variables remain constant over time. In that case, prediction is straightforward, as the individual can be assigned to a similar cluster, allowing their response trajectory to be inferred from similar subjects. When independent variables change predictably over time, such as age, the model can still generate reasonable predictions by adjusting for expected variations. However, if the independent variables fluctuate unpredictably, the reliability of predictions decreases, as clustering-based KNN regression primarily relies on historical patterns and does not explicitly model dynamic individual-level changes. Therefore, this approach is most effective when independent variables are stable or follow predictable trends over time. Adaptive weighting in KNN, utilizing techniques such as locally weighted regression, ensures that recent observations have a greater influence on estimates. By integrating this approach, the predictive accuracy of clustering-based KNN regression can be significantly improved, even in cases where independent variables exhibit unpredictable variations [42].

The results indicated that CKNNRLD achieved optimal prediction accuracy when the specified number of clusters matched the true value. When C was misspecified by ± 1, performance degradation was generally minimal, with an increase in the mean squared error (MSE) of less than 10%. Substantial degradation was observed only when C was severely mis-specified (e.g., using C = 2 when the true value was C = 4). In 73% of scenarios, the Calinski-Harabasz index picked the true C. These findings suggest that CKNNRLD is reasonably robust to moderate deviations in the number of clusters, particularly when standard internal validation techniques are employed to guide cluster selection. The relationship between clustering quality and prediction accuracy was evaluated using Rand, Adjusted Rand, and Nowak indices. Table 3 summarizes Pearson correlation coefficients between clustering quality and model errors (BIAS, MSE), showing significant negative correlations that confirm improved clustering leads to better prediction accuracy. Although noisy or improperly clustered data can affect performance, our validation-based cluster selection maintains model robustness across moderate variations in clustering. In addition to these findings, two specific safeguards help further mitigate this dependency: (1) cluster quality is validated using both statistical indices (e.g., Calinski-Harabasz, Silhouette) and visual examination of trajectory patterns (Fig. 5), and (2) the subsequent weighted KNN step enhances robustness, as evidenced by minimal degradation in prediction accuracy (MSE increase < 15%) even under high-noise conditions (Table 2). In the simulation study, while only D = 2 and 3 were tested, the model remained stable, supporting its potential resilience to moderate levels of dimensionality. We acknowledge that additional simulations on higher-dimensional datasets would further validate the method’s effectiveness under the curse of dimensionality. This is a valuable direction for future work.

We tested the applicability of the proposed approach using the Iran Bafeq mining factory’s spirometry data, which provided a real-world longitudinal dataset for evaluating the effectiveness of clustering-based KNN regression. The dataset consisted of repeated pulmonary function measurements from workers, allowing us to assess how well the method captured individual trajectories and intra-subject correlation over time. The resultant method may be utilized in various medical research applications, particularly in predictive modeling of disease progression, occupational health monitoring, and personalized medicine. By analyzing repeated measurements in individuals or specific communities, this approach can help identify risk factors, predict long-term health outcomes, and inform the optimization of intervention strategies. Similar to its application in this study, the method can be extended to various fields, including epidemiology, chronic disease management, and public health surveillance, enabling more accurate predictions of physiological changes in response to environmental or occupational exposure. In the real-world case study, the primary objective was to demonstrate the feasibility and interpretability of the CKNNRLD method. Since comparative evaluation was already extensively conducted under controlled simulation scenarios, repeating the same comparisons on observational data was deemed unnecessary.

Future research could focus on refining the CKNNRLD algorithm by incorporating adaptive clustering techniques to account for more complex intra-subject correlation structures. Exploring hybrid models that integrate clustering-based KNN regression with mixed-effects or autoregressive models could further improve predictive accuracy, especially in cases where individual trajectories exhibit significant variability. Additionally, future studies could investigate the application of this method in broader medical and epidemiological contexts, such as chronic disease progression and personalized treatment planning. Extending CKNNRLD to handle dynamic independent variables through adaptive weighting or time-varying feature selection may enhance its applicability to real-world longitudinal datasets with unpredictable variations.

Although interpretability was not explicitly evaluated in this study, the cluster centers of CKNNRLD offer natural insights into subgroup patterns. For example, in the spirometry application, clusters could represent distinct trajectories of lung function decline (e.g., ‘stable FVC,’ ‘rapid decline with age’), allowing clinicians to associate predictions with prototypical cases. Since predictions are based on neighboring subjects within the same cluster, the model retains a degree of transparency and traceability. To enhance clinical interpretability, future research could explore ways to quantify the contribution of individual features to both cluster assignment and local prediction behavior. Model-agnostic explanation tools, such as SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations), could be applied within clusters to assess the influence of covariates. These tools may enhance transparency by revealing how specific subject-level features influence predicted trajectories. Additionally, visualization techniques—such as trajectory overlays or t-SNE cluster maps (t-distributed Stochastic Neighbor Embedding)—could be valuable in interpreting local structure and neighborhood relationships [43,44,45].

Data availability

The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Abbreviations

CKNNRLD:: Clustering-based KNN regression for longitudinal data
KNNRLD:: KNN regression for longitudinal data
LMM:: Linear mixed effect model
GLMM:: Generalized linear mixed effect models
GEE :: Generalized estimating equations
DT:: Decision trees
RF:: Random forests
SVM:: Support vector machines
KNN:: K-nearest neighbor
CLUEKR:: Clustering-based efficient KNN regression
KNNVWC:: KNN approach based on various-widths clustering
CK-NN:: Clustered k-nearest neighbors approach for large-scale classification
MSE:: Mean squared error
KML:: K-means for longitudinal data
CP:: Coverage probability
GBTM:: Group-based trajectory modeling
GMM:: Growth mixed modeling
FVC:: Forced vital capacity
RNN:: Recurrent neural networks
GP:: Gaussian processes

References

Klemelä JS. Multivariate nonparametric regression and visualization: with R and applications to finance. Hoboken: John; 2014.
Google Scholar
Dubey H, Pudi V. CLUEKR: Clustering based efficient k NN regression. In: Advances in knowledge discovery and data mining: 17th Pacific-Asia conference, PAKDD 2013, Gold Coast: Springer; 2013. April 14–17, 2013, Proceedings, Part I 17. p. 450–8.
Halder RK, Uddin MN, Uddin MA, Aryal S, Khraisat A. Enhancing K-nearest neighbor algorithm: a comprehensive review and performance analysis of modifications. J Big Data. 2024;11:113.
Article Google Scholar
Dhanabal S, Chandramathi S. A review of various k-nearest neighbor query processing techniques. Int J Comput Appl. 2011;31:14–22.
Google Scholar
Bharambe S, Dubey H, Pudi V. BINER: BINary search based efficient regression. In: Machine learning and data mining in pattern recognition: 8th international conference, MLDM 2012, Berlin: Springer; 2012. July 13–20, 2012. Proceedings 8. p. 76–85.
Ougiaroglou S, Evangelidis G. Efficient k-NN classification based on homogeneous clusters. Artif Intell Rev. 2014;42:491–513.
Article Google Scholar
Al-Helali B, Chen Q, Xue B, Zhang M. A new imputation method based on genetic programming and weighted KNN for symbolic regression with incomplete data. Soft Comput. 2021;25:5993–6012.
Article Google Scholar
Song Y, Liang J, Lu J, Zhao X. An efficient instance selection algorithm for k nearest neighbor regression. Neurocomputing. 2017;251:26–34.
Article Google Scholar
Cui L, Zhang Y, Zhang R, Liu QH. A modified efficient KNN method for antenna optimization and design. IEEE Trans Antennas Propag. 2020;68:6858–66.
Article Google Scholar
Gallego AJ, Rico-Juan JR, Valero-Mas JJ. Efficient k-nearest neighbor search based on clustering and adaptive k values. Pattern Recognit. 2022;122: 108356.
Article Google Scholar
Ullah R, Emaduddin SM. ck-NN: A clustered k-nearest neighbours approach for large-scale classification. 2020.
Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis. Hoboken: John; 2012.
Google Scholar
Angali KA, Loeloe MS, Akhoond MR, Daneshkhah A, Borazjani F. Early feeding and growth pattern in infants: Using a three-variate longitudinal model derived from Gaussian copula function. Epidemiol Biostat Public Heal. 2018;15:e12908–11.
Google Scholar
LoeLoe MS, Akhoond MR, Ahmadi Angali K, Borazjani F. Modeling multivariate longitudinal data using vine pair copula constructions. J Adv Math Model. 2023;13:448–66.
Google Scholar
Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982;963–74.
Verbeke G, Lesaffre E. A linear mixed-effects model with heterogeneity in the random-effects population. J Am Stat Assoc. 1996;91:217–21.
Article Google Scholar
Diggle P, Diggle PJ, Heagerty P, Liang K-Y, Zeger S. Analysis of longitudinal data. Oxford: Oxford University Press; 2002.
Book Google Scholar
Verboon P, Pat-El R. Clustering longitudinal data using R: A Monte Carlo study. Methodology. 2022;18:144–63.
Article Google Scholar
Genolini C, Falissard B. KmL: k-means for longitudinal data. Comput Stat. 2010;25:317–28.
Article Google Scholar
Genolini C, Alacoque X, Sentenac M, Arnaud C. kml and kml3d: R packages to cluster longitudinal data. J Stat Softw. 2015;65:1–34.
Article Google Scholar
Genolini C, Ecochard R, Benghezal M, Driss T, Andrieu S, Subtil F. kmlShape: an efficient method to cluster longitudinal data (time-series) according to their shapes. PLoS ONE. 2016;11: e0150738.
Article PubMed PubMed Central Google Scholar
Sohil F, Sohali MU, Shabbir J. An introduction to statistical learning with applications in R: by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, New York: Springer Science and Business Media; 2013, 41ドル.98. 2022.
Xiang D, Qiu P, Pu X. Nonparametric regression analysis of multivariate longitudinal data. Stat Sin. 2013;769–89.
Kpotufe S. k-NN regression adapts to local intrinsic dimension. Adv Neural Inf Process Syst. 2011;24.
Sheetal A, Jiang Z, Di Milia L. Using machine learning to analyze longitudinal data: a tutorial guide and best-practice recommendations for social science researchers. Appl Psychol. 2023;72:1339–64.
Article Google Scholar
Caliński T, Harabasz J. A dendrite method for cluster analysis. Commun Stat Methods. 1974;3:1–27.
Article Google Scholar
Teuling N Den, Pauws S, Heuvel E van den. Latrend: a framework for clustering longitudinal data. arXiv Prepr arXiv240214621. 2024.
Burton A, Altman DG, Royston P, Holder RL. The design of simulation studies in medical statistics. Stat Med. 2006;25:4279–92.
Article PubMed Google Scholar
Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66:846–50.
Article Google Scholar
Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2:193–218.
Article Google Scholar
Nowak E. Wskaźnik podobieństwa wyników podziałów. Przegląd. Stat. 1985;32:41–8.
Google Scholar
Walesiak M, Dudek A. Searching for optimal clustering procedure for a data set. 2015.
Jian W, Gao Y, Hao C, Wang N, Ai T, Liu C, et al. Reference values for spirometry in Chinese aged 4–80 years. J Thorac Dis. 2017;9:4538.
Article PubMed PubMed Central Google Scholar
Braun L, Wolfgang M, Dickersin K. Defining race/ethnicity and explaining difference in research studies on lung function. Eur Respir J. 2013;41:1362–70.
Article PubMed Google Scholar
Strippoli M-PF, Kuehni CE, Dogaru CM, Spycher BD, McNally T, Silverman M, et al. Etiology of ethnic differences in childhood spirometry. Pediatrics. 2013;131:e1842–9.
Article PubMed Google Scholar
Quanjer PH, Stanojevic S, Cole TJ, Baur X, Hall GL, Culver BH, et al. Multi-ethnic reference values for spirometry for the 3–95-yr age range: the global lung function 2012 equations. 2012.
Loeloe MS, Sefidkar R, Tabatabaei SM, Mehrparvar AH, Jambarsang S. Machine learning-based spirometry reference values for the iranian population: a cross-sectional study from the Shahedieh Persian Cohort. Front Med. 2025;12:1480931.
Article Google Scholar
Ventilation AC of GIHC on I. Industrial ventilation: a manual of recommended practice. In: American Conference of Governmental Industrial Hygienists; 1995.
Zhang S. Challenges in KNN classification. IEEE Trans Knowl Data Eng. 2021;34:4663–75.
Article Google Scholar
Waring J, Lindvall C, Umeton R. Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artif Intell Med. 2020;104: 101822.
Article PubMed Google Scholar
Sendak MP, D’Arcy J, Kashyap S, Gao M, Nichols M, Corey K, et al. A path for translation of machine learning products into healthcare delivery. EMJ Innov. 2020;10:19–172.
Google Scholar
Anava O, Levy K. k-nearest neighbors: from global to local. Adv Neural Inf Process Syst. 2016;29.
Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30.
Ribeiro MT, Singh S, Guestrin C. "Why should i trust you?" Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016. p. 1135–44.
van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9:2576–605.
Google Scholar

Download references

Acknowledgements

The authors would like to thank the participants for their participation in this research.

Funding

The authors confirm that no funding, grants, or other financial support was received during the preparation of this manuscript.

Author information

Authors and Affiliations

Center for Healthcare Data Modeling, Department of Biostatistics and Epidemiology, School of Public Health, Shahid Sadoughi University of Medical Sciences, Yazd, Iran
Mohammad Sadegh Loeloe, Reyhane Sefidkar & Sara Jambarsang
Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
Seyyed Mohammad Tabatabaei
Applied biomedical Research Center, Basic Sciences Research Institute, Mashhad University of Medical Sciences, Mashhad, Iran
Seyyed Mohammad Tabatabaei
Industrial Diseases Research Center, Occupational Medicine Department, Shahid Sadoughi University of Medical Sciences, Yazd, Iran
Amir Houshang Mehrparvar
Occupational Medicine Department, Shahid Rahnemoon Hospital, Farrokhi Ave, Yazd, Iran
Amir Houshang Mehrparvar

Authors

Mohammad Sadegh Loeloe
View author publications
Search author on:PubMed Google Scholar
Seyyed Mohammad Tabatabaei
View author publications
Search author on:PubMed Google Scholar
Reyhane Sefidkar
View author publications
Search author on:PubMed Google Scholar
Amir Houshang Mehrparvar
View author publications
Search author on:PubMed Google Scholar
Sara Jambarsang
View author publications
Search author on:PubMed Google Scholar

Contributions

M.S.L. contributed to the conceptualization and design of the study, developed the CKNNRLD algorithm, and conducted the theoretical analysis. S.M.T. assisted in the methodology and implementation of the simulation studies, providing critical feedback on the algorithm’s performance evaluation. R.S. conducted the data analysis and interpretation of results, and assisted in writing and editing the manuscript. A.H.M. provided expertise in occupational health data applications and contributed to the discussion of the algorithm’s implications in real-world contexts. S.J. led the manuscript writing and coordinated the submission process, ensuring all authors contributed to the final version of the manuscript.

Corresponding author

Correspondence to Sara Jambarsang.

Ethics declarations

Ethics approval and consent to participate

The research underwent ethical review by the Research Ethics Committee of the School of Public Health at Shahid Sadoughi University of Medical Sciences (Code: IR.SSU.SPH.REC.1402.069). This study is in compliance with the Declaration of Helsinki. The research protocol was approved by the locally appointed ethics committee, and informed consent was obtained from all participants in the study. All participants provided written informed consent to take part in this study, including authorization for the use of information obtained from the study and their medical records for research purposes. Personal information was kept confidential by the researchers, and the results were presented in general terms to ensure privacy and anonymity.

Consent for publication

All authors have given informed consent to the publication of this article in this journal.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Loeloe, M.S., Tabatabaei, S.M., Sefidkar, R. et al. Boosting K-nearest neighbor regression performance for longitudinal data through a novel learning approach. BMC Bioinformatics 26, 232 (2025). https://doi.org/10.1186/s12859-025-06205-1

Download citation

Received: 12 March 2025
Accepted: 26 June 2025
Published: 30 September 2025
DOI: https://doi.org/10.1186/s12859-025-06205-1

Boosting K-nearest neighbor regression performance for longitudinal data through a novel learning approach

Abstract

Background

Methods

Result

Conclusion

Background

Methods

Variable-trajectory

Distances in the D quantitative covariates and the T measurement times

K-Nearest Neighbors (KNN) regression

KNN regression for longitudinal data

Some drawbacks related to KNN regression

Enhancing KNN regression accuracy through clustering in longitudinal data

Clustering-based KNN regression for longitudinal data

Relation to previous work

Simulation methods

The criteria for evaluating different methods for predicting new observations

Assessment of BIAS

Assessment of accuracy

Assessment of coverage probability (CP)

The criteria for the evaluation of clustering

Execution time

Results on simulated data

Applications to real data

Discussion

Data availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Supplementary material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us