Group uncorrelated variables into subsets using correlation matrix

Question 1

I am trying to group uncorrelated variables into subsets. So, using the correlation matrix, I check each variable to see the correlation. If correlation is more then a threshold I will create a new list, else I will add it to the current list. At the end, in each subset the variables are not correlated. I have written the below code and it works fine. However, when the number of variables are high (> 20,000), it takes more than two hours to run. Is there any suggestion to make it faster? or do some operations in parallel?

corr <- matrix(c(1,0.9,0,0.83,0.9,0.9,1,0.2,0.9,0.1,0,0.2,1,0.1,0.9,0.83,0.9,0.1,1,0.9,0.9,0.1,0.9,0.9,1), 5,5, byrow = T)
rownames(corr) <- colnames(corr) <- LETTERS[1:5]
#corr <- cor(t(dataset)) %>% abs()
vars <- rownames(corr)
list_data[[1]] <- vars[1]
for(i in 2:length(vars)){
 message(vars[i])
 added <- 1
 for(j in 1:length(list_data)){
 cur_list <- list_data[[j]]
 flag <- 1
 for(k in 1:length(cur_list)){
 corr_data <- corr[vars[i], cur_list[k]]
 if(corr_data >= 0.8){
 flag <- 0
 break
 }
 }
 if(flag == 0) next
 else {
 list_data[[j]] <- c(cur_list, vars[i])
 added <- 0
 break
 }
 }
 if(added == 1) list_data[[j+1]] <- vars[i]
}

I have added an example input data including five variables. In my data, the number of variables are around 21,000, which makes the code really slow.

Question 2

rownames(corr) <- colnames(corr) <- 1:ncol(corr)
vars <- rownames(corr)
vars <- as.integer(vars)
list_data2 <- list()
list_data2[[1]] <- vars[1]
t1 <- proc.time()
for (i in 2:length(vars)) {
 added <- 1L
 corr2 <- corr[vars[i], ]
 for (j in 1:length(list_data2)) {
 cur_list <- list_data2[[j]]
 flag <- 1L
 for (k in 1:length(cur_list)) {
 corr_data <- corr2[cur_list[k]]
 if (corr_data >= 0.8) {
 flag <- 0L
 break
 }
 }
 if (flag == 0L) next
 else {
 list_data2[[j]] <- c(cur_list, vars[i])
 added <- 0L
 break
 }
 }
 if (added == 1L) list_data2[[j + 1L]] <- vars[i]
}

don't use col/row names to subset matrix, use integers (positions of cols/rows)
we can subset row in outer loop (line: corr2 <- corr[vars[i], ])
afterwards we can get names from indexes, if needed:

your_names <- paste0('v', 1:n) # example
name_list <- lapply(list_data2, function(x) your_names[x])

Update

Another huge improvement is to do your comparison outside loop & remove names of resulting matrix, because of that matrix/vectors subsetting is much faster.

vars <- 1:ncol(corr)
list_data3 <- list()
list_data3[[1]] <- vars[1]
t1 <- proc.time()
compar <- unname(corr) >= 0.8 # do comparison outside loop
for (i in 2:length(vars)) {
 added <- 1L
 corr2 <- compar[vars[i], ]
 for (j in 1:length(list_data3)) {
 cur_list <- list_data3[[j]]
 flag <- 1L
 for (k in seq_along(cur_list)) { # little bit faster
 corr_data <- corr2[cur_list[k]]
 if (corr_data) {
 flag <- 0L
 break
 }
 }
 if (flag == 0L) next
 else {
 list_data3[[j]] <- c(cur_list, vars[i])
 added <- 0L
 break
 }
 }
 if (added == 1L) list_data3[[j + 1L]] <- vars[i]
}
t2 <- proc.time()
(t2 - t1)[3] # ~10 sec for 20k*20k symmetric matrix

minem minemminem 9921 gold badge8 silver badges12 bronze badges · Accepted Answer · 2023-01-30 16:27:40Z

rownames(corr) <- colnames(corr) <- 1:ncol(corr)
vars <- rownames(corr)
vars <- as.integer(vars)
list_data2 <- list()
list_data2[[1]] <- vars[1]
t1 <- proc.time()
for (i in 2:length(vars)) {
 added <- 1L
 corr2 <- corr[vars[i], ]
 for (j in 1:length(list_data2)) {
 cur_list <- list_data2[[j]]
 flag <- 1L
 for (k in 1:length(cur_list)) {
 corr_data <- corr2[cur_list[k]]
 if (corr_data >= 0.8) {
 flag <- 0L
 break
 }
 }
 if (flag == 0L) next
 else {
 list_data2[[j]] <- c(cur_list, vars[i])
 added <- 0L
 break
 }
 }
 if (added == 1L) list_data2[[j + 1L]] <- vars[i]
}

don't use col/row names to subset matrix, use integers (positions of cols/rows)
we can subset row in outer loop (line: corr2 <- corr[vars[i], ])
afterwards we can get names from indexes, if needed:

your_names <- paste0('v', 1:n) # example
name_list <- lapply(list_data2, function(x) your_names[x])

Update

Another huge improvement is to do your comparison outside loop & remove names of resulting matrix, because of that matrix/vectors subsetting is much faster.

vars <- 1:ncol(corr)
list_data3 <- list()
list_data3[[1]] <- vars[1]
t1 <- proc.time()
compar <- unname(corr) >= 0.8 # do comparison outside loop
for (i in 2:length(vars)) {
 added <- 1L
 corr2 <- compar[vars[i], ]
 for (j in 1:length(list_data3)) {
 cur_list <- list_data3[[j]]
 flag <- 1L
 for (k in seq_along(cur_list)) { # little bit faster
 corr_data <- corr2[cur_list[k]]
 if (corr_data) {
 flag <- 0L
 break
 }
 }
 if (flag == 0L) next
 else {
 list_data3[[j]] <- c(cur_list, vars[i])
 added <- 0L
 break
 }
 }
 if (added == 1L) list_data3[[j + 1L]] <- vars[i]
}
t2 <- proc.time()
(t2 - t1)[3] # ~10 sec for 20k*20k symmetric matrix

Stack Exchange Network

Group uncorrelated variables into subsets using correlation matrix

1 Answer 1

Update

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Group uncorrelated variables into subsets using correlation matrix

1 Answer 1

Update

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions