2
\$\begingroup\$

I am trying to group uncorrelated variables into subsets. So, using the correlation matrix, I check each variable to see the correlation. If correlation is more then a threshold I will create a new list, else I will add it to the current list. At the end, in each subset the variables are not correlated. I have written the below code and it works fine. However, when the number of variables are high (> 20,000), it takes more than two hours to run. Is there any suggestion to make it faster? or do some operations in parallel?

corr <- matrix(c(1,0.9,0,0.83,0.9,0.9,1,0.2,0.9,0.1,0,0.2,1,0.1,0.9,0.83,0.9,0.1,1,0.9,0.9,0.1,0.9,0.9,1), 5,5, byrow = T)
rownames(corr) <- colnames(corr) <- LETTERS[1:5]
#corr <- cor(t(dataset)) %>% abs()
vars <- rownames(corr)
list_data[[1]] <- vars[1]
for(i in 2:length(vars)){
 message(vars[i])
 added <- 1
 for(j in 1:length(list_data)){
 cur_list <- list_data[[j]]
 flag <- 1
 for(k in 1:length(cur_list)){
 corr_data <- corr[vars[i], cur_list[k]]
 if(corr_data >= 0.8){
 flag <- 0
 break
 }
 }
 if(flag == 0) next
 else {
 list_data[[j]] <- c(cur_list, vars[i])
 added <- 0
 break
 }
 }
 if(added == 1) list_data[[j+1]] <- vars[i]
}

I have added an example input data including five variables. In my data, the number of variables are around 21,000, which makes the code really slow.

Mast
13.8k12 gold badges56 silver badges127 bronze badges
asked Jan 29, 2023 at 19:14
\$\endgroup\$
0

1 Answer 1

2
\$\begingroup\$
rownames(corr) <- colnames(corr) <- 1:ncol(corr)
vars <- rownames(corr)
vars <- as.integer(vars)
list_data2 <- list()
list_data2[[1]] <- vars[1]
t1 <- proc.time()
for (i in 2:length(vars)) {
 added <- 1L
 corr2 <- corr[vars[i], ]
 for (j in 1:length(list_data2)) {
 cur_list <- list_data2[[j]]
 flag <- 1L
 for (k in 1:length(cur_list)) {
 corr_data <- corr2[cur_list[k]]
 if (corr_data >= 0.8) {
 flag <- 0L
 break
 }
 }
 if (flag == 0L) next
 else {
 list_data2[[j]] <- c(cur_list, vars[i])
 added <- 0L
 break
 }
 }
 if (added == 1L) list_data2[[j + 1L]] <- vars[i]
}
  • don't use col/row names to subset matrix, use integers (positions of cols/rows)

  • we can subset row in outer loop (line: corr2 <- corr[vars[i], ])

  • afterwards we can get names from indexes, if needed:

your_names <- paste0('v', 1:n) # example
name_list <- lapply(list_data2, function(x) your_names[x])

Update

Another huge improvement is to do your comparison outside loop & remove names of resulting matrix, because of that matrix/vectors subsetting is much faster.

vars <- 1:ncol(corr)
list_data3 <- list()
list_data3[[1]] <- vars[1]
t1 <- proc.time()
compar <- unname(corr) >= 0.8 # do comparison outside loop
for (i in 2:length(vars)) {
 added <- 1L
 corr2 <- compar[vars[i], ]
 for (j in 1:length(list_data3)) {
 cur_list <- list_data3[[j]]
 flag <- 1L
 for (k in seq_along(cur_list)) { # little bit faster
 corr_data <- corr2[cur_list[k]]
 if (corr_data) {
 flag <- 0L
 break
 }
 }
 if (flag == 0L) next
 else {
 list_data3[[j]] <- c(cur_list, vars[i])
 added <- 0L
 break
 }
 }
 if (added == 1L) list_data3[[j + 1L]] <- vars[i]
}
t2 <- proc.time()
(t2 - t1)[3] # ~10 sec for 20k*20k symmetric matrix
answered Jan 30, 2023 at 16:27
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.