I am trying to group uncorrelated variables into subsets. So, using the correlation matrix, I check each variable to see the correlation. If correlation is more then a threshold I will create a new list, else I will add it to the current list. At the end, in each subset the variables are not correlated. I have written the below code and it works fine. However, when the number of variables are high (> 20,000), it takes more than two hours to run. Is there any suggestion to make it faster? or do some operations in parallel?
corr <- matrix(c(1,0.9,0,0.83,0.9,0.9,1,0.2,0.9,0.1,0,0.2,1,0.1,0.9,0.83,0.9,0.1,1,0.9,0.9,0.1,0.9,0.9,1), 5,5, byrow = T)
rownames(corr) <- colnames(corr) <- LETTERS[1:5]
#corr <- cor(t(dataset)) %>% abs()
vars <- rownames(corr)
list_data[[1]] <- vars[1]
for(i in 2:length(vars)){
message(vars[i])
added <- 1
for(j in 1:length(list_data)){
cur_list <- list_data[[j]]
flag <- 1
for(k in 1:length(cur_list)){
corr_data <- corr[vars[i], cur_list[k]]
if(corr_data >= 0.8){
flag <- 0
break
}
}
if(flag == 0) next
else {
list_data[[j]] <- c(cur_list, vars[i])
added <- 0
break
}
}
if(added == 1) list_data[[j+1]] <- vars[i]
}
I have added an example input data including five variables. In my data, the number of variables are around 21,000, which makes the code really slow.
1 Answer 1
rownames(corr) <- colnames(corr) <- 1:ncol(corr)
vars <- rownames(corr)
vars <- as.integer(vars)
list_data2 <- list()
list_data2[[1]] <- vars[1]
t1 <- proc.time()
for (i in 2:length(vars)) {
added <- 1L
corr2 <- corr[vars[i], ]
for (j in 1:length(list_data2)) {
cur_list <- list_data2[[j]]
flag <- 1L
for (k in 1:length(cur_list)) {
corr_data <- corr2[cur_list[k]]
if (corr_data >= 0.8) {
flag <- 0L
break
}
}
if (flag == 0L) next
else {
list_data2[[j]] <- c(cur_list, vars[i])
added <- 0L
break
}
}
if (added == 1L) list_data2[[j + 1L]] <- vars[i]
}
don't use col/row names to subset matrix, use integers (positions of cols/rows)
we can subset row in outer loop (line:
corr2 <- corr[vars[i], ]
)afterwards we can get names from indexes, if needed:
your_names <- paste0('v', 1:n) # example
name_list <- lapply(list_data2, function(x) your_names[x])
Update
Another huge improvement is to do your comparison outside loop & remove names of resulting matrix, because of that matrix/vectors subsetting is much faster.
vars <- 1:ncol(corr)
list_data3 <- list()
list_data3[[1]] <- vars[1]
t1 <- proc.time()
compar <- unname(corr) >= 0.8 # do comparison outside loop
for (i in 2:length(vars)) {
added <- 1L
corr2 <- compar[vars[i], ]
for (j in 1:length(list_data3)) {
cur_list <- list_data3[[j]]
flag <- 1L
for (k in seq_along(cur_list)) { # little bit faster
corr_data <- corr2[cur_list[k]]
if (corr_data) {
flag <- 0L
break
}
}
if (flag == 0L) next
else {
list_data3[[j]] <- c(cur_list, vars[i])
added <- 0L
break
}
}
if (added == 1L) list_data3[[j + 1L]] <- vars[i]
}
t2 <- proc.time()
(t2 - t1)[3] # ~10 sec for 20k*20k symmetric matrix