1
\$\begingroup\$

I have a dataframe of cases that score 0-1 on a group of binary attributes.

What I want to do is extract all possible ombinations of attribute triplets (e.g. A/B/C, A/B/D... out of A-E) and then sum for each possible combination triplet the number of times a case in the original dataframe matched those attributes.

Using dplyr logic as well as lapply I can solve this problem but the performance is very bad, especially for bigger dataframes and more possible attributes. My real dataframe leads to a test matrix of >1000 possible triplets and the function performs very bad on this.

Please help me optimize the code while ideally staying within the dplyr framework as much as possible.

library(tidyverse)
# Create a test data frame and vector of relevant variables
test_df <- data.frame(ID = c(1,2,3,4), Target = c(1,1,0,0),F_A = c(1,0,0,1),F_B = c(0,1,0,1),F_C = c(1,1,0,0),F_D = c(0,1,1,0),F_E = c(1,0,0,1))
invars = c("F_A","F_B","F_C","F_D","F_E")
NumOfElements = 3
# Create a full matrix of all relevant variables in NumOfElements-combinations
combn(invars,NumOfElements) %>%
 t() %>%
 as.data.frame() %>%
 rowid_to_column("ID") %>%
 select(ID, T1 = V1, T2 = V2, T3 = V3) %>%
 unite("Test",starts_with("T"),sep = "|",remove = FALSE,na.rm = TRUE) %>%
 {.} -> test_matrix
# Brute Force Function to calculate number of all IDs that fullfill the test rules
bruteForce_size = function(rule_iterator,source_df,invars){
 source_df %>%
 pivot_longer(cols = c(-ID,-Target), names_to = "Affinity", values_to = "Value") %>%
 mutate(Value = ifelse(Value ==1, Affinity,NA_character_)) %>%
 pivot_wider(names_from = Affinity, values_from = Value) %>%
 unite("Test",invars,sep = "|",remove = FALSE,na.rm = TRUE) %>%
 mutate(Size = as.numeric(rule_iterator == Test)) %$%
 sum(Size)
}
# Calculate and attach sizes to test_matrix
test_matrix %>%
 mutate(Size = unlist(lapply(Test, bruteForce_size, test_df)))
asked Aug 17, 2021 at 14:36
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

Maybe something like this, with base functions:

rez <- apply(test_matrix[, c('T1', 'T2', 'T3')], 1, function(x) {
 y <- test_df[, x]
 sum(rowSums(y) == 3)
}, simplify = T)
rez # vector
test_matrix$Size <- rez
test_matrix
answered Aug 30, 2021 at 13:46
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.