Mapping variables in two data frames using a third data frame

Question 1

I have two data frames, df1 and df2 of identical structure. The first three columns, id, form, and instance identify the participant and form(s). The remaining variable columns, var1, var2, and var3, contain analytic data and are largely identical except for a few slight discrepancies.

I have a third data frame, map that identifies the id, form(s), and variables that have discrepancies, but does not contain the values of these discrepancies. I would like to use the map data to create a final data frame, df.final that appends two columns and places the discrepant values from df1 and df2.

Below I am providing sample data as well as a clunky-but-working for loop that creates the desired df.final. It is very slow (takes several hours to run on the full dataset) - so slow in fact it is functionally unusable (this ideally needs to be updated in near real time). I am hoping someone better at coding than me can provide alternative, faster solutions.

(Note that given the simplicity of the example data, there are alternate ways to compare df1 and df2, but assume these don't work and using map is the only option.)

# Example data, df1
df1 <- data.frame(id = rep(sprintf("K00%s", 0:9), each = 3),
 form = rep(c("A","B", "B"), times = 10),
 instance = rep(c("None", "1", "2"), times = 10),
 var1 = sample(LETTERS, 30, replace = TRUE),
 var2 = rnbinom(30, mu = 1, size = 0.02),
 var3 = sample(c("Apples", "Oranges", "Pears"), 30, replace = TRUE))
# Sample data df2, same as df1 but with slight discrepancies
df2 <- df1
df2[15, 4] <- "A"
df2[c(4, 6, 8), 5] <- c(11,15,16)
df2[27:28, 6] <- "Bannanas"
# Example "Map" that only indicates what ID, form, instance, and variable is discrepant
map <- data.frame(id = c("K004", "K001", "K001", "K002", "K008", "K009"),
 form = c("B","A", "B", "B", "B", "A"),
 instance = c("2", "None", "2", "1", "2", "None"),
 variable = c("var1", rep("var2", 3), "var3", "var3"))
# id form instance variable
# 1 K004 B 2 var1
# 2 K001 A None var2
# 3 K001 B 2 var2
# 4 K002 B 1 var2
# 5 K008 B 2 var3
# 6 K009 A None var3
## - - - - - - - - - - - - - - - - - - - 
# Currently working for-loop, but VERY slow in full data
## - - - - - - - - - - - - - - - - - - - 
df.final <- data.frame(matrix(NA, ncol = 6))
for (i in 1:nrow(map)){
 keepcols <- c("id", "form","instance", map[i,4])
 m1 <- merge(map[i,], df1[ , keepcols], by = keepcols[-4])
 m2 <- merge(m1, df2[, keepcols], by = keepcols[-4])
 df.final[i,] <- m2
}
names(df.final) <- c(names(map), "df1_entry","df2_entry")
df.final
# id form instance variable df1_entry df2_entry
# 1 K004 B 2 var1 P A
# 2 K001 A None var2 0 11
# 3 K001 B 2 var2 0 15
# 4 K002 B 1 var2 0 16
# 5 K008 B 2 var3 Pears Bannanas
# 6 K009 A None var3 Apples Bannanas

Question 2

What is the motivation for this code? It looks obfuscated and therefore hard to review.

Question 3

This code identifies discrepancies in two datasets and outputs a report identifying the discrepancies in each dataset. As indicated in the question, the for loop is what needs review. I would respectfully ask you to reconsider your downvote as it is quite straightforward and I closely considered the sites guidelines. The only thing that is not verbatim is the sample data as the real data is tens of thousands of rows.

Question 4

what is the purpose of the map table? Roughly what needs to be done is pivot_longer df1 and df2, anti_join twice (each direction) and merge those results

Question 5

You do not mention on how you create the map table, but from start with df1 and df2 to the end result I suggest to skip the whole for loop.

Efficient approach would be to:

convert df1 and df2 in long format
use anti_join to get the differences
merge the differences

Code

dfl1 <- df1 %>% mutate(var2 = as.factor(var2)) %>% # just make sure all are the same type
 pivot_longer(cols = 4:6, names_to = "variable", values_to = "entry")
dfl2 <- df2 %>% mutate(var2 = as.factor(var2)) %>% # just make sure all are the same type
 pivot_longer(cols = 4:6, names_to = "variable", values_to = "entry")
merge(anti_join(dfl1, dfl2), anti_join(dfl2, dfl1), by = c("id", "form", "instance", "variable"), suffixes = c("_df1", "_df2"))
# id form instance variable entry_df1 entry_df2
# 1 K001 A None var2 0 11
# 2 K001 B 2 var2 0 15
# 3 K002 B 1 var2 0 16
# 4 K004 B 2 var1 I A
# 5 K008 B 2 var3 Pears Bannanas
# 6 K009 A None var3 Apples Bannanas

Merijn van Tilborg Merijn van Tilborg 1812 bronze badges · Accepted Answer · 2022-03-02 15:27:45Z

You do not mention on how you create the map table, but from start with df1 and df2 to the end result I suggest to skip the whole for loop.

Efficient approach would be to:

convert df1 and df2 in long format
use anti_join to get the differences
merge the differences

Code

dfl1 <- df1 %>% mutate(var2 = as.factor(var2)) %>% # just make sure all are the same type
 pivot_longer(cols = 4:6, names_to = "variable", values_to = "entry")
dfl2 <- df2 %>% mutate(var2 = as.factor(var2)) %>% # just make sure all are the same type
 pivot_longer(cols = 4:6, names_to = "variable", values_to = "entry")
merge(anti_join(dfl1, dfl2), anti_join(dfl2, dfl1), by = c("id", "form", "instance", "variable"), suffixes = c("_df1", "_df2"))
# id form instance variable entry_df1 entry_df2
# 1 K001 A None var2 0 11
# 2 K001 B 2 var2 0 15
# 3 K002 B 1 var2 0 16
# 4 K004 B 2 var1 I A
# 5 K008 B 2 var3 Pears Bannanas
# 6 K009 A None var3 Apples Bannanas

Stack Exchange Network

Mapping variables in two data frames using a third data frame

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Mapping variables in two data frames using a third data frame

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions