I have two data frames, df1
and df2
of identical structure. The first three columns, id
, form
, and instance
identify the participant and form(s). The remaining variable columns, var1
, var2
, and var3
, contain analytic data and are largely identical except for a few slight discrepancies.
I have a third data frame, map
that identifies the id, form(s), and variables that have discrepancies, but does not contain the values of these discrepancies. I would like to use the map
data to create a final data frame, df.final
that appends two columns and places the discrepant values from df1
and df2
.
Below I am providing sample data as well as a clunky-but-working for
loop that creates the desired df.final
. It is very slow (takes several hours to run on the full dataset) - so slow in fact it is functionally unusable (this ideally needs to be updated in near real time). I am hoping someone better at coding than me can provide alternative, faster solutions.
(Note that given the simplicity of the example data, there are alternate ways to compare df1
and df2
, but assume these don't work and using map
is the only option.)
# Example data, df1
df1 <- data.frame(id = rep(sprintf("K00%s", 0:9), each = 3),
form = rep(c("A","B", "B"), times = 10),
instance = rep(c("None", "1", "2"), times = 10),
var1 = sample(LETTERS, 30, replace = TRUE),
var2 = rnbinom(30, mu = 1, size = 0.02),
var3 = sample(c("Apples", "Oranges", "Pears"), 30, replace = TRUE))
# Sample data df2, same as df1 but with slight discrepancies
df2 <- df1
df2[15, 4] <- "A"
df2[c(4, 6, 8), 5] <- c(11,15,16)
df2[27:28, 6] <- "Bannanas"
# Example "Map" that only indicates what ID, form, instance, and variable is discrepant
map <- data.frame(id = c("K004", "K001", "K001", "K002", "K008", "K009"),
form = c("B","A", "B", "B", "B", "A"),
instance = c("2", "None", "2", "1", "2", "None"),
variable = c("var1", rep("var2", 3), "var3", "var3"))
# id form instance variable
# 1 K004 B 2 var1
# 2 K001 A None var2
# 3 K001 B 2 var2
# 4 K002 B 1 var2
# 5 K008 B 2 var3
# 6 K009 A None var3
## - - - - - - - - - - - - - - - - - - -
# Currently working for-loop, but VERY slow in full data
## - - - - - - - - - - - - - - - - - - -
df.final <- data.frame(matrix(NA, ncol = 6))
for (i in 1:nrow(map)){
keepcols <- c("id", "form","instance", map[i,4])
m1 <- merge(map[i,], df1[ , keepcols], by = keepcols[-4])
m2 <- merge(m1, df2[, keepcols], by = keepcols[-4])
df.final[i,] <- m2
}
names(df.final) <- c(names(map), "df1_entry","df2_entry")
df.final
# id form instance variable df1_entry df2_entry
# 1 K004 B 2 var1 P A
# 2 K001 A None var2 0 11
# 3 K001 B 2 var2 0 15
# 4 K002 B 1 var2 0 16
# 5 K008 B 2 var3 Pears Bannanas
# 6 K009 A None var3 Apples Bannanas
1 Answer 1
You do not mention on how you create the map table, but from start with df1 and df2 to the end result I suggest to skip the whole for loop.
Efficient approach would be to:
- convert df1 and df2 in long format
- use anti_join to get the differences
- merge the differences
Code
dfl1 <- df1 %>% mutate(var2 = as.factor(var2)) %>% # just make sure all are the same type
pivot_longer(cols = 4:6, names_to = "variable", values_to = "entry")
dfl2 <- df2 %>% mutate(var2 = as.factor(var2)) %>% # just make sure all are the same type
pivot_longer(cols = 4:6, names_to = "variable", values_to = "entry")
merge(anti_join(dfl1, dfl2), anti_join(dfl2, dfl1), by = c("id", "form", "instance", "variable"), suffixes = c("_df1", "_df2"))
# id form instance variable entry_df1 entry_df2
# 1 K001 A None var2 0 11
# 2 K001 B 2 var2 0 15
# 3 K002 B 1 var2 0 16
# 4 K004 B 2 var1 I A
# 5 K008 B 2 var3 Pears Bannanas
# 6 K009 A None var3 Apples Bannanas
for
loop is what needs review. I would respectfully ask you to reconsider your downvote as it is quite straightforward and I closely considered the sites guidelines. The only thing that is not verbatim is the sample data as the real data is tens of thousands of rows. \$\endgroup\$