Identify changes between two data.frames, explain deltas for x columns

Question 1

I wrote the following snippet to deal with a situation in which I have two datasets (usually the current version versus a previous one) and I need to identify what changes have occured: usually explaining the deltas of one specific column.

I want to identify what rows have been added, what rows have been removed as well as what (manually specified) variables have changed from the old to the new version.

I've done it like so:

Return the most recent version, but containing all rows in both x and y. This, thus, consists of adding the 'missing' or 'lost' rows from the previous version to the new version.
I can manually specify the keys that will uniquely identify a row in df.old and df.new;
Delta columns are added for variables of my choice; for each of these specified variables the output will contain 'x.delta, x.old, x.new' columns at the end;
I wanted to be able to specify additional columns that would be filled for the 'missing rows'. As the output is the most current representation, rows appearing only in the 'old' version by default only consist of the original keys.

So, in short, compare these two data.frames and identify how the difference between the sums of 'qsec' can be allocated to each row:

Original data.frame:

 id1 id2 hp cyl qsec
1 Mazda RX4 Maz 110 6 16.46
2 Mazda RX4 Wag Maz 110 6 17.02
3 Datsun 710 Dat 181 4 33.00
4 Hornet 4 Drive Hor 110 6 19.44
7 Duster 360 Dus 245 8 15.84
8 Merc 240D Mer 62 4 20.00

'New data.frame'

 id1 id2 hp cyl qsec
1 Mazda RX4 Maz 110 6 16.46
2 Mazda RX4 Wag Maz 110 6 17.02
3 Datsun 710 Dat 93 4 18.61
4 Hornet 4 Drive Hor 110 6 19.44
5 Hornet Sportabout Hor 175 8 17.02
6 Valiant Val 105 6 20.22

Function call with output:

df.ch <- df.changes(df.old, 
 df.new, 
 KEYS=c("id1", "id2"), 
 VAL = c("qsec"), # Values for which I want a delta column
 retain.columns=c("cyl")) # Columns for which the NAs should be 
 # filled with the 'old' data if rows were lost
 # (e.g. see 'HP' and 'qsec' in the output)
 id1 id2 row.changed hp cyl qsec qsec.delta qsec.new qsec.old
1 Hornet Sportabout Hor 00. New 175 8 17.02 17.02 17.02 0.00
2 Valiant Val 00. New 105 6 20.22 20.22 20.22 0.00
3 Duster 360 Dus 05. Lost NA 8 NA -15.84 0.00 15.84
4 Merc 240D Mer 05. Lost NA 4 NA -20.00 0.00 20.00
5 Mazda RX4 Maz 10. Retained 110 6 16.46 0.00 16.46 16.46
6 Mazda RX4 Wag Maz 10. Retained 110 6 17.02 0.00 17.02 17.02
7 Datsun 710 Dat 10. Retained 181 4 33.00 14.39 33.00 18.61
8 Hornet 4 Drive Hor 10. Retained 110 6 19.44 0.00 19.44 19.44 
# Check that all is good:
round(sum(df.x$qsec) - sum(df.y$qsec) + sum(df.ch$qsec.delta), 0) == 0
[1] TRUE

Below is the code I wrote: I'm new to writing these reusable chunks, I'd appreciate some feedback on what I could do more effectively.

To generate the example data.frames:

#Setup example dataframes
require(dplyr); require(tidyr)
df <- mtcars %>% 
 mutate(id1 = row.names(mtcars),
 id2 = substr(row.names(mtcars), 1, 3)) %>%
 select(id1, id2, hp, cyl, qsec)
# Select some rows
df.old <- df[c(1:4,7:8),]
df.new <- df[c(1:6),]; rm(df)
# Change a value in A ==> Should be identified via the script
df.new$qsec[df.new$id1=="Datsun 710"] = 33
df.new$hp[df.new$id1=="Datsun 710"] = 181

The function:

df.changes <- function(df.old, df.new, 
 KEYS=c("id"),
 VAL=NULL,
 retain.columns=NULL) {
 require(dplyr)
 require(tidyr)
 # Make sure everything is possible
 if(sum(!KEYS %in% names(df.old))>0 |
 sum(!KEYS %in% names(df.new))>0) {
 for(key in KEYS[!KEYS %in% names(df.old)]){
 print(paste0("Key `", key, "` not in df.old"))
 }
 for(key in KEYS[!KEYS %in% names(df.new)]){
 print(paste0("Key `", key, "` not in df.new"))
 }
 stop("Specified keys do not appear in both data.frames")
 }
 if(sum(!VAL %in% names(df.old))>0 |
 sum(!VAL %in% names(df.new))>0) {
 for(column in VAL[!VAL %in% names(df.old)]){
 print(paste0("Column `", column, "` not in df.old"))
 }
 for(column in VAL[!VAL %in% names(df.new)]){
 print(paste0("Key `", column, "` not in df.new"))
 }
 stop("Not all values required for delta columns appear in both data.frames")
 }
 if(sum(!retain.columns %in% names(df.old))>0 |
 sum(!retain.columns %in% names(df.new))>0) {
 for(column in retain.columns[!retain.columns %in% names(df.old)]){
 print(paste0("Column `", column, "` not in df.old"))
 }
 for(column in retain.columns[!retain.columns %in% names(df.new)]){
 print(paste0("Key `", column, "` not in df.new"))
 }
 stop("Not all columns specified for 'retain.columns' appear in both data.frames")
 }
 # Create key representations
 keys.old <-df.old[,KEYS] 
 keys.new <-df.new[,KEYS]
 # Identify mutations
 retained <- intersect(keys.old, keys.new)
 new <- setdiff(keys.new, keys.old)
 lost <- setdiff(keys.old, keys.new)
 # Make a master table
 if(nrow(retained)>0) retained <- cbind(retained, changetype = "10. Retained")
 if(nrow(lost)>0) lost <- cbind(lost, changetype = "05. Lost")
 if(nrow(lost)>0) new <- cbind(new, changetype = "00. New")
 keys.z <- rbind(retained, lost, new)
 # If we're not proceeding: export only this smaller data.frame
 df.z <- keys.z
 # Generate the 'delta' columns for the values specified
 if (!is.null(VAL)){
 df.old.val <- select(df.x, one_of(KEYS), one_of(VAL))
 df.new.val <- select(df.y, one_of(KEYS), one_of(VAL))
 # Create a data.frame with [KEYS, a.delta, a.new, a.old, b.delta...]
 old_new_vals <- keys.z %>% 
 left_join(df.old.val, by=KEYS) %>%
 left_join(df.new.val, by=KEYS) %>%
 gather(valuetype, value, -changetype, -one_of(KEYS)) %>%
 mutate(valuetype = gsub("\\.x", ".old", valuetype),
 valuetype = gsub("\\.y", ".new", valuetype)) %>%
 separate(valuetype, into = c("column", "version"), sep="\\.") %>%
 spread(key=version, value=value) %>%
 mutate(new = ifelse(is.na(new), 0, new),
 old = ifelse(is.na(old), 0, old)) %>%
 mutate(delta = new - old) %>%
 gather(valuetype, value, -column, -changetype, -one_of(KEYS)) %>%
 unite(colname, column, valuetype, sep=".") %>%
 spread(key=colname, value = value) %>%
 select(-changetype)
 df.z <- keys.z %>%
 left_join(df.y, by=KEYS) %>%
 left_join(old_new_vals, by=KEYS) %>%
 mutate(changetype = as.character(changetype)) %>%
 rename(row.changed = changetype) %>%
 arrange(row.changed)
 }
 # Now identify for which columns the 'lost' rows should be 
 # supplemented with the data from the original data.frame
 if(!is.null(retain.columns)){
 missing.values <- df.z %>%
 filter(!complete.cases(.[,retain.columns])) %>%
 select(one_of(KEYS), one_of(retain.columns)) %>%
 left_join(df.old[,c(KEYS, retain.columns)], by=KEYS)
 df.z <- df.z %>% 
 left_join(missing.values, by=KEYS)
 # TODO: Make this prettier, surely the loop isn't required?
 for(var in retain.columns){
 df.z[[var]] <- ifelse(is.na(df.z[[var]]), 
 df.z[[paste0(var, ".y")]],
 df.z[[var]])
 df.z[[paste0(var, ".x")]] <- NULL
 df.z[[paste0(var, ".y")]] <- NULL
 }
 }
 return(df.z)
}

Question 2

I feel you wrote some very complicated code when the hardest part (from an algorithm point of view) should be a single merge of your two data.frames. So my rewrite is centered around a call to the base merge function. The only trick is to add an is column to both data.frames before merging so the output data.frame M will contain two is.new and is.old columns telling us from which of the two input data.frames the output rows are coming from. From here, M contains everything you need to know, and forming the output is just a few lines of vectorized operations to add rows.changed, the new data, the retained data, and the comparisons.

I hope you will agree that it is much more readable and easier to maintain this way. dplyr (similarly data.table) is a great tool for speeds or for making clean/concise code, but I don't think it applied well here.

Also notice how I used stopifnot in a concise manner for doing all your input checks.

df.changes <- function(df.old, df.new, 
 KEYS = c("id"),
 VAL = NULL,
 retain.columns = NULL) {
 # input checks 
 stopifnot(KEYS %in% names(df.old),
 KEYS %in% names(df.new),
 VAL %in% names(df.old),
 VAL %in% names(df.new),
 retain.columns %in% names(df.new),
 retain.columns %in% names(df.old))
 # add columns to help us track new/old provenance
 N <- transform(df.new, is = TRUE)
 O <- transform(df.old, is = TRUE)
 # merge
 M <- merge(N, O, by = KEYS, all = TRUE, suffixes = c(".new",".old"))
 M$is.new <- !is.na(M$is.new) # replace NA with FALSE
 M$is.old <- !is.na(M$is.old) # replace NA with FALSE
 # this will be our output
 O <- M[KEYS]
 # add rows.changed
 O$row.changed <- with(M, ifelse(is.old & is.new, "10.Retained",
 ifelse(is.old, "05. Lost",
 "00. New")))
 # add data from new
 original.vars <- setdiff(names(df.new), KEYS)
 for (var in original.vars)
 O[[var]] <- M[[paste0(var, ".new")]]
 # modify data for retain.columns
 for (var in retain.columns)
 O[[var]] <- ifelse(M$is.new, M[[paste0(var, ".new")]],
 M[[paste0(var, ".old")]])
 # add comparisons
 for (var in VAL) {
 old.var <- paste0(var, ".old")
 new.var <- paste0(var, ".new")
 del.var <- paste0(var, ".delta")
 O[[del.var]] <- M[[new.var]] - M[[old.var]]
 O[[old.var]] <- M[[old.var]]
 O[[new.var]] <- M[[new.var]]
 }
 # reorder rows
 O[order(O$row.changed), ]
}

Question 3

I tried to get rid of all the loops, making it more difficult than it should have been! Your solution is indeed much easier to understand! Thank you!

flodel flodel 3,5551 gold badge16 silver badges15 bronze badges · Accepted Answer · 2015-06-21 14:50:37Z

I feel you wrote some very complicated code when the hardest part (from an algorithm point of view) should be a single merge of your two data.frames. So my rewrite is centered around a call to the base merge function. The only trick is to add an is column to both data.frames before merging so the output data.frame M will contain two is.new and is.old columns telling us from which of the two input data.frames the output rows are coming from. From here, M contains everything you need to know, and forming the output is just a few lines of vectorized operations to add rows.changed, the new data, the retained data, and the comparisons.

I hope you will agree that it is much more readable and easier to maintain this way. dplyr (similarly data.table) is a great tool for speeds or for making clean/concise code, but I don't think it applied well here.

Also notice how I used stopifnot in a concise manner for doing all your input checks.

df.changes <- function(df.old, df.new, 
 KEYS = c("id"),
 VAL = NULL,
 retain.columns = NULL) {
 # input checks 
 stopifnot(KEYS %in% names(df.old),
 KEYS %in% names(df.new),
 VAL %in% names(df.old),
 VAL %in% names(df.new),
 retain.columns %in% names(df.new),
 retain.columns %in% names(df.old))
 # add columns to help us track new/old provenance
 N <- transform(df.new, is = TRUE)
 O <- transform(df.old, is = TRUE)
 # merge
 M <- merge(N, O, by = KEYS, all = TRUE, suffixes = c(".new",".old"))
 M$is.new <- !is.na(M$is.new) # replace NA with FALSE
 M$is.old <- !is.na(M$is.old) # replace NA with FALSE
 # this will be our output
 O <- M[KEYS]
 # add rows.changed
 O$row.changed <- with(M, ifelse(is.old & is.new, "10.Retained",
 ifelse(is.old, "05. Lost",
 "00. New")))
 # add data from new
 original.vars <- setdiff(names(df.new), KEYS)
 for (var in original.vars)
 O[[var]] <- M[[paste0(var, ".new")]]
 # modify data for retain.columns
 for (var in retain.columns)
 O[[var]] <- ifelse(M$is.new, M[[paste0(var, ".new")]],
 M[[paste0(var, ".old")]])
 # add comparisons
 for (var in VAL) {
 old.var <- paste0(var, ".old")
 new.var <- paste0(var, ".new")
 del.var <- paste0(var, ".delta")
 O[[del.var]] <- M[[new.var]] - M[[old.var]]
 O[[old.var]] <- M[[old.var]]
 O[[new.var]] <- M[[new.var]]
 }
 # reorder rows
 O[order(O$row.changed), ]
}

I tried to get rid of all the loops, making it more difficult than it should have been! Your solution is indeed much easier to understand! Thank you!

Stack Exchange Network

Identify changes between two data.frames, explain deltas for x columns

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Identify changes between two data.frames, explain deltas for x columns

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions