Use a data frame of functions to calculate values using specific columns in another data frame

Question 1

Background

This routine is used in a package that calculates tree (as in Christmas tree) volumes for various species codes (spcd) and geographic regions. The equation forms and coefficients vary by species and region, so I have a dataframe of functions along with their respective species and region that calculate the volume based off of the height (ht) of the tree and diameter (dbh).

Data Setup

Note: In my package, this part is taken care of by other functions, this is just to create a reproducible example (please ignore the sloppyness)

I have a data frame that includes a column of functions, along with some information about "where" to apply those functions in another data frame.

The functions (in reality these are more complex):

func1 <- function(dbh,ht){dbh^2 + ht}
func2 <- function(dbh,ht){dbh^2 - ht}

The data frame (in reality this data frame is much longer):

spcd <- c(122, 122, 141, 141)
region <- c('OR_W', 'OR_E', 'OR_W', 'OR_E')
funcs_df <- data.frame(spcd, region, funcs)
funcs_df$funcs <- c("func1", "func2", "func1", "func2")

Then, I have another frame that has some information, including the spcd and region that should match the values in func_df:

spcd <- c(122, 141, 141, 122, 141, 122)
region <- c('OR_W', 'OR_E', 'OR_W', 'OR_E', 'OR_W', 'OR_W')
dbh <- c(12, 13, 15, 11, 10, 21)
ht <- c(101, 121, 100, 99, 88, 76)
tree_df <- data.frame(spcd, region, dbh, ht)

Applying the Functions

This is the part I would prefer feedback on.

First, I split the tree_df into distinct groups based on spcd and region so I can apply the functions that correspond to these distinct groups.

tree_split <- split(tree_df, list(tree_df$region, tree_df$spcd))

Then, I create an empty data frame to append to.

new_tree <- data.frame()

Next, (and this is where things get messy) I loop through each group, grab the top left cell that acts as a "key" to get the equation from the func_df and use mapply on each group (with some conditionals to handle NA values).

for (group in tree_split) {
 # Get the 'group key'
 region <- group$region[1]
 spcd <- group$spcd[1]
 # Get the equation from eqs
 eq <- funcs_df$funcs[which((funcs_df$spcd == spcd & funcs_df$region == 
 region))]
 # Convert func string into actual function
 eq <- eq[[1]]
 eq <- eval(parse(text=eq))
 # Apply the equation to each record in the group
 group$cvts <- mapply(eq, group$dbh, group$ht)
 # Append to new_tree
 new_tree <- rbind(new_tree ,group)
}

Discussion

This results in the desired output with the new cvts outputs according to each function defined in the dataframe:

 spcd region dbh ht cvts
4 122 OR_E 11 99 22
1 122 OR_W 12 101 245
6 122 OR_W 21 76 517
2 141 OR_E 13 121 48
3 141 OR_W 15 100 325
5 141 OR_W 10 88 188

I have a few concerns with this approach:

The old adage "if you write a for-loop you are doing it wrong" seems to apply here. Is there some way I could reduce this for-loop to some sort of apply or mapply type function?
Grabbing the key from a cell (see "# Get the 'group key'" comment above) seems sloppy. Is there a way to get this 'group key' in a more formal fashion?

Other advice is, of course, welcome.

Question 2

Can you confirm if your functions are all vectorized, like it is the case in your reproducible example?

Question 3

@flodel that is the case.

Question 4

did you get a chance to review and test my answer?

Question 5

@flodel I will try to get a chance to review it sometime this week

Question 6

I would suggest you merge your data.frames so you can split the data per function:

mg <- merge(funcs_df, tree_df)
sp <- split(mg, mg$funcs)
print(sp)
# $func1
# spcd region funcs dbh ht
# 2 122 OR_W func1 12 101
# 3 122 OR_W func1 21 76
# 5 141 OR_W func1 15 100
# 6 141 OR_W func1 10 88
# 
# $func2
# spcd region funcs dbh ht
# 1 122 OR_E func2 11 99
# 4 141 OR_E func2 13 121

Then, you just have to call each function once (since you confirmed the functions are vectorized):

cvts_list <- Map(function(fun, x) fun(x$dbh, x$ht),
 fun = mget(names(sp)), x = sp)
print(cvts_list)
# $func1
# [1] 245 517 325 188
# 
# $func2
# [1] 22 48

and stack the results into a new column, using unsplit:

mg$cvts <- unsplit(cvts_list, mg$funcs)
print(mg)
# spcd region funcs dbh ht cvts
# 1 122 OR_E func2 11 99 22
# 2 122 OR_W func1 12 101 245
# 3 122 OR_W func1 21 76 517
# 4 141 OR_E func2 13 121 48
# 5 141 OR_W func1 15 100 325
# 6 141 OR_W func1 10 88 188

Question 7

Consider merging the two dataframes then use by, the method designed to split a dataframe by one or more factors. As the object-oriented wrapper to tapply, by tends to be a more streamlined handler than split...lapply or split ... for since you can attach a function directly to pass subsetted dataframes into it.

Then, take the list of dataframes returned from by and bind them with a do.call instead of initalizing an empty dataframe and iteratively expanding it in a loop.

merged_df <- merge(funcs_df, tree_df, by=c("spcd", "region"))
process_func <- function(df) {
 # Get the equation from eqs
 eq <- df$funcs[[1]]
 # Convert func string into actual function
 eq <- eval(parse(text=eq))
 # Apply the equation to each record in the group
 df$cvts <- mapply(eq, df$dbh, df$ht)
 return(df)
}
df_list <- by(merged_df, list(merged_df$region, merged_df$spcd), FUN=process_func)
finaldf <- do.call(rbind, df_list)

flodel flodel 3,5551 gold badge16 silver badges15 bronze badges · Accepted Answer · 2018-01-06 18:36:22Z

I would suggest you merge your data.frames so you can split the data per function:

mg <- merge(funcs_df, tree_df)
sp <- split(mg, mg$funcs)
print(sp)
# $func1
# spcd region funcs dbh ht
# 2 122 OR_W func1 12 101
# 3 122 OR_W func1 21 76
# 5 141 OR_W func1 15 100
# 6 141 OR_W func1 10 88
# 
# $func2
# spcd region funcs dbh ht
# 1 122 OR_E func2 11 99
# 4 141 OR_E func2 13 121

Then, you just have to call each function once (since you confirmed the functions are vectorized):

cvts_list <- Map(function(fun, x) fun(x$dbh, x$ht),
 fun = mget(names(sp)), x = sp)
print(cvts_list)
# $func1
# [1] 245 517 325 188
# 
# $func2
# [1] 22 48

and stack the results into a new column, using unsplit:

mg$cvts <- unsplit(cvts_list, mg$funcs)
print(mg)
# spcd region funcs dbh ht cvts
# 1 122 OR_E func2 11 99 22
# 2 122 OR_W func1 12 101 245
# 3 122 OR_W func1 21 76 517
# 4 141 OR_E func2 13 121 48
# 5 141 OR_W func1 15 100 325
# 6 141 OR_W func1 10 88 188

Stack Exchange Network

Use a data frame of functions to calculate values using specific columns in another data frame

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Use a data frame of functions to calculate values using specific columns in another data frame

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions