1
\$\begingroup\$

I have a performance issue in R. I have a function that iterates over a dataframe for different levels of "site" and "method". The function samples 1:1000 interactions (rows), converts these to matrices and calculates a value of connectance for each matrix. This is repeated n times. The function runs exactly as I want it, returning a dataframe with connectance values for 1 to 1000 interactions when repeated a small number of times. The problem is when I increase the number of repetitions (say to 100) the function runs progressively slower.

df <- read.table(text = "bird_sp plant_sp value site method
 1 species_a plant_a 1 a m
 2 species_a plant_a 1 a m
 3 species_b plant_b 1 a m
 4 species_b plant_b 1 a m
 5 species_c plant_c 1 a m
 6 species_a plant_a 1 b m
 7 species_a plant_a 1 b m
 8 species_b plant_b 1 b m
 9 species_b plant_b 1 b m
 10 species_c plant_c 1 b m
 11 species_a plant_a 1 a f
 12 species_a plant_a 1 a f
 13 species_b plant_b 1 a f
 14 species_b plant_b 1 a f
 15 species_c plant_c 1 a f
 16 species_a plant_a 1 b f
 17 species_a plant_a 1 b f
 18 species_b plant_b 1 b f
 19 species_b plant_b 1 b f
 20 species_c plant_c 1 b f", header = TRUE)
xDegrees <-function(df, size, numRep){
 #Loading required library
 require(bipartite)
 #Creating vector of unique combinations
 df <- within(df, {SiteMethod <- paste(site, method, sep = ":")})
 #Creating empty dataframe
 connectMatrix <- as.data.frame(matrix(rep(0,4), ncol = 4))
 colnames(connectMatrix) <- c("Site","Method","Size","connectance")
 #Beginning of matrix
 k <- 1
 #Beginning subsetting loop
 for(i in 1:length(unique(df$SiteMethod))){
 #subsetting dataset
 dfSub <- subset(df, SiteMethod == unique(df$SiteMethod)[i])
 #Storing site for matrix
 site <- as.character(dfSub[1,]$site)
 #Storing method for matrix
 method <- as.character(dfSub[1,]$method)
 for(l in 1:numRep){
 #Beginning calculation loop
 for(j in 1:length(size)){
 #show progress
 print(paste("S:M", i,j , "completed", sep = " "))
 #The size being calculated
 subSize <- size[j]
 #generate random samples and convert to matrices
 rows <- sample(1:nrow(dfSub), subSize, replace=T)
 intlist <- dfSub[rows,]
 mat <- with(intlist, tapply(value, list(plant_sp, bird_sp), sum))
 mat[is.na(mat)] <- 0
 #network level function to calculate connectance
 con <- networklevel(mat, index = c("connectance"))
 #Stitch matrix together
 connectMatrix[k,] <- c(site, method, subSize, con)
 #Update row
 k <- k + 1
 }
 }
 }
#Return complete matrix
return(connectMatrix)
}
#run the function. 1:1000 interactions, 100 reps
stuff <- xDegrees(df, size = 1:1000, numRep = 100)

Any ideas on how to speed this up?

200_success
146k22 gold badges190 silver badges479 bronze badges
asked Dec 13, 2017 at 22:58
\$\endgroup\$
0

2 Answers 2

1
\$\begingroup\$

Consider by to slice your dataframes by the site and method factors and then pass dataframe subsets into an adjusted xDegrees function. Other than the group slice, the other major change is combining for loops with rep and passing result into one sapply call. Below of course is not tested since you do not provide a reproducible example.

# Loading required library
require(bipartite)
xDegrees <-function(dfSub, size, numRep){
 # Retrieving current site
 s <- as.character(dfSub$site[[1]])
 # Retrieving current method
 m <- as.character(dfSub$method[[1]])
 # Build large vector of all random runs
 sample_iter <- rep(seq_along(size), numRep) 
 # Generate random samples and convert to matrices
 con_vec <- sapply(sample_iter, function(i) {
 rows <- sample(1:nrow(dfSub), i, replace=T))
 intlist <- dfSub[rows,]
 mat <- with(intlist, tapply(value, list(plant_sp, bird_sp), sum))
 mat[is.na(mat)] <- 0
 # network level function to calculate connectance
 con <- networklevel(mat, index = c("connectance"))
 })
 # size is a multiple of con_vec (whose length = size * numRep)
 return(data.frame(method = m, site = s, subsize = size, con = con_vec))
}
df_List <- by(df, df[,c("site", "method")], FUN=function(d)
 xDegrees(d, size = 1:1000, numRep = 100))
finald_df <- do.call(rbind, df_List)
answered Dec 14, 2017 at 5:01
\$\endgroup\$
3
\$\begingroup\$

Without installing bipartite and getting into the exact details of what the actual calculations are looking to do, there are a few things that jump out at me that might help you.

Growing the Object (connectMatrix)

It looks like you're in Circle 2 - Growing Objects of the R Inferno.
Specifically, the line where you've got:

#Stitch matrix together 
connectMatrix[k,] <- c(site, method, subSize, con)

In that scenario, every time you add another row to the connectMatrix, you're expanding and fragmenting the memory you're using, which as you've observed starts out okay but then starts to really drag. Here's another resource on that same issue, and a much less intricate Stack Overflow question.

nested for loops

You've got three layers of nested for loops

for(i in 1:length(unique(df$SiteMethod))){
 #subsetting dataset
 ...
 ...
 ...
 for(l in 1:numRep){
 #Beginning calculation loop
 for(j in 1:length(size)){
 ...
 ...
 }
 }

The outside for loop jumps out at me as something the plyr package might be good for -- essentially it looks like you're splitting the main dataframe into subsets, then doing things to it (the meat of your second and third for loops). Basically, plyr was designed for splitting-applying-and combining, and there's a really great paper on how to implement it.

You may be able to do two (or maybe even three) layers of splitting-applying-combining.

If the data lend themselves to grouping operations, those tend to help, too, but it doesn't seem like it based on the comments in the code.

Parallel processing

If, after you've optimized for not growing the object and chunked the code for splitting/combining (or another method of getting to the same outcome), you're still not at the level of performance you're looking for, you can journey into parallel application.

Basically, parallel application works well when you're using apply or plyr approaches and you can convert over to mcmapply or one of the other parallel approaches.

For example, we use mcmapply in production at work to take a 30 minute process and trim it down to about 6 minutes. Parallel processing tends to be worth it (in my personal experience) only once you've done everything else you can to get to the best speed you can without parallelizing - it tends to be more work to get it right, and, as a bonus, once you've done the other things it tends to make it easier to convert the process to run on multiple cores.

Finally, Efficient R is a good general resource on optimizing R processes.

answered Dec 14, 2017 at 4:22
\$\endgroup\$
1
  • \$\begingroup\$ upvoted for good links \$\endgroup\$ Commented Dec 14, 2017 at 4:32

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.