I have a performance issue in R. I have a function that iterates over a dataframe for different levels of "site" and "method". The function samples 1:1000 interactions (rows), converts these to matrices and calculates a value of connectance for each matrix. This is repeated n times. The function runs exactly as I want it, returning a dataframe with connectance values for 1 to 1000 interactions when repeated a small number of times. The problem is when I increase the number of repetitions (say to 100) the function runs progressively slower.
df <- read.table(text = "bird_sp plant_sp value site method
1 species_a plant_a 1 a m
2 species_a plant_a 1 a m
3 species_b plant_b 1 a m
4 species_b plant_b 1 a m
5 species_c plant_c 1 a m
6 species_a plant_a 1 b m
7 species_a plant_a 1 b m
8 species_b plant_b 1 b m
9 species_b plant_b 1 b m
10 species_c plant_c 1 b m
11 species_a plant_a 1 a f
12 species_a plant_a 1 a f
13 species_b plant_b 1 a f
14 species_b plant_b 1 a f
15 species_c plant_c 1 a f
16 species_a plant_a 1 b f
17 species_a plant_a 1 b f
18 species_b plant_b 1 b f
19 species_b plant_b 1 b f
20 species_c plant_c 1 b f", header = TRUE)
xDegrees <-function(df, size, numRep){
#Loading required library
require(bipartite)
#Creating vector of unique combinations
df <- within(df, {SiteMethod <- paste(site, method, sep = ":")})
#Creating empty dataframe
connectMatrix <- as.data.frame(matrix(rep(0,4), ncol = 4))
colnames(connectMatrix) <- c("Site","Method","Size","connectance")
#Beginning of matrix
k <- 1
#Beginning subsetting loop
for(i in 1:length(unique(df$SiteMethod))){
#subsetting dataset
dfSub <- subset(df, SiteMethod == unique(df$SiteMethod)[i])
#Storing site for matrix
site <- as.character(dfSub[1,]$site)
#Storing method for matrix
method <- as.character(dfSub[1,]$method)
for(l in 1:numRep){
#Beginning calculation loop
for(j in 1:length(size)){
#show progress
print(paste("S:M", i,j , "completed", sep = " "))
#The size being calculated
subSize <- size[j]
#generate random samples and convert to matrices
rows <- sample(1:nrow(dfSub), subSize, replace=T)
intlist <- dfSub[rows,]
mat <- with(intlist, tapply(value, list(plant_sp, bird_sp), sum))
mat[is.na(mat)] <- 0
#network level function to calculate connectance
con <- networklevel(mat, index = c("connectance"))
#Stitch matrix together
connectMatrix[k,] <- c(site, method, subSize, con)
#Update row
k <- k + 1
}
}
}
#Return complete matrix
return(connectMatrix)
}
#run the function. 1:1000 interactions, 100 reps
stuff <- xDegrees(df, size = 1:1000, numRep = 100)
Any ideas on how to speed this up?
2 Answers 2
Consider by
to slice your dataframes by the site and method factors and then pass dataframe subsets into an adjusted xDegrees function. Other than the group slice, the other major change is combining for
loops with rep
and passing result into one sapply
call. Below of course is not tested since you do not provide a reproducible example.
# Loading required library
require(bipartite)
xDegrees <-function(dfSub, size, numRep){
# Retrieving current site
s <- as.character(dfSub$site[[1]])
# Retrieving current method
m <- as.character(dfSub$method[[1]])
# Build large vector of all random runs
sample_iter <- rep(seq_along(size), numRep)
# Generate random samples and convert to matrices
con_vec <- sapply(sample_iter, function(i) {
rows <- sample(1:nrow(dfSub), i, replace=T))
intlist <- dfSub[rows,]
mat <- with(intlist, tapply(value, list(plant_sp, bird_sp), sum))
mat[is.na(mat)] <- 0
# network level function to calculate connectance
con <- networklevel(mat, index = c("connectance"))
})
# size is a multiple of con_vec (whose length = size * numRep)
return(data.frame(method = m, site = s, subsize = size, con = con_vec))
}
df_List <- by(df, df[,c("site", "method")], FUN=function(d)
xDegrees(d, size = 1:1000, numRep = 100))
finald_df <- do.call(rbind, df_List)
Without installing bipartite
and getting into the exact details of what the actual calculations are looking to do, there are a few things that jump out at me that might help you.
Growing the Object (connectMatrix)
It looks like you're in Circle 2 - Growing Objects of the R Inferno.
Specifically, the line where you've got:
#Stitch matrix together
connectMatrix[k,] <- c(site, method, subSize, con)
In that scenario, every time you add another row to the connectMatrix, you're expanding and fragmenting the memory you're using, which as you've observed starts out okay but then starts to really drag. Here's another resource on that same issue, and a much less intricate Stack Overflow question.
nested for
loops
You've got three layers of nested for
loops
for(i in 1:length(unique(df$SiteMethod))){
#subsetting dataset
...
...
...
for(l in 1:numRep){
#Beginning calculation loop
for(j in 1:length(size)){
...
...
}
}
The outside for loop jumps out at me as something the plyr
package might be good for -- essentially it looks like you're splitting the main dataframe into subsets, then doing things to it (the meat of your second and third for loops). Basically, plyr
was designed for splitting-applying-and combining, and there's a really great paper on how to implement it.
You may be able to do two (or maybe even three) layers of splitting-applying-combining.
If the data lend themselves to grouping operations, those tend to help, too, but it doesn't seem like it based on the comments in the code.
Parallel processing
If, after you've optimized for not growing the object and chunked the code for splitting/combining (or another method of getting to the same outcome), you're still not at the level of performance you're looking for, you can journey into parallel application.
Basically, parallel application works well when you're using apply
or plyr
approaches and you can convert over to mcmapply
or one of the other parallel
approaches.
For example, we use mcmapply
in production at work to take a 30 minute process and trim it down to about 6 minutes. Parallel processing tends to be worth it (in my personal experience) only once you've done everything else you can to get to the best speed you can without parallelizing - it tends to be more work to get it right, and, as a bonus, once you've done the other things it tends to make it easier to convert the process to run on multiple cores.
Finally, Efficient R is a good general resource on optimizing R processes.
-
\$\begingroup\$ upvoted for good links \$\endgroup\$C8H10N4O2– C8H10N4O22017年12月14日 04:32:53 +00:00Commented Dec 14, 2017 at 4:32