Overview
I have 165,000 *.rds files that are around 200,000 obs x 13 variables each. These files are unique to grid files (2.5 square mile data) so I need to keep them in their unique grid numbers. Currently, I'm loading each individual file, performing some function on the data, and resaving them to another directory; however, this took up to a week to process and I'm trying to improve the code to run in parallel. I recently started 10 individual RStudio sessions and divided the grids into 10 separate sessions to run in parallel. It worked fine and only took a day to run through. This is not very efficient though and I would like to try to get it running in parallel using only one rsession.
The following code is a simple example I made that closely resembles the basic function of the loop. This is not the exact code as it would not be useful to post all the files and all the code.
Question
Is this the proper way to run parallel when loading individual files, applying a function to the data, and re-save to a directory? It seems to be working, but I don't know if it is the correct way to do this.
RDS Files:
Code
# Multicore
library(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)
# Function
check <- function(x){
df <- readRDS(paste0("./", x))
df$var <- df$var*2
saveRDS(df, paste0("./temp/", x))
}
# Directory where *.rds files are
files <- list.files("./")
# for loop
for (i in unique(files)){
check(i)
}
# foreach loop
foreach(i = unique(files)) %dopar% check(i)
2 Answers 2
Parallelization is not necessarily implemented nicely in R. However, it is far more ideal to use R's batch process than opening 10x Rstudio sessions as you saw (less of a resource drain per task).
Cores, cores, where art thou?
The first thing I would do is find out how many cores you have access to. Within this script 4 seem to be allocated. It is very important that you have at most or less than the total number of cores on your system for the parallel jobs to be appropriately scheduled.
# Find out how many cores exist
parallel::detectCores()
Speed
Speed-wise, I find the foreach
iterator to consume more time than I really care for. The reason is because it has a lot of unnecessary overhead and combine operations that tend to dynamically grow objects. So, I normally just interface directly with R's parallel
package's parLapply
function.
Code
I realize that this is just a condensed example, however, without seeing everything, we really cannot give super specific feedback.
With that being said, you will probably end up with an error running this code since i
is not defined within check()
. Only x
is in the function's scope.
Try to return some value to the foreach
loop, e.g. an i
value from check or a "pass-i"
, "fail-i"
to figure out if a process fails.
Another bit to note is that the script is currently set to output to the input directory. Try to create dedicate directories for input and output. The reason for this is, if you were to re-run your script, the processed files would be included within the next run. (Unless the pattern specified by list.files
is more specific to file nomenclature that you have.)
Also, since it seems like you are using spatial data. Is this a raster image that is contained within .rds? If so, could you get away with using a stack instead of the whole raster into memory?
Aside
If you are interested in learning more about how to parallelize with R, I would recommend looking over this slide deck (Disclaimer: I wrote it)
Edit
Per the submitter's comment that using lapply
is not possible. Please note that a for loop in R is the same in this case as using an lapply
. There are some benefits (speed being one) that make lapply
typically better than using a for
loop. Furthermore, the foreach
here is really a cast over the parXapply
statements.
library(parallel)
cl <- makeCluster(detectCores())
# Function
check <- function(x){
df <- readRDS(paste0("./", x))
df$var <- df$var*2
saveRDS(df, paste0("./temp/", x))
}
# Directory where *.rds files are
files <- list.files("./")
# Obtain unique files
i = unique(files)
# lapply statement
lapply(i, FUN=check)
# parlapply
parLapply(cl, X=i, fun=check)
Edit 2
This edit is meant to show how to export functions or variables per submitter's issue. Since the submitter has not made available the functions or variables needed within the parallelization this is a generic example.
library(parallel)
cl <- makeCluster(detectCores())
# Load packages on cluster's R sessions
clusterEvalQ(cl, library(pkgname))
# Export functions or variables to cluster's R session
func <- function(a){
out <- a*a
return(out)
}
v = 1:10
clusterExport(cl,c("func","v"))
stopCluster(cl)
-
\$\begingroup\$ Thanks for this answer. I've corrected the
x
mistake incheck
and the output directory is now./temp/
This is a mistake in this code and not my original. Sorry. These are not raster images, but instead are data.frames. And finally, I am not able to use anapply
function because of the ways in which thefor
loop is setup, which is why I'm choosingforeach
. Also, thanks for the slide deck. I've give it a read. \$\endgroup\$Amstell– Amstell2016年01月04日 17:48:32 +00:00Commented Jan 4, 2016 at 17:48 -
\$\begingroup\$ The
for
loop is synonymous with thelapply
. That is, you can typically write1:n
orc("a","b","c")
as element increments of both. Furthermore, theforeach
here is calling one of theparXapply
functions behind the scenes to generate results. (See doParallel and foreach package source.) \$\endgroup\$coatless– coatless2016年01月04日 19:45:20 +00:00Commented Jan 4, 2016 at 19:45 -
\$\begingroup\$ This is very helpful and certainly what I am looking for. I tried your method on my original data and ran into problems. Can you not use outside functions within
parLapply
? It's telling me there is a missing function, but it's defined correctly. \$\endgroup\$Amstell– Amstell2016年01月04日 20:12:50 +00:00Commented Jan 4, 2016 at 20:12 -
1\$\begingroup\$ You can. I've updated the post above. The functions just must be exported or run on the cluster before using
parXapply
. You should of run into this issue withforeach()
(e.g..export = "base"
) If that is it, please feel free to accept the answer. \$\endgroup\$coatless– coatless2016年01月04日 21:24:49 +00:00Commented Jan 4, 2016 at 21:24
Well this does not use doParallel as in the question but i think it will be an productive approach.
I would create following files:
myfun.R - just does the transformation you want.
args <- commandArgs(trailingOnly = TRUE)
df <- readRDS(args[[1]])
df$var <- df$var*2
saveRDS(df, file=args[[2]])
Makefile - defines the relationships
all: $(patsubst data/%.rds, tmp/%.rds, $(wildcard data/*.rds))
tmp/%.rds: data/%.rds myfun.R
Rscript myfun.R $< $@
The first dependency all describes what you need in total. The second tells what needs to be done to convert a single file. It's handy to have the program as dependency cause then make will be able to figure out which files to update if you change the program.
Then the go to a shell in the directory (Rstudio in windows comes with its own) and type
make all -j4
this will then figure out which files need conversion and use 4 processes to convert.
Alternatively you can set in Rstudio the project to use make then you will have a build button which will do a make all
- though I'm not sure if you are able to set the -j4 option.
-
\$\begingroup\$ I like this way of doing it and it seems more intuitive to what I need. Could you possible breakdown the systax, such as
$(wildcard data/*.rds)
and$< $@
, or point me in the direction of some documentation to handle this? Also, is the makefile filename justMakefile
? Is there an extension or how does make know which file to make? \$\endgroup\$Amstell– Amstell2016年01月04日 18:44:23 +00:00Commented Jan 4, 2016 at 18:44 -
\$\begingroup\$ I just ran this and it works perfectly. I'll need to start using makefiles as part of my workflow. I wonder if it would be possible to add a progress bar? Cycling through 165,000 files it should be nice to see where it is at. \$\endgroup\$Amstell– Amstell2016年01月04日 19:39:38 +00:00Commented Jan 4, 2016 at 19:39
-
\$\begingroup\$ @Amstell the main documentation is under gnu.org/software/make but it's not hard to come by documentation for it. Basically the wildcard commands gets all the files like in the shell. $< is replaced by the first dependency - in your case a filename in the data directory. $@ is replaced by the target name - in your case the output file - both are sent to the R script as input. \$\endgroup\$bdecaf– bdecaf2016年01月04日 21:16:09 +00:00Commented Jan 4, 2016 at 21:16
-
\$\begingroup\$ Thank you! I really like this answer and have already used it to process data using 20 processes on a large machine I have. It was done in a matter of minutes. I'll be incorporating this into my workflow. Thanks! \$\endgroup\$Amstell– Amstell2016年01月04日 21:17:44 +00:00Commented Jan 4, 2016 at 21:17
make
(parallel option!) heavy workflow with lots of small R instances in parallel - you definitely don't need a full RStudio for each process. \$\endgroup\$i
was a type; I've fixed it. And yes, I have saved to a temp directory, but I left it out here. Can you define a make ? Not sure what youmean by this. \$\endgroup\$