Using doParallel to cycle through *.rds files

Question 1

Overview

I have 165,000 *.rds files that are around 200,000 obs x 13 variables each. These files are unique to grid files (2.5 square mile data) so I need to keep them in their unique grid numbers. Currently, I'm loading each individual file, performing some function on the data, and resaving them to another directory; however, this took up to a week to process and I'm trying to improve the code to run in parallel. I recently started 10 individual RStudio sessions and divided the grids into 10 separate sessions to run in parallel. It worked fine and only took a day to run through. This is not very efficient though and I would like to try to get it running in parallel using only one rsession.

The following code is a simple example I made that closely resembles the basic function of the loop. This is not the exact code as it would not be useful to post all the files and all the code.

Question

Is this the proper way to run parallel when loading individual files, applying a function to the data, and re-save to a directory? It seems to be working, but I don't know if it is the correct way to do this.

RDS Files:

df1

df2

df3

Code

# Multicore
library(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)
# Function
check <- function(x){
 df <- readRDS(paste0("./", x))
 df$var <- df$var*2
 saveRDS(df, paste0("./temp/", x))
}
# Directory where *.rds files are
files <- list.files("./")
# for loop
for (i in unique(files)){
 check(i)
}
# foreach loop
foreach(i = unique(files)) %dopar% check(i)

Question 2

a few comments: 1) In the script in the saveRDS line you use i - is this a typo? 2) I would strongly suggest to save to a new directory and check which of these files have already been processed. 3) Personally I prefer the opposite a make (parallel option!) heavy workflow with lots of small R instances in parallel - you definitely don't need a full RStudio for each process.

Question 3

@bdecaf Yes, i was a type; I've fixed it. And yes, I have saved to a temp directory, but I left it out here. Can you define a make ? Not sure what youmean by this.

Question 4

with make heavy I meant writing a makefile (as Rstudio does for some project types, usually it is shipped with Rstudio). This is a bit complex to describe here - but essentially it is a nifty program that will figure out what needs to be done.

Question 5

I gave an example as answer. make has incredible useful features.

Question 6

Due to luck this was just published kbroman.org/minimal_make

Question 7

Parallelization is not necessarily implemented nicely in R. However, it is far more ideal to use R's batch process than opening 10x Rstudio sessions as you saw (less of a resource drain per task).

Cores, cores, where art thou?

The first thing I would do is find out how many cores you have access to. Within this script 4 seem to be allocated. It is very important that you have at most or less than the total number of cores on your system for the parallel jobs to be appropriately scheduled.

# Find out how many cores exist
parallel::detectCores()

Speed

Speed-wise, I find the foreach iterator to consume more time than I really care for. The reason is because it has a lot of unnecessary overhead and combine operations that tend to dynamically grow objects. So, I normally just interface directly with R's parallel package's parLapply function.

Code

I realize that this is just a condensed example, however, without seeing everything, we really cannot give super specific feedback.

With that being said, you will probably end up with an error running this code since i is not defined within check(). Only x is in the function's scope.

Try to return some value to the foreach loop, e.g. an i value from check or a "pass-i", "fail-i" to figure out if a process fails.

Another bit to note is that the script is currently set to output to the input directory. Try to create dedicate directories for input and output. The reason for this is, if you were to re-run your script, the processed files would be included within the next run. (Unless the pattern specified by list.files is more specific to file nomenclature that you have.)

Also, since it seems like you are using spatial data. Is this a raster image that is contained within .rds? If so, could you get away with using a stack instead of the whole raster into memory?

Aside

If you are interested in learning more about how to parallelize with R, I would recommend looking over this slide deck (Disclaimer: I wrote it)

Edit

Per the submitter's comment that using lapply is not possible. Please note that a for loop in R is the same in this case as using an lapply. There are some benefits (speed being one) that make lapply typically better than using a for loop. Furthermore, the foreach here is really a cast over the parXapply statements.

library(parallel)
cl <- makeCluster(detectCores())
# Function
check <- function(x){
 df <- readRDS(paste0("./", x))
 df$var <- df$var*2
 saveRDS(df, paste0("./temp/", x))
}
# Directory where *.rds files are
files <- list.files("./")
# Obtain unique files 
i = unique(files)
# lapply statement
lapply(i, FUN=check)
# parlapply 
parLapply(cl, X=i, fun=check)

Edit 2

This edit is meant to show how to export functions or variables per submitter's issue. Since the submitter has not made available the functions or variables needed within the parallelization this is a generic example.

library(parallel)
cl <- makeCluster(detectCores())
# Load packages on cluster's R sessions
clusterEvalQ(cl, library(pkgname))
# Export functions or variables to cluster's R session
func <- function(a){
out <- a*a
return(out)
}
v = 1:10
clusterExport(cl,c("func","v"))
stopCluster(cl)

Question 8

Thanks for this answer. I've corrected the x mistake in check and the output directory is now ./temp/ This is a mistake in this code and not my original. Sorry. These are not raster images, but instead are data.frames. And finally, I am not able to use an apply function because of the ways in which the for loop is setup, which is why I'm choosing foreach. Also, thanks for the slide deck. I've give it a read.

Question 9

The for loop is synonymous with the lapply. That is, you can typically write 1:n or c("a","b","c") as element increments of both. Furthermore, the foreach here is calling one of the parXapply functions behind the scenes to generate results. (See doParallel and foreach package source.)

Question 10

This is very helpful and certainly what I am looking for. I tried your method on my original data and ran into problems. Can you not use outside functions within parLapply? It's telling me there is a missing function, but it's defined correctly.

Question 11

You can. I've updated the post above. The functions just must be exported or run on the cluster before using parXapply. You should of run into this issue with foreach() (e.g. .export = "base") If that is it, please feel free to accept the answer.

Question 12

Well this does not use doParallel as in the question but i think it will be an productive approach.

I would create following files:

myfun.R - just does the transformation you want.

args <- commandArgs(trailingOnly = TRUE)
df <- readRDS(args[[1]])
df$var <- df$var*2
saveRDS(df, file=args[[2]])

Makefile - defines the relationships

all: $(patsubst data/%.rds, tmp/%.rds, $(wildcard data/*.rds))
tmp/%.rds: data/%.rds myfun.R
 Rscript myfun.R $< $@

The first dependency all describes what you need in total. The second tells what needs to be done to convert a single file. It's handy to have the program as dependency cause then make will be able to figure out which files to update if you change the program.

Then the go to a shell in the directory (Rstudio in windows comes with its own) and type

make all -j4

this will then figure out which files need conversion and use 4 processes to convert.

Alternatively you can set in Rstudio the project to use make then you will have a build button which will do a make all - though I'm not sure if you are able to set the -j4 option.

Question 13

I like this way of doing it and it seems more intuitive to what I need. Could you possible breakdown the systax, such as $(wildcard data/*.rds) and $< $@, or point me in the direction of some documentation to handle this? Also, is the makefile filename just Makefile? Is there an extension or how does make know which file to make?

Question 14

I just ran this and it works perfectly. I'll need to start using makefiles as part of my workflow. I wonder if it would be possible to add a progress bar? Cycling through 165,000 files it should be nice to see where it is at.

Question 15

@Amstell the main documentation is under gnu.org/software/make but it's not hard to come by documentation for it. Basically the wildcard commands gets all the files like in the shell. $< is replaced by the first dependency - in your case a filename in the data directory. $@ is replaced by the target name - in your case the output file - both are sent to the R script as input.

Question 16

Thank you! I really like this answer and have already used it to process data using 20 processes on a large machine I have. It was done in a matter of minutes. I'll be incorporating this into my workflow. Thanks!

coatless coatless 2653 silver badges11 bronze badges · Accepted Answer · 2016-01-04 08:16:56Z

Parallelization is not necessarily implemented nicely in R. However, it is far more ideal to use R's batch process than opening 10x Rstudio sessions as you saw (less of a resource drain per task).

Cores, cores, where art thou?

The first thing I would do is find out how many cores you have access to. Within this script 4 seem to be allocated. It is very important that you have at most or less than the total number of cores on your system for the parallel jobs to be appropriately scheduled.

# Find out how many cores exist
parallel::detectCores()

Speed

Speed-wise, I find the foreach iterator to consume more time than I really care for. The reason is because it has a lot of unnecessary overhead and combine operations that tend to dynamically grow objects. So, I normally just interface directly with R's parallel package's parLapply function.

Code

I realize that this is just a condensed example, however, without seeing everything, we really cannot give super specific feedback.

With that being said, you will probably end up with an error running this code since i is not defined within check(). Only x is in the function's scope.

Try to return some value to the foreach loop, e.g. an i value from check or a "pass-i", "fail-i" to figure out if a process fails.

Another bit to note is that the script is currently set to output to the input directory. Try to create dedicate directories for input and output. The reason for this is, if you were to re-run your script, the processed files would be included within the next run. (Unless the pattern specified by list.files is more specific to file nomenclature that you have.)

Also, since it seems like you are using spatial data. Is this a raster image that is contained within .rds? If so, could you get away with using a stack instead of the whole raster into memory?

Aside

If you are interested in learning more about how to parallelize with R, I would recommend looking over this slide deck (Disclaimer: I wrote it)

Edit

Per the submitter's comment that using lapply is not possible. Please note that a for loop in R is the same in this case as using an lapply. There are some benefits (speed being one) that make lapply typically better than using a for loop. Furthermore, the foreach here is really a cast over the parXapply statements.

library(parallel)
cl <- makeCluster(detectCores())
# Function
check <- function(x){
 df <- readRDS(paste0("./", x))
 df$var <- df$var*2
 saveRDS(df, paste0("./temp/", x))
}
# Directory where *.rds files are
files <- list.files("./")
# Obtain unique files 
i = unique(files)
# lapply statement
lapply(i, FUN=check)
# parlapply 
parLapply(cl, X=i, fun=check)

Edit 2

This edit is meant to show how to export functions or variables per submitter's issue. Since the submitter has not made available the functions or variables needed within the parallelization this is a generic example.

library(parallel)
cl <- makeCluster(detectCores())
# Load packages on cluster's R sessions
clusterEvalQ(cl, library(pkgname))
# Export functions or variables to cluster's R session
func <- function(a){
out <- a*a
return(out)
}
v = 1:10
clusterExport(cl,c("func","v"))
stopCluster(cl)

Thanks for this answer. I've corrected the x mistake in check and the output directory is now ./temp/ This is a mistake in this code and not my original. Sorry. These are not raster images, but instead are data.frames. And finally, I am not able to use an apply function because of the ways in which the for loop is setup, which is why I'm choosing foreach. Also, thanks for the slide deck. I've give it a read.
The for loop is synonymous with the lapply. That is, you can typically write 1:n or c("a","b","c") as element increments of both. Furthermore, the foreach here is calling one of the parXapply functions behind the scenes to generate results. (See doParallel and foreach package source.)
This is very helpful and certainly what I am looking for. I tried your method on my original data and ran into problems. Can you not use outside functions within parLapply? It's telling me there is a missing function, but it's defined correctly.
You can. I've updated the post above. The functions just must be exported or run on the cluster before using parXapply. You should of run into this issue with foreach() (e.g. .export = "base") If that is it, please feel free to accept the answer.

Stack Exchange Network

Using doParallel to cycle through *.rds files

2 Answers 2

Cores, cores, where art thou?

Speed

Code

Aside

Edit

Edit 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Using doParallel to cycle through *.rds files

2 Answers 2

Cores, cores, where art thou?

Speed

Code

Aside

Edit

Edit 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions