%@meta language="R-vignette" content="-------------------------------- %\VignetteIndexEntry{A Future for R: Common Issues with Solutions} %\VignetteAuthor{Henrik Bengtsson} %\VignetteKeyword{R} %\VignetteKeyword{package} %\VignetteKeyword{vignette} %\VignetteKeyword{future} %\VignetteKeyword{promise} %\VignetteEngine{R.rsp::rsp} %\VignetteTangle{FALSE} --------------------------------------------------------------------"%> # <%@meta name="title"%> In the ideal case, all it takes to start using futures in R is to replace select standard assignments (`<-`) in your R code with future assignments (`%<-%`) and make sure the right-hand side (RHS) expressions are within curly brackets (`{ ... }`). Also, if you assign these to lists (e.g. in a for loop), you need to use a list environment (`listenv`) instead of a plain list. However, as show below, there are few cases where you might run into some hurdles, but, as also shown, they are often easy to overcome. These are often related to global variables. _If you identify other cases, please consider [reporting](https://github.com/futureverse/future/issues/) them so they can be documented here and possibly even be fixed._ ## Issues with globals and packages ### Missing globals (false negatives) If a global variable is used in a future expression that _conditionally_ overrides this global variable with a local one, the future framework fails to identify the global variable and therefore fails to export it, resulting in a run-time error. For example, although this works: ```r plan(multisession) reset <- TRUE x <- 1 y %<-% { if (reset) x <- 0; x + 1 } y ## [1] 1 ``` the following does _not_ work: ```r reset <- FALSE x <- 1 y %<-% { if (reset) x <- 0; x + 1 } y ## Error: object 'x' not found ``` It is recommended to avoid above constructs where it is ambiguous whether a variable is global or local. To force variable `x` to always be global, insert it at the very being of the future expression, e.g. ```r reset <- FALSE x <- 1 y %<-% { x; if (reset) x <- 0; x + 1 } y ## [1] 2 ``` _Comment:_ The goal is to in a future version of the package detect globals also in expression where the local-global state of a variable is only known at run time. #### do.call() - function not found When calling a function using `do.call()` make sure to specify the function as the object itself and not by name. This will help identify the function as a global object in the future expression. For instance, use ```r do.call(file_ext, list("foo.txt")) ``` instead of ```r do.call("file_ext", list("foo.txt")) ``` so that `file_ext()` is properly located and exported. Although you may not notice a difference when evaluating futures in the same R session, it may become a problem if you use a character string instead of a function object when futures are evaluated in external R sessions, such as on a cluster. It may also become a problem with futures evaluated with lazy evaluation if the intended function is redefined after the future is resolved. For example, ```r> library(future)> library(listenv)> library(tools)> plan(sequential)> pathnames <- c("foo.txt", "bar.png", "yoo.md")> res <- listenv()> for (ii in seq_along(pathnames)) { + res[[ii]] %<-% do.call("file_ext", list(pathnames[ii])) %lazy% TRUE + }> file_ext <- function(...) "haha!"> unlist(res) [1] "haha!" "haha!" "haha!" ``` #### get() - object not found The base R function `get()` can be used to get the value of a object by its name. For example, `get("pi", envir = baseenv())` will get the value of object `pi` in the 'base' environment, i.e. it corresponds to `base::pi`. If no other objects named `pi` exists on the search path, we could have used `get("pi")` and `pi`, respectively. It is not unusual to see code snippets such as: ```r> a <- 1:3> b <- 4:6> c <- 3:5> my_sum <- function(var) { sum(get(var)) }> y <- my_sum("a")> y [1] 6 ``` If we attempt to call `my_sum()` via a future, we will get an error (if the future is resolved in an external R process); ```r> library(future)> plan(multisession)> f <- future(my_sum("a"))> y <- value(f) Error in get(var) : object 'a' not found ``` This is because the static code inspection done on the future expression `my_sum("a")` does not reveal object `a` as a global object. In that expression alone, there are only three objects: the function `my_sum()`, the primitive function `(`, and the string `"a"`, and none of those are object `a`. The future framework will also scan these three objects for globals, which in this example means that it scans also `my_sum()`. This recursive search for globals will identify three additional globals, namely, the primitive function `{`, the function `sum()`, and the function `get()`, but, as before, none of these source will identify `a` as a global object. In order for `a` to be identified, the future framework would need to have a built-in understanding on how `get(var)` works, which would be a daunting task, especially if it need to know how it acts for different data types of `var` and various choices on arguments `envir` and `enclos`. In fact, this can often not be inferred until run time, that is, it is not possible to identify what objects are needed without actually running the code. In short, it is not possible to automatically identify global variables specified via a character string. The workaround is to tell the future framework what _additional_ globals are needed. This can be done via argument `globals` using: ```r> f <- future(my_sum("a"), globals = structure(TRUE, add = "a"))> y <- value(f)> y [1] 6 ``` or by injecting variable `a` at the beginning of the future expression, e.g. ```r> f <- future({ a; my_sum("a") })> y <- value(f)> y [1] 6 ``` Note that, independently of the future framework, it is often a bad idea to use `get()`, and related functions `mget()` and `assign()`, in R code. Searching the archives of R forums, such as the R-help and R-devel mailing lists, will reveal numerous suggestions against using them. A good rule of thumb is:> If you find yourself using `get()` in your code, take a step back, and reconsider your implementation. There is most likely a better solution available. For example, consider this, slightly more complex, example: ```r> a <- 1:3> b <- 4:7> c <- 3:5> my_sum <- function(var) { sum(get(var)) }> y <- sapply(c("a", "b", "c"), FUN = my_sum)> y a b c 6 22 12 ``` Instead of using "free roaming" objects `a`, `b`, and `c`, it's better to put those values in a list (or a data frame of of the same length); ```r> data <- list(a = 1:3, b = 4:7, c = 3:5) ``` This will in turn allow us to perform the same calculations without having to use `get()`; ```r> my_sum <- function(x) { sum(x) }> y <- sapply(data, FUN = my_sum)> y a b c 6 22 12 ``` #### glue::glue() - object not found The future framework will fail to identify globals that are declared via character strings. The above section gives an example of this where `get()` is used and explains why it is not feasible to automatically identify string-embedded globals from such code. Another example, is when using `glue()` from the **[glue]** package to generate strings dynamically, e.g. ```r> library(glue)> a <- 42> s <- glue("The value of a is {a}.")> s The value of a is 42. ``` Attempt to perform the same via a future that is resolved in another R session will produce an "object not found" error; ```r> library(glue)> library(future)> plan(multisession)> a <- 42> s %<-% glue("The value of a is {a}.")> s Error in eval(parse(text = text, keep.source = FALSE), envir) : object 'a' not found ``` As explained in the previous section, the workaround is to specify what additional global variables there are, which can be done as: ```> s %<-% glue("The value of a is {a}.") %globals% structure(TRUE, add = "a")> s The value of a is 42. ``` An alternative solution is to guide the future framework by adding the missing globals as "dummy" variables, e.g. ```> s %<-% { a; glue("The value of a is {a}.") }> s The value of a is 42. ``` ### Missing packages (false negatives) Occasionally, the static-code inspection of the future expression fails to identify packages needed to evaluated the expression. This may occur when an expression uses S3 generic functions part of one package whereas the required S3 method is in another package. For example, in the below future generic function `[` is used on data.table object `DT`, which requires S3 method `[.data.table` from the **[data.table]** package. However, the **future** and **[globals]** packages fail to identify **data.table** as a required package, which results in an evaluation error: ```r> library(future)> plan(multisession)> library(data.table)> DT <- data.table(a = LETTERS[1:3], b = 1:3)> y %<-% DT[, sum(b)]> y Error: object 'b' not found ``` The above error occurs because, contrary to the master R process, the R worker that evaluated the future expression does not have **data.table** loaded. Instead the evaluation falls back to the `[.data.frame` method, which is not what we want. Until the future framework manages to identify **data.table** as a required package (which is the goal), we can guide future by specifying additional packages needed: ```r> y %<-% DT[, sum(b)] %packages% "data.table"> y [1] 6 ``` or equivalently ```r> f <- future(DT[, sum(b)], packages = "data.table")> value(f) [1] 6 ``` Note, do _not_ use `library()` or `loadNamespace()` to resolve these problems. It is always better to use the above `packages` approach. ## '...' used in an incorrect context In R, we can use the `...` construct is used to refer to zero or more arguments. For example, we can use as in: ```r my_mean <- function(x, ...) mean(x, ...) y <- my_mean(1:10, trim = 0.1, na.rm = FALSE) ``` This makes sure that the `trim` and the `na.rm` arguments are passed down to the `mean()` function. We can also use it to pass arguments in map-reduce calls to anonymous functions as in: ```r X <- rnorm(10) y <- lapply(X, FUN = function(x, ...) { round(x, ...) }, digits = 3) ``` Note how `digits = 3` is passed to the anonymous function via its `...` argument, which is then passed on to `round()`, effectively calling `round(X[[1]], digits = 3)`, `round(X[[2]], digits = 3)`, and so on. If we take this one step further, we might see things like: ```r my_fcn <- function(X, ...) { ## outer '...' y <- lapply(X, FUN = function(x, ...) { ## inner '...' round(x, ...) ## inner '...' }, ...) ## outer '...' y } X <- rnorm(10) y <- my_fcn(X, digits = 3) ``` In this case, we have two levels of `...` arguments; one for `my_fcn()` and one for the anonymous function. Note how the `...` arguments for `my_fcn()` are passed down to the anonymous function by specifying `...` as an final argument to `lapply()`. The above is the ideal and proper way to pass down `...`. However, it is not uncommon to see that the `...` is used as a global variable in anonymous functions. For example, you might find: ```r my_fcn <- function(X, ...) { ## outer '...' y <- lapply(X, FUN = function(x) { round(x, ...) ## outer '...' as global variables }) y } ``` This will also work, because `...` becomes a global variable in the environment of the anonymous function. Although we know that relying on global variables is a bad idea, this one often slips through. If we attempt to do the same with the future framework, or other parallel frameworks, it might not work. For example, using: ```r my_fcn <- function(X, ...) { y <- future_lapply(X, FUN = function(x) { round(x, ...) }) y } ``` might result in an error on: ``` Error: '...' used in an incorrect context ``` Even you do not get this error, it is always a good idea to make sure `...` is passed as an argument all the way down to where it is used, e.g. ```r my_fcn <- function(X, ...) { y <- future.apply::future_lapply(X, FUN = function(x, ...) { round(x, ...) }, ...) y } ``` ## Non-exportable objects Certain types of objects are tied to a given R session and cannot be passed along to another R process (a "worker"). An example of a non-exportable object is is XML objects of the **[xml2]** package. If we attempt to use those in parallel processing, we may get a error when the future is evaluated (or just invalid results depending on how they are used), e.g. ```r> library(future)> plan(multisession)> library(xml2)> xml <- read_xml("