This is the parallelisation lab for RaukR. It will take you through some basic steps to parallelise your code and try to make you think about when and where you can use this tool.
You are highly encouraged to test things out yourself and tweak things to figure out how these methods behave.
The first thing we want to do is install the package required for the exercise.
install.packages("future)
The basic construct for a future looks like this:
a %<-% { expression(s) }
Here is a computationally intensive task that samples numbers from 1:100, 200000000 times.
sample(100,200000000,replace=T)
Evaluating the computation time on my machine, it comes out taking about 5.4 seconds to run.
system.time({sample(100,200000000,replace=T)})
user system elapsed
5.173 0.160 5.369
Use the future package with plan(sequential)
,which is the default, and run the supplied sample()
inside a future.
Due to a recent change in future, there is a warning message when generating random numbers without seed. We can ignore this by changing our options: options(future.rng.onMisuse="ignore")
.
Add an approach from yesterdays lecture on benchmarking or some other way that you are comfortable with to calculate the time it takes to complete the operation of simply assigning the future. Do not evaluate the future yet by asking for the outcome value.
Note: You should not attempt to calculate times taken within the future, always wrap this around futures.
Question 1: Split your sampling into multiple futures and compute the time again. Did it complete faster?
Question 2: Change to plan(multisession)
or plan(multicore)
according to your setup (operating system type, rstudio or just console). Compute the time again for your multiple futures. Did it complete faster? Think about what the time it takes to compute implies.
Note: I was having some issues with plan(multisession) in Rstudio. If this happens, you might want to just start R console from a terminal window.
Question 3: Ask for the outcome of your futures after their definitions, thus evaluating them. How does this influence the time it takes to perform the operations?
At this stage your code should, in pseudocode, look something like this:
plan(multisession)
timer(
a %<-% {sample expression}
b %<-% {sample expression}
#evaluate futures by requesting outcome values
a + b
)
Question 4: If you have more than two availableCores()
, split the sample()
expression to even more futures . Does this influence time to complete in the manner you thought?
Question 5: Does the error output when the future is defined and unresolved, or when it is resolved?
Question 6: What happens when you try to use that future later in your code?
Question 7: Can you perform other operations between defining your future and evaluating your future?
To use futures in for loops we can use named indices to assign the future to environments. This is pretty similar to assigning values to named indices with the normal assigner <-
, the main difference being that we need to use new environments and we can have multiple expressions for futures.
For example:
plan(multisession)
#Create a new environment
v <- new.env()
for (name in c("a", "b", "c")) {
v[[name]] %<-% {
#expression(s)
}
}
#Turn the environment back into a list
v <- as.list(v)
#To turn the list of vectors into the same format, one long vector, that we had above when running "a + b + c"
vec <- Reduce(c,v)
sample()
operation into however many smaller pieces you want. Do remember to transform your output back into the object we started with before parallelising the execution.Now you know the basics of using the future
package. With this you have already come a long way in lowering the threshold to implement parallel methods and seeing parallel solutions when you run into it next!
Try to apply parallelisation to your own code in a different context than we have done here. For example dividing up a plot or a large dataset. The possibilities are endless.
Check out futures demo visualisation of sequential vs multicore/session.
Using futures on HPC clusters with future.batchtools.
Using futures for your lapply statements with future.apply.
The recently published paper on future.
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] future_1.26.1 fontawesome_0.2.2 captioner_2.2.3 bookdown_0.27
## [5] knitr_1.39
##
## loaded via a namespace (and not attached):
## [1] parallelly_1.32.0 rstudioapi_0.13 magrittr_2.0.3 R6_2.5.1
## [5] rlang_1.0.2 fastmap_1.1.0 stringr_1.4.0 globals_0.15.0
## [9] tools_4.1.2 parallel_4.1.2 xfun_0.31 cli_3.3.0
## [13] jquerylib_0.1.4 htmltools_0.5.2 yaml_2.3.5 digest_0.6.29
## [17] sass_0.4.1 codetools_0.2-18 evaluate_0.15 rmarkdown_2.14
## [21] stringi_1.7.6 compiler_4.1.2 bslib_0.3.1 jsonlite_1.8.0
## [25] listenv_0.8.0
Built on: 15-Jun-2022 at 11:11:24.
2022 • SciLifeLab • NBIS • RaukR