+ - 0:00:00
Notes for current slide
Notes for next slide

Future package has backup packages, for example future.batchtools which provides access to cluster functions, like slurm, torque, SGE and LSF. Futures are evaluated in a local environment, meaning that you cant change variables in there. Like in functions.

Recently published article. The big thing about futures is that it seems to support most infrastructures and it is written in a way where you are not deciding which infrastructure the user has, which parallel and foreach did. Finish with info that he has described this in detail for those that want to know more.

Parallelisation in R

RaukR 2022 • Advanced R for Bioinformatics

Sebastian DiLorenzo

NBIS, SciLifeLab

RaukR 2022 • 1/9

Future package has backup packages, for example future.batchtools which provides access to cluster functions, like slurm, torque, SGE and LSF. Futures are evaluated in a local environment, meaning that you cant change variables in there. Like in functions.

Recently published article. The big thing about futures is that it seems to support most infrastructures and it is written in a way where you are not deciding which infrastructure the user has, which parallel and foreach did. Finish with info that he has described this in detail for those that want to know more.

Parallelisation






  • Save time by doing atomic tasks in parallel

  • Divide tasks or datasets into smaller pieces

  • Can bottleneck if tasks are directly dependent
RaukR 2022 • 3/9

R-package: future

Other packages decide the parallelisation method during development. With future the code is the same and the USER decides parallelisation method.

  • Very simple
  • Uniform code, no matter the strategy
  • User defined parallelisation
  • Unblocked R process during resolving of futures process
  • Works well on multiple architectures



Published 2021-06-08 in "The R Journal": A Unifying Framework for Parallel and Distributed Processing in R using Futures

RaukR 2022 • 4/9

R package developers rarely know who the end-users are and what compute resources they have. With future, instead of programming for one architecture, the code should work on most architectures. We will get back to the Unblocked R process during resolving but basically what it means is that even if multiple things are being computed in parallel the R code can continue unblocked until the resolved values are needed.

Text from article: The state of a future can either be unresolved or resolved. As soon as it is resolved, the value is available instantaneously. If the value is queried while the future is still unresolved, the current process is blocked until the future is resolved.

R package developers rarely know who the end-users are and what compute resources they have. Regardless, developers who wish to support parallel processing still face the problem of deciding which parallel framework to target, a decision which often has to be done early in the development cycle. This means deciding on what type of parallelism to support, This decision is critical because it limits the end-user’s options and any change, later on, might be expensive because of, for instance, having to rewrite and retest part of the codebase. A developer who wishes to support multiple parallel backends has to implement support for each of them individually and provide the end-user with a mechanism to choose between them.

R-package: future

Other packages decide the parallelisation method during development. With future the code is the same and the USER decides parallelisation method.

  • Very simple
  • Uniform code, no matter the strategy
  • User defined parallelisation
  • Unblocked R process during resolving of futures process
  • Works well on multiple architectures



Published 2021-06-08 in "The R Journal": A Unifying Framework for Parallel and Distributed Processing in R using Futures



Building block: variable %<-% {expression(s)}

RaukR 2022 • 4/9

R package developers rarely know who the end-users are and what compute resources they have. With future, instead of programming for one architecture, the code should work on most architectures. We will get back to the Unblocked R process during resolving but basically what it means is that even if multiple things are being computed in parallel the R code can continue unblocked until the resolved values are needed.

Text from article: The state of a future can either be unresolved or resolved. As soon as it is resolved, the value is available instantaneously. If the value is queried while the future is still unresolved, the current process is blocked until the future is resolved.

R package developers rarely know who the end-users are and what compute resources they have. Regardless, developers who wish to support parallel processing still face the problem of deciding which parallel framework to target, a decision which often has to be done early in the development cycle. This means deciding on what type of parallelism to support, This decision is critical because it limits the end-user’s options and any change, later on, might be expensive because of, for instance, having to rewrite and retest part of the codebase. A developer who wishes to support multiple parallel backends has to implement support for each of them individually and provide the end-user with a mechanism to choose between them.

Plans

RaukR 2022 • 5/9

Synchronus: existing or occurring at the same time.

Asynchronus: Not occurring or existing at the same time

Sequential: One after another. Default. Very useful when developing the code the first time.

multisession: All operating systems. Evaluated in background R sessions. Number of sessions decided by availableCores().

multicore: operating systems supporting forking of processes, all except windows. Forks existing R process rather than creating new sessions. Max forks decided by availableCores().

Cluster: Cluster environment, such as HPC. Uses package parallel

plan(sequential)


Building block: variable %<-% {expression(s)}

RaukR 2022 • 6/9

plan(sequential)


Building block: variable %<-% {expression(s)}

future::plan(sequential)
a %<-% {
Sys.sleep(3)
a <- 1
}
b %<-% {
Sys.sleep(3)
b <- 2
}
a + b
RaukR 2022 • 6/9

In programming, a future is an abstraction for a value that may be available at some point in the future. The state of a future can either be unresolved or resolved. As soon as it is resolved, the value is available instantaneously. If the value is queried while the future is still unresolved, the current process is blocked until the future is resolved. It is possible to check whether a future is resolved or not without blocking.

plan(sequential)


Building block: variable %<-% {expression(s)}

future::plan(sequential)
a %<-% {
Sys.sleep(3)
a <- 1
}
b %<-% {
Sys.sleep(3)
b <- 2
}
a + b
## [1] 3
## user system elapsed
## 0.027 0.008 6.044
RaukR 2022 • 6/9

In programming, a future is an abstraction for a value that may be available at some point in the future. The state of a future can either be unresolved or resolved. As soon as it is resolved, the value is available instantaneously. If the value is queried while the future is still unresolved, the current process is blocked until the future is resolved. It is possible to check whether a future is resolved or not without blocking.

plan(multisession) & plan(multicore)

plan(multicore)
a %<-% {
Sys.sleep(3)
a <- 1
}
b %<-% {
Sys.sleep(3)
b <- 2
}
a + b
RaukR 2022 • 7/9

plan(multisession) & plan(multicore)

plan(multicore)
a %<-% {
Sys.sleep(3)
a <- 1
}
b %<-% {
Sys.sleep(3)
b <- 2
}
a + b
## [1] 3
## user system elapsed
## 0.043 0.057 3.062
RaukR 2022 • 7/9

Note: To compute plan(multicore) the rmarkdown must be rendered from terminal r console, as rstudio does not support multicore. rmarkdown::render("parallelisation_Sebastian.Rmd")

plan(multisession) & plan(multicore)

plan(multicore)
a %<-% {
Sys.sleep(3)
a <- 1
}
b %<-% {
Sys.sleep(3)
b <- 2
}
a + b
## [1] 3
## user system elapsed
## 0.043 0.057 3.062
availableCores()
## system
## 10
RaukR 2022 • 7/9

Note: To compute plan(multicore) the rmarkdown must be rendered from terminal r console, as rstudio does not support multicore. rmarkdown::render("parallelisation_Sebastian.Rmd")

plan(multisession) & plan(multicore)

plan(multicore)
a %<-% {
Sys.sleep(3)
a <- 1
}
b %<-% {
Sys.sleep(3)
b <- 2
}
c %<-% {
Sys.sleep(3)
c <- 3
}
...
}
k %<-% {
Sys.sleep(3)
e <- 5
}
a + b + c + d + e + f + g + h + j + k
RaukR 2022 • 8/9

plan(multisession) & plan(multicore)

plan(multicore)
a %<-% {
Sys.sleep(3)
a <- 1
}
b %<-% {
Sys.sleep(3)
b <- 2
}
c %<-% {
Sys.sleep(3)
c <- 3
}
...
}
k %<-% {
Sys.sleep(3)
e <- 5
}
a + b + c + d + e + f + g + h + j + k
## [1] 60
## user system elapsed
## 0.298 0.365 6.208
RaukR 2022 • 8/9

plan(cluster)

  • To some degree a wrapper around parallel::makeCluster()
  • For example:
    • 3 connected nodes (computers) named n1:n3
    • Each with 16 CPUs
plan(cluster, workers = c("n1", "n2", "n3"))
RaukR 2022 • 9/9

plan(cluster)

  • To some degree a wrapper around parallel::makeCluster()
  • For example:
    • 3 connected nodes (computers) named n1:n3
    • Each with 16 CPUs
plan(cluster, workers = c("n1", "n2", "n3"))

Specialized r package for interfacing with common HPC job schedulers exists: future.batchtools

RaukR 2022 • 9/9

Work in progress. Say you have access to three nodes, n1:n3. This will then create a set of copies of R running in parallel and communicating over sockets between them.

I have not tried this yet, as it is one thing to need to work in parallel with for example 8 or 16 cores on the HPC, but another use case to need 3 whole nodes for example.

Thank you. Questions?

R version 4.1.2 (2021-11-01)

Platform: x86_64-apple-darwin17.0 (64-bit)

OS: macOS Big Sur 10.16


Built on : 15-Jun-2022 at 10:14:15

2022SciLifeLabNBISRaukR

RaukR 2022 • 9/9

Future package has backup packages, for example future.batchtools which provides access to cluster functions, like slurm, torque, SGE and LSF. Futures are evaluated in a local environment, meaning that you cant change variables in there. Like in functions.

Recently published article. The big thing about futures is that it seems to support most infrastructures and it is written in a way where you are not deciding which infrastructure the user has, which parallel and foreach did. Finish with info that he has described this in detail for those that want to know more.

Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow