Parallelisation in R

class: center, middle, inverse, title-slide

# Parallelisation in R
## RaukR 2021 • Advanced R for Bioinformatics
### Sebastian DiLorenzo
### NBIS, SciLifeLab

---

exclude: true
count: false

---
name: overview

## Overview

.pull-left-50[
* [Parallelisation](#par)
* [future](#fut)
* [plans](#pla)
 + [sequential](#seq)
 + [multisession/multicluster](#mul)
 + [cluster](#clu)
]

.pull-right-50[
<img src="https://raw.githubusercontent.com/HenrikBengtsson/future/develop/man/figures/logo.png", width="144" height="168">
.vsmall[https://github.com/HenrikBengtsson/future]
]

???
Future package has backup packages, for example future.batchtools which provides access to cluster functions, like slurm, torque, SGE and LSF.
Futures are evaluated in a local environment, meaning that you cant change variables in there. Like in functions.
Finish with info that he has described this in detail for those that want to know more. Just published article.
The big thing about futures is that it seems to support most infrastructures and it is written in a way where you are not deciding which infrastructure the user has, which parallel and foreach did.

---
name: par

## Parallelisation

.pull-left-50[
 
<img src="custom_assets/parallel.png">
]

.pull-right-30[
 
* Save time by doing atomic tasks in parallel
 
* Divide tasks or datasets into smaller pieces
 
* Can bottleneck if tasks are directly dependent
]

???

---
name: fut

## R-package: future

Other packages decide the parallelisation method during development. With future the code is the same and the USER decides parallelisation method.

Pros:
* Very simple
* Uniform code, no matter the strategy
* User defined parallelisation
* Unblocked R process during resolving of futures process

Published 2021-06-08 in "The R Journal": [A Unifying Framework for Parallel and Distributed Processing in R using Futures](https://journal.r-project.org/archive/2021/RJ-2021-048/RJ-2021-048.pdf)

???
The state of a future can either be unresolved or resolved. As soon as it is resolved, the value is available instantaneously. If the value is queried while the future is still unresolved, the current process is blocked until the future is resolved.

R package developers rarely know who the end-users are and what compute resources they have. Regardless, developers who wish to support parallel processing still face the problem of deciding which parallel framework to target, a decision which often has to be done early in the development cycle. This means deciding on what type of parallelism to support,
This decision is critical because it limits the end-user’s options and any change, later on, might be expensive because of, for instance, having to rewrite and retest part of the codebase. A developer who wishes to support multiple parallel backends has to implement support for each of them individually and provide the end-user with a mechanism to choose between them.

Building block:
`variable %<-% {expression(s)}`

---
name: pla

## Plans

???
Synchronus:
existing or occurring at the same time.
"glaciations were approximately synchronous in both hemispheres"

Asynchronus:
occurring at the same time

controlling the timing of operations by the use of pulses sent when the previous operation is completed rather than at regular intervals.

Sequential: One after another. Default.

Transparent: troubleshooting. Not covered.

multisession: All operating systems. Evaluated in background R sessions. Number of sessions decided by availableCores().

multicore: operating systems supporting forking of processes, all except windows. Forks existing R process rather than creating new sessions. Max forks decided by availableCores().

Cluster: Cluster environtment, such as HPC. Uses package parallel

remote: connection to a separate r session on separate machine, typically on different network.

---
name: seq

## plan(sequential)
 
Building block:
`variable %<-% {expression(s)}`

```r
future::plan(sequential)

a %<-% {
 Sys.sleep(3)
 a <- 1
}
b %<-% {
 Sys.sleep(3)
 b <- 2
}

a + b
```

```
## [1] 3
##    user  system elapsed 
##   0.039   0.003   6.045
```

---
name: mul

## plan(multisession) & plan(multicore)

```r
plan(multicore)

a %<-% {
 Sys.sleep(3)
 a <- 1
}
b %<-% {
 Sys.sleep(3)
 b <- 2
}

a + b
```

```
## [1] 3
##    user  system elapsed 
##   0.080   0.078   3.103
```

???
Note: To compute plan(multicore) the rmarkdown must be rendered from terminal r console, as rstudio does not support multicore.

.pull-right-50[

```r
availableCores()
```

```
## system 
##      4
```
]

---
## plan(multisession) & plan(multicore)

```r
plan(multicore)

a %<-% {
 Sys.sleep(3)
 a <- 1
}
b %<-% {
 Sys.sleep(3)
 b <- 2
}
c %<-% {
 Sys.sleep(3)
 c <- 3
}
d %<-% {
 Sys.sleep(3)
 d <- 4
}
e %<-% {
 Sys.sleep(3)
 e <- 5
}

a + b + c + d + e
```

```
## [1] 15
##    user  system elapsed 
##   0.184   0.187   6.222
```

---
name: clu

## plan(cluster)

* To some degree a wrapper around `parallel::makeCluster()`
* For example:
  + 3 connected nodes (computers) named `n1:n3`
  + Each with 16 CPUs

```r
plan(cluster, workers = c("n1", "n2", "n3"))
```

Specialized r package for interfacing with common HPC job schedulers exists:
`future.batchtools`

???
Work in progress. Say you have access to three nodes, n1:n3. This will then create a set of copies of R running in parallel and communicating over sockets between them.

I have not tried this yet, as it is one thing to need to work in parallel with for example 8 or 16 cores on the HPC, but another use case to need 3 whole nodes for example.

---
name: end-slide
class: end-slide, middle
count: false

# Thank you. Questions?

R version 4.1.0 (2021-05-18) Platform: x86_64-apple-darwin17.0 (64-bit)OS: macOS Big Sur 10.16

Built on : 15-Jun-2021 at 23:03:59

2021 • [SciLifeLab](https://www.scilifelab.se/) • [NBIS](https://nbis.se/) • [RaukR](https://nbisweden.github.io/workshop-RaukR-2106/)