Working with R scripts

class: center, middle, inverse, title-slide

# Working with R scripts
## Advanced R for Bioinformatics. Visby, 2018.
### Markus Mayrhofer (Presenter: Sebastian DiLorenzo)
### 11 June, 2018

---

class: spaced

## R scripts as standalone tools

???
In many ways this quote about the UNIX philosophy relates to the philosophy you should have for an R script.

* Data analysis with R is usually performed interactively using e.g. RStudio

???
Usually when you are **analyzing data** you will use the **interactive view** and try different things going forward. But say that you have figured out something that you want to **do** for **multiple numbers** of datasets?

* Routine tasks can be executed from the terminal using R scripts

???
In this case it might be **efficient** to use an **Rscript**.

* R scripts can form powerful standalone tools

???
And like the quote it should do **one** thing and do it well. Because of the **power** in that an **R script** can contain **multiple functions**, or "programs", this one thing can be quite **simple**, or quite **advanced**. And like the textstream mentioned here Rscripts often take input, something we will look more at now.

---

## Executing an R script

* Easiest way: `source myscript.R` in R console (interactive session)

???
To execute an Rscript one way to do it is to use "source myscript.R" from an interactive session which **runs** whatever code is in the R script. So if it has **functions** or wether it **reads** a separate file and creates some new **object**, these will be in your **R environment** after sourcing the script.

* From command line: `Rscript myscript.R` (no interactive session)

???
You can also run the Rscript from the command line, or terminal. Then we use the command **Rscript**. It used to be not long ago that people used **R CMD batch**, but nowadays people usually use Rscript.
Like the source, this will **execute** whichever code is in **myscript.R** but there is **no environment** for the **objects or functions** to pop into so the **code** in this Rscript is probably **different** than one that is intended for **source**.

* As executable file: `path/myscript.R` if:
  + Script is executable: `chmod +x myscript.R`
  + First line in script is a hashbang e.g. `#!/usr/bin/env Rscript`
  + Script's path is included in call or `$PATH`

???
You can also execute the Rscript **itself**, from terminal.
To execute an R script it must *meet three requirements*.
It must be **executable**.
It must start with this **special line**, specifying how it is executed if run on its own.
If you want to run it without giving path, its folder must be in you $PATH variable.

---

## Providing arguments to an R script

* Passing arguments to the script allows for flexibility in settings and input data

???
**Often** when we use an R script, like I mentioned in the **beginning**, we want to **pass multiple files/samples** through it for efficiency reasons. It **doesnt** just have to be **files**, like **functions** it can also be **settings.**

+ `./myscript.R inputfile.vcf outputfile.vcf`

???
Here for example we are using the Rscript as an **executable** file, giving it an **inputfile** and specifying what we want the **outputfile** to be named.

* Packages are available that support long and short flags

+ `./myscript.R -i inputfile.vcf -o outputfile.vcf`

???
**Short flags** are when you give a single dash and usually a shortened version of the keyword, here *i for input* and *o for output* for example.

+ `./myscript.R --input inputfile.vcf --output outputfile.vcf`

???
And here **long flags** with *two dashes*

+ `./myscript.R --output inputfile.vcf --input outputfile.vcf`

???
A part of the **flexibility** of this is that you can give the flags in **any order**.

<!-- --

+ `./myscript.R --output inputfile.vcf`

???
Not sure what Markus is trying to show in this slide.

-->

+ `./myscript.R --output inputfile.vcf -i inputfile.vcf`

???
And you can also *mix* the *long/short flag order and styles*.

---

## Parsing arguments

The easy way

???
I am not sure I would say the "easy" way but the **built in way**, or **standard** way that you can do this perhaps.

* Use `commandArgs()` to access the arguments passed to R at launch

```r
commandArgs()
```

```
## [1] "/usr/local/Cellar/r-x11/3.5.0/lib/R/bin/exec/R"
```

Is to use **commandArgs()** to capture whatever was **passed** into R as it was **executed**. So to be **clear** this is a command that is **within the Rscript file.**

+ Add `trailingOnly = TRUE` to suppress the first few items and get the arguments *you* passed to the script.

```r
commandArgs(trailingOnly = T)
```

```
## character(0)
```

???
A **standard parameter**, but **not default**, that you can use when invoking commandArgs() is **trailingOnly = TRUE**, which basically tells it to start counting the input from **after** the **Rscript arguments**. And you can see here that this removes the invocation we see here. (Live demo?)

---

## Parsing arguments

The flexible way: short and long flags

???
So how do we do it with **flags**?

* Several packages are available: `getopt`, `optparse`, `argparser`, ...

* Define set of possible arguments at start of script:

```r
library(optparse)
my_options = list(
  make_option(c("-i", "--inputfile"), default='variants.vcf'),
  make_option(c("-o", "--outputfile"), default='variants_filtered.vcf')
)
```

???
If we use **optparse** as an example you **create** your options using the **make_option** command, and can set default values. We see also that you can give both long and short form here.

* Parse arguments using your definition:

```r
parse_args(OptionParser(option_list=my_options))
```

```
## $inputfile
## [1] "variants.vcf"
## 
## $outputfile
## [1] "variants_filtered.vcf"
## 
## $help
## [1] FALSE
```

???
And then you use the **my_options** object we defined together with **parse_args and OptionParser** to **check our input** for those **flags**
We also see an option, **help**, that we did not make, this is a **standard flag** that optparse always looks for and can generate what arguments it is looking for.

---

## Text streams

* Text streams allow for piping of data through a set of applications without writing intermediate files.

+ `samtools mpileup -uf ref.fa aln.bam | bcftools call -mv | myPythonscript.py | myRscript.R > variants.vcf`

* To define and open a connection, read one line, and close it:

```r
input_con  <- file("stdin")
open(input_con)
oneline=readLines(input_con, n = 1)
close(input_con)
```

* Or just read a `tibble` from text stream: `read_csv(file("stdin"))`

---

## Text streams

* Writing results to a stream:

+ Any `stdout` produced by the code (`print()`, `cat()`, etc) can be piped to a new process: `...myRscript.R | myNewScript`

+ or written to a file: `...myRscript.R > output.csv`

* To write a `tibble` as a text stream: `cat(format_csv(my_tibble))`

---

## Summary

* Let's try it out

---
name: report

## Session

* This presentation was created in RStudio using [`remarkjs`](https://github.com/gnab/remark) framework through R package [`xaringan`](https://github.com/yihui/xaringan).
* For R Markdown, see <http://rmarkdown.rstudio.com>
* For R Markdown presentations, see <https://rmarkdown.rstudio.com/lesson-11.html>

```r
R.version
```

```
##                _                           
## platform       x86_64-apple-darwin16.7.0   
## arch           x86_64                      
## os             darwin16.7.0                
## system         x86_64, darwin16.7.0        
## status                                     
## major          3                           
## minor          5.0                         
## year           2018                        
## month          04                          
## day            23                          
## svn rev        74626                       
## language       R                           
## version.string R version 3.5.0 (2018-04-23)
## nickname       Joy in Playing
```

---
name: end-slide
class: end-slide

# Thank you