class: center, middle, inverse, title-slide # Working with R scripts ## Advanced R for Bioinformatics. Visby, 2018. ### Markus Mayrhofer (Presenter: Sebastian DiLorenzo) ### 11 June, 2018 --- class: spaced ## R scripts as standalone tools <img src="http://www.azquotes.com/picture-quotes/quote-this-is-the-unix-philosophy-write-programs-that-do-one-thing-and-do-it-well-write-programs-douglas-mcilroy-81-95-07.jpg"> ??? In many ways this quote about the UNIX philosophy relates to the philosophy you should have for an R script. -- * Data analysis with R is usually performed interactively using e.g. RStudio ??? Usually when you are **analyzing data** you will use the **interactive view** and try different things going forward. But say that you have figured out something that you want to **do** for **multiple numbers** of datasets? -- * Routine tasks can be executed from the terminal using R scripts ??? In this case it might be **efficient** to use an **Rscript**. -- * R scripts can form powerful standalone tools ??? And like the quote it should do **one** thing and do it well. Because of the **power** in that an **R script** can contain **multiple functions**, or "programs", this one thing can be quite **simple**, or quite **advanced**. And like the textstream mentioned here Rscripts often take input, something we will look more at now. --- ## Executing an R script * Easiest way: `source myscript.R` in R console (interactive session) ??? To execute an Rscript one way to do it is to use "source myscript.R" from an interactive session which **runs** whatever code is in the R script. So if it has **functions** or wether it **reads** a separate file and creates some new **object**, these will be in your **R environment** after sourcing the script. -- * From command line: `Rscript myscript.R` (no interactive session) ??? You can also run the Rscript from the command line, or terminal. Then we use the command **Rscript**. It used to be not long ago that people used **R CMD batch**, but nowadays people usually use Rscript. Like the source, this will **execute** whichever code is in **myscript.R** but there is **no environment** for the **objects or functions** to pop into so the **code** in this Rscript is probably **different** than one that is intended for **source**. -- * As executable file: `path/myscript.R` if: + Script is executable: `chmod +x myscript.R` + First line in script is a hashbang e.g. `#!/usr/bin/env Rscript` + Script's path is included in call or `$PATH` ??? You can also execute the Rscript **itself**, from terminal. To execute an R script it must *meet three requirements*. It must be **executable**. It must start with this **special line**, specifying how it is executed if run on its own. If you want to run it without giving path, its folder must be in you $PATH variable. --- ## Providing arguments to an R script * Passing arguments to the script allows for flexibility in settings and input data ??? **Often** when we use an R script, like I mentioned in the **beginning**, we want to **pass multiple files/samples** through it for efficiency reasons. It **doesnt** just have to be **files**, like **functions** it can also be **settings.** -- + `./myscript.R inputfile.vcf outputfile.vcf` ??? Here for example we are using the Rscript as an **executable** file, giving it an **inputfile** and specifying what we want the **outputfile** to be named. -- * Packages are available that support long and short flags -- + `./myscript.R -i inputfile.vcf -o outputfile.vcf` ??? **Short flags** are when you give a single dash and usually a shortened version of the keyword, here *i for input* and *o for output* for example. -- + `./myscript.R --input inputfile.vcf --output outputfile.vcf` ??? And here **long flags** with *two dashes* -- + `./myscript.R --output inputfile.vcf --input outputfile.vcf` ??? A part of the **flexibility** of this is that you can give the flags in **any order**. <!-- -- + `./myscript.R --output inputfile.vcf` ??? Not sure what Markus is trying to show in this slide. --> -- + `./myscript.R --output inputfile.vcf -i inputfile.vcf` ??? And you can also *mix* the *long/short flag order and styles*. --- ## Parsing arguments The easy way ??? I am not sure I would say the "easy" way but the **built in way**, or **standard** way that you can do this perhaps. -- * Use `commandArgs()` to access the arguments passed to R at launch ```r commandArgs() ``` ``` ## [1] "/usr/local/Cellar/r-x11/3.5.0/lib/R/bin/exec/R" ``` Is to use **commandArgs()** to capture whatever was **passed** into R as it was **executed**. So to be **clear** this is a command that is **within the Rscript file.** -- + Add `trailingOnly = TRUE` to suppress the first few items and get the arguments *you* passed to the script. ```r commandArgs(trailingOnly = T) ``` ``` ## character(0) ``` ??? A **standard parameter**, but **not default**, that you can use when invoking commandArgs() is **trailingOnly = TRUE**, which basically tells it to start counting the input from **after** the **Rscript arguments**. And you can see here that this removes the invocation we see here. (Live demo?) --- ## Parsing arguments The flexible way: short and long flags ??? So how do we do it with **flags**? -- * Several packages are available: `getopt`, `optparse`, `argparser`, ... -- * Define set of possible arguments at start of script: ```r library(optparse) my_options = list( make_option(c("-i", "--inputfile"), default='variants.vcf'), make_option(c("-o", "--outputfile"), default='variants_filtered.vcf') ) ``` ??? If we use **optparse** as an example you **create** your options using the **make_option** command, and can set default values. We see also that you can give both long and short form here. -- * Parse arguments using your definition: ```r parse_args(OptionParser(option_list=my_options)) ``` ``` ## $inputfile ## [1] "variants.vcf" ## ## $outputfile ## [1] "variants_filtered.vcf" ## ## $help ## [1] FALSE ``` ??? And then you use the **my_options** object we defined together with **parse_args and OptionParser** to **check our input** for those **flags** We also see an option, **help**, that we did not make, this is a **standard flag** that optparse always looks for and can generate what arguments it is looking for. --- ## Text streams * Text streams allow for piping of data through a set of applications without writing intermediate files. -- + `samtools mpileup -uf ref.fa aln.bam | bcftools call -mv | myPythonscript.py | myRscript.R > variants.vcf` -- * To define and open a connection, read one line, and close it: ```r input_con <- file("stdin") open(input_con) oneline=readLines(input_con, n = 1) close(input_con) ``` -- * Or just read a `tibble` from text stream: `read_csv(file("stdin"))` --- ## Text streams * Writing results to a stream: -- + Any `stdout` produced by the code (`print()`, `cat()`, etc) can be piped to a new process: `...myRscript.R | myNewScript` -- + or written to a file: `...myRscript.R > output.csv` -- * To write a `tibble` as a text stream: `cat(format_csv(my_tibble))` --- ## Summary -- <img src="http://www.azquotes.com/picture-quotes/quote-this-is-the-unix-philosophy-write-programs-that-do-one-thing-and-do-it-well-write-programs-douglas-mcilroy-81-95-07.jpg"> -- * Let's try it out --- name: report ## Session * This presentation was created in RStudio using [`remarkjs`](https://github.com/gnab/remark) framework through R package [`xaringan`](https://github.com/yihui/xaringan). * For R Markdown, see <http://rmarkdown.rstudio.com> * For R Markdown presentations, see <https://rmarkdown.rstudio.com/lesson-11.html> ```r R.version ``` ``` ## _ ## platform x86_64-apple-darwin16.7.0 ## arch x86_64 ## os darwin16.7.0 ## system x86_64, darwin16.7.0 ## status ## major 3 ## minor 5.0 ## year 2018 ## month 04 ## day 23 ## svn rev 74626 ## language R ## version.string R version 3.5.0 (2018-04-23) ## nickname Joy in Playing ``` --- name: end-slide class: end-slide # Thank you