class: center, middle, inverse, title-slide # Reproducible research in R ## RaukR 2021 • Advanced R for Bioinformatics ###
Roy Francis and Mun-Gwan Hong
### NBIS, SciLifeLab --- exclude: true count: false <link href="https://fonts.googleapis.com/css?family=Roboto|Source+Sans+Pro:300,400,600|Ubuntu+Mono&subset=latin-ext" rel="stylesheet"> <link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.3.1/css/all.css" integrity="sha384-mzrmE5qonljUremFsqc01SB46JvROS7bZs3IO2EmfFsd15uHvIt+Y8vEf7N7fWAU" crossorigin="anonymous"> <!-- ----------------- Only edit title & author above this ----------------- --> --- name: topics ## Topics * Reproducibility * Environment * RStudio * Markdown/Rmarkdown * Reports and presentations in R --- name: fuss ## What's all the fuss about? <img src="rr_presentation_assets/nature-reproducibility.jpg" class="fancyimage size-70"> .small[<https://www.nature.com/collections/prbfkwmwvz/>] <img src="rr_presentation_assets/nature-reproducibility-2.jpg" class="fancyimage size-70"> .small[<https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970>] ??? A large percentage of research is not reproducible by other researchers or by the original researchers themselves. This concern has been lately addressed by several high profile journals. --- <img src="rr_presentation_assets/nature-rr-pie.jpg" class="fancyimage size-45"> <img src="rr_presentation_assets/nature-rr-bar.jpg" class="fancyimage size-65"> --- ## What is reproducibility? > "reproducibility refers to the **ability** of a researcher **to duplicate the results** of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results…" .small[K. Bollen, J. T. Cacioppo, R. Kaplan, J. Krosnick, J. L. Olds, Social, Behavioral, and Economic Sciences Perspectives on Robust and Reliable Science (National Science Foundation, Arlington, VA, 2015)] .small[Goodman, S. N., Fanelli, D., & Ioannidis, J. P. A. (2016). What does research reproducibility mean? Science Translational Medicine, 8(341), 341ps12–341ps12. http://doi.org/10.1126/scitranslmed.aaf5027] --- ## Reproducibility in R <img src="rr_presentation_assets/turingway-reproduciblematrix.jpeg" class="size-85"> .small[from Turing way (<https://the-turing-way.netlify.app/reproducible-research/overview/overview-definitions.html>)] --- class: spaced ## Typical workflow 1. Get data 2. Clean, transform data in spreadsheet 3. Copy-paste-adjust, copy-paste-adjust, ... 4. Run analysis & export figures using A 5. Write up report using B 6. Import figures from A to B 7. Realizes a sample was mis-labelled 8. Go back to step 2, Repeat 9. (after a couple of months) Need to fix the figures 10. Back to step 2, but forgot what was the latest version? 11. Realizes the number of samples didn't match 12. Back to step 2, try to remember why some data was modified manually before -- .pull-right-30[ <img src="rr_presentation_assets/picard.jpg" class="fancyimage"> ] ??? Manually handling workflow is hard to reproduce because it is hard to know the exact step carried out. A programmatic workflow allows full transparency to the exact steps followed. --- class: spaced ## Benefits of reproducibility - Get the same results as before - Rerunning workflow - Additional data/New data - Returning to a project - Transferring projects - Collaborative work - Easy to make changes - Eliminate copy-paste errors ??? A reproducible workflow allows a lot of convenience. - It's easy to automate re-running of analysis when earlier steps have changed such as new input data, code or assumptions. - Useful for an investigator returning to an analyses after a period of time. - Useful when a project is transferred to a new investigator. - Useful when working collaboratively. - When you are asked to modify or change a parameter. --- class: spaced ## Solutions ![](rr_presentation_assets/rr-solutions.jpg) * Containerised computing environment. Eg: *Docker* * Workflow manager Eg: *Snakemake, Nextflow* * Package and environment manager. Eg: *renv, Conda* * Track edits and collaborate coding. Eg: *Git* * Share and track code. Eg: *GitHub* * Notebooks to document ongoing analyses. Eg: *Jupyter* * Analyze and generate reports. Eg: *R Markdown* ??? Reproducible projects can be performed at different levels. Reproducibility is the ability for a work to be reproduced by an independently working third-party. --- class: spaced ## Steps to reproducibility * Documents containing analysis, code and results * Note the environment * Self-contained portable project * Avoid manual steps * Results are directly linked to code used to generate them * Contextual narrative to why a certain step was performed * Version control of documents * Keep the original data intact (read-only) with descriptions incl. how the data was obtained ??? Reproducible programming is not an R specific issue. R offers a set of tools and an environment that is conducive to reproducible research. --- ## Environment * Not about `environment()` * Environment around your code * Operating system (Windows, Mac, Linux, ...) * A particular version of R/Python * Loaded package versions --- ## Software for environment management * Operating system - *Docker* * R/Python - *Conda* * Loaded package versions - *`renv`* package A NBIS course of **Tools for reproducible research** (<https://nbis-reproducible-research.readthedocs.io/en/latest/>) --- ## `renv` package * **R env**ironment management package * It helps our individual projects **isolated** so **portable**, and **reproducible** * Local library of R packages * Install `renv` package from CRAN ``` install.packages("renv") ``` * Initialize local R environment using `renv` ``` renv::init() ``` * Save the local library state ``` renv::snapshot() ``` * Restore the local library ``` renv::restore() ``` .small[<https://kevinushey-2020-rstudio-conf.netlify.app/slides.html>] --- ## Install R packages * Use `renv::install` as below * From CRAN : `renv::install("`*package name*`")` (e.g. `renv::install("dplyr")`) * From Bioconductor : `renv::install("bioc::`*package name*`")` (e.g `renv::install("bioc::Biobase")`) * From GitHub : `renv::install("`*user name*`/`*repository*`")` (e.g. `renv::install("StoreyLab/qvalue")`) * From GitLab/Bitbucket : `renv::install("[gitlab|bitbucket]::`*user name*`/`*repository*`")` --- ## RStudio • IDE <img src="rr_presentation_assets/rstudio.jpg" class="fancyimage size-90"> * Code completion & Syntax highlighting (for many languages) * R Notebook * Debugging * Useful GUI elements * Multiple sessions can be opened in parallel --- ## RStudio • Project .small[**Create a new project**] <img src="rr_presentation_assets/new-project.gif" class="fancyimage size-90"> * Portable project (.Rproj) * Dynamic reports * Version control (git) * Package management (`renv`) --- ## Project Structure ``` project_name/ +-- data/ | +-- gene_counts.txt | +-- metadata.txt +-- results/ | +-- gene_filtered_counts.txt | +-- gene_vst_counts.txt +-- images/ | +-- exp-setup.jpg +-- scripts/ | +-- bash/ | | +-- fastqc.sh | | +-- trim_adapters.sh | | +-- mapping.sh | +-- r/ | +-- qc.R | +-- functions.R | +-- dge.R +-- report/ +-- report.Rmd ``` * Organise data, scripts and results sensibly * Keep projects self contained * Use relative links ??? Try to organize all material related to a project in a common directory. Organise the directory in a sensible manner. Use relative links to refer to files. Consider raw as read-only content. --- ## Document converter ![](rr_presentation_assets/knit.png) - Rmd > md > docx|HTML|PDF - PDF needs Latex - Handouts - Scientific Articles - Presentations - beamer - ioslides - slidy - xaringan --- ## Document formats - [Summary](https://rmarkdown.rstudio.com/formats.html) - Reports in [HTML](https://bookdown.org/yihui/rmarkdown/html-document.html), [PDF](https://bookdown.org/yihui/rmarkdown/pdf-document.html), [MS Word](https://bookdown.org/yihui/rmarkdown/word-document.html) etc - Simple web pages and websites using [Rmarkdown](https://rmarkdown.rstudio.com/lesson-13.html) - Complex websites using [blogdown](https://bookdown.org/yihui/blogdown/) - Books using [bookdown](https://bookdown.org/yihui/bookdown/) - Package documentation using [pkgdown](http://pkgdown.r-lib.org/) - Web applications and interactive documents using [Shiny](https://www.rstudio.com/products/shiny/) - Dashboards using [flexdashboard](https://rmarkdown.rstudio.com/flexdashboard/) or [shinydashboard](https://rstudio.github.io/shinydashboard/) --- ## RMarkdown • Intro ### Markdown - Plain text format for readability - Support pure HTML for complex formatting - Many variations - [John Gruber's original](https://daringfireball.net/projects/markdown/syntax) - [GitHub Flavored Markdown (GFM)](https://github.github.com/gfm/) - [Pandoc](https://pandoc.org/MANUAL.html#pandocs-markdown) - [MultiMarkdown](https://fletcherpenney.net/multimarkdown/) - [![](rr_presentation_assets/cm.png)](https://commonmark.org/) - Pandoc supports conversion to multiple output formats - To compare MD variants [![](rr_presentation_assets/bm.png)](https://babelmark.github.io) -- ### RMarkdown - Markdown + embedded R chunks - Combine text and code in one file - RMarkdown mostly uses [Pandoc markdown](https://rmarkdown.rstudio.com/authoring_pandoc_markdown.html%23raw-tex#pandoc_markdown) --- ## RStudio • Notebook .small[**Create a new .Rmd document**] <img src="rr_presentation_assets/new-rmarkdown.gif" class="fancyimage"> * Text and code can be written together * Inline R output (text and figures) ??? R Notebook demonstration. --- ## RMarkdown • Guide * Create a file that ends in `.Rmd` * Add YAML matter to top ``` --- title: "This is a title" output: rmarkdown::html_document --- ``` * In RStudio `File > New File > R Markdown` opens up an Rmd template * Render interactively using the **Knit** button .fancyimage[![](rr_presentation_assets/knit-button.png)] * Render using command `rmarkdown::render("report.Rmd")` --- ## RMarkdown • Guide .pull-left-50[ ``` ### Heading 3 #### Heading 4 _italic text_ *italic text* __bold text__ **bold text** `code text` ~~strikethrough~~ 2^10^ 2~10~ $2^{10}$ $2_{10}$ $\sum\limits_{n=1}^{10} \frac{3}{2}\cdot n$ - bullet point Link to [this](somewhere.com) ![](https://www.r-project.org/Rlogo.png) ``` ] .pull-right-50[ ### Heading 3 #### Heading 4 *italic text* **bold text** `code text` ~~strikethrough~~ 2<sup>10</sup> 2<sub>10</sub> `\(2^{10}\)` `\(2_{10}\)` `\(\sum\limits_{n=1}^{10} \frac{3}{2}\cdot n\)` * bullet point Link to [this](somewhere.com) .size-60[![](https://www.r-project.org/Rlogo.png)] ] --- ## RMarkdown • Guide * R code can be executed inline Today's date is `` `r date()` `` Today's date is Mon Jun 7 11:27:26 2021 * R code can be executed in code chunks ````{.r} ```{r} date() ``` ```` * By default shows input code and output result. ```r date() ``` ``` ## [1] "Mon Jun 7 11:27:26 2021" ``` * Many arguments to [customise chunks](https://yihui.name/knitr/options/) * Set `eval=FALSE` to not evaluate a code chunk * Set `echo=FALSE` to hide input code * Set `results="hide"` to hide output * [R Markdown reference](https://rmarkdown.rstudio.com/lesson-1.html) --- ## A few packages useful for R Markdown * `kableExtra` : Beautiful tables in html .small[(<https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html>)] <table class=" lightable-classic table" style='font-family: "Arial Narrow", "Source Sans Pro", sans-serif; margin-left: auto; margin-right: auto; font-size: 10px; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="empty-cells: hide;" colspan="1"></th> <th style="padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #111111; margin-bottom: -1px; ">Group 1</div></th> <th style="padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #111111; margin-bottom: -1px; ">Group 2</div></th> <th style="padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #111111; margin-bottom: -1px; ">Group 3</div></th> </tr> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> mpg </th> <th style="text-align:right;"> cyl </th> <th style="text-align:right;"> disp </th> <th style="text-align:right;"> hp </th> <th style="text-align:right;"> drat </th> <th style="text-align:right;"> wt </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mazda RX4 </td> <td style="text-align:right;"> 21.0 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 160 </td> <td style="text-align:right;"> 110 </td> <td style="text-align:right;"> 3.90 </td> <td style="text-align:right;"> 2.620 </td> </tr> <tr> <td style="text-align:left;"> Mazda RX4 Wag </td> <td style="text-align:right;"> 21.0 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 160 </td> <td style="text-align:right;"> 110 </td> <td style="text-align:right;"> 3.90 </td> <td style="text-align:right;"> 2.875 </td> </tr> <tr> <td style="text-align:left;"> Datsun 710 </td> <td style="text-align:right;"> 22.8 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 108 </td> <td style="text-align:right;"> 93 </td> <td style="text-align:right;"> 3.85 </td> <td style="text-align:right;"> 2.320 </td> </tr> <tr> <td style="text-align:left;"> Hornet 4 Drive </td> <td style="text-align:right;"> 21.4 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 258 </td> <td style="text-align:right;"> 110 </td> <td style="text-align:right;"> 3.08 </td> <td style="text-align:right;"> 3.215 </td> </tr> <tr> <td style="text-align:left;"> Hornet Sportabout </td> <td style="text-align:right;"> 18.7 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 360 </td> <td style="text-align:right;"> 175 </td> <td style="text-align:right;"> 3.15 </td> <td style="text-align:right;"> 3.440 </td> </tr> </tbody> </table> * `english` : Integer in English (e.g. `two hundred one` instead of `201`) .small[(<https://cran.r-project.org/web/packages/english/vignettes/the-english-patient.pdf>)] * `janitor` : More than `table` .small[(<https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html>)] ``` ## cyl n percent ## 4 11 34.4% ## 6 7 21.9% ## 8 14 43.8% ``` --- ## RStudio • Project with Git .small[**Create a new project with version control**] <img src="rr_presentation_assets/rstudio_git_new.gif" class="fancyimage"> * Version control : keep old versions and who/when files modified for what * A repository in GitHub/Bitbucket --- ## RStudio • Git commit .small[**Log a set of changes using Git**] <img src="rr_presentation_assets/rstudio_git_push.gif" class="fancyimage"> ??? How to Git commit using RStudio --- name: help class: spaced ## Acknowledgements * [**Reproducible Research in R and RStudio**](https://www.slideshare.net/SusanJohnston3/reproducible-research-in-r-and-r-studio) - Susan Johnston * [**New Tools for Reproducible Research with R**](https://slides.yihui.name/2012-knitr-RStudio.html) - JJ Allaire and Yihui Xie * [**Reproducible research with R**](http://www.hafro.is/~einarhj/education/tcrenv2016/pre/r-markdown.pdf) - Bjarki Thor Elvarsson and Einar Hjorleifsson * [**Reproducible Research Workshop**](http://www.geo.uzh.ch/microsite/reproducible_research/post/rr-r-publication/) - University of Zurich * RStudio [learning](https://www.rstudio.com/online-learning/) <!-- --------------------- Do not edit this and below --------------------- --> --- name: end-slide class: end-slide, middle count: false # Thank you. Questions? <p>R version 4.0.3 (2020-10-10)<br><p>Platform: x86_64-apple-darwin13.4.0 (64-bit)</p><p>OS: macOS Big Sur 10.16</p><br> Built on : <i class='fa fa-calendar' aria-hidden='true'></i> 07-Jun-2021 at <i class='fa fa-clock-o' aria-hidden='true'></i> 11:27:27 <b>2021</b> • [SciLifeLab](https://www.scilifelab.se/) • [NBIS](https://nbis.se/) • [RaukR](https://nbisweden.github.io/workshop-RaukR-2106/)