+ - 0:00:00
Notes for current slide
Notes for next slide

Reproducible research

RaukR 2019 • Advanced R for Bioinformatics

Roy Francis

RaukR 2019 • 1/22

Topics

  • Reproducibility
  • RStudio
  • Markdown/Rmarkdown
  • Reports and presentations in R
RaukR 2019 • 2/22

A large percentage of research is not reproducible by other researchers or by the original researchers themselves. This concern has been lately addressed by several high profile journals.


RaukR 2019 • 4/22

"The difference between a scientist and a crazy person is that a scientist takes notes."

RaukR 2019 • 5/22

Typical workflow

  1. Get data
  2. Clean, transform data in spreadsheet
  3. Copy-paste, copy-paste, copy-paste
  4. Run analysis & export figures using A
  5. Write up report using B
  6. Import figures from A to B
  7. Realises a sample was mislabelled
  8. Go back to step 2, Repeat
RaukR 2019 • 6/22

Typical workflow

  1. Get data
  2. Clean, transform data in spreadsheet
  3. Copy-paste, copy-paste, copy-paste
  4. Run analysis & export figures using A
  5. Write up report using B
  6. Import figures from A to B
  7. Realises a sample was mislabelled
  8. Go back to step 2, Repeat

Problems with using Excel for data analyses. ]

RaukR 2019 • 6/22

Manually handling workflow is hard to reproduce because it is hard to know the exact step carried out. A programmatic workflow allows full transparency to the exact steps followed.

Benefits of reproducibility

  • Rerunning workflow
  • Additional data/New data
  • Returning to a project
  • Transferring projects
  • Collaborative work
  • Easy to make changes
  • Eliminate copy-paste errors
RaukR 2019 • 7/22

A reproducible workflow allows a lot of convenience.

  • It's easy to automate re-running of analysis when earlier steps have changed such as new input data, code or assumptions.
  • Useful for an investigator returning to an analyses after a period of time.
  • Useful when a project is transferred to a new investigator.
  • Useful when working collaboratively.
  • When you are asked to modify or change a parameter.

Solutions

  • Containerised computing environment. Eg: Docker
  • Workflow manager Eg: Snakemake, Nextflow
  • Package and environment manager. Eg: Packrat, Conda
  • Track edits and collaborate coding. Eg: Git
  • Share and track code. Eg: GitHub
  • Notebooks to document ongoing analyses. Eg: Jupyter
  • Analyse and generate reports. Eg: R Markdown
RaukR 2019 • 8/22

Reproducible projects can be performed at different levels. Reproducibility is the ability for a work to be reproduced by an independently working third-party.

Steps to reproducibility

  • Single document containing analysis, code and results
  • Self-contained portable project
  • Avoid manual steps
  • Results are directly linked to code used to generate them
  • Contexual narrative to why a certain step was performed
  • Version control of documents
RaukR 2019 • 9/22

Reproducible programming is not an R specific issue. R offers a set of tools and an environment that is conducive to reproducible research.

Automate workflow

  • Install packages from repositories
install.packages(), devtools::install_github()
  • Read data and scripts
read.delim(), source(), readr::read_tsv()
  • Reorganise data
dplyr, tidyr
  • Create figures
ggplot2
  • Run statistics
lm(), wilcox.test()
  • Run external programs
system("./plink --file --flag1 --flag2 --out bla")
RaukR 2019 • 10/22

R

  • Multiple R versions can be installed
  • Be explicit about R version
  • Set up R with write permission in libraries
  • Windows users install to C:/R/ rather than C:/R/Program Files/
  • Windows users install rtools for compiling from source
  • Linux users will need additional linux packages
  • Bioconductor packages are better managed with BiocManager to avoid conflicts
RaukR 2019 • 11/22

RStudio • IDE

  • Code completion & Syntax highlighting (for many languages)
  • R Notebook
  • Debugging
  • Useful GUI elements
  • Multiple sessions can be opened in parallel
RaukR 2019 • 12/22

RStudio • Project

Create a new project

  • Portable project (.Rproj)
  • Dynamic reports
  • Version control (git)
  • Package control (packrat)
RaukR 2019 • 13/22

Project Structure

project_name/
+-- raw/
| +-- gene_counts.txt
| +-- metadata.txt
+-- results/
| +-- gene_filtered_counts.txt
| +-- gene_vst_counts.txt
+-- images/
| +-- exp-setup.jpg
+-- scripts/
| +-- bash/
| | +-- fastqc.sh
| | +-- trim_adapters.sh
| | +-- mapping.sh
| +-- r/
| +-- qc.R
| +-- functions.R
| +-- dge.R
+-- report/
+-- report.Rmd
  • Organise data, scripts and results sensibly
  • Keep projects self contained
  • Use relative links
RaukR 2019 • 14/22

Try to organise all material related to a project in a common directory. Organise the directory in a sensible manner. Use relative links to refer to files. Consider raw as read-only content.

Document converter

  • Rmd > md > docx|HTML|PDF
  • PDF needs Latex
  • Handouts
  • Scientific Articles
  • Presentations
    • beamer
    • ioslides
    • slidy
    • xaringan
RaukR 2019 • 15/22

Document formats

RaukR 2019 • 16/22

RMarkdown • Intro

Markdown

RaukR 2019 • 17/22

RMarkdown • Intro

Markdown

RMarkdown

  • Markdown + embedded R chunks
  • Combine text and code in one file
  • RMarkdown mostly uses Pandoc markdown
RaukR 2019 • 17/22

RStudio • Notebook

Create a new .Rmd document

  • Text and code can be written together
  • Inline R output (text and figures)
RaukR 2019 • 18/22

R Notebook demonstration.

RMarkdown • Guide

  • Create a file that ends in .Rmd
  • Add YAML matter to top
---
title: "This is a title"
output:
rmarkdown::html_document
---
  • In RStudio File > New File > R Markdown opens up an Rmd template
  • Render interactively using the Knit button
  • Render using command rmarkdown::render("report.Rmd")
RaukR 2019 • 19/22

RMarkdown • Guide

### Heading 3
#### Heading 4
_italic text_
__bold text__
`code text`
~~strikethrough~~
2^10^
2~10~
- bullet point
Link to [this](somewhere.com)
![](https://www.r-project.org/Rlogo.png)

Heading 3

Heading 4

italic text
bold text
code text
strikethrough
210
210

  • bullet point

Link to this

RaukR 2019 • 20/22

RMarkdown • Guide

  • R code can be executed inline like this

Today's date is `r date()`
Today's date is Fri Jun 7 12:43:36 2019

  • R code can be executed in code chunks
```{r}
date()
```
  • By default shows input code and output result.
date()
## [1] "Fri Jun 7 12:43:36 2019"
  • Many arguments to tweak chunks
    • Set eval=FALSE to not evaluate a code chunk
    • Set echo=FALSE to hide input code
    • Set results="hide" to hide output

R Markdown reference https://rmarkdown.rstudio.com/

RaukR 2019 • 21/22

Acknowledgements

RaukR 2019 • 22/22

Thank you. Questions?

R version 3.5.2 (2018-12-20)

Platform: x86_64-pc-linux-gnu (64-bit)

OS: Ubuntu 18.04.2 LTS


Built on : 07-Jun-2019 at 12:43:36

2019SciLifeLabNBIS

RaukR 2019 • 22/22

Topics

  • Reproducibility
  • RStudio
  • Markdown/Rmarkdown
  • Reports and presentations in R
RaukR 2019 • 2/22
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow