Tools for Reproducible Research

Introduction

29-Oct-2024

Course content

  • Good practices for working with data

  • How to use the version control system Git to track changes to code

  • How to use the package and environment manager Conda

  • How to use the workflow managers Snakemake and Nextflow

  • How to generate automated reports using Quarto and Jupyter

  • How to use Docker and Apptainer to distribute containerised computational environments

The Teachers

John Sundh

Erik Fasterius

Verena Kutschera

Tomas Larsson

Estelle Proux-Wera

Mahesh Binzer-Panchal

Cormac Kinsella

What is NBIS?

National Bioinformatics Infrastructure Sweden

  • A distributed national bioinformatics infrastructure supporting life sciences in Sweden
  • Provides hands-on bioinformatic support, training, infrastructure and a weekly drop-in
  • Situated throughout Sweden
  • Provides wide-spectrum support in the fields of bioinformatics, bioimage informatics, data management, imaging AI, development of systems and tools as well as national compute resources.
  • Read more at nbis.se

What is reproducibility?




Why all the talk about reproducibility?

The Reproducibility project set out to replicate 100 experiments published in high-impact psychology journals. 1


About one-half to two-thirds of the original findings could not be observed in the replication study.

Why all the talk about reproducibility?

A survey in Nature revealed that irreproducible experiments are a problem across all domains of science.1

Why all the talk about reproducibility?

Medicine is among the most affected research fields. A study in Nature found that 47 out of 53 medical research papers focused on cancer research were irreproducible.1

Why all the talk about reproducibility?

Replication of 18 articles on microarray-based experiments published in  Nature Genetics in 2005 & 20061

Why all the talk about reproducibility?

Replication of 18 articles on microarray-based experiments published in  Nature Genetics in 2005 & 20061

Reproducibility is rarer than you think

The results of only 26% out of 204 randomly selected papers in the journal Science could be reproduced. 1

“Many journals are revising author guidelines to include data and code availability.”

“(…) an improvement over no policy, but currently insufficient for reproducibility.”

Reproducibility is rarer than you think

There are many so-called excuses not to work reproducibly:


“Thank you for your interest in our paper. For the [redacted] calculations I used my own code, and there is no public version of this code, which could be downloaded. Since this code is not very user-friendly and is under constant development I prefer not to share this code.”

“We do not typically share our internal data or code with people outside our collaboration.”

“When you approach a PI for the source codes and raw data, you better explain who you are, whom you work for, why you need the data and what you are going to do with it.”

“I have to say that this is a very unusual request without any explanation! Please ask your supervisor to send me an email with a detailed, and I mean detailed, explanation.”

What does reproducible research mean?




Data
Same Different
Code Same Reproducible Replicable
Different Robust Generalisable

“Why call the course Reproducible Research, when it could just as well be called Research?”

- Niclas Jareborg, NBIS data management expert

How are you handling your data?

Decent:

  • Data available on request
  • All metadata required for generating the results available

Good:

  • Raw data deposited in public repositories
  • If the raw data needed preprocessing, scripts were used rather than modifying it manually

Great:

  • Section in the paper or online repository (e.g. GitHub) to aid in reproduction
  • Used non-proprietary and machine-readable formats, e.g. .csv rather than .xls.

How are you handling your code?

Decent:

  • All code for generating results from processed data available on request

Good:

  • All code for generating results from raw data is available
  • The code is publicly available with timestamps or tags

Great:

  • Code is documented and contains instructions for reproducing results
  • Seeds were used and documented for heuristic methods

How are you handling your environment?

Decent:

  • Key programs used are mentioned in the materials and methods section

Good:

  • List of all programs used and their respective versions are available

Great:

  • Instructions for reproducing the whole environment publicly available

“What’s in it for me?”

Before the project:

  • Improved structure and organisation
  • Forced to think about scope and limitations

During the project:

  • Easier to re-run analyses and generate results after updates and/or changes
  • Closer interaction between collaborators
  • Much of the manuscript “writes itself”

After the project:

  • Faster resumption of research by others (or, more likely, your future self)
  • Increased visibility in the scientific community

Questions?