Working reproducibly will make your research life a lot easier!
Take control of your research by making its different components reproducible
What have we learned?
How to use the version control system Git to track changes to code
How to use the package and environment manager Conda
How to use the workflow managers Snakemake and Nextflow
How to use Quarto and Jupyter to generate automated reports and to document your analyses
How to use Docker and Apptainer to distribute containerized computational environments
Divide your work into distinct projects
Keep all files needed to go from raw data to final results in a dedicated directory
Use relevant subdirectories
Use Git to version control your projects
Do not store data and results/output in your Git repository
When in doubt, commit often rather than not
Find your own project structure
For example:
code/ Code needed to go from input files to final resultsdata/ Raw data - this should never editeddoc/ Documentation of the projectenv/ Environment-related files, e.g. Conda environments or Dockerfilesresults/ Output from workflows and analysesREADME.md Project description and instructions
Don’t create different versions of the input data - write a script, Quarto document, Jupyter notebook or a Snakemake / Nextflow workflow if you need to pre-process your input data so that the steps can be recreated
Backup! Keep redundant copies in different physical locations
Upload your raw data as soon as possible to a public data repository
Organise your coding
Avoid generating files interactively or doing things by hand
Write scripts, Quarto documents, Jupyter notebooks or Snakemake / Nextflow workflows for reproducible results to connect raw data to final results
Keep the parameters separate (e.g. at top of file or in a separate configuration file)
What is reasonable for your project?
What is reasonable for your project?
Minimal
Write code in a reproducible way and track your environment
Track your projects with a Git repository each; publish code with your results on e.g.GitHub
Use Conda to install software in environments that can be exported and installed on a different system
Publish your environment.yml file along with your code
What is reasonable for your project?
Good
Structure and document your code with notebooks
Use Quarto or Jupyter notebooks to better keep track of and document your code
Track your notebooks with Git
What is reasonable for your project?
Great
Track the full environment and connect your code in a workflow
Go one step beyond in tracking your environment using Docker or Apptainer
Convert your code into a Snakemake / Nextflow workflow
Track both your image definitions (e.g. Dockerfiles) as well as your workflows with Git
Alternatives
Version control
Git – Widely used and a lot of tools available + GitHub/BitBucket.
Mercurial – Distributed model just like Git, close to Sourceforge.
Subversion – Centralized model unlike git/mercurial; no local repository on your computer and somewhat easier to use.
Alternatives
Environment / package managers
Conda – General purpose environment and package manager. Community-hosted collections of tools at Bioconda or Conda-forge.
Pixi - General purpose environment/package manager built on the Conda ecosystem, but much faster and works seamlessly with lock-files.
Pip – Package manager for Python, has a large repository at PyPI.
Apt/yum/brew – Native package managers for different OS. Integrated in OS and might deal with e.g. update notifications better.
Virtualenv – Environment manager used to set up semi-isolated python environments.
Alternatives
Workflow managers
Snakemake – Based on Python, easily understandable format, relies on file names.
Nextflow – Based on Groovy, uses data pipes rather than file names to construct the workflow.
Make – Used in software development and has been around since the 70s. Flexible but notoriously obscure syntax.
Galaxy - attempts to make computational biology accessible to researchers without programming experience by using a GUI.
Alternatives
Literate programming
Quarto - Developed by Posit (previously RStudio), command-line tool focused on generating high-quality documents in a language-agnostic way.
Jupyter – Create and share notebooks in a variety of languages and formats by using a web browser.
R Markdown – Developed by Posit (previously RStudio), focused on generating high-quality documents.
Zeppelin – Developed by Apache. Closely integrated with Spark for distributed computing and Big Data applications.
Beaker – Newcomer based on IPython, just as Jupyter. Has a focus on integrating multiple languages in the same notebook.
Alternatives
Containerization / virtualization
Docker – Used for packaging and isolating applications in containers. Dockerhub allows for convenient sharing. Requires root access.
Apptainer/Singularity – Simpler Docker alternative geared towards high performance computing. Does not require root.
Podman - open source daemonless container tool similar to docker in many regards
Shifter – Similar ambition as Singularity, but less focus on mobility and more on resource management.
VirtualBox/VMWare – Virtualisation rather than containerization. Less lightweight, but no reliance on host kernel.
“What’s in it for me?”
NBIS Bioinformatics drop-in
Any questions related to reproducible research tools and concepts? Talk to an NBIS expert!
Online (Zoom)
Every Tuesday, 14.00-15.00 (except public holidays)
Check www.nbis.se/events for Zoom link and more info