Documentation and reproducibility

1 Introduction

Documentation is often overlooked, even when following other best practices for bioinformatics reproducibility. While tools like Git, Conda, workflow managers, and containers are essential for reproducibility, they are not sufficient on their own. Without clear documentation, even a perfectly versioned and containerized analysis can be difficult for others (or, more likely, your future self) to understand and reproduce. From a reproducibility perspective, documentation serves several purposes:

It explains what the project is about
It links the data, code and computational environment together
It records how analyses were performed and with which tools
It provides clear instructions for reproducing the results

In a published research project, the article is usually the central artefact. In it, readers should thus be able to find all the information listed above.

Think about one of your projects. Would an outside viewer be able to find all of the above information somewhere in your project? If no, what is missing?

Organising documentation is not a trivial matter: even though you may have all the information somewhere in the project, it may not be easily readable or understandable. The simplest and most straightforward way to organise the project is to store the data in whatever public repository is appropriate for your data, while storing the code, workflow and environment in a Git repository on e.g. GitHub. More detailed documentation regarding reproducibility should also be stored in the repository, which is usually done in a README.md file.

Note

Just like all the other things we’ve gone through in this course, good documentation is not something you add at the end of a project: it should be developed continuously as the project evolves.

2 The README file

In most computational projects, the README.md file is the most important piece of documentation. It is often the first thing a reader sees when visiting your repository, and it plays an important role in reproducibility. A good README should answer the following questions:

What is this project about?
Is there a publication related to this project?
What is contained in this repository?
What is the minimum hardware requirements to reproduce the analyses?
How do I reproduce the analyses?
What outputs should I expect?
How can a reader discuss with the author?

All projects are different, not only in what they are about but also in terms of the tools they use. For example, let’s look at point 4: if you use a workflow manager (e.g. Nextflow or Snakemake) this question has a very short answer; simply show the command to run the workflow from start to finish. If you don’t use a workflow manager, however, you will need to more fully describe the order in which scripts should be run, with what parameters, and so on. A workflow manager is, to a certain extent, self-documenting.

Think about one of your projects. Are all of the above points available in an organised way? If not, what is missing?

The first two points are generally best done in free-flowing text, and can be at the top of the README itself. We will also provide some examples of how you might do the rest of the points.

Note

Whenever changes are made to the codebase, these instructions should be re-tested. A README that no longer works could be worse than no README at all.

2.1 Repository contents

It is highly useful to give an overview of the repository’s contents, which will help anybody coming across it to orient themselves. A Nextflow-based example might look like this:

## Project organisation

```
project/
 ├── bin/                   Scripts and executables
 ├── data/                  Data (not stored in this repository)
 ├── doc/                   Documents and other information
 ├── env/*/                 Environment-related files
 │   ├── Dockerfile           Docker image specification
 │   └── environment.yml      Conda environment file
 ├── results/               Workflow results
 ├── main.nf                Workflow definition
 ├── nextflow.config        Workflow configuration
 └── README.md              Project overview and documentation
```

You can get the nice lines between files and directories using the tree command, which can then be copied to the README. It should be explicitly stated where the data can be found and downloaded, as raw data should not be stored in Git repositories.

2.2 Hardware requirements

Reproducibility is not only about software; hardware constraints can determine whether an analysis is feasible on any given system. Some analyses can be run locally on a small laptop, while others may require running on e.g. an HPC cluster or in the cloud. It could also be that part of the analysis could be run locally, not run on Apple Silicon hardware, or perhaps some step requires GPUs. If there are no specific hardware requirements you should also mention this; some analyses can run on most any hardware, and the only limiting factor is time to completion.

## Hardware requirements

 - Minimum 64 GB RAM
 - Minimum 8 cores
 - x86_64 CPU architecture

2.3 Reproducing the analyses

This is the most important section of the README from a reproducibility standpoint. The instructions should be explicit, complete and as simple as possible. Ideally, a complete reproduction of the project should be possible with as few commands as possible. How the environment is reproduced should also be included, which is easiest when using e.g. Conda and/or containers.

## Reproducibility

 1. Clone this repository:

```bash
git clone https://github.com/my-account/my-project.git
cd my-project
```

 2. Run the Nextflow workflow using Docker or Singularity:
```bash
nextflow run main.nf -profile <docker/singularity>
```

2.4 Expected outputs

Here you can describe what results should be expected after a successful reproduction, and which ones might be of special interest.

## Outputs

- `results/multiqc/` - Quality control summary reports
- `results/reports/` - Final reports with analyses in R/Python
- `results/tables/`  - Summary statistics and tables used in the manuscript
- `results/figures`  - Figures used in the manuscript

2.5 Discussion with your audience

Simply because a project has been published, doesn’t mean the end of it. Readers may want clarification on certain aspects, or further help. Describe how a reader should communicate further with you, whether it be e-mail, Github issues/discussions, or some other method.

Quick recap

In this section we’ve learnt:

The README.md file is the central hub for documentation
Documentation should link together code, data and the environment
The documentation should continuously be tested and kept up-to-date