Controlling your environment with Containers

1 Introduction

Container-based technologies are designed to make it easier to create, deploy, and run applications by isolating them in self-contained software units (hence their name). The idea is to package software and/or code together with everything it needs (other packages it depends, various environment settings, etc.) into one unit, i.e. a container. This way we can ensure that the software or code functions in exactly the same way regardless of where it’s executed. Containers are in many ways similar to virtual machines but more lightweight. Rather than starting up a whole new operating system, containers can use the same kernel (usually Linux) as the system that they’re running on. This makes them much faster and smaller compared to virtual machines. While this might sound a bit technical, actually using containers is quite smooth and very powerful.

Containers have also proven to be a very good solution for packaging, running and distributing scientific data analyses. Some applications of containers relevant for reproducible research are:

When publishing, package your analyses in a container image and let it accompany the article. This way interested readers can reproduce your analysis at the push of a button.
Packaging your analysis in a container enables you to develop on e.g. your laptop and seamlessly move to cluster or cloud to run the actual analysis.
Say that you are collaborating on a project and you are using Mac while your collaborator is using Windows. You can then set up a container image specific for your project to ensure that you are working in an identical environment.

One of the largest and most widely used container-based technologies is Docker. Just as with Git, Docker was designed for software development but is rapidly becoming widely used in scientific research. Another container-based technology is Apptainer (and the related Singularity), which was developed to work well in computer cluster environments such as Uppmax. We will cover both Docker and Apptainer in this course, but the focus will be be on the former (since that is the most widely used and runs on all three operating systems).

This tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to install Docker if you haven’t done so already, then open up a terminal and go to workshop-reproducible-research/tutorials/containers.

Dockage and storage

Docker images tend to take up quite a lot of space. In order to do all the exercises in this tutorial you need to have ~10 GB available.

2 The basics

We’re almost ready to start, just one last note on nomenclature. You might have noticed that we sometimes refer to “Docker images” and sometimes to “Docker containers”. We use images to start containers, so containers are simply an instances of an image. You can have an image containing, say, a certain Linux distribution, and then start multiple containers running that same OS.

Root privileges are required

If you don’t have root privileges you have to prepend all Docker commands with sudo.

2.1 Downloading images

Docker containers typically run Linux, so let’s start by downloading an image containing Ubuntu (a popular Linux distribution that is based on only open-source tools) through the command line.

docker pull ubuntu:latest

You will notice that it downloads different layers with weird hashes as names. This represents a very fundamental property of Docker images that we’ll get back to in just a little while. The process should end with something along the lines of:

Status: Downloaded newer image for ubuntu:latest
docker.io/library/ubuntu:latest

Let’s take a look at our new and growing collection of Docker images:

docker image ls

The Ubuntu image should show up in this list, with something looking like this:

REPOSITORY       TAG              IMAGE ID            CREATED             SIZE
ubuntu           latest           d70eaf7277ea        3 weeks ago         72.9MB

2.2 Running containers

We can now start a container from the image we just downloaded. We can refer to the image either by “REPOSITORY:TAG” (“latest” is the default so we can omit it) or “IMAGE ID”. The syntax for docker run is docker run [OPTIONS] IMAGE [COMMAND] [ARG...]. To see the available options run docker run --help. The COMMAND part is any command that you want to run inside the container, it can be a script that you have written yourself, a command line tool or a complete workflow. The ARG part is where you put optional arguments that the command will use.

Let’s run uname -a to get some info about the operating system. In this case, uname is the COMMAND and -a the ARG. This command will display some general info about your system, and the -a argument tells uname to display all possible information.

First run it on your own system (use systeminfo if you are on Windows):

uname -a

This should print something like this to your command line:

Darwin liv433l.lan 15.6.0 Darwin Kernel Version 15.6.0: Mon Oct  2 22:20:08 PDT 2017; root:xnu-3248.71.4~1/RELEASE_X86_64 x86_64

Seems like I’m running the Darwin version of MacOS. Then run it in the Ubuntu Docker container:

docker run ubuntu uname -a

Here I get the following result:

Linux 24d063b5d877 5.4.39-linuxkit #1 SMP Fri May 8 23:03:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

And now I’m running on Linux! What happens is that we use the downloaded ubuntu image to run a container that has Ubuntu as the operating system, and we instruct Docker to execute uname -a to print the system info within that container. The output from the command is printed to the terminal.

Try the same thing with whoami instead of uname -a.

2.3 Running interactively

So, seems we can execute arbitrary commands on Linux. This looks useful, but maybe a bit limited. We can also get an interactive terminal with the flags -it.

docker run -it ubuntu

Your prompt should now look similar to:

root@1f339e929fa9:/#

You are now using a terminal inside a container running Ubuntu. Here you can do whatever; install, run, remove stuff. Anything you do will be isolated within the container and never affect your host system.

Now exit the container with exit.

2.4 Containers inside scripts

Okay, so Docker lets us work in any OS in a quite convenient way. That would probably be useful on its own, but Docker is much more powerful than that. For example, let’s look at the shell part of the index_genome rule in the Snakemake workflow for the MRSA case study:

shell:
    """
    bowtie2-build tempfile results/bowtie2/{wildcards.genome_id} > {log}
    """

You may have seen that one can use containers through both Snakemake and Nextflow if you’ve gone through their tutorial’s extra material, but we can also use containers directly inside scripts in a very simple way. Let’s imagine we want to run the above command using containers instead. How would that look? It’s quite simple, really: first we find a container image that has bowtie2 installed, and then prepend the command with docker run <image>.

First of all we need to download the genome to index though, so run:

curl -o NCTC8325.fa.gz ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz
gunzip -c NCTC8325.fa.gz > tempfile

to download and prepare the input for Bowtie2.

Now try running the following Bash code:

docker run -v $(pwd):/analysis quay.io/biocontainers/bowtie2:2.5.1--py39h3321a2d_0 bowtie2-build /analysis/tempfile /analysis/NCTC8325

Docker will automatically download the container image for Bowtie2 version 2.5.1 from the remote repository https://quay.io/repository/biocontainers/bowtie2 and subsequently run the command! This is the docker run [OPTIONS] IMAGE [COMMAND] [ARG...] syntax just like before. In this case quay.io/biocontainers/bowtie2:2.5.1--py39h3321a2d_0 is the IMAGE but instead of first downloading and then running it we point to its remote location directly, which will cause Docker to download it on the fly. The bowtie2-build part is the COMMAND followed by the ARG (the input tempfile and the output index)

The -v $(pwd):/analysis part is the OPTIONS which we use to mount the current directory inside the container in order to make the tempfile input available to Bowtie2. More on these so-called “Bind mounts” in Section 4 of this tutorial.

Quick recap

In this section we’ve learned:

How to use docker pull for downloading remotely stored images
How to use docker image ls for getting information about the images we have on our system.
How to use docker run for starting a container from an image.
How to use the -it flag for running in interactive mode.
How to use Docker inside scripts.

3 Building images

In the previous section we downloaded a Docker image of Ubuntu and noticed that it was based on layers, each with a unique hash as id. An image in Docker is based on a number of read-only layers, where each layer contains the differences to the previous layers. If you’ve done the Git tutorial this might remind you of how a Git commit contains the difference to the previous commit. The great thing about this is that we can start from one base layer, say containing an operating system and some utility programs, and then generate many new images based on this, say 10 different project-specific images. This dramatically reduces the storage space requirements. For example, Bioconda (see the Conda tutorial) has one base image and then one individual layer for each of the more than 3000 packages available in Bioconda.

Docker provides a convenient way to describe how to go from a base image to the image we want by using a “Dockerfile”. This is a simple text file containing the instructions for how to generate each layer. Docker images are typically quite large, often several GBs, while Dockerfiles are small and serve as blueprints for the images. It is therefore good practice to have your Dockerfile in your project Git repository, since it allows other users to exactly replicate your project environment.

We will be looking at a Dockerfile called slim.Dockerfile that is located in your containers directory (where you should hopefully be standing already). We will now go through that file and discuss the different steps and what they do. After that we’ll build the image and test it out. Lastly, we’ll start from that image and make a new one to reproduce the results from the Conda tutorial.

Naming Dockerfile

The default name for a Dockerfile is just that, Dockerfile. In cases where you want to have more than one Dockerfile in a single directory you can use the format <name>.Dockerfile instead, as per the official documentation.

3.1 Understanding Dockerfiles

Here are the first few lines of slim.Dockerfile. Each line in the Dockerfile will typically result in one layer in the resulting image. The format for Dockerfiles is INSTRUCTION arguments. A full specification of the format, together with best practices, can be found at the Docker website.

FROM condaforge/miniforge3

LABEL authors="John Sundh, john.sundh@scilifelab.se; Erik Fasterius, erik.fasterius@nbis.se"
LABEL description="Minimal image for the NBIS reproducible research course."

Here we use the instructions FROM and LABEL. While LABEL is just key/value metadata pairs that can be used for organising your various Docker components, the important one is FROM, which specifies the base image we want to start from. Because we want to use conda to install packages we will start from an image from the conda-forge community that has conda pre-installed. This image was in turn built using a Dockerfile as a blueprint and then uploaded to Dockerhub. The conda-forge community keeps the Dockerfile in a git repository and you can view the file here. You will see that it starts from an official Ubuntu image (check the first line with the FROM instruction), followed by code to install various packages including conda.

A note on label best practices

While you can use arbitrary key/value pairs for LABEL instructions however you like, there are best practices available that you might want to follow. These follow the format of org.opencontainers.image.<label>, where the namespace (the first part of the format) comes from the Open Container Initiative (OCI), an organisation aimed at creating industry standards for container formats. You can find a list of all the standard labels at the OCI GitHub

There are many roads to Rome

When it comes to choosing the best image to start from there are multiple routes you could take. Say you want to run RStudio in a Conda environment through a Jupyter notebook. You could then start from one of the rocker images for R, a Condaforge image, or a Jupyter image. Or you just start from one of the low-level official images and set up everything from scratch.

Let’s take a look at the next section of slim.Dockerfile.

WORKDIR /course

WORKDIR determines the directory the container should start in. By default it is set to /, i.e. the container root, but it can be useful to set it to something else so you don’t always see system-level files that are irrelevant for your analyses. While we call it course here you can call it whatever you like, e.g. work or analyses.

Next up is:

SHELL ["/bin/bash", "-c"]

SHELL sets the default shell to use in the container. The SHELL instruction has to be written in the ["executable", "parameters"] syntax, which is referred to as “JSON form”. Here we set SHELL to the bash shell (the -c flag is used to pass a command to the shell).

The next few lines introduce the important RUN instruction, which is used for executing shell commands:

# Install `curl` for downloading of FASTQ data later in the tutorial
RUN apt-get update && \
    apt-get install -y curl && \
    apt-get clean

# Configure Conda
RUN conda config --set channel_priority strict

The first RUN command installs the curl command, which will be used to download some raw FASTQ data. As a general rule, you want each layer in an image to be a “logical unit”. For example, if you want to install a program the RUN command should both retrieve the program, install it and perform any necessary clean up. This is due to how layers work and how Docker decides what needs to be rerun between builds (more on this later).

We then configure Conda to only use strict mode for building the dependency tree, like we did in the pre-course-setup.

Note

While installing things with apt-get inside Dockerfiles is relatively common practice, it’s important to note that this may affect reproducibility, since it’s not common to specify an exact version. The packages installed in this manner are, however, usually not important for the actual analyses performed, but rather help in the building of the container image itself. While not critical, it’s important to note this from a reproducibility perspective.

Next up is:

# Start Bash shell by default
CMD /bin/bash

CMD is an interesting instruction. It sets what a container should run when nothing else is specified, i.e. if you run docker run [OPTIONS] [IMAGE] without the additional [COMMAND] [ARG]. It can be used for example for printing some information on how to use the image or, as here, start a Bash shell for the user. If the purpose of your image is to accompany a publication then CMD could be to run the workflow that generates the paper figures from raw data, e.g. CMD snakemake -s Snakefile -c 1 generate_figures.

3.2 Building from Dockerfiles

Now we understand how a Dockerfile works. Constructing the image itself from the Dockerfile can be done as follows - try it out:

docker build -f slim.Dockerfile -t my_docker_image .

This should result in something similar to this:

 [+] Building 2.2s (7/7) FINISHED
 => [internal] load build definition from slim.Dockerfile                                                                                                                                             0.0s
 => => transferring dockerfile: 667B                                                                                                                                                                  0.0s
 => [internal] load .dockerignore                                                                                                                                                                     0.0s
 => => transferring context: 2B                                                                                                                                                                       0.0s
 => [internal] load metadata for docker.io/condaforge/miniforge3:latest                                                                                                                               0.0s
 => [1/3] FROM docker.io/condaforge/miniforge3                                                                                                                                                        0.0s
 => CACHED [2/3] WORKDIR /course                                                                                                                                                                      0.0s
 => [3/3] RUN conda config --set channel_priority strict 0.4s
 => exporting to image                                                                                                                                                                                0.0s
 => => exporting layers                                                                                                                                                                               0.0s
 => => writing image sha256:53e6efeaa063eadf44c509c770d887af5e222151f08312e741aecc687e6e8981                                                                                                          0.0s
 => => naming to docker.io/library/my_docker_image

Exactly how the output looks depends on which version of Docker you are using. The -f flag sets which Dockerfile to use and -t tags the image with a name. This name is how you will refer to the image later. Lastly, the . is the path to where the image should be build (. means the current directory). This had no real impact in this case, but matters if you want to import files. Validate with docker image ls that you can see your new image.

3.3 Creating your own Dockerfile

Now it’s time to make your own Dockerfile to reproduce the results from the Conda tutorial. If you haven’t done the tutorial, it boils down to creating a Conda environment file, setting up that environment, downloading three RNA-seq data files, and running FastQC on those files. We will later package and run the whole RNA-seq workflow in a Docker container, but for now we keep it simple to reduce the size and time required.

The Conda tutorial uses a shell script, run_qc.sh, for downloading and running the analysis. A copy of this file should also be available in your current directory. If we want to use the same script we need to include it in the image. A basic outline of what we need to do is:

Create a file called conda.Dockerfile
Start the image from the my_docker_image we just built
Install the package fastqc which is required for the analysis.
Add the run_qc.sh script to the image
Set the default command of the image to run the run_qc.sh script.

We’ll now go through these steps in more detail. Try to add the corresponding code to conda.Dockerfile on your own, and if you get stuck you can click to reveal the solution below under “Click to show solution”.

Set image starting point

To set the starting point of the new image, use the FROM instruction and point to my_docker_image that we built in the previous Building from Dockerfiles step.

Install packages

Use the RUN instruction to install the package fastqc=0.11.9 with conda. Here there are several options available. For instance we could add an environment file e.g. environment.yml from the Conda tutorial and use conda env update --name base to update the base environment from that file (the rule to keep the base Conda environment free of other packages does not apply to when you’re building it inside a Docker image). Or we could install the package directly with conda install --name base. We’ll try this later option here, so add a line that will install the fastqc package, and also clean up packages and cache after installation. Use the -y flag to conda install to avoid the prompt that expects an interaction from the user.

Since we used the excellent Miniforge as a base image for my_docker_image the base environment is always available and the conda-forge channel is already added as a default. The only thing we need to do is to add the bioconda channel to the configuration, but other than that there’s not much else needed for Conda to work when using the Miniforge base image.

Add the analysis script

Use the COPY instruction to Add run_qc.sh to the image. The syntax is COPY SOURCE TARGET. In this case SOURCE is the run_qc.sh script and TARGET is a path inside the image, for simplicity it can be specified with ..

Set default command

Use the CMD instruction to set the default command for the image to bash run_qc.sh.

Click to show solution

FROM my_docker_image

RUN conda install -y -n base -c bioconda fastqc=0.11.9 && \
    conda clean -a

COPY run_qc.sh .

CMD bash run_qc.sh

Build the image and tag it my_docker_conda.

docker build -t my_docker_conda -f conda.Dockerfile .

Verify that the image was built using docker image ls.

3.4 Seqera containers

While building from your own Dockerfile gives you full control over exactly how you build your image, there is also another alternative for building images: Seqera Containers. This a free service provided by Seqera, the company behind Nextflow (among other things), and it allows you to easily build images from any package available in PyPI, Bioconda or Conda Forge. This means that most (but not all) bioinformatic software packages can be bundled with a Seqera container, all without the need for a Dockerfile of your own. In fact, all you need to do is to search for the packages in the web interface and Seqera will both build and host your final image for you!

Head over to https://seqera.io/containers/ and type fastqc in the search bar.

You should find a bunch of packages with fastqc in their names, but the one we’re looking for is the one from Bioconda, which you’ll find listed as bioconda::fastqc.

Select the 0.11.9 version (the same as we previously used for our other Docker image), which you can do on the right-hand side of the list.

Once you’ve done that the text bioconda::fastqc=0.11.9 should appear in the search bar. We’ll also need curl, since that’s used by the run_qc.sh script.

Add curl to the image in the same way.

Once you’ve added curl you can press the blue “Get Container” button, and the image will start building. When you pressed the button you should have gotten a quick response of “Fetching container” and a link to the image itself; you can also press “View build details” if you want to explore the details about the build and image. Once the build is ready you’ll instead see “Container is ready”. The image can now be used in the same way as for any other image.

Since we previously ran the run_qc.sh script with our image, let’s do the same here. Remember that you’ll need to supply -v $(pwd):/tmp so that the run_qc.sh will be available in the container:

Execute docker run -v $(pwd):/tmp <CONTAINER LINK> bash run_qc.sh from the command line.

This will automatically download the Seqera image and start the script as soon as it’s ready.

Note

The reason we used -v $(pwd):/tmp rather than -v $(pwd):/analysis as before is that Seqera containers default to the /tmp directory as their point of entry when running them. There is nothing special about this name, it’s just the default location.

While this particular image was quite simple, it’s quite common to create more complicated R and Python environments, which require many more packages and dependencies. If all of them are available in e.g. Bioconda (the most common place for bioinformatic software) then you can use a Seqera container instead of making your own Dockerfile. Making your own Dockerfile is, however, still the better alternative if you want complete control of the image (and it’s sometimes the only alternative if your package isn’t available in Bioconda, Conda Forge or PyPI and you have to install it manually).

Quick recap

In this section we’ve learned: - How the keywords FROM, LABEL, MAINTAINER, RUN, ENV, SHELL, WORKDIR, and CMD can be used when writing a Dockerfile. - How to use docker build to construct and tag an image from a Dockerfile. - How to create your own Dockerfile. - How to create an image using Seqera Containers.

4 Managing containers

When you start a container with docker run it is given an unique id that you can use for interacting with the container. Let’s try to run a container from the image we just created:

docker run my_docker_conda

If everything worked run_qc.sh is executed and will first download and then analyse the three samples. Once it’s finished you can list all containers, including those that have exited.

docker container ls --all

This should show information about the container that we just ran. Similar to:

CONTAINER ID   IMAGE            COMMAND                  CREATED         STATUS          PORTS      NAMES
b6f7790462c4   my_docker_conda   "tini -- /bin/bash -…"  3 minutes ago   Up 24 seconds      sad_maxwell

If we run docker run without any flags, your local terminal is attached to the container. This enables you to see the output of run_qc.sh, but also disables you from doing anything else in the meantime. We can start a container in detached mode with the -d flag. Try this out and run docker container ls to validate that the container is running.

By default, Docker keeps containers after they have exited. This can be convenient for debugging or if you want to look at logs, but it also consumes huge amounts of disk space. It’s therefore a good idea to always run with --rm, which will remove the container once it has exited.

If we want to enter a running container, there are two related commands we can use, docker attach and docker exec. docker attach will attach local standard input, output, and error streams to a running container. This can be useful if your terminal closed down for some reason or if you started a terminal in detached mode and changed your mind. docker exec can be used to execute any command in a running container. It’s typically used to peak in at what is happening by opening up a new shell. Here we start the container in detached mode and then start a new interactive shell so that we can see what happens. If you use ls inside the container you can see how the script generates file in the data and results directories. Note that you will be thrown out when the container exits, so you have to be quick.

docker run -d --rm --name my_container my_docker_conda
docker exec -it my_container /bin/bash

4.1 Bind mounts

There are obviously some advantages to isolating and running your data analysis in containers, but at some point you need to be able to interact with the rest of the host system (e.g. your laptop) to actually deliver the results. This is done via bind mounts. When you use a bind mount, a file or directory on the host machine is mounted into a container. That way, when the container generates a file in such a directory it will appear in the mounted directory on your host system.

Tip

Docker also has a more advanced way of data storage called volumes. Volumes provide added flexibility and are independent of the host machine’s file system having a specific directory structure available. They are particularly useful when you want to share data between containers.

Say that we are interested in getting the resulting html reports from FastQC in our container. We can do this by mounting a directory called, say, fastqc_results in your current directory to the /course/results/fastqc directory in the container. Try this out by running:

docker run --rm -v $(pwd)/fastqc_results:/course/results/fastqc my_docker_conda

Here the -v flag to docker run specifies the bind mount in the form of directory/on/your/computer:/directory/inside/container. $(pwd) simply evaluates to the working directory on your computer.

Once the container finishes validate that it worked by opening one of the html reports under fastqc_results/.

We can also use bind mounts for getting files into the container rather than out. We’ve mainly been discussing Docker in the context of packaging an analysis pipeline to allow someone else to reproduce its outcome. Another application is as a kind of very powerful environment manager, similarly to how we’ve used Conda before. If you’ve organised your work into projects, then you can mount the whole project directory in a container and use the container as the terminal for running stuff while still using your normal OS for editing files and so on. Let’s try this out by mounting our current directory and start an interactive terminal. Note that this will override the CMD command, so we won’t start the analysis automatically when we start the container.

docker run -it --rm -v $(pwd):/course/ my_docker_conda /bin/bash

If you run ls you will see that all the files in the container/ directory are there.

Quick recap

In this section we’ve learned:

How to use docker run for starting a container and how the flags -d and --rm work.
How to use docker container ls for displaying information about the containers.
How to use docker attach and docker exec to interact with running containers.
How to use bind mounts to share data between the container and the host system.

5 Sharing images

There would be little point in going through all the trouble of making your analyses reproducible if you can’t distribute them to others. Luckily, sharing Docker containers is extremely easy, and can be done in several ways. One of the more common ways to share Docker images is through container registries and repositories.

For example, a Docker registry is a service that stores Docker images, which could be hosted by a third party, publicly or privately. One of the most common registries is Docker Hub, which is a registry hosted by Docker itself. A repository, on the other hand, is a collection of container images with the same name but different tags (i.e. versions), for example ubuntu:latest or ubuntu:20.04. Repositories are stored in registries.

Note

Remember that we now have some clashing nomenclature between Git repositories (which we covered in the Git tutorial) and container repositories, so be aware of which one you’re talking about!

There are many registries out there, but here are some that might be of interest to you who are taking this course:

The most common registry is probably Docker Hub, which lets you host unlimited public images and one private image for free (after which they charge a small fee). The GitHub Container Registry is also quite handy if you’re already using GitHub. Let’s see how it’s done using Docker Hub!

Register for an account on Docker Hub.
Use docker login -u your_dockerhub_id to login to the Docker Hub registry. Or use the Sign in button in Docker Desktop.
When you build an image, tag it with -t your_dockerhub_id/image_name, rather than just image_name.
Once the image has been built, upload it to Docker Hub with docker push your_dockerhub_id/image_name.
If another user runs docker run your_dockerhub_id/image_name the image will automatically be retrieved from Docker Hub. You can use docker pull for downloading without running.

If you want to refer to a Docker image in for example a publication, it’s very important that it’s the correct version of the image. This is handled via the ‘tags’ (e.g. docker build -t your_dockerhub_id/image_name:tag_name) that we introduced in the basics and used when building images.

Tip

On Docker Hub it is also possible to link to your Bitbucket or GitHub account and select repositories from which you want to automatically build and distribute Docker images. The Docker Hub servers will then build an image from the Dockerfile in your Git repository and make it available for download using docker pull. That way, you don’t have to bother manually building and pushing using docker push. The GitHub repository for this course is linked to Docker Hub and the Docker images are built automatically from Dockerfile and slim.Dockerfile, triggered by changes made to the GitHub repository. You can take a look at the course on Docker Hub.

Quick recap

In this section we’ve learned:

How container registries and repositories work
How to use Docker Hub to share Docker images

6 Packaging the case study

During these tutorials we have been working on a case study about the multi-resistant bacteria MRSA. Here we will build and run a Docker container that contains all the work we’ve done so far.

We’ve set up a GitHub repository for version control and for hosting our project.
We’ve defined a Conda environment that specifies the packages we’re depending on in the project.
We’ve constructed a Snakemake workflow that performs the data analysis and keeps track of files and parameters.
We’ve written a Quarto document that takes the results from the Snakemake workflow and summarizes them in a report.

The workshop-reproducible-research/tutorials/containers directory contains the final versions of all the files we’ve generated in the other tutorials: environment.yml, Snakefile, config.yml and code/supplementary_material.qmd. The only difference compared to the other tutorials is that we have also included the rendering of the Supplementary Material HTML file into the Snakemake workflow as the rule make_supplementary. Running all of these steps will take some time to execute (around 20 minutes or so), in particular if you’re on a slow internet connection.

Now take a look at Dockerfile. Everything should look quite familiar to you, since it’s basically the same steps as in the image we constructed in the building images section, although with some small modifications. The main difference is that we add the project files needed for executing the workflow (mentioned in the previous paragraph), and install the conda packages using environment.yml. If you look at the CMD command you can see that it will run the whole Snakemake workflow by default.

Now run docker build as before, tag the image with my_docker_project:

Building on ARM64 systems

This particular environment (which is quite complicated) has some packages that are not available on ARM64 systems. So if you’re using a new Mac computer you will additionally have to supply the --platform linux/amd64 flag in order for it to build successfully.

docker build -t my_docker_project -f Dockerfile .

Go get a coffee while the image builds (or you could use docker pull nbisweden/workshop-reproducible-research which will download the same image).

Validate with docker image ls. Now all that remains is to run the whole thing with docker run. We just want to get the results, so mount the directory /course/results/ to, say, results/ in your current directory. Click below to see how to write the command.

Click to show

If building your own image:

docker run -v $(pwd)/results:/course/results my_docker_project

If you pulled the image from DockerHub:

docker run -v $(pwd)/results:/course/results nbisweden/workshop-reproducible-research

Well done! You now have an image that allows anyone to exactly reproduce your analysis workflow (if you first docker push to Dockerhub that is).

7 Apptainer

Apptainer is a container software alternative to Docker. It was originally developed as Singularity by researchers at Lawrence Berkeley National Laboratory (read more about this below) with focus on security, scientific software, and HPC clusters. One of the ways in which Apptainer is more suitable for HPC is that it very actively restricts permissions so that you do not gain access to additional resources while inside the container. Apptainer also, unlike Docker, stores images as single files using the Singularity Image Format (SIF). A SIF file is self-contained and can be moved around and shared like any other file, which also makes it easy to work with on an HPC cluster.

Apptainer and Singularity

The open source Singularity project was renamed to Apptainer in 2021. The company Sylabs still keeps their commercial branch of the project under the Singularity name, and offer a free ‘Community Edition’ version. The name change was done in order to clarify the distinction between the open source project and the various commercial versions. At the moment there is virtually no difference to you as a user whether you use Singularity or Apptainer, but eventually it’s very likely that the two will diverge.

While it is possible to define and build Apptainer images from scratch, in a manner similar to what you’ve already learned for Docker, this is not something we will cover here (but feel free to read more about this in e.g. the Apptainer docs.

The reasons for not covering Apptainer more in-depth are varied, but it basically boils down to it being more or less Linux-only, unless you use Virtual Machines (VMs). Even with this you’ll run into issues of incompatibility of various kinds, and these issues are further compounded if you’re on one of the new ARM64-Macs. You also need root (admin) access in order to actually build Apptainer images regardless of platform, meaning that you can’t build them on e.g. Uppmax, even though Apptainer is already installed there. You can, however, use the --remote flag, which runs the build on Apptainer’s own servers. This doesn’t work in practice a lot of the time, though, since most scientist will work in private Git repositories so that their research and code is not available to anybody, and the --remote flag requires that e.g. the environment.yml file is publicly available.

There are very good reasons to use Apptainer, however, the major one being that you aren’t allowed to use Docker on most HPC systems! One of the nicer features of Apptainer is that it can convert Docker images directly for use within Apptainer, which is highly useful for the cases when you already built your Docker image or if you’re using a remotely available image stored on e.g. DockerHub. For a lot of scientific work based in R and/or Python, however, it is most often the case that you build your own images, since you have a complex dependency tree of software packages not readily available in existing images. So, we now have another problem for building our own images:

Only Apptainer is allowed on HPC systems, but you can’t build images there due to not having root access.
You can build Apptainer images locally and transfer them to HPCs, but this is problematic unless you’re running Linux natively.

Seems like a “catch 22”-problem, right? There are certainly workarounds (some of which we have already mentioned) but most are roundabout or difficult to get working for all use-cases. Funnily enough, there’s a simple solution: run Apptainer locally from inside a Docker container! Conceptually very meta, yes, but works very well in practice. What we are basically advocating for is that you stick with Docker for most of your container-based work, but convert your Docker images using Apptainer-in-Docker whenever you need to work on an HPC. This is of course not applicable to Linux users or those of you who are fine with working through using VMs and managing any issues that arise from doing that.

Summary

Apptainer is a great piece of software that is easiest to use if you’re working on a Linux environment. Docker is, however, easier to use from a cross-platform standpoint and covers all use-cases except running on HPCs. Running on HPCs can be done by converting existing Docker images at runtime, while building images for use on HPCs can be done using local Docker images and Apptainer-in-Docker.

7.1 Apptainer-in-Docker

By creating a bare-bones, Linux-based Docker image with Apptainer you can build Apptainer images locally on non-Linux operating systems. There is already a good image setup for just this, and it is defined in this GitHub repository. Looking at the instructions there we can see that we need to do the following:

docker run \
    --rm \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v $(pwd):/work \
    kaczmarj/apptainer \
    build <IMAGE>.sif docker-daemon://<IMAGE>:<TAG>

You already know about docker run, the --rm flag and bind mounts using -v. The /var/run/docker.sock part is the Unix socket that the Docker daemon listens to by default, meaning that it is needed for us to be able to specify the location of the Docker container we want to convert to a SIF file. The kaczmarj/apptainer part after the bind mounts is the image location hosted at DockerHub, while the last line is the Apptainer command that actually does the conversion. All we need to do is to replace the <IMAGE> part with the Docker image we want to convert, e.g. my_docker_image.

Replace <IMAGE> and <TAG> with one of your locally available Docker images and one of its tags and run the command - remember that you can use docker image ls to check what images you have available.

In the end you’ll have a SIF file (e.g. my_docker_image.sif) that you can transfer to an HPC such as Uppmax and run whatever analyses you need. If you want to be able to do this without having to remember all the code you can check out this script.

7.2 Running Apptainer

The following exercises assume that you have a login to the Uppmax HPC cluster in Uppsala, but will also work for any other system that has Apptainer installed - like if you managed to install Apptainer on your local system or have access to some other HPC cluster. Let’s try to convert the Docker image for this course directly from DockerHub:

apptainer pull mrsa_proj.sif docker://nbisweden/workshop-reproducible-research

This should result in a SIF file called mrsa_proj.sif.

In the Docker image we included the code needed for the workflow in the /course directory of the image. These files are of course also available in the Apptainer image. However, a Apptainer image is read-only. This will be a problem if we try to run the workflow within the /course directory, since the workflow will produce files and Snakemake will create a .snakemake directory. Instead, we need to provide the files externally from our host system and simply use the Apptainer image as the environment to execute the workflow in (i.e. all the software and dependencies).

In your current working directory (workshop-reproducible-research/tutorials/containers/) the vital MRSA project files are already available (Snakefile, config.yml and code/supplementary_material.qmd). Since Apptainer bind mounts the current working directory we can simply execute the workflow and generate the output files using:

apptainer run mrsa_proj.sif

This executes the default run command, which is snakemake -rp -c 1 --configfile config.yml (as defined in the original Dockerfile). Once completed you should see a bunch of directories and files generated in your current working directory, including the results/ directory containing the final HTML report.

Quick recap

In this section we’ve learned:

How to build a Apptainer image using Apptainer inside Docker.
How to convert Docker images to Apptainer images.
How to run Apptainer images.

8 Extra material

Containers can be large and complicated, but once you start using them regularly you’ll find that you start understand these complexities. There are lots of different things you can do with images and containers in general, especially when it comes to optimising build time or final image size. Here is some small tips and tricks that you can be inspired from!

If you want to read more about containers in general you can check out these resources:

A “Get started with Docker” at the Docker website.
An early paper on the subject of using Docker for reproducible research.

8.1 Building for multiple platforms

With the newer ARM64 architectures introduced by Apple one often runs into the problem of not having an architecture-native image to run with. This is sometimes okay since the Rosetta2 software can emulate the old AMD64 architecture on newer ARM64 computers, but results in a performance hit. One could just build for ARM64 using --platform=linux/arm64 instead, but then somebody who doesn’t have the new architecture can’t run it. There is a way around this, however: multi-platform builds. We can build for multiple platforms at the same time and push those to e.g. DockerHub and anybody using those images will automatically pull the one appropriate for their computer. Here’s how to do it:

Start by checking the available builders using docker buildx ls.

You should only see the default builder, which does not have access to multi-platform builds. Let’s create a new builder that does have access to it:

Run the following: docker buildx create --name mybuilder --driver docker-container --bootstrap.
Switch to using the new builder with docker buildx use mybuilder and check that it worked with docker buildx ls.

All that’s needed now is to build and push the images! The following command assumes that you have an account with <username> at DockerHub and you’re pushing the <image> image:

docker buildx build --platform linux/amd64,linux/arm64 -t <username>/<image>:latest --push .

Execute the above command with your username and your image.

That’s it! Now anybody who does e.g. docker pull <username>/<image> will get an image appropriate for their architecture whether they are on AMD64 or ARM64!

An alias to buildx

You can type docker buildx install to make the docker build into an alias for docker buildx, allowing you to run multi-platform builds using docker build. Use docker buildx uninstall to remove this alias.

8.2 Multi-stage builds

Some build processes can be quite complicated, requiring a more diverse set of software packages in order to successfully build a Docker image. Not all of these packages are always required for running the final image, however, which results in a somewhat bloated image with a larger size footprint than what is strictly required. This is where multi-stage builds come in, which allows for optimisation where only the files and packages actually required for execution is included in the final image.

While this is mostly interesting for software developers and people working with binaries and other non-scripted code, it can be of interest in the context of bioinformatics when it comes to optimising Conda environments. Conda actually comes in two parts: the Conda installation itself (along with all the files requires to build Conda environments) and the Conda environments that we’ve built. The former is actually not needed to be able to use the latter, but the technical details behind this are non-trivial and out of the scope of this course. Regardless, there’s a package that can help us with this: conda-pack.

Let’s take a very simple Dockerfile as an example of what we might have created before:

FROM condaforge/miniforge3:24.7.1-0
RUN conda install -y -n base python=3.10
CMD /bin/bash

Here we install Python in the Conda base environment, similar to how we have done it previously in the course.

Copy the above code into e.g. base.Dockerfile and build it using docker build -f base.Dockerfile -t my_docker_base .

Let’s look at multi-stage image that includes conda-pack:

#
# First stage: Conda environment
#
FROM condaforge/miniforge3:24.7.1-0 AS build

# Install conda-pack into the base environment
RUN conda install -y -n base conda-pack

# Create a new environment that just contains Python
RUN conda create -y -n env python=3.10

# Package the new environment into /env
RUN conda-pack -n env -o /tmp/env.tar && \
    mkdir /env && \
    tar -xf /tmp/env.tar -C /env && \
    rm /tmp/env.tar && \
    /env/bin/conda-unpack

#
# Second stage: final image
#
FROM ubuntu:20.04

# Copy Conda environment from previous stage
COPY --from=build /env /env

# Activate the environment when running the container
RUN echo "source /env/bin/activate" >> ~/.bashrc
CMD /bin/bash

The first thing to notice here is the first FROM statement, which also includes AS build, which is specific to multi-stage builds. What we are doing is giving the stage a name, build in this case, so that we may refer back to it later.

The first RUN command install conda-pack in the base environment, not the environment with Python that we’re actually interested in. The reason for this is that we only need conda-pack for making the Python environment independent of the Conda installation, but we won’t need it to actually activate the environment once this is done. The second and third RUN commands installs the new environment (named env here) and packages that environment into a separate directory, respectively.

The second FROM statement starts the second (and final) stage of the build, and thus does not need a name (hence no AS <name> statement). Notice that there’s a COPY statement just after: this is the directive that copies files from the build stage into the current stage. We only copy the Python environment itself, not the base environment nor the Conda installation.

Copy the code above into a file called multi.Dockerfile and build it with:

docker build -f multi.Dockerfile -t my_docker_multi .

List your docker images with docker image ls and compare the newly created my_docker_multi with my_docker_base.

Hopefully you should see that the size of the final image are different: the multi image should be about 500 MB smaller than the multi image. This tells you something about how large the Conda installation is all by itself.

A smaller base image

We’re still using Ubuntu as the base image for the final stage of our image, but we could go even further in our attempt to minimise the image size by using an even smaller base image, such as alpine. Doing this means that you lose out on a lot of basic Unix utilities, though, such as Bash.

Try to optimise the last image we did for the MRSA project, my_docker_project! Click below if you want some help.

Click to show solution

#
# First stage: Conda environment
#
FROM condaforge/miniforge3:24.7.1-0 AS conda

# Install conda-pack
RUN conda install -y -n base conda-pack

# Install environment
COPY environment.yml ./
RUN conda env create -f environment.yml -n env && \
    conda clean -a

# Package the new environment into /env
RUN conda-pack -n env -o /tmp/env.tar && \
    mkdir /env && \
    tar -xf /tmp/env.tar -C /env && \
    rm /tmp/env.tar && \
    /env/bin/conda-unpack

#
# Second stage: final image
#
FROM ubuntu:20.04
COPY --from=conda /env /env
WORKDIR /course

# Install required packages
RUN apt-get update && \
    apt-get install -y curl && \
    apt-get clean

# Install Quarto
ARG QUARTO_VERSION="1.3.450"
RUN mkdir -p /opt/quarto && \
    curl -o quarto.tar.gz -L "https://github.com/quarto-dev/quarto-cli/releases/download/v${QUARTO_VERSION}/quarto-${QUARTO_VERSION}-linux-amd64.tar.gz" && \
    tar -zxvf quarto.tar.gz -C /opt/quarto/ --strip-components=1 && \
    rm quarto.tar.gz
ENV PATH /opt/quarto/bin:${PATH}

# Activate the environment when running the container
RUN echo "source /env/bin/activate" >> ~/.bashrc

# Add project files
COPY Snakefile config.yml ./
COPY code ./code/

SHELL ["/bin/bash", "-c"]
CMD source /env/bin/activate && \
    snakemake -p -c 1 --configfile config.yml

Now check the difference between the previous my_docker_project and your new image. How much of a size did you get?

You should hopefully find that you get around 500 MB difference here as well, but since the Conda environment for this particular image is much more complicated than the previous example, this difference is proportionally smaller: Going from ~850 to ~250 MB is quite a big optimisation, while going from 4 to 3.5 GB is not quite as big. Regardless, that (albeit smaller) difference can add up over time when it comes to downloading the image, especially if it’s an image used by many people in a project, or if you’re actually paying for hosting of the image.