1 Introduction
Container-based technologies are designed to make it easier to create, deploy, and run applications by isolating them in self-contained software units (hence their name). The idea is to package software and/or code together with everything it needs (other packages it depends, various environment settings, etc.) into one unit, i.e. a container. This way we can ensure that the software or code functions in exactly the same way regardless of where it’s executed. Containers are in many ways similar to virtual machines but more lightweight. Rather than starting up a whole new operating system, containers can use the same kernel (usually Linux) as the system that they’re running on. This makes them much faster and smaller compared to virtual machines. While this might sound a bit technical, actually using containers is quite smooth and very powerful.
Containers have also proven to be a very good solution for packaging, running and distributing scientific data analyses. Some applications of containers relevant for reproducible research are:
- When publishing, package your analyses in a container image and let it accompany the article. This way interested readers can reproduce your analysis at the push of a button.
- Packaging your analysis in a container enables you to develop on e.g. your laptop and seamlessly move to cluster or cloud to run the actual analysis.
- Say that you are collaborating on a project and you are using Mac while your collaborator is using Windows. You can then set up a container image specific for your project to ensure that you are working in an identical environment.
One of the largest and most widely used container-based technologies is Docker. Just as with Git, Docker was designed for software development but is rapidly becoming widely used in scientific research. Another container-based technology is Apptainer (and the related Singularity), which was developed to work well in computer cluster environments such as Uppmax. We will cover both Docker and Apptainer in this course, but the focus will be be on the former (since that is the most widely used and runs on all three operating systems).
This tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to install Docker if you haven’t done so already, then open up a terminal and go to workshop-reproducible-research/tutorials/containers
.
Docker images tend to take up quite a lot of space. In order to do all the exercises in this tutorial you need to have ~10 GB available.
2 The basics
We’re almost ready to start, just one last note on nomenclature. You might have noticed that we sometimes refer to “Docker images” and sometimes to “Docker containers”. We use images to start containers, so containers are simply an instances of an image. You can have an image containing, say, a certain Linux distribution, and then start multiple containers running that same OS.
If you don’t have root privileges you have to prepend all Docker commands with sudo
.
2.1 Downloading images
Docker containers typically run Linux, so let’s start by downloading an image containing Ubuntu (a popular Linux distribution that is based on only open-source tools) through the command line.
docker pull ubuntu:latest
You will notice that it downloads different layers with weird hashes as names. This represents a very fundamental property of Docker images that we’ll get back to in just a little while. The process should end with something along the lines of:
Status: Downloaded newer image for ubuntu:latest
docker.io/library/ubuntu:latest
Let’s take a look at our new and growing collection of Docker images:
docker image ls
The Ubuntu image should show up in this list, with something looking like this:
REPOSITORY TAG IMAGE ID CREATED SIZE
ubuntu latest d70eaf7277ea 3 weeks ago 72.9MB
2.2 Running containers
We can now start a container from the image we just downloaded. We can refer to the image either by “REPOSITORY:TAG” (“latest” is the default so we can omit it) or “IMAGE ID”. The syntax for docker run
is docker run [OPTIONS] IMAGE [COMMAND] [ARG...]
. To see the available options run docker run --help
. The COMMAND
part is any command that you want to run inside the container, it can be a script that you have written yourself, a command line tool or a complete workflow. The ARG
part is where you put optional arguments that the command will use.
Let’s run uname -a
to get some info about the operating system. In this case, uname
is the COMMAND
and -a
the ARG
. This command will display some general info about your system, and the -a
argument tells uname
to display all possible information.
First run it on your own system (use systeminfo
if you are on Windows):
uname -a
This should print something like this to your command line:
Darwin liv433l.lan 15.6.0 Darwin Kernel Version 15.6.0: Mon Oct 2 22:20:08 PDT 2017; root:xnu-3248.71.4~1/RELEASE_X86_64 x86_64
Seems like I’m running the Darwin version of MacOS. Then run it in the Ubuntu Docker container:
docker run ubuntu uname -a
Here I get the following result:
Linux 24d063b5d877 5.4.39-linuxkit #1 SMP Fri May 8 23:03:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
And now I’m running on Linux! What happens is that we use the downloaded ubuntu image to run a container that has Ubuntu
as the operating system, and we instruct Docker to execute uname -a
to print the system info within that container. The output from the command is printed to the terminal.
Try the same thing with whoami
instead of uname -a
.
2.3 Running interactively
So, seems we can execute arbitrary commands on Linux. This looks useful, but maybe a bit limited. We can also get an interactive terminal with the flags -it
.
docker run -it ubuntu
Your prompt should now look similar to:
root@1f339e929fa9:/#
You are now using a terminal inside a container running Ubuntu. Here you can do whatever; install, run, remove stuff. Anything you do will be isolated within the container and never affect your host system.
Now exit the container with exit
.
2.4 Containers inside scripts
Okay, so Docker lets us work in any OS in a quite convenient way. That would probably be useful on its own, but Docker is much more powerful than that. For example, let’s look at the shell
part of the index_genome
rule in the Snakemake workflow for the MRSA case study:
shell:"""
bowtie2-build tempfile results/bowtie2/{wildcards.genome_id} > {log}
"""
You may have seen that one can use containers through both Snakemake and Nextflow if you’ve gone through their tutorial’s extra material, but we can also use containers directly inside scripts in a very simple way. Let’s imagine we want to run the above command using containers instead. How would that look? It’s quite simple, really: first we find a container image that has bowtie2
installed, and then prepend the command with docker run <image>
.
First of all we need to download the genome to index though, so run:
curl -o NCTC8325.fa.gz ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz
gunzip -c NCTC8325.fa.gz > tempfile
to download and prepare the input for Bowtie2.
Now try running the following Bash code:
docker run -v $(pwd):/analysis quay.io/biocontainers/bowtie2:2.5.1--py39h3321a2d_0 bowtie2-build /analysis/tempfile /analysis/NCTC8325
Docker will automatically download the container image for Bowtie2 version 2.5.1 from the remote repository https://quay.io/repository/biocontainers/bowtie2
and subsequently run the command! This is the docker run [OPTIONS] IMAGE [COMMAND] [ARG...]
syntax just like before. In this case quay.io/biocontainers/bowtie2:2.5.1--py39h3321a2d_0
is the IMAGE but instead of first downloading and then running it we point to its remote location directly, which will cause Docker to download it on the fly. The bowtie2-build
part is the COMMAND followed by the ARG (the input tempfile and the output index)
The -v $(pwd):/analysis
part is the OPTIONS which we use to mount the current directory inside the container in order to make the tempfile
input available to Bowtie2. More on these so-called “Bind mounts” in Section 4 of this tutorial.
In this section we’ve learned:
- How to use
docker pull
for downloading remotely stored images - How to use
docker image ls
for getting information about the images we have on our system. - How to use
docker run
for starting a container from an image. - How to use the
-it
flag for running in interactive mode. - How to use Docker inside scripts.
3 Building images
In the previous section we downloaded a Docker image of Ubuntu and noticed that it was based on layers, each with a unique hash as id. An image in Docker is based on a number of read-only layers, where each layer contains the differences to the previous layers. If you’ve done the Git tutorial this might remind you of how a Git commit contains the difference to the previous commit. The great thing about this is that we can start from one base layer, say containing an operating system and some utility programs, and then generate many new images based on this, say 10 different project-specific images. This dramatically reduces the storage space requirements. For example, Bioconda (see the Conda tutorial) has one base image and then one individual layer for each of the more than 3000 packages available in Bioconda.
Docker provides a convenient way to describe how to go from a base image to the image we want by using a “Dockerfile”. This is a simple text file containing the instructions for how to generate each layer. Docker images are typically quite large, often several GBs, while Dockerfiles are small and serve as blueprints for the images. It is therefore good practice to have your Dockerfile in your project Git repository, since it allows other users to exactly replicate your project environment.
We will be looking at a Dockerfile called slim.Dockerfile
that is located in your containers
directory (where you should hopefully be standing already). We will now go through that file and discuss the different steps and what they do. After that we’ll build the image and test it out. Lastly, we’ll start from that image and make a new one to reproduce the results from the Conda tutorial.
The default name for a Dockerfile is just that, Dockerfile
. In cases where you want to have more than one Dockerfile in a single directory you can use the format <name>.Dockerfile
instead, as per the official documentation.
3.1 Understanding Dockerfiles
Here are the first few lines of slim.Dockerfile
. Each line in the Dockerfile will typically result in one layer in the resulting image. The format for Dockerfiles is INSTRUCTION arguments
. A full specification of the format, together with best practices, can be found at the Docker website.
FROM condaforge/miniforge3
LABEL authors="John Sundh, john.sundh@scilifelab.se; Erik Fasterius, erik.fasterius@nbis.se"
LABEL description="Minimal image for the NBIS reproducible research course."
Here we use the instructions FROM
and LABEL
. While LABEL
is just key/value metadata pairs that can be used for organising your various Docker components, the important one is FROM
, which specifies the base image we want to start from. Because we want to use conda
to install packages we will start from an image from the conda-forge community that has conda
pre-installed. This image was in turn built using a Dockerfile as a blueprint and then uploaded to Dockerhub. The conda-forge community keeps the Dockerfile in a git repository and you can view the file here. You will see that it starts from an official Ubuntu image (check the first line with the FROM
instruction), followed by code to install various packages including conda.
While you can use arbitrary key/value pairs for LABEL instructions however you like, there are best practices available that you might want to follow. These follow the format of org.opencontainers.image.<label>
, where the namespace (the first part of the format) comes from the Open Container Initiative (OCI), an organisation aimed at creating industry standards for container formats. You can find a list of all the standard labels at the OCI GitHub
When it comes to choosing the best image to start from there are multiple routes you could take. Say you want to run RStudio in a Conda environment through a Jupyter notebook. You could then start from one of the rocker images for R, a Condaforge image, or a Jupyter image. Or you just start from one of the low-level official images and set up everything from scratch.
Let’s take a look at the next section of slim.Dockerfile
.
WORKDIR /course
WORKDIR
determines the directory the container should start in. By default it is set to /
, i.e. the container root, but it can be useful to set it to something else so you don’t always see system-level files that are irrelevant for your analyses. While we call it course
here you can call it whatever you like, e.g. work
or analyses
.
Next up is:
SHELL ["/bin/bash", "-c"]
SHELL
sets the default shell to use in the container. The SHELL
instruction has to be written in the ["executable", "parameters"]
syntax, which is referred to as “JSON form”. Here we set SHELL
to the bash
shell (the -c
flag is used to pass a command to the shell).
The next few lines introduce the important RUN
instruction, which is used for executing shell commands:
# Install `curl` for downloading of FASTQ data later in the tutorial
RUN apt-get update && \
apt-get install -y curl && \
apt-get clean
# Configure Conda
RUN conda config --set channel_priority strict
The first RUN
command installs the curl
command, which will be used to download some raw FASTQ data. As a general rule, you want each layer in an image to be a “logical unit”. For example, if you want to install a program the RUN
command should both retrieve the program, install it and perform any necessary clean up. This is due to how layers work and how Docker decides what needs to be rerun between builds (more on this later).
We then configure Conda to only use strict mode for building the dependency tree, like we did in the pre-course-setup.
While installing things with apt-get
inside Dockerfiles is relatively common practice, it’s important to note that this may affect reproducibility, since it’s not common to specify an exact version. The packages installed in this manner are, however, usually not important for the actual analyses performed, but rather help in the building of the container image itself. While not critical, it’s important to note this from a reproducibility perspective.
Next up is:
# Start Bash shell by default
CMD /bin/bash
CMD
is an interesting instruction. It sets what a container should run when nothing else is specified, i.e. if you run docker run [OPTIONS] [IMAGE]
without the additional [COMMAND] [ARG]
. It can be used for example for printing some information on how to use the image or, as here, start a Bash shell for the user. If the purpose of your image is to accompany a publication then CMD
could be to run the workflow that generates the paper figures from raw data, e.g. CMD snakemake -s Snakefile -c 1 generate_figures
.
3.2 Building from Dockerfiles
Now we understand how a Dockerfile works. Constructing the image itself from the Dockerfile can be done as follows - try it out:
docker build -f slim.Dockerfile -t my_docker_image .
This should result in something similar to this:
[+] Building 2.2s (7/7) FINISHED
=> [internal] load build definition from slim.Dockerfile 0.0s
=> => transferring dockerfile: 667B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/condaforge/miniforge3:latest 0.0s
=> [1/3] FROM docker.io/condaforge/miniforge3 0.0s
=> CACHED [2/3] WORKDIR /course 0.0s
=> [3/3] RUN conda config --set channel_priority strict 0.4s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:53e6efeaa063eadf44c509c770d887af5e222151f08312e741aecc687e6e8981 0.0s
=> => naming to docker.io/library/my_docker_image
Exactly how the output looks depends on which version of Docker you are using. The -f
flag sets which Dockerfile to use and -t
tags the image with a name. This name is how you will refer to the image later. Lastly, the .
is the path to where the image should be build (.
means the current directory). This had no real impact in this case, but matters if you want to import files. Validate with docker image ls
that you can see your new image.
3.3 Creating your own Dockerfile
Now it’s time to make your own Dockerfile to reproduce the results from the Conda tutorial. If you haven’t done the tutorial, it boils down to creating a Conda environment file, setting up that environment, downloading three RNA-seq data files, and running FastQC on those files. We will later package and run the whole RNA-seq workflow in a Docker container, but for now we keep it simple to reduce the size and time required.
The Conda tutorial uses a shell script, run_qc.sh
, for downloading and running the analysis. A copy of this file should also be available in your current directory. If we want to use the same script we need to include it in the image. A basic outline of what we need to do is:
- Create a file called
conda.Dockerfile
- Start the image from the
my_docker_image
we just built - Install the package
fastqc
which is required for the analysis. - Add the
run_qc.sh
script to the image - Set the default command of the image to run the
run_qc.sh
script.
We’ll now go through these steps in more detail. Try to add the corresponding code to conda.Dockerfile
on your own, and if you get stuck you can click to reveal the solution below under “Click to show solution”.
Set image starting point
To set the starting point of the new image, use the FROM
instruction and point to my_docker_image
that we built in the previous Building from Dockerfiles step.
Install packages
Use the RUN
instruction to install the package fastqc=0.11.9
with conda. Here there are several options available. For instance we could add an environment file e.g. environment.yml
from the Conda tutorial and use conda env update --name base
to update the base environment from that file (the rule to keep the base Conda environment free of other packages does not apply to when you’re building it inside a Docker image). Or we could install the package directly with conda install --name base
. We’ll try this later option here, so add a line that will install the fastqc
package, and also clean up packages and cache after installation. Use the -y
flag to conda install
to avoid the prompt that expects an interaction from the user.
Since we used the excellent Miniforge as a base image for my_docker_image
the base environment is always available and the conda-forge
channel is already added as a default. The only thing we need to do is to add the bioconda
channel to the configuration, but other than that there’s not much else needed for Conda to work when using the Miniforge base image.
Add the analysis script
Use the COPY
instruction to Add run_qc.sh
to the image. The syntax is COPY SOURCE TARGET
. In this case SOURCE
is the run_qc.sh
script and TARGET
is a path inside the image, for simplicity it can be specified with .
.
Set default command
Use the CMD
instruction to set the default command for the image to bash run_qc.sh
.
FROM my_docker_image
RUN conda install -y -n base -c bioconda fastqc=0.11.9 && \
conda clean -a
COPY run_qc.sh .
CMD bash run_qc.sh
Build the image and tag it my_docker_conda
.
docker build -t my_docker_conda -f conda.Dockerfile .
Verify that the image was built using docker image ls
.
In this section we’ve learned:
- How the keywords
FROM
,LABEL
,MAINTAINER
,RUN
,ENV
,SHELL
,WORKDIR
, andCMD
can be used when writing a Dockerfile. - How to use
docker build
to construct and tag an image from a Dockerfile. - How to create your own Dockerfile.
4 Managing containers
When you start a container with docker run
it is given an unique id that you can use for interacting with the container. Let’s try to run a container from the image we just created:
docker run my_docker_conda
If everything worked run_qc.sh
is executed and will first download and then analyse the three samples. Once it’s finished you can list all containers, including those that have exited.
docker container ls --all
This should show information about the container that we just ran. Similar to:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b6f7790462c4 my_docker_conda "tini -- /bin/bash -…" 3 minutes ago Up 24 seconds sad_maxwell
If we run docker run
without any flags, your local terminal is attached to the container. This enables you to see the output of run_qc.sh
, but also disables you from doing anything else in the meantime. We can start a container in detached mode with the -d
flag. Try this out and run docker container ls
to validate that the container is running.
By default, Docker keeps containers after they have exited. This can be convenient for debugging or if you want to look at logs, but it also consumes huge amounts of disk space. It’s therefore a good idea to always run with --rm
, which will remove the container once it has exited.
If we want to enter a running container, there are two related commands we can use, docker attach
and docker exec
. docker attach
will attach local standard input, output, and error streams to a running container. This can be useful if your terminal closed down for some reason or if you started a terminal in detached mode and changed your mind. docker exec
can be used to execute any command in a running container. It’s typically used to peak in at what is happening by opening up a new shell. Here we start the container in detached mode and then start a new interactive shell so that we can see what happens. If you use ls
inside the container you can see how the script generates file in the data
and results
directories. Note that you will be thrown out when the container exits, so you have to be quick.
docker run -d --rm --name my_container my_docker_conda
docker exec -it my_container /bin/bash
4.1 Bind mounts
There are obviously some advantages to isolating and running your data analysis in containers, but at some point you need to be able to interact with the rest of the host system (e.g. your laptop) to actually deliver the results. This is done via bind mounts. When you use a bind mount, a file or directory on the host machine is mounted into a container. That way, when the container generates a file in such a directory it will appear in the mounted directory on your host system.
Docker also has a more advanced way of data storage called volumes. Volumes provide added flexibility and are independent of the host machine’s file system having a specific directory structure available. They are particularly useful when you want to share data between containers.
Say that we are interested in getting the resulting html reports from FastQC in our container. We can do this by mounting a directory called, say, fastqc_results
in your current directory to the /course/results/fastqc
directory in the container. Try this out by running:
docker run --rm -v $(pwd)/fastqc_results:/course/results/fastqc my_docker_conda
Here the -v
flag to docker run specifies the bind mount in the form of directory/on/your/computer:/directory/inside/container
. $(pwd)
simply evaluates to the working directory on your computer.
Once the container finishes validate that it worked by opening one of the html reports under fastqc_results/
.
We can also use bind mounts for getting files into the container rather than out. We’ve mainly been discussing Docker in the context of packaging an analysis pipeline to allow someone else to reproduce its outcome. Another application is as a kind of very powerful environment manager, similarly to how we’ve used Conda before. If you’ve organised your work into projects, then you can mount the whole project directory in a container and use the container as the terminal for running stuff while still using your normal OS for editing files and so on. Let’s try this out by mounting our current directory and start an interactive terminal. Note that this will override the CMD
command, so we won’t start the analysis automatically when we start the container.
docker run -it --rm -v $(pwd):/course/ my_docker_conda /bin/bash
If you run ls
you will see that all the files in the container/
directory are there.
In this section we’ve learned:
- How to use
docker run
for starting a container and how the flags-d
and--rm
work. - How to use
docker container ls
for displaying information about the containers. - How to use
docker attach
anddocker exec
to interact with running containers. - How to use bind mounts to share data between the container and the host system.
6 Packaging the case study
During these tutorials we have been working on a case study about the multi-resistant bacteria MRSA. Here we will build and run a Docker container that contains all the work we’ve done so far.
- We’ve set up a GitHub repository for version control and for hosting our project.
- We’ve defined a Conda environment that specifies the packages we’re depending on in the project.
- We’ve constructed a Snakemake workflow that performs the data analysis and keeps track of files and parameters.
- We’ve written a Quarto document that takes the results from the Snakemake workflow and summarizes them in a report.
The workshop-reproducible-research/tutorials/containers
directory contains the final versions of all the files we’ve generated in the other tutorials: environment.yml
, Snakefile
, config.yml
and code/supplementary_material.qmd
. The only difference compared to the other tutorials is that we have also included the rendering of the Supplementary Material HTML file into the Snakemake workflow as the rule make_supplementary
. Running all of these steps will take some time to execute (around 20 minutes or so), in particular if you’re on a slow internet connection.
Now take a look at Dockerfile
. Everything should look quite familiar to you, since it’s basically the same steps as in the image we constructed in the building images section, although with some small modifications. The main difference is that we add the project files needed for executing the workflow (mentioned in the previous paragraph), and install the conda packages using environment.yml
. If you look at the CMD
command you can see that it will run the whole Snakemake workflow by default.
Now run docker build
as before, tag the image with my_docker_project
:
This particular environment (which is quite complicated) has some packages that are not available on ARM64 systems. So if you’re using a new Mac computer you will additionally have to supply the --platform linux/amd64
flag in order for it to build successfully.
docker build -t my_docker_project -f Dockerfile .
Go get a coffee while the image builds (or you could use docker pull nbisweden/workshop-reproducible-research
which will download the same image).
Validate with docker image ls
. Now all that remains is to run the whole thing with docker run
. We just want to get the results, so mount the directory /course/results/
to, say, results/
in your current directory. Click below to see how to write the command.
If building your own image:
docker run -v $(pwd)/results:/course/results my_docker_project
If you pulled the image from DockerHub:
docker run -v $(pwd)/results:/course/results nbisweden/workshop-reproducible-research
Well done! You now have an image that allows anyone to exactly reproduce your analysis workflow (if you first docker push
to Dockerhub that is).
7 Apptainer
Apptainer is a container software alternative to Docker. It was originally developed as Singularity by researchers at Lawrence Berkeley National Laboratory (read more about this below) with focus on security, scientific software, and HPC clusters. One of the ways in which Apptainer is more suitable for HPC is that it very actively restricts permissions so that you do not gain access to additional resources while inside the container. Apptainer also, unlike Docker, stores images as single files using the Singularity Image Format (SIF). A SIF file is self-contained and can be moved around and shared like any other file, which also makes it easy to work with on an HPC cluster.
The open source Singularity project was renamed to Apptainer in 2021. The company Sylabs still keeps their commercial branch of the project under the Singularity name, and offer a free ‘Community Edition’ version. The name change was done in order to clarify the distinction between the open source project and the various commercial versions. At the moment there is virtually no difference to you as a user whether you use Singularity or Apptainer, but eventually it’s very likely that the two will diverge.
While it is possible to define and build Apptainer images from scratch, in a manner similar to what you’ve already learned for Docker, this is not something we will cover here (but feel free to read more about this in e.g. the Apptainer docs.
The reasons for not covering Apptainer more in-depth are varied, but it basically boils down to it being more or less Linux-only, unless you use Virtual Machines (VMs). Even with this you’ll run into issues of incompatibility of various kinds, and these issues are further compounded if you’re on one of the new ARM64-Macs. You also need root
(admin) access in order to actually build Apptainer images regardless of platform, meaning that you can’t build them on e.g. Uppmax, even though Apptainer is already installed there. You can, however, use the --remote
flag, which runs the build on Apptainer’s own servers. This doesn’t work in practice a lot of the time, though, since most scientist will work in private Git repositories so that their research and code is not available to anybody, and the --remote
flag requires that e.g. the environment.yml
file is publicly available.
There are very good reasons to use Apptainer, however, the major one being that you aren’t allowed to use Docker on most HPC systems! One of the nicer features of Apptainer is that it can convert Docker images directly for use within Apptainer, which is highly useful for the cases when you already built your Docker image or if you’re using a remotely available image stored on e.g. DockerHub. For a lot of scientific work based in R and/or Python, however, it is most often the case that you build your own images, since you have a complex dependency tree of software packages not readily available in existing images. So, we now have another problem for building our own images:
- Only Apptainer is allowed on HPC systems, but you can’t build images there due to not having
root
access. - You can build Apptainer images locally and transfer them to HPCs, but this is problematic unless you’re running Linux natively.
Seems like a “catch 22”-problem, right? There are certainly workarounds (some of which we have already mentioned) but most are roundabout or difficult to get working for all use-cases. Funnily enough, there’s a simple solution: run Apptainer locally from inside a Docker container! Conceptually very meta, yes, but works very well in practice. What we are basically advocating for is that you stick with Docker for most of your container-based work, but convert your Docker images using Apptainer-in-Docker whenever you need to work on an HPC. This is of course not applicable to Linux users or those of you who are fine with working through using VMs and managing any issues that arise from doing that.
Apptainer is a great piece of software that is easiest to use if you’re working on a Linux environment. Docker is, however, easier to use from a cross-platform standpoint and covers all use-cases except running on HPCs. Running on HPCs can be done by converting existing Docker images at runtime, while building images for use on HPCs can be done using local Docker images and Apptainer-in-Docker.
7.1 Apptainer-in-Docker
By creating a bare-bones, Linux-based Docker image with Apptainer you can build Apptainer images locally on non-Linux operating systems. There is already a good image setup for just this, and it is defined in this GitHub repository. Looking at the instructions there we can see that we need to do the following:
docker run \
--rm \
-v /var/run/docker.sock:/var/run/docker.sock \
-v $(pwd):/work \
\
kaczmarj/apptainer <IMAGE>.sif docker-daemon://<IMAGE>:<TAG> build
You already know about docker run
, the --rm
flag and bind mounts using -v
. The /var/run/docker.sock
part is the Unix socket that the Docker daemon listens to by default, meaning that it is needed for us to be able to specify the location of the Docker container we want to convert to a SIF file. The kaczmarj/apptainer
part after the bind mounts is the image location hosted at DockerHub, while the last line is the Apptainer command that actually does the conversion. All we need to do is to replace the <IMAGE>
part with the Docker image we want to convert, e.g. my_docker_image
.
- Replace
<IMAGE>
and<TAG>
with one of your locally available Docker images and one of its tags and run the command - remember that you can usedocker image ls
to check what images you have available.
In the end you’ll have a SIF file (e.g. my_docker_image.sif
) that you can transfer to an HPC such as Uppmax and run whatever analyses you need. If you want to be able to do this without having to remember all the code you can check out this script.
7.2 Running Apptainer
The following exercises assume that you have a login to the Uppmax HPC cluster in Uppsala, but will also work for any other system that has Apptainer installed - like if you managed to install Apptainer on your local system or have access to some other HPC cluster. Let’s try to convert the Docker image for this course directly from DockerHub:
apptainer pull mrsa_proj.sif docker://nbisweden/workshop-reproducible-research
This should result in a SIF file called mrsa_proj.sif
.
In the Docker image we included the code needed for the workflow in the /course
directory of the image. These files are of course also available in the Apptainer image. However, a Apptainer image is read-only. This will be a problem if we try to run the workflow within the /course
directory, since the workflow will produce files and Snakemake will create a .snakemake
directory. Instead, we need to provide the files externally from our host system and simply use the Apptainer image as the environment to execute the workflow in (i.e. all the software and dependencies).
In your current working directory (workshop-reproducible-research/tutorials/containers/
) the vital MRSA project files are already available (Snakefile
, config.yml
and code/supplementary_material.qmd
). Since Apptainer bind mounts the current working directory we can simply execute the workflow and generate the output files using:
apptainer run mrsa_proj.sif
This executes the default run command, which is snakemake -rp -c 1 --configfile config.yml
(as defined in the original Dockerfile
). Once completed you should see a bunch of directories and files generated in your current working directory, including the results/
directory containing the final HTML report.
In this section we’ve learned:
- How to build a Apptainer image using Apptainer inside Docker.
- How to convert Docker images to Apptainer images.
- How to run Apptainer images.
8 Extra material
Containers can be large and complicated, but once you start using them regularly you’ll find that you start understand these complexities. There are lots of different things you can do with images and containers in general, especially when it comes to optimising build time or final image size. Here is some small tips and tricks that you can be inspired from!
If you want to read more about containers in general you can check out these resources:
- A “Get started with Docker” at the Docker website.
- An early paper on the subject of using Docker for reproducible research.
8.1 Building for multiple platforms
With the newer ARM64 architectures introduced by Apple one often runs into the problem of not having an architecture-native image to run with. This is sometimes okay since the Rosetta2 software can emulate the old AMD64 architecture on newer ARM64 computers, but results in a performance hit. One could just build for ARM64 using --platform=linux/arm64
instead, but then somebody who doesn’t have the new architecture can’t run it. There is a way around this, however: multi-platform builds. We can build for multiple platforms at the same time and push those to e.g. DockerHub and anybody using those images will automatically pull the one appropriate for their computer. Here’s how to do it:
- Start by checking the available builders using
docker buildx ls
.
You should only see the default builder, which does not have access to multi-platform builds. Let’s create a new builder that does have access to it:
Run the following:
docker buildx create --name mybuilder --driver docker-container --bootstrap
.Switch to using the new builder with
docker buildx use mybuilder
and check that it worked withdocker buildx ls
.
All that’s needed now is to build and push the images! The following command assumes that you have an account with <username>
at DockerHub and you’re pushing the <image>
image:
docker buildx build --platform linux/amd64,linux/arm64 -t <username>/<image>:latest --push .
- Execute the above command with your username and your image.
That’s it! Now anybody who does e.g. docker pull <username>/<image>
will get an image appropriate for their architecture whether they are on AMD64 or ARM64!
buildx
You can type docker buildx install
to make the docker build
into an alias for docker buildx
, allowing you to run multi-platform builds using docker build
. Use docker buildx uninstall
to remove this alias.
8.2 Multi-stage builds
Some build processes can be quite complicated, requiring a more diverse set of software packages in order to successfully build a Docker image. Not all of these packages are always required for running the final image, however, which results in a somewhat bloated image with a larger size footprint than what is strictly required. This is where multi-stage builds come in, which allows for optimisation where only the files and packages actually required for execution is included in the final image.
While this is mostly interesting for software developers and people working with binaries and other non-scripted code, it can be of interest in the context of bioinformatics when it comes to optimising Conda environments. Conda actually comes in two parts: the Conda installation itself (along with all the files requires to build Conda environments) and the Conda environments that we’ve built. The former is actually not needed to be able to use the latter, but the technical details behind this are non-trivial and out of the scope of this course. Regardless, there’s a package that can help us with this: conda-pack
.
Let’s take a very simple Dockerfile as an example of what we might have created before:
FROM condaforge/miniforge3:24.7.1-0
RUN conda install -y -n base python=3.10
CMD /bin/bash
Here we install Python in the Conda base environment, similar to how we have done it previously in the course.
- Copy the above code into e.g.
base.Dockerfile
and build it usingdocker build -f base.Dockerfile -t my_docker_base .
Let’s look at multi-stage image that includes conda-pack
:
#
# First stage: Conda environment
#
FROM condaforge/miniforge3:24.7.1-0 AS build
# Install conda-pack into the base environment
RUN conda install -y -n base conda-pack
# Create a new environment that just contains Python
RUN conda create -y -n env python=3.10
# Package the new environment into /env
RUN conda-pack -n env -o /tmp/env.tar && \
mkdir /env && \
tar -xf /tmp/env.tar -C /env && \
rm /tmp/env.tar && \
/env/bin/conda-unpack
#
# Second stage: final image
#
FROM ubuntu:20.04
# Copy Conda environment from previous stage
COPY --from=build /env /env
# Activate the environment when running the container
RUN echo "source /env/bin/activate" >> ~/.bashrc
CMD /bin/bash
The first thing to notice here is the first FROM
statement, which also includes AS build
, which is specific to multi-stage builds. What we are doing is giving the stage a name, build
in this case, so that we may refer back to it later.
The first RUN
command install conda-pack
in the base environment, not the environment with Python that we’re actually interested in. The reason for this is that we only need conda-pack
for making the Python environment independent of the Conda installation, but we won’t need it to actually activate the environment once this is done. The second and third RUN
commands installs the new environment (named env
here) and packages that environment into a separate directory, respectively.
The second FROM
statement starts the second (and final) stage of the build, and thus does not need a name (hence no AS <name>
statement). Notice that there’s a COPY
statement just after: this is the directive that copies files from the build
stage into the current stage. We only copy the Python environment itself, not the base environment nor the Conda installation.
- Copy the code above into a file called
multi.Dockerfile
and build it with:
docker build -f multi.Dockerfile -t my_docker_multi .
- List your docker images with
docker image ls
and compare the newly createdmy_docker_multi
withmy_docker_base
.
Hopefully you should see that the size of the final image are different: the multi
image should be about 500 MB smaller than the multi
image. This tells you something about how large the Conda installation is all by itself.
We’re still using Ubuntu as the base image for the final stage of our image, but we could go even further in our attempt to minimise the image size by using an even smaller base image, such as alpine
. Doing this means that you lose out on a lot of basic Unix utilities, though, such as Bash.
- Try to optimise the last image we did for the MRSA project,
my_docker_project
! Click below if you want some help.
#
# First stage: Conda environment
#
FROM condaforge/miniforge3:24.7.1-0 AS conda
# Install conda-pack
RUN conda install -y -n base conda-pack
# Install environment
COPY environment.yml ./
RUN conda env create -f environment.yml -n env && \
conda clean -a
# Package the new environment into /env
RUN conda-pack -n env -o /tmp/env.tar && \
mkdir /env && \
tar -xf /tmp/env.tar -C /env && \
rm /tmp/env.tar && \
/env/bin/conda-unpack
#
# Second stage: final image
#
FROM ubuntu:20.04
COPY --from=conda /env /env
WORKDIR /course
# Install required packages
RUN apt-get update && \
apt-get install -y curl && \
apt-get clean
# Install Quarto
ARG QUARTO_VERSION="1.3.450"
RUN mkdir -p /opt/quarto && \
curl -o quarto.tar.gz -L "https://github.com/quarto-dev/quarto-cli/releases/download/v${QUARTO_VERSION}/quarto-${QUARTO_VERSION}-linux-amd64.tar.gz" && \
tar -zxvf quarto.tar.gz -C /opt/quarto/ --strip-components=1 && \
rm quarto.tar.gz
ENV PATH /opt/quarto/bin:${PATH}
# Activate the environment when running the container
RUN echo "source /env/bin/activate" >> ~/.bashrc
# Add project files
COPY Snakefile config.yml ./
COPY code ./code/
SHELL ["/bin/bash", "-c"]
CMD source /env/bin/activate && \
snakemake -p -c 1 --configfile config.yml
- Now check the difference between the previous
my_docker_project
and your new image. How much of a size did you get?
You should hopefully find that you get around 500 MB difference here as well, but since the Conda environment for this particular image is much more complicated than the previous example, this difference is proportionally smaller: Going from ~850 to ~250 MB is quite a big optimisation, while going from 4 to 3.5 GB is not quite as big. Regardless, that (albeit smaller) difference can add up over time when it comes to downloading the image, especially if it’s an image used by many people in a project, or if you’re actually paying for hosting of the image.