1 Introduction
Container-based technologies are designed to make it easier to create, deploy, and run applications by isolating them in self-contained software units (hence their name). The idea is to package software and/or code together with everything it needs (other packages it depends, various environment settings, etc.) into one unit, i.e. a container. This way we can ensure that the software or code functions in exactly the same way regardless of where it’s executed. Containers are in many ways similar to virtual machines but more lightweight. Rather than starting up a whole new operating system, containers can use the same kernel (usually Linux) as the system that they’re running on. This makes them much faster and smaller compared to virtual machines. While this might sound a bit technical, actually using containers is quite smooth and very powerful.
Containers have also proven to be a very good solution for packaging, running and distributing scientific data analyses. Some applications of containers relevant for reproducible research are:
- When publishing, package your analyses in a container image and let it accompany the article. This way interested readers can reproduce your analysis at the push of a button.
- Packaging your analysis in a container enables you to develop on e.g. your laptop and seamlessly move to cluster or cloud to run the actual analysis.
- Say that you are collaborating on a project and you are using Mac while your collaborator is using Windows. You can then set up a container image specific for your project to ensure that you are working in an identical environment.
One of the largest and most widely used container-based technologies is Docker. Just as with Git, Docker was designed for software development but is rapidly becoming widely used in scientific research. Another container-based technology is Apptainer (and the related Singularity), which was developed to work well in computer cluster environments such as Uppmax. We will cover both Docker and Apptainer in this course, but the focus will be be on the former (since that is the most widely used and runs on all three operating systems).
This tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to install Docker if you haven’t done so already, then open up a terminal and go to workshop-reproducible-research/tutorials/containers
.
Docker images tend to take up quite a lot of space. In order to do all the exercises in this tutorial you need to have ~10 GB available.
2 The basics
We’re almost ready to start, just one last note on nomenclature. You might have noticed that we sometimes refer to “Docker images” and sometimes to “Docker containers”. We use images to start containers, so containers are simply an instances of an image. You can have an image containing, say, a certain Linux distribution, and then start multiple containers running that same OS.
If you don’t have root privileges you have to prepend all Docker commands with sudo
.
2.1 Downloading images
Docker containers typically run Linux, so let’s start by downloading an image containing Ubuntu (a popular Linux distribution that is based on only open-source tools) through the command line.
docker pull ubuntu:latest
You will notice that it downloads different layers with weird hashes as names. This represents a very fundamental property of Docker images that we’ll get back to in just a little while. The process should end with something along the lines of:
Status: Downloaded newer image for ubuntu:latest
docker.io/library/ubuntu:latest
Let’s take a look at our new and growing collection of Docker images:
docker image ls
The Ubuntu image should show up in this list, with something looking like this:
REPOSITORY TAG IMAGE ID CREATED SIZE
ubuntu latest d70eaf7277ea 3 weeks ago 72.9MB
2.2 Running containers
We can now start a container from the image we just downloaded. We can refer to the image either by “REPOSITORY:TAG” (“latest” is the default so we can omit it) or “IMAGE ID”. The syntax for docker run
is docker run [OPTIONS] IMAGE [COMMAND] [ARG...]
. To see the available options run docker run --help
. The COMMAND
part is any command that you want to run inside the container, it can be a script that you have written yourself, a command line tool or a complete workflow. The ARG
part is where you put optional arguments that the command will use.
Let’s run uname -a
to get some info about the operating system. In this case, uname
is the COMMAND
and -a
the ARG
. This command will display some general info about your system, and the -a
argument tells uname
to display all possible information.
First run it on your own system (use systeminfo
if you are on Windows):
uname -a
This should print something like this to your command line:
Darwin liv433l.lan 15.6.0 Darwin Kernel Version 15.6.0: Mon Oct 2 22:20:08 PDT 2017; root:xnu-3248.71.4~1/RELEASE_X86_64 x86_64
Seems like I’m running the Darwin version of MacOS. Then run it in the Ubuntu Docker container:
docker run ubuntu uname -a
Here I get the following result:
Linux 24d063b5d877 5.4.39-linuxkit #1 SMP Fri May 8 23:03:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
And now I’m running on Linux! What happens is that we use the downloaded ubuntu image to run a container that has Ubuntu
as the operating system, and we instruct Docker to execute uname -a
to print the system info within that container. The output from the command is printed to the terminal.
Try the same thing with whoami
instead of uname -a
.
2.3 Running interactively
So, seems we can execute arbitrary commands on Linux. This looks useful, but maybe a bit limited. We can also get an interactive terminal with the flags -it
.
docker run -it ubuntu
Your prompt should now look similar to:
root@1f339e929fa9:/#
You are now using a terminal inside a container running Ubuntu. Here you can do whatever; install, run, remove stuff. Anything you do will be isolated within the container and never affect your host system.
Now exit the container with exit
.
2.4 Containers inside scripts
Okay, so Docker lets us work in any OS in a quite convenient way. That would probably be useful on its own, but Docker is much more powerful than that. For example, let’s look at the shell
part of the index_genome
rule in the Snakemake workflow for the MRSA case study:
shell:"""
bowtie2-build tempfile results/bowtie2/{wildcards.genome_id} > {log}
"""
You may have seen that one can use containers through both Snakemake and Nextflow if you’ve gone through their tutorial’s extra material, but we can also use containers directly inside scripts in a very simple way. Let’s imagine we want to run the above command using containers instead. How would that look? It’s quite simple, really: first we find a container image that has bowtie2
installed, and then prepend the command with docker run <image>
.
First of all we need to download the genome to index though, so run:
curl -o NCTC8325.fa.gz ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz
gunzip -c NCTC8325.fa.gz > tempfile
To download and prepare the input for Bowtie2.
Now try running the following Bash code:
docker run -v $(pwd):/analysis quay.io/biocontainers/bowtie2:2.5.1--py39h3321a2d_0 bowtie2-build /analysis/tempfile /analysis/NCTC8325
Docker will automatically download the container image for Bowtie2 version 2.5.1 from the remote repository https://quay.io/repository/biocontainers/bowtie2
and subsequently run the command! This is the docker run [OPTIONS] IMAGE [COMMAND] [ARG...]
syntax just like before. In this case quay.io/biocontainers/bowtie2:2.5.1--py39h3321a2d_0
is the IMAGE but instead of first downloading and then running it we point to its remote location directly, which will cause Docker to download it on the fly. The bowtie2-build
part is the COMMAND followed by the ARG (the input tempfile and the output index)
The -v $(pwd):/analysis
part is the OPTIONS which we use to mount the current directory inside the container in order to make the tempfile
input available to Bowtie2. More on these so-called “Bind mounts” in Section 4 of this tutorial.
In this section we’ve learned:
- How to use
docker pull
for downloading remotely stored images - How to use
docker image ls
for getting information about the images we have on our system. - How to use
docker run
for starting a container from an image. - How to use the
-it
flag for running in interactive mode. - How to use Docker inside scripts.
3 Building images
In the previous section we downloaded a Docker image of Ubuntu and noticed that it was based on layers, each with a unique hash as id. An image in Docker is based on a number of read-only layers, where each layer contains the differences to the previous layers. If you’ve done the Git tutorial this might remind you of how a Git commit contains the difference to the previous commit. The great thing about this is that we can start from one base layer, say containing an operating system and some utility programs, and then generate many new images based on this, say 10 different project-specific images. This dramatically reduces the storage space requirements. For example, Bioconda (see the Conda tutorial) has one base image and then one individual layer for each of the more than 3000 packages available in Bioconda.
Docker provides a convenient way to describe how to go from a base image to the image we want by using a “Dockerfile”. This is a simple text file containing the instructions for how to generate each layer. Docker images are typically quite large, often several GBs, while Dockerfiles are small and serve as blueprints for the images. It is therefore good practice to have your Dockerfile in your project Git repository, since it allows other users to exactly replicate your project environment.
We will be looking at a Dockerfile called Dockerfile_slim
that is located in your containers
directory (where you should hopefully be standing already). We will now go through that file and discuss the different steps and what they do. After that we’ll build the image and test it out. Lastly, we’ll start from that image and make a new one to reproduce the results from the Conda tutorial.
3.1 Understanding Dockerfiles
Here are the first few lines of Dockerfile_slim
. Each line in the Dockerfile will typically result in one layer in the resulting image. The format for Dockerfiles is INSTRUCTION arguments
. A full specification of the format, together with best practices, can be found at the Docker website.
FROM condaforge/miniforge3
LABEL description = "Minimal image for the NBIS reproducible research course."
MAINTAINER "John Sundh" john.sundh@scilifelab.se
Here we use the instructions FROM
, LABEL
and MAINTAINER
. While LABEL
and MAINTAINER
is just meta-data that can be used for organising your various Docker components the important one is FROM
, which specifies the base image we want to start from. Because we want to use conda
to install packages we will start from an image from the conda-forge community that has conda
pre-installed. This image was in turn built using a Dockerfile as a blueprint and then uploaded to Dockerhub. The conda-forge community keeps the Dockerfile in a git repository and you can view the file here. You will see that it starts from an official Ubuntu image (check the first line with the FROM
instruction), followed by code to install various packages including conda.
When it comes to choosing the best image to start from there are multiple routes you could take. Say you want to run RStudio in a Conda environment through a Jupyter notebook. You could then start from one of the rocker images for R, a Condaforge image, or a Jupyter image. Or you just start from one of the low-level official images and set up everything from scratch.
Let’s take a look at the next section of Dockerfile_slim
.
# Use bash as shell
SHELL ["/bin/bash", "--login", "-c"]
# Set workdir
WORKDIR /course
# Set time zone
ENV TZ="Europe/Stockholm"
ENV DEBIAN_FRONTEND=noninteractive
SHELL
simply sets which shell to use and WORKDIR
determines the directory the container should start in. The ENV
instruction is used to set environmental variables and here we use it to set the time zone by declaring a TZ
variable. The DEBIAN_FRONTEND=noninteractive
line means that we force the subsequent installation to not prompt us to set the time zone manually.
The next few lines introduce the important RUN
instruction, which is used for executing shell commands:
# Install package for setting time zone
RUN apt-get update && apt-get install -y tzdata && apt-get clean
# Configure Conda
RUN conda init bash && conda config --set channel_priority strict && \
conda config --append channels bioconda && \
conda config --append channels r && \
conda config --set subdir linux-64
The first RUN command installs the tzdata
package for managing local time settings in the container. This may not always be required for your Dockerfile but it’s added here because some R packages used in the course require it.
While installing things with apt-get
inside Dockerfiles is relatively common practice, it’s important to note that this may affect reproducibility, since it’s not common to specify an exact version. The packages installed in this manner are, however, usually not important for the actual analyses performed, but rather help in the building of the container image itself. While not critical, it’s important to note this from a reproducibility perspective.
Next, we run conda init bash
to initialize the bash shell inside the image, meaning we can use conda activate
in containers that run from the image. In the same RUN
statement we also configure the strict channel priority and add appropriate channels with conda config
. You’ll probably recognize this from the pre-course-setup. The last part sets the somewhat obscure subdir
config parameter pointing to the linux-64
architecture of conda channels.
As a general rule, you want each layer in an image to be a “logical unit”. For example, if you want to install a program the RUN
command should both retrieve the program, install it and perform any necessary clean up. This is due to how layers work and how Docker decides what needs to be rerun between builds. More on this later.
Next up is:
# Open port for running Jupyter Notebook
EXPOSE 8888
# Start Bash shell by default
CMD /bin/bash
EXPOSE
opens up the port 8888, so that we can later run a Jupyter Notebook server on that port. CMD
is an interesting instruction. It sets what a container should run when nothing else is specified, i.e. if you run docker run [OPTIONS] [IMAGE]
without the additional [COMMAND] [ARG]
. It can be used for example for printing some information on how to use the image or, as here, start a Bash shell for the user. If the purpose of your image is to accompany a publication then CMD
could be to run the workflow that generates the paper figures from raw data, e.g. CMD snakemake -s Snakefile -c 1 generate_figures
.
3.2 Building from Dockerfiles
Now we understand how a Dockerfile works. Constructing the image itself from the Dockerfile can be done as follows - try it out:
If your computer is a MAC with the M1 chip, you may have to add --platform linux/x86_64
to the docker build
command.
docker build -f Dockerfile_slim -t my_docker_image .
This should result in something similar to this:
[+] Building 2.2s (7/7) FINISHED
=> [internal] load build definition from Dockerfile_slim 0.0s
=> => transferring dockerfile: 667B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/condaforge/miniforge3:latest 0.0s
=> [1/3] FROM docker.io/condaforge/miniforge3 0.0s
=> CACHED [2/3] WORKDIR /course 0.0s
=> [3/3] RUN conda init bash && conda config --set channel_priority strict && conda config --append channels bioconda && conda config --append channels r && conda config --set subdir 2.1s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:53e6efeaa063eadf44c509c770d887af5e222151f08312e741aecc687e6e8981 0.0s
=> => naming to docker.io/library/my_docker_image
Exactly how the output looks depends on which version of Docker you are using. The -f
flag sets which Dockerfile to use and -t
tags the image with a name. This name is how you will refer to the image later. Lastly, the .
is the path to where the image should be build (.
means the current directory). This had no real impact in this case, but matters if you want to import files. Validate with docker image ls
that you can see your new image.
3.3 Creating your own Dockerfile
Now it’s time to make your own Dockerfile to reproduce the results from the Conda tutorial. If you haven’t done the tutorial, it boils down to creating a Conda environment file, setting up that environment, downloading three RNA-seq data files, and running FastQC on those files. We will later package and run the whole RNA-seq workflow in a Docker container, but for now we keep it simple to reduce the size and time required.
The Conda tutorial uses a shell script, run_qc.sh
, for downloading and running the analysis. A copy of this file should also be available in your current directory. If we want to use the same script we need to include it in the image. A basic outline of what we need to do is:
- Create a file called
Dockerfile_conda
- Start the image from the
my_docker_image
we just built - Install the package
fastqc
which is required for the analysis. - Add the
run_qc.sh
script to the image - Set the default command of the image to run the
run_qc.sh
script.
We’ll now go through these steps in more detail. Try to add the corresponding code to Dockerfile_conda
on your own, and if you get stuck you can click to reveal the solution below under “Click to show solution”.
Set image starting point
To set the starting point of the new image, use the FROM
instruction and point to my_docker_image
that we built in the previous Building from Dockerfiles step.
Install packages
Use the RUN
instruction to install the package fastqc=0.11.9
with conda. Here there are several options available. For instance we could add an environment file e.g. environment.yml
from the Conda tutorial and use conda env create
to create an environment from that file. Or we could create an environment directly with conda create
. We’ll try this later option here, so add a line that will create an environment named project_mrsa
containing the fastqc
package, and also clean up packages and cache after installation. Use the -y
flag to conda create
to avoid the prompt that expects an interaction from the user.
In order to have the project_mrsa
environment activated upon start-up we need to add two more lines to the Dockerfile. First we need to use a RUN
instruction to run echo "source activate project_mrsa" >> ~/.bashrc
, and then we need to use the ENV
instruction to set the $PATH
variable inside the image to /opt/conda/envs/project_mrsa/bin:$PATH
.
Add the analysis script
Use the COPY
instruction to Add run_qc.sh
to the image. The syntax is COPY SOURCE TARGET
. In this case SOURCE
is the run_qc.sh
script and TARGET
is a path inside the image, for simplicity it can be specified with ./
.
Set default command
Use the CMD
instruction to set the default command for the image to bash run_qc.sh
.
FROM my_docker_image
RUN conda create -y -n project_mrsa -c bioconda fastqc=0.11.9 && conda clean -a
RUN echo "source activate project_mrsa" >> ~/.bashrc
ENV PATH=/opt/conda/envs/project_mrsa/bin:$PATH
COPY run_qc.sh .
CMD bash run_qc.sh
Build the image and tag it my_docker_conda
(remember to add --platform linux/x86_64
to the build command if you are using a Mac with the Apple chip).
docker build -t my_docker_conda -f Dockerfile_conda .
Verify that the image was built using docker image ls
.
In this section we’ve learned:
- How the keywords
FROM
,LABEL
,MAINTAINER
,RUN
,ENV
,SHELL
,WORKDIR
, andCMD
can be used when writing a Dockerfile. - How to use
docker build
to construct and tag an image from a Dockerfile. - How to create your own Dockerfile.
4 Managing containers
When you start a container with docker run
it is given an unique id that you can use for interacting with the container. Let’s try to run a container from the image we just created:
docker run my_docker_conda
If everything worked run_qc.sh
is executed and will first download and then analyse the three samples. Once it’s finished you can list all containers, including those that have exited.
docker container ls --all
This should show information about the container that we just ran. Similar to:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b6f7790462c4 my_docker_conda "tini -- /bin/bash -…" 3 minutes ago Up 24 seconds 8888/tcp sad_maxwell
If we run docker run
without any flags, your local terminal is attached to the container. This enables you to see the output of run_qc.sh
, but also disables you from doing anything else in the meantime. We can start a container in detached mode with the -d
flag. Try this out and run docker container ls
to validate that the container is running.
By default, Docker keeps containers after they have exited. This can be convenient for debugging or if you want to look at logs, but it also consumes huge amounts of disk space. It’s therefore a good idea to always run with --rm
, which will remove the container once it has exited.
If we want to enter a running container, there are two related commands we can use, docker attach
and docker exec
. docker attach
will attach local standard input, output, and error streams to a running container. This can be useful if your terminal closed down for some reason or if you started a terminal in detached mode and changed your mind. docker exec
can be used to execute any command in a running container. It’s typically used to peak in at what is happening by opening up a new shell. Here we start the container in detached mode and then start a new interactive shell so that we can see what happens. If you use ls
inside the container you can see how the script generates file in the data
and results
directories. Note that you will be thrown out when the container exits, so you have to be quick.
docker run -d --rm --name my_container my_docker_conda
docker exec -it my_container /bin/bash
4.1 Bind mounts
There are obviously some advantages to isolating and running your data analysis in containers, but at some point you need to be able to interact with the rest of the host system (e.g. your laptop) to actually deliver the results. This is done via bind mounts. When you use a bind mount, a file or directory on the host machine is mounted into a container. That way, when the container generates a file in such a directory it will appear in the mounted directory on your host system.
Docker also has a more advanced way of data storage called volumes. Volumes provide added flexibility and are independent of the host machine’s file system having a specific directory structure available. They are particularly useful when you want to share data between containers.
Say that we are interested in getting the resulting html reports from FastQC in our container. We can do this by mounting a directory called, say, fastqc_results
in your current directory to the /course/results/fastqc
directory in the container. Try this out by running:
docker run --rm -v $(pwd)/fastqc_results:/course/results/fastqc my_docker_conda
Here the -v
flag to docker run specifies the bind mount in the form of directory/on/your/computer:/directory/inside/container
. $(pwd)
simply evaluates to the working directory on your computer.
Once the container finishes validate that it worked by opening one of the html reports under fastqc_results/
.
We can also use bind mounts for getting files into the container rather than out. We’ve mainly been discussing Docker in the context of packaging an analysis pipeline to allow someone else to reproduce its outcome. Another application is as a kind of very powerful environment manager, similarly to how we’ve used Conda before. If you’ve organised your work into projects, then you can mount the whole project directory in a container and use the container as the terminal for running stuff while still using your normal OS for editing files and so on. Let’s try this out by mounting our current directory and start an interactive terminal. Note that this will override the CMD
command, so we won’t start the analysis automatically when we start the container.
docker run -it --rm -v $(pwd):/course/ my_docker_conda /bin/bash
If you run ls
you will see that all the files in the container/
directory are there.
In this section we’ve learned:
- How to use
docker run
for starting a container and how the flags-d
and--rm
work. - How to use
docker container ls
for displaying information about the containers. - How to use
docker attach
anddocker exec
to interact with running containers. - How to use bind mounts to share data between the container and the host system.
6 Packaging the case study
During these tutorials we have been working on a case study about the multi-resistant bacteria MRSA. Here we will build and run a Docker container that contains all the work we’ve done so far.
- We’ve set up a GitHub repository for version control and for hosting our project.
- We’ve defined a Conda environment that specifies the packages we’re depending on in the project.
- We’ve constructed a Snakemake workflow that performs the data analysis and keeps track of files and parameters.
- We’ve written a Quarto document that takes the results from the Snakemake workflow and summarizes them in a report.
The workshop-reproducible-research/tutorials/containers
directory contains the final versions of all the files we’ve generated in the other tutorials: environment.yml
, Snakefile
, config.yml
and code/supplementary_material.qmd
. The only difference compared to the other tutorials is that we have also included the rendering of the Supplementary Material HTML file into the Snakemake workflow as the rule make_supplementary
. Running all of these steps will take some time to execute (around 20 minutes or so), in particular if you’re on a slow internet connection.
Now take a look at Dockerfile
. Everything should look quite familiar to you, since it’s basically the same steps as in the image we constructed in the building images section, although with some small modifications. The main difference is that we add the project files needed for executing the workflow (mentioned in the previous paragraph), and install the conda packages using environment.yml
. If you look at the CMD
command you can see that it will run the whole Snakemake workflow by default.
Now run docker build
as before, tag the image with my_docker_project
(remember the --platform linux/x86_64
flag if you’re on a new Mac with the Apple chip):
docker build -t my_docker_project -f Dockerfile .
Go get a coffee while the image builds (or you could use docker pull nbisweden/workshop-reproducible-research
which will download the same image).
Validate with docker image ls
. Now all that remains is to run the whole thing with docker run
. We just want to get the results, so mount the directory /course/results/
to, say, results/
in your current directory. Click below to see how to write the command.
If building your own image:
docker run -v $(pwd)/results:/course/results my_docker_project
If you pulled the image from DockerHub:
docker run -v $(pwd)/results:/course/results nbisweden/workshop-reproducible-research
Well done! You now have an image that allows anyone to exactly reproduce your analysis workflow (if you first docker push
to Dockerhub that is).
If you’ve done the Jupyter tutorial, you know that Jupyter Notebook runs as a web server. This makes it very well suited for running in a Docker container, since we can just expose the port Jupyter Notebook uses and redirect it to one of our own. You can then work with the notebooks in your browser just as you’ve done before, while it’s actually running in the container. This means you could package your data, scripts and environment in a Docker image that also runs a Jupyter Notebook server. If you make this image available, say on Dockerhub, other researchers could then download it and interact with your data/code via the fancy interactive Jupyter notebooks that you have prepared for them. We haven’t made any fancy notebooks for you, but we have set up a Jupyter Notebook server. Try it out if you want to (replace the image name with your version if you’ve built it yourself):
docker run -it nbisweden/workshop-reproducible-research jupyter notebook -allow-root --no-browser
7 Apptainer
Apptainer is a container software alternative to Docker. It was originally developed as Singularity by researchers at Lawrence Berkeley National Laboratory (read more about this below) with focus on security, scientific software, and HPC clusters. One of the ways in which Apptainer is more suitable for HPC is that it very actively restricts permissions so that you do not gain access to additional resources while inside the container. Apptainer also, unlike Docker, stores images as single files using the Singularity Image Format (SIF). A SIF file is self-contained and can be moved around and shared like any other file, which also makes it easy to work with on an HPC cluster.
The open source Singularity project was renamed to Apptainer in 2021. The company Sylabs still keeps their commercial branch of the project under the Singularity name, and offer a free ‘Community Edition’ version. The name change was done in order to clarify the distinction between the open source project and the various commercial versions. At the moment there is virtually no difference to you as a user whether you use Singularity or Apptainer, but eventually it’s very likely that the two will diverge.
While it is possible to define and build Apptainer images from scratch, in a manner similar to what you’ve already learned for Docker, this is not something we will cover here (but feel free to read more about this in e.g. the Apptainer docs.
The reasons for not covering Apptainer more in-depth are varied, but it basically boils down to it being more or less Linux-only, unless you use Virtual Machines (VMs). Even with this you’ll run into issues of incompatibility of various kinds, and these issues are further compounded if you’re on one of the new ARM64-Macs. You also need root
(admin) access in order to actually build Apptainer images regardless of platform, meaning that you can’t build them on e.g. Uppmax, even though Apptainer is already installed there. You can, however, use the --remote
flag, which runs the build on Apptainer’s own servers. This doesn’t work in practice a lot of the time, though, since most scientist will work in private Git repositories so that their research and code is not available to anybody, and the --remote
flag requires that e.g. the environment.yml
file is publicly available.
There are very good reasons to use Apptainer, however, the major one being that you aren’t allowed to use Docker on most HPC systems! One of the nicer features of Apptainer is that it can convert Docker images directly for use within Apptainer, which is highly useful for the cases when you already built your Docker image or if you’re using a remotely available image stored on e.g. DockerHub. For a lot of scientific work based in R and/or Python, however, it is most often the case that you build your own images, since you have a complex dependency tree of software packages not readily available in existing images. So, we now have another problem for building our own images:
- Only Apptainer is allowed on HPC systems, but you can’t build images there due to not having
root
access. - You can build Apptainer images locally and transfer them to HPCs, but this is problematic unless you’re running Linux natively.
Seems like a “catch 22”-problem, right? There are certainly workarounds (some of which we have already mentioned) but most are roundabout or difficult to get working for all use-cases. Funnily enough, there’s a simple solution: run Apptainer locally from inside a Docker container! Conceptually very meta, yes, but works very well in practice. What we are basically advocating for is that you stick with Docker for most of your container-based work, but convert your Docker images using Apptainer-in-Docker whenever you need to work on an HPC. This is of course not applicable to Linux users or those of you who are fine with working through using VMs and managing any issues that arise from doing that.
Apptainer is a great piece of software that is easiest to use if you’re working on a Linux environment. Docker is, however, easier to use from a cross-platform standpoint and covers all use-cases except running on HPCs. Running on HPCs can be done by converting existing Docker images at runtime, while building images for use on HPCs can be done using local Docker images and Apptainer-in-Docker.
7.1 Apptainer-in-Docker
By creating a bare-bones, Linux-based Docker image with Apptainer you can build Apptainer images locally on non-Linux operating systems. There is already a good image setup for just this, and it is defined in this GitHub repository. Looking at the instructions there we can see that we need to do the following:
docker run \
--rm \
-v /var/run/docker.sock:/var/run/docker.sock \
-v $(pwd):/work \
\
kaczmarj/apptainer <IMAGE>.sif docker-daemon://<IMAGE>:<TAG> build
You already know about docker run
, the --rm
flag and bind mounts using -v
. The /var/run/docker.sock
part is the Unix socket that the Docker daemon listens to by default, meaning that it is needed for us to be able to specify the location of the Docker container we want to convert to a SIF file. The kaczmarj/apptainer
part after the bind mounts is the image location hosted at DockerHub, while the last line is the Apptainer command that actually does the conversion. All we need to do is to replace the <IMAGE>
part with the Docker image we want to convert, e.g. my_docker_image
.
- Replace
<IMAGE>
and<TAG>
with one of your locally available Docker images and one of its tags and run the command - remember that you can usedocker image ls
to check what images you have available.
In the end you’ll have a SIF file (e.g. my_docker_image.sif
) that you can transfer to an HPC such as Uppmax and run whatever analyses you need. If you want to be able to do this without having to remember all the code you can check out this script.
7.2 Running Apptainer
The following exercises assume that you have a login to the Uppmax HPC cluster in Uppsala, but will also work for any other system that has Apptainer installed - like if you managed to install Apptainer on your local system or have access to some other HPC cluster. Let’s try to convert the Docker image for this course directly from DockerHub:
apptainer pull mrsa_proj.sif docker://nbisweden/workshop-reproducible-research
This should result in a SIF file called mrsa_proj.sif
.
In the Docker image we included the code needed for the workflow in the /course
directory of the image. These files are of course also available in the Apptainer image. However, a Apptainer image is read-only. This will be a problem if we try to run the workflow within the /course
directory, since the workflow will produce files and Snakemake will create a .snakemake
directory. Instead, we need to provide the files externally from our host system and simply use the Apptainer image as the environment to execute the workflow in (i.e. all the software and dependencies).
In your current working directory (workshop-reproducible-research/tutorials/containers/
) the vital MRSA project files are already available (Snakefile
, config.yml
and code/supplementary_material.qmd
). Since Apptainer bind mounts the current working directory we can simply execute the workflow and generate the output files using:
apptainer run mrsa_proj.sif
This executes the default run command, which is snakemake -rp -c 1 --configfile config.yml
(as defined in the original Dockerfile
). Once completed you should see a bunch of directories and files generated in your current working directory, including the results/
directory containing the final HTML report.
In this section we’ve learned:
- How to build a Apptainer image using Apptainer inside Docker.
- How to convert Docker images to Apptainer images.
- How to run Apptainer images.
8 Extra material
Containers can be large and complicated, but once you start using them regularly you’ll find that you start understand these complexities. There are lots of different things you can do with images and containers in general, especially when it comes to optimising build time or final image size. Here is some small tips and tricks that you can be inspired from!
If you want to read more about containers in general you can check out these resources:
- A “Get started with Docker” at the Docker website.
- An early paper on the subject of using Docker for reproducible research.
8.1 Building for multiple platforms
With the newer ARM64 architectures introduced by Apple one often runs into the problem of not having an architecture-native image to run with. This is sometimes okay since the Rosetta2 software can emulate the old AMD64 architecture on newer ARM64 computers, but results in a performance hit. One could just build for ARM64 using --platform=linux/arm64
instead, but then somebody who doesn’t have the new architecture can’t run it. There is a way around this, however: multi-platform builds. We can build for multiple platforms at the same time and push those to e.g. DockerHub and anybody using those images will automatically pull the one appropriate for their computer. Here’s how to do it:
- Start by checking the available builders using
docker buildx ls
.
You should only see the default builder, which does not have access to multi-platform builds. Let’s create a new builder that does have access to it:
Run the following:
docker buildx create --name mybuilder --driver docker-container --bootstrap
.Switch to using the new builder with
docker buildx use mybuilder
and check that it worked withdocker buildx ls
.
All that’s needed now is to build and push the images! The following command assumes that you have an account with <username>
at DockerHub and you’re pushing the <image>
image:
docker buildx build --platform linux/amd64,linux/arm64 -t <username>/<image>:latest --push .
- Execute the above command with your username and your image.
That’s it! Now anybody who does e.g. docker pull <username>/<image>
will get an image appropriate for their architecture whether they are on AMD64 or ARM64!
buildx
You can type docker buildx install
to make the docker build
into an alias for docker buildx
, allowing you to run multi-platform builds using docker build
. Use docker buildx uninstall
to remove this alias.