1 Introduction
Container-based technologies are designed to make it easier to create, deploy, and run applications by isolating them in self-contained software units (hence their name). The idea is to package software and/or code together with everything it needs (other packages it depends, various environment settings, etc.) into one unit, i.e. a container. This way we can ensure that the software or code functions in exactly the same way regardless of where it’s executed. Containers are in many ways similar to virtual machines but more lightweight. Rather than starting up a whole new operating system, containers can use the same kernel (usually Linux) as the system that they’re running on. This makes them much faster and smaller compared to virtual machines. While this might sound a bit technical, actually using containers is quite smooth and very powerful.
Containers have also proven to be a very good solution for packaging, running and distributing scientific data analyses. Some applications of containers relevant for reproducible research are:
- When publishing, package your analyses in a container image and let it accompany the article. This way interested readers can reproduce your analysis at the push of a button.
- Packaging your analysis in a container enables you to develop on e.g. your laptop and seamlessly move to cluster or cloud to run the actual analysis.
- Say that you are collaborating on a project and you are using Mac while your collaborator is using Windows. You can then set up a container image specific for your project to ensure that you are working in an identical environment.
One of the largest and most widely used container-based technologies is Docker. Just as with Git, Docker was designed for software development but is rapidly becoming widely used in scientific research. Another container-based technology is Apptainer (and the related Singularity), which was developed to work well in computer cluster environments such as Uppmax. We will cover both Docker and Apptainer in this course, but the focus will be be on the former (since that is the most widely used and runs on all three operating systems).
This tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to install Docker if you haven’t done so already, then open up a terminal and go to workshop-reproducible-research/tutorials/containers.
Docker images tend to take up quite a lot of space. In order to do all the exercises in this tutorial you need to have ~10 GB available.
2 The basics
We’re almost ready to start, just one last note on nomenclature. You might have noticed that we sometimes refer to “Docker images” and sometimes to “Docker containers”. We use images to start containers, so containers are simply an instances of an image. You can have an image containing, say, a certain Linux distribution, and then start multiple containers running that same OS.
If you don’t have root privileges you have to prepend all Docker commands with sudo.
2.1 Downloading images
Docker containers typically run Linux, so let’s start by downloading an image containing Ubuntu (a popular Linux distribution that is based on only open-source tools) through the command line.
docker pull ubuntu:latestYou will notice that it downloads different layers with weird hashes as names. This represents a very fundamental property of Docker images that we’ll get back to in just a little while. The process should end with something along the lines of:
Status: Downloaded newer image for ubuntu:latest
docker.io/library/ubuntu:latest
Let’s take a look at our new and growing collection of Docker images:
docker image lsThe Ubuntu image should show up in this list, with something looking like this:
REPOSITORY TAG IMAGE ID CREATED SIZE
ubuntu latest d70eaf7277ea 3 weeks ago 72.9MB
2.2 Running containers
We can now start a container from the image we just downloaded. We can refer to the image either by “REPOSITORY:TAG” (“latest” is the default so we can omit it) or “IMAGE ID”. The syntax for docker run is docker run [OPTIONS] IMAGE [COMMAND] [ARG...]. To see the available options run docker run --help. The COMMAND part is any command that you want to run inside the container, it can be a script that you have written yourself, a command line tool or a complete workflow. The ARG part is where you put optional arguments that the command will use.
Let’s run uname -a to get some info about the operating system. In this case, uname is the COMMAND and -a the ARG. This command will display some general info about your system, and the -a argument tells uname to display all possible information.
- Run
uname -aon your own system (usesysteminfoif you are on Windows).
This should print something like this to your command line (Darwin is the Unix-like OS that Macs use):
Darwin liv433l.lan 15.6.0 Darwin Kernel Version 15.6.0: Mon Oct 2 22:20:08 PDT 2017; root:xnu-3248.71.4~1/RELEASE_X86_64 x86_64
- Run the same command inside the Docker container, like so:
docker run ubuntu uname -a.
You should then get something like this:
Linux 24d063b5d877 5.4.39-linuxkit #1 SMP Fri May 8 23:03:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
What happens is that we use the downloaded ubuntu image to run a container that has Ubuntu as the operating system, and we instruct Docker to execute uname -a to print the system info within that container. The output from the command is printed to the terminal.
- Try the same thing with
whoamiinstead ofuname -a.
2.3 Running interactively
Docker allows us to execute arbitrary commands on Linux. This looks useful, but maybe a bit limited. We can also get an interactive terminal with the flags -it.
- Run a container interactively with
docker run -it ubuntu
Your prompt should now look similar to:
root@1f339e929fa9:/#
You are now using a terminal inside a container running Ubuntu. Here you can do whatever: install, run, remove stuff, and so on. Anything you do will be isolated within the container and never affect your host system.
- Now exit the container by typing
exitor pressingCTRL + D.
2.4 Containers inside scripts
Docker lets us work in any OS in a quite convenient way. That would probably be useful on its own, but Docker is much more powerful than that. For example, let’s look at a bioinformatic command with Bowtie2, used to build a named reference from a FASTA file:
bowtie2-build <name>.fa <name>Let’s imagine we want to run the above command using containers. We need to find a container image that has bowtie2 installed, and then prepend the command with docker run <image>. First of all we need to download the genome to index, though.
- Run the following to download and prepare the input for Bowtie2:
curl -o NCTC8325.fa.gz ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz
gunzip -c NCTC8325.fa.gz > NCTC8325.fa- Now try running the following command:
docker run quay.io/biocontainers/bowtie2:2.5.1--py39h3321a2d_0 \
bowtie2-build NCTC8325.fa NCTC8325Docker will automatically download the container image for Bowtie2 version 2.5.1 from the remote repository https://quay.io/repository/biocontainers/bowtie2 and subsequently run the command! This is the docker run [OPTIONS] IMAGE [COMMAND] [ARG...] syntax just like before. In this case quay.io/biocontainers/bowtie2:2.5.1--py39h3321a2d_0 is the IMAGE but instead of first downloading and then running it we point to its remote location directly, which will cause Docker to download it on the fly.
The bowtie2-build part is the COMMAND followed by the ARG (the input FASTA file and the output index). You’ll find, however, that the command failed with an error message including Error: could not open NCTC8325.fa. Why is that? Well, it’s something to do with a thing called bind mounts.
2.5 Bind mounts
Docker containers are, by default, isolated from the rest of your system. This is a really good thing, as it keeps everything separate and reproducible, but it requires some extra work in order to get your file system to talk to the container. The error message we got was because the container couldn’t find the FASTA file we just downloaded, even though we can see it right there in our directory.
This is where bind mounts come into play: we can explicitly give the container access to any files we need to run the analysis. When you use a bind mount, a file or directory on the host machine is mounted into a container. That way, when the container generates a file in such a directory it will appear in the mounted directory on your host system.
Docker also has a more advanced way of data storage called volumes. Volumes provide added flexibility and are independent of the host machine’s file system having a specific directory structure available. They are particularly useful when you want to share data between containers.
The simplest way to use bind mounts is usually to bind the current directory to a directory inside the container. This can be done with the -v $(pwd):<MOUNT> syntax, where <MOUNT> is the directory inside the container we want the files to be available (the general form of this syntax is -v <LOCAL>:<MOUNT>).
- Run the same command again, but also mount the files into the
/analysisdirectory inside the container:
docker run -v $(pwd):/analysis quay.io/biocontainers/bowtie2:2.5.1--py39h3321a2d_0 \
bowtie2-build /analysis/NCTC8325.fa /analysis/NCTC8325Note that we not only added the -v flag, but also changed the locations of the input and output files: NCTC8325.fa and NCTC8325 from the previous command becomes /analysis/NCTC8325.fa and /analysis/NCTC8325, since that’s where the files will be mounted inside the container.
It is important to note that if you are running the above command on a Linux machine like Ubuntu, the output files will have root ownership which is not ideal. This is because Docker daemon always runs as a root user or the user that the image has. To avoid this, you can add the following option to the above command.
-u $(id -u ${USER}):$(id -g ${USER})This will make the output files have the same permissions as the user that is running the command.
The operation should now successfully run and you’ll have a bunch of Bowtie2-related files in your directory; since we mounted the current working directory into the container to get the input files in, we also get the output files out of the container.
In this section we’ve learned:
- How to use
docker pullfor downloading remotely stored images - How to use
docker image lsfor getting information about the images we have on our system. - How to use
docker runfor starting a container from an image. - How to use the
-itflag for running in interactive mode. - How to use Docker inside scripts.
- How to use bind mounts to share data between the container and the host system.
3 Building images
In the previous section we downloaded a Docker image of Ubuntu and noticed that it was based on layers, each with a unique hash as id. An image in Docker is based on a number of read-only layers, where each layer contains the differences to the previous layers. If you’ve done the Git tutorial this might remind you of how a Git commit contains the difference to the previous commit. The great thing about this is that we can start from one base layer, say containing an operating system and some utility programs, and then generate many new images based on this, say 10 different project-specific images. This dramatically reduces the storage space requirements. For example, Bioconda (see the Conda tutorial) has one base image and then one individual layer for each of the more than 3000 packages available in Bioconda.
Docker provides a convenient way to describe how to go from a base image to the image we want by using a Dockerfile. This is a simple text file containing the instructions for how to generate each layer. Docker images can sometimes be quite large, upwards of several GBs, while Dockerfiles are small and serve as blueprints for the images. It is therefore good practice to have your Dockerfile in your project Git repository, since it allows other users to exactly replicate your project environment.
We will be looking at a Dockerfile called Dockerfile that is located in your tutorials/containers directory (where you should hopefully be standing already). We will now go through that file and discuss the different steps and what they do. After that we’ll build the image and test it out. Lastly, we’ll start from that image and make a new one to reproduce the results from the Conda tutorial.
The default name for a Dockerfile is just that, Dockerfile. In cases where you want to have more than one Dockerfile in a single directory you can use the format <name>.Dockerfile instead, as per the official documentation.
3.1 Understanding Dockerfiles
Here are the first few lines of Dockerfile. Each line in the Dockerfile will typically result in one layer in the resulting image. The format for Dockerfiles is INSTRUCTION arguments. A full specification of the format, together with best practices, can be found at the Docker website.
FROM condaforge/miniforge3
LABEL authors="John Sundh, john.sundh@scilifelab.se; Erik Fasterius, erik.fasterius@nbis.se"
LABEL description="Minimal image for the NBIS reproducible research course."Here we use the instructions FROM and LABEL. While LABEL is just key/value metadata pairs that can be used for organising your various Docker components, the important one is FROM, which specifies the base image we want to start from. Because we want to use conda to install packages we will start from an image from the conda-forge community that has conda pre-installed. This image was in turn built using a Dockerfile as a blueprint and then uploaded to Dockerhub. The conda-forge community keeps the Dockerfile in a git repository and you can view the file here. You will see that it starts from an official Ubuntu image (check the first line with the FROM instruction), followed by code to install various packages including conda.
While you can use arbitrary key/value pairs for LABEL instructions however you like, there are best practices available that you might want to follow. These follow the format of org.opencontainers.image.<label>, where the namespace (the first part of the format) comes from the Open Container Initiative (OCI), an organisation aimed at creating industry standards for container formats. You can find a list of all the standard labels at the OCI GitHub.
When it comes to choosing the best image to start from there are multiple routes you could take. Say you want to run RStudio in a Conda environment together with a Jupyter notebook. You could then start from one of the Rocker images for R, a Condaforge image, or a Jupyter image. Or you just start from one of the low-level official images and set up everything from scratch.
Let’s take a look at the next section of Dockerfile.
WORKDIR /courseWORKDIR determines the directory the container should start in. By default it is set to /, i.e. the container root, but it can be useful to set it to something else so you don’t always see system-level files that are irrelevant for your analyses. While we call it course here you can call it whatever you like, e.g. work or analyses.
Next up is:
SHELL ["/bin/bash", "-c"]SHELL sets the default shell to use in the container. The SHELL instruction has to be written in the ["executable", "parameters"] syntax, which is referred to as “JSON form”. Here we set SHELL to the bash shell (the -c flag is used to pass a command to the shell).
The next few lines introduce the important RUN instruction, which is used for executing shell commands:
# Install `curl` for downloading of FASTQ data later in the tutorial
RUN apt-get update && \
apt-get install -y curl && \
apt-get clean
# Configure Conda
RUN conda config --set channel_priority strict && \
conda config --append channels biocondaThe first RUN command installs the curl command, which will be used to download some raw FASTQ data. As a general rule, you want each layer in an image to be a “logical unit”. For example, if you want to install a program the RUN command should both retrieve the program, install it and perform any necessary clean up. This is due to how layers work and how Docker decides what needs to be rerun between builds (more on this later).
We then configure Conda to only use strict mode for building the dependency tree, like we did in the pre-course-setup, as well as add bioconda to the channel list.
While installing things with package managers such as apt-get inside Dockerfiles is relatively common practice, it’s important to note that this may affect reproducibility, since it’s not common to specify an exact version. The packages installed in this manner are, however, usually not important for the actual analyses performed, but rather help in the building of the container image itself. While not critical, it’s important to note this from a reproducibility perspective.
Next up is:
# Start Bash shell by default
CMD /bin/bashCMD is an interesting instruction: it sets what a container should run when nothing else is specified, i.e. if you run docker run [OPTIONS] [IMAGE] without the additional [COMMAND] [ARG]. It can be used for e.g. printing some information on how to use the image or, as here, start a Bash shell for the user. If you want some other command to be run by default, for example a workflow, you can change CMD to whatever command would be appropriate instead.
3.2 Building from Dockerfiles
Now we understand how a Dockerfile works, and it’s time to actually build an image from one.
- Build an image using the following command:
docker build -f Dockerfile -t my_docker_image .This should result in something similar to this:
[+] Building 52.0s (9/9) FINISHED docker:desktop-linux
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 585B 0.0s
=> [internal] load metadata for docker.io/condaforge/miniforge3:latest 1.8s
=> [auth] condaforge/miniforge3:pull token for registry-1.docker.io 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [1/4] FROM docker.io/condaforge/miniforge3:latest@sha256:b176780143fe 35.2s
=> => resolve docker.io/condaforge/miniforge3:latest@sha256:b176780143fe0 0.0s
=> => sha256:97dd3f0ce510a30a2868ff104e9ff286ffc0ef0128 28.86MB / 28.86MB 9.0s
=> => sha256:cf80b7965770fb1031c5fe4ce4ca539d1930312 111.15MB / 111.61MB 50.1s
=> => extracting sha256:97dd3f0ce510a30a2868ff104e9ff286ffc0ef01284aebe38 0.4s
=> => extracting sha256:cf80b7965770fb1031c5fe4ce4ca539d1930312626d3ead3e 1.4s
=> [2/4] WORKDIR /course 0.4s
=> [3/4] RUN apt-get update && apt-get install -y curl && apt-ge 12.1s
=> [4/4] RUN conda config --set channel_priority strict && conda conf 0.4s
=> exporting to image 1.9s
=> => exporting layers 1.6s
=> => exporting manifest sha256:57132027a7303f6024ff7a2da78155fcc45563a76 0.0s
=> => exporting config sha256:7d25bab9c520dbc56b134d825279dd88c718bfeb89e 0.0s
=> => exporting attestation manifest sha256:a4d504627e50a93a33dd76838866d 0.0s
=> => exporting manifest list sha256:49504f68de7c0265181b7a322272272e43bb 0.0s
=> => naming to docker.io/library/my_docker_image:latest 0.0s
=> => unpacking to docker.io/library/my_docker_image:latest 0.3s
Exactly how the output looks depends on which version of Docker you are using. The -f flag sets which Dockerfile to use and -t tags the image with a name. This name is how you will refer to the image later. Lastly, the . is the path to where the image should be built (. means the current directory). This had no real impact in this case, but matters if you want to import files.
- Validate with
docker image lsthat you can see your new image.
3.3 Creating your own Dockerfile
Now it’s time to make your own Dockerfile to reproduce the results from the Conda tutorial. If you haven’t done the tutorial, it boils down to creating a Conda environment file, setting up that environment, downloading three RNA-seq data files, and running FastQC on those files. The Conda tutorial uses a shell script, run_qc.sh, for downloading and running the analysis.
Remember from the lecture that the best way to use containers is, generally, as an advanced environment manager together with a Git repository that tracks code, documentation and the environment specification. What we need to do is thus the following:
- Create a file called
conda.Dockerfile - Start the image from the
my_docker_imagewe previously built - Install the package
fastqcwhich is required for the analysis - Run the
run_qc.shscript using the new image and the appropriate bind mounts
We’ll now go through these steps in more detail. Try to add the corresponding code to conda.Dockerfile on your own, and if you get stuck you can click to reveal the solution below under “Click to show solution”.
Set image starting point
To set the starting point of the new image, use the FROM instruction and point to my_docker_image that we built in the previous Building from Dockerfiles step.
Install packages
Use the RUN instruction to install the package fastqc=0.11.9 with conda. Here there are several options available. For instance we could add an environment file e.g. environment.yml from the Conda tutorial and use conda env update --name base to update the base environment from that file (the rule to keep the base Conda environment free of other packages does not apply to when you’re building it inside a Docker image). Or we could install the package directly with conda install --name base. We’ll try this later option here, so add a line that will install the fastqc package, and also clean up packages and cache after installation. Use the -y flag to conda install to avoid the prompt that expects an interaction from the user.
Since we used the excellent Miniforge as a base image for my_docker_image the base environment is always available and the conda-forge channel is already added as a default. As you saw above the Dockerfile contains a line where the bioconda channel is added to the Conda configuration so all we need to do is install the fastqc package.
FROM my_docker_image
RUN conda install -y -n base fastqc=0.11.9 && \
conda clean -ay- Build the image and tag it using:
docker build -t my_docker_conda -f conda.Dockerfile .Verify that the image was built using
docker image ls.Execute the
run_qc.shscript inside the container with the appropriate bind mounts using the following command:
docker run -v $(pwd):/analysis my_docker_conda /analysis/run_qc.shYou should now see the script to completion, but you won’t find the output files in your working directory. The script creates the data and results directories and puts its files there, but since we’re running from within the container in the root (/) directory, the directories created become /data and /results, which are not mounted (since they’re not inside the /analysis directory). What we can do is to add the -w (--workdir) flag, which tells Docker where execution should happen, i.e. what should be the working directory (you could instead add WORKDIR /analysis to the Dockerfile for the same effect).
- Run the same command again, but add
-w /analysisto the command.
While the general strategy we’ve outlined here is sound, it can feel a bit cumbersome to keep track of all the various directories and bind mounts we’re using, but there is an even better strategy: using workflow managers! Both Snakemake and Nextflow can handle all of this for you, and you can focus purely on the command you want to execute in what container. More information about this can be found in their respective “extra materials” section.
Another way to build complicated images is by using Seqera Containers, which is outlined in the extra materials section. This is a very convenient way to get around local builds and hosting the resulting images yourself, and has become widely used in the bioinformatics community.
In this section we’ve learned:
- How the keywords
FROM,LABEL,MAINTAINER,RUN,ENV,SHELL,WORKDIR, andCMDcan be used when writing a Dockerfile. - How to use
docker buildto construct and tag an image from a Dockerfile. - How to create your own Dockerfile.
- How to use
-wto set the working directory at runtime.
4 Managing containers
When you start a container with docker run it is given a unique id that you can use for interacting with the container.
- Let’s try to run a container from the same image we previously used by running the same command again:
docker run -v $(pwd):/analysis -w /analysis my_docker_conda /analysis/run_qc.shIf everything worked run_qc.sh is executed and will first download and then analyse the three samples.
- Once it’s finished you can list all containers, including those that have exited, using
docker container ls --all.
This should show information about the container that we just ran, something similar to the following:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b6f7790462c4 my_docker_conda "tini -- /bin/bash -…" 3 minutes ago Up 24 seconds sad_maxwell
This shows useful information such as the unique container id, the name of the image used to start the container and the unique name given to the container.
By default, Docker keeps containers after they have exited. This can be convenient for debugging or if you want to look at logs, but it also consumes huge amounts of disk space. It’s therefore usually a good idea to run with --rm, which will remove the container once it has exited.
With the way we’ve used docker run so far, your local terminal is attached to the container while it’s running. This enables you to see the output of run_qc.sh, but also disables you from doing anything else in the meantime. We can start a container in detached mode with the -d flag. This will print the container ID that you can use to interact with the running container. Alternatively, you can use the unique name assigned to the container. You can assign this name yourself using the flag --name <any name you want>.
If we want to enter a running container, there are two commands we can use: docker attach and docker exec.
docker attachwill attach local standard input, output, and error streams to a running container. This can be useful if your terminal closed down for some reason or if you started a terminal in detached mode and changed your mind.docker execcan be used to execute any command in a running container. It’s typically used to peek in at what is happening by opening up a new shell.
Let’s try out the docker exec command.
- Start a new container in detached mode and name it
my_containerusing:
docker run -d --name my_container -v $(pwd):/analysis -w /analysis my_docker_conda /analysis/run_qc.sh- Start a new interactive Bash shell inside the newly created container using:
docker exec -it my_container /bin/bash- Use e.g.
lsto see how the script generated files in thedataandresultsdirectories.
You will be thrown out when the container exits, so you have to be quick with performing these three steps, one after the other.
In this section we’ve learned:
- How the
docker runflags-dand--rmwork. - How to use
docker container lsfor displaying information about the containers. - How to use
docker attachanddocker execto interact with running containers.
6 Extra material
There are lots of different things you can do with images and containers in general, especially when it comes to optimising build time or final image size. Here is some tips and tricks that you can be inspired from!
If you want to read more about containers in general you can check out these resources:
- A “Get started with Docker” at the Docker website.
- An early paper on the subject of using Docker for reproducible research.
6.1 Seqera containers
While building from your own Dockerfile gives you full control over exactly how you build your image, there is also another alternative for building images: Seqera Containers. This a free service provided by Seqera, the company behind Nextflow (among other things), and it allows you to easily build images from any package available in PyPI, Bioconda or any Conda channel. This means that most (but not all) bioinformatic software packages can be bundled with a Seqera container, all without the need for a Dockerfile of your own. In fact, all you need to do is to search for the packages in the web interface and Seqera will both build and host your final image for you!
- Head over to https://seqera.io/containers/ and type
fastqcin the search bar.
You should find a bunch of packages with fastqc in their names, but the one we’re looking for is the one from Bioconda, which you’ll find listed as bioconda::fastqc.
- Select the
0.11.9version (the same as we previously used for our other Docker image), which you can do on the right-hand side of the list.
Once you’ve done that the text bioconda::fastqc=0.11.9 should appear in the search bar. We’ll also need curl, since that’s used by the run_qc.sh script.
- Add
curlto the image in the same way.
Once you’ve added curl you can press the blue “Get Container” button, and the image will start building; you can also change from Docker to Apptainer image, or the platform (AMD64 or ARM64), but going with the default of AMD64 Docker is usually useful. When you pressed the button you should have gotten a quick response of “Fetching container” and a link to the image itself; you can also press “View build details” if you want to explore the details about the build and image. Once the build is ready you’ll instead see “Container is ready”. The image can now be used in the same way as for any other image.
Since we previously ran the run_qc.sh script with our image, let’s do the same here. Remember that you’ll need to supply -v $(pwd):/tmp so that the /tmp/run_qc.sh will be available in the container, as well as the -w /tmp to set the workdir:
- Execute
docker run -v $(pwd):/tmp -w /tmp <CONTAINER_LINK> /tmp/run_qc.shfrom the command line.
This will automatically download the Seqera image and start the script as soon as it’s ready.
The reason we used -v $(pwd):/tmp rather than -v $(pwd):/analysis as before is that Seqera containers default to the /tmp directory as their point of entry when running them. There is nothing special about this name, it’s just the default location.
While this particular image was relatively simple, it’s quite common to create more complicated R and Python environments, which require many more packages and dependencies. If all of them are available in e.g. Bioconda (the most common place for bioinformatic software) then you can use a Seqera container instead of making your own Dockerfile. Making your own Dockerfile is, however, still the better alternative if you want complete control of the image (and it’s sometimes the only alternative if your package isn’t available in a Conda channel or PyPI and you have to install it manually).
You can also check out the underlying Wave CLI if you want even more options than the ones available in the Seqera Containers web version.
6.2 Apptainer
Apptainer is a container software alternative to Docker. It was originally developed as Singularity by researchers at Lawrence Berkeley National Laboratory (read more about this below) with focus on security, scientific software, and HPC clusters. One of the ways in which Apptainer is more suitable for HPC is that it very actively restricts permissions so that you do not gain access to additional resources while inside the container. Apptainer also, unlike Docker, stores images as single files using the Singularity Image Format (SIF). A SIF file is self-contained and can be moved around and shared like any other file, which also makes it easy to work with on an HPC cluster.
The open source Singularity project was renamed to Apptainer in 2021. The company Sylabs still keeps their commercial branch of the project under the Singularity name, and offer a free ‘Community Edition’ version. The name change was done in order to clarify the distinction between the open source project and the various commercial versions. At the moment there is little difference to you as a user whether you use Singularity or Apptainer, but eventually it’s very likely that the two will diverge.
While it is possible to define and build Apptainer images from scratch, in a manner similar to what you’ve already learned for Docker, this is not something we will cover here (but feel free to read more about this in e.g. the Apptainer docs.
The reasons for not covering Apptainer more in-depth are varied, but it basically boils down to it being more or less Linux-only, unless you use Virtual Machines (VMs). Even with this you’ll run into issues of incompatibility of various kinds, and these issues are further compounded if you’re on an ARM64-Mac. You also need root (admin) access in order to actually build Apptainer images regardless of platform, meaning that you can’t build them on e.g. Uppmax, even though Apptainer is already installed there. You can, however, use the --remote flag, which runs the build on Apptainer’s own servers. This doesn’t work in practice a lot of the time, though, since most scientist will work in private Git repositories so that their research and code is not available to anybody, and the --remote flag requires that any files required for the build (e.g. the environment.yml file) is publicly available.
There are very good reasons to use Apptainer, however, the major one being that you aren’t allowed to use Docker on most HPC systems! One of the nicer features of Apptainer is that it can convert Docker images directly for use within Apptainer, which is highly useful for the cases when you already built your Docker image or if you’re using a remotely available image stored on e.g. DockerHub. For a lot of scientific work based in R and/or Python, however, it is most often the case that you build your own images, since you have a complex dependency tree of software packages not readily available in existing images (unless you use Seqera Containers). So, we now have another problem for building our own images:
- Only Apptainer is allowed on HPC systems, but you can’t build images there due to not having
rootaccess. - You can build Apptainer images locally and transfer them to HPCs, but this is problematic unless you’re running Linux natively.
Seems like a “catch 22”-problem, right? There are certainly workarounds (some of which we have already mentioned) but most are roundabout or difficult to get working for all use-cases. Funnily enough, there’s a simple solution: run Apptainer locally from inside a Docker container! Conceptually very meta, yes, but works well in practice. What we are basically advocating for is that you stick with Docker for most of your container-based work, but convert your Docker images using Apptainer-in-Docker whenever you need to work on an HPC. This is of course not applicable to Linux users or those of you who are fine with working through using VMs and managing any issues that arise from doing that.
Apptainer is a great piece of software that is easiest to use if you’re working on a Linux environment. Docker is, however, easier to use from a cross-platform standpoint and covers all use-cases except running on HPCs. Running on HPCs can be done by converting existing Docker images at runtime, while building images for use on HPCs can be done using local Docker images and Apptainer-in-Docker.
6.2.1 Apptainer-in-Docker
By creating a bare-bones, Linux-based Docker image with Apptainer you can build Apptainer images locally on non-Linux operating systems. There is already a good image setup for just this, and it is defined in this GitHub repository. Looking at the instructions there we can see that we need to do the following:
docker run \
--rm \
-v /var/run/docker.sock:/var/run/docker.sock \
-v $(pwd):/work \
kaczmarj/apptainer \
build <IMAGE>.sif docker-daemon://<IMAGE>:<TAG>You already know about docker run, the --rm flag and bind mounts using -v. The /var/run/docker.sock part is the Unix socket that the Docker daemon listens to by default, meaning that it is needed for us to be able to specify the location of the Docker container we want to convert to a SIF file. The kaczmarj/apptainer part after the bind mounts is the image location hosted at DockerHub, while the last line is the Apptainer command that actually does the conversion. All we need to do is to replace the <IMAGE> part with the Docker image we want to convert, e.g. my_docker_image.
- Replace
<IMAGE>and<TAG>with one of your locally available Docker images and one of its tags and run the command - remember that you can usedocker image lsto check what images you have available.
In the end you’ll have a SIF file (e.g. my_docker_image.sif) that you can transfer to an HPC such as Uppmax and run whatever analyses you need. If you want to be able to do this without having to remember all the code you can check out this script.
6.2.2 Running Apptainer
The following small exercise assumes that you have a login to the Uppmax HPC cluster in Uppsala, but will also work for any other system that has Apptainer installed - like if you managed to install Apptainer on your local system or have access to some other HPC cluster.
- Convert the Docker image for the
lolcowsoftware directly from the GitHub Container Registry using the following command:
apptainer pull lolcow.sif docker://ghcr.io/apptainer/lolcow.This should result in a SIF file called lolcow.sif.
- Run the
lolcowsoftware with Apptainer usingapptainer run lolcow.sif.
You should now see a small ASCII art of a cow and today’s date.
6.3 Building for multiple platforms
With the newer ARM64 architectures introduced by Apple one often runs into the problem of not having an architecture-native image to run with. This is sometimes okay since the Rosetta2 software can emulate the old AMD64 architecture on newer ARM64 computers, but results in a performance hit. One could just build for ARM64 using --platform=linux/arm64 instead, but then somebody who doesn’t have the new architecture can’t run it. There is a way around this, however: multi-platform builds. We can build for multiple platforms at the same time and push those to e.g. DockerHub and anybody using those images will automatically pull the one appropriate for their computer. Here’s how to do it:
- Start by checking the available builders using
docker buildx ls.
You should only see the default builder, which does not have access to multi-platform builds. Let’s create a new builder that does have access to it:
- Run the following command:
docker buildx create --name mybuilder --driver docker-container --bootstrap.- Switch to using the new builder with
docker buildx use mybuilderand check that it worked withdocker buildx ls.
All that’s needed now is to build and push the images! The following command assumes that you have an account with <username> at DockerHub and you’re pushing the <image> image:
docker buildx build --platform linux/amd64,linux/arm64 -t <username>/<image>:latest --push .- Execute the above command with your username and your image.
That’s it! Now anybody who does e.g. docker pull <username>/<image> will get an image appropriate for their architecture whether they are on AMD64 or ARM64!
buildx
You can type docker buildx install to make the docker build into an alias for docker buildx, allowing you to run multi-platform builds using docker build. Use docker buildx uninstall to remove this alias.
6.4 Multi-stage builds
Some build processes can be quite complicated, requiring a more diverse set of software packages in order to successfully build a Docker image. Not all of these packages are always required for running the final image, however, which results in a somewhat bloated image with a larger size footprint than what is strictly required. This is where multi-stage builds come in, which allows for optimisation where only the files and packages actually required for execution is included in the final image.
While this is mostly interesting for software developers and people working with binaries and other non-scripted code, it can be of interest in the context of bioinformatics when it comes to optimising e.g. Conda environments. Conda actually comes in two parts: the Conda installation itself (along with all the files requires to build Conda environments) and the Conda environments that we’ve built. The former is actually not needed to be able to use the latter, but the technical details behind this are non-trivial and out of the scope of this course. Regardless, there’s a package that can help us with this: conda-pack.
Let’s take a very simple Dockerfile as an example of what we might have created before:
FROM condaforge/miniforge3:24.7.1-0
RUN conda install -y -n base python=3.10
CMD /bin/bashHere we install Python in the Conda base environment, similar to how we have done it previously in the course.
- Copy the above code into e.g.
base.Dockerfileand build it usingdocker build -f base.Dockerfile -t my_docker_base .
Let’s look at the Dockerfile for a multi-stage image that includes conda-pack:
#
# First stage: Conda environment
#
FROM condaforge/miniforge3:24.7.1-0 AS build
# Install conda-pack into the base environment
RUN conda install -y -n base conda-pack
# Create a new environment that just contains Python
RUN conda create -y -n env python=3.10
# Package the new environment into /env
RUN conda-pack -n env -o /tmp/env.tar && \
mkdir /env && \
tar -xf /tmp/env.tar -C /env && \
rm /tmp/env.tar && \
/env/bin/conda-unpack
#
# Second stage: final image
#
FROM ubuntu:20.04
# Copy Conda environment from previous stage
COPY --from=build /env /env
# Activate the environment when running the container
RUN echo "source /env/bin/activate" >> ~/.bashrc
CMD /bin/bashThe first thing to notice here is the first FROM statement, which also includes AS build, which is specific to multi-stage builds. What we are doing is giving the stage a name, build in this case, so that we may refer back to it later.
The first RUN command install conda-pack in the base environment, not the environment with Python that we’re actually interested in. The reason for this is that we only need conda-pack for making the Python environment independent of the Conda installation, but we won’t need it to actually activate the environment once this is done. The second and third RUN commands installs the new environment (named env here) and packages that environment into a separate directory, respectively.
The second FROM statement starts the second (and final) stage of the build, and thus does not need a name (hence no AS <name> statement). Notice that there’s a COPY statement just after: this is the directive that copies files from the build stage into the current stage. We only copy the Python environment itself, not the base environment nor the Conda installation.
Copy the code above into a file called
multi.Dockerfileand build it withdocker build -f multi.Dockerfile -t my_docker_multi ..List your docker images with
docker image lsand compare the newly createdmy_docker_multiwithmy_docker_base.
Hopefully you should see that the size of the final image are different: the multi image should be about 500 MB smaller than the multi image. This tells you something about how large the Conda installation is all by itself.
We’re still using Ubuntu as the base image for the final stage of our image, but we could go even further in our attempt to minimise the image size by using an even smaller base image, such as alpine. Doing this means that you lose out on a lot of basic Unix utilities, though, such as Bash.
There are more complex environment specifications than this, though, especially those using multiple R and/or Python packages. The Conda environments for those type of environments are much more complicated than the previous example, which means that the optimisation is proportionally smaller. Going from ~850 to ~250 MB is quite a big optimisation, while going from e.g. 2 or 3 GB to 1.5 or 2.5 GB is not quite as big. Regardless, that (albeit smaller) difference can add up over time when it comes to downloading the image, especially if it’s an image used by many people in a project, or if you’re actually paying for hosting of the image.