Controlling your environment with Conda

1 Introduction

Conda is a package and environment manager. As a package manager it enables you to install a wide range of software and tools using one simple command: conda install. As an environment manager it allows you to create and manage multiple different environments, each with their own set of packages.

What are the benefits of using an environment manager? Some examples include the ability to easily run different versions of the same package, have different cross-package dependencies that are otherwise incompatible with each other and, last but not least, easy installation of all the software needed for an analysis.

Environments are of particular relevance when making bioinformatics projects reproducible. Full reproducibility requires the ability to recreate the system that was originally used to generate the results. This can, to a large extent, be accomplished by using Conda to make a project environment with specific versions of the packages that are needed in the project. You can read more about Conda at the Conda website.

A Conda package is a compressed tarball (system-level libraries, Python or other modules, executable programs or other components). Conda keeps track of the dependencies between packages and platforms - this means that when installing a given package, all necessary dependencies will also be installed.

Conda packages are typically hosted and downloaded from remote so-called channels. Some widely used channels for general-purpose and bioinformatics packages are conda-forge and Bioconda, respectively. Both of these are community-driven projects, so if you’re missing some package you can contribute to the channel by adding the package to it. When installing a Conda package you specify the package name, version (optional) and channel to download from.

A Conda environment is essentially a directory that is added to your PATH and that contains a specific collection of packages that you have installed. Packages are symlinked between environments to avoid unnecessary duplication.

Different Conda flavours

You may come across several flavours of Conda. There’s Miniconda, which is the installer for Conda. The second is Anaconda, which is a distribution of not only Conda, but also over 150 scientific Python packages curated by the company by the same name (Anaconda). It’s generally better to stick with the Miniconda installation rather than installing 3 GB worth of packages you may not even use. Then, lastly, there’s the Miniforge flavour that we’re using here, which is a community-driven version of Conda that’s highly popular within the scientific community.

The difference between Miniconda and Miniforge is that the former points to the default channel by default (which requires an Anaconda license for commercial purposes), while the latter points to the community-maintained conda-forge channel by default. While Conda is created and owned by Anaconda the company, Conda itself is open source - it’s the default channel that is proprietary. The conda-forge and bioconda channels (two of the largest channels outside of default) are community-driven. Confusing? Yes. If you want this information more in-depth you can read this blog post by Anaconda.

2 The basics

This tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if, you haven’t done so already. Then open up a terminal and go to workshop-reproducible-research/tutorials/conda. Instructions below assume that you are standing in workshop-reproducible-research/tutorials/conda/ unless otherwise specified (e.g. if it says “create a file”, it means save it in workshop-reproducible-research/tutorials/conda/).

Let’s assume that you are just about to start a new exciting research project called Project A.

2.1 Creating Conda environments

Let’s make our first Conda environment:

conda create -n project_a fastqc

This will create an environment called project_a, containing FastQC.

You should see something like this printed to the terminal:

Channels:
 - conda-forge
 - bioconda
Platform: osx-arm64 # <- Your platform may differ

This is Conda telling you which channels it is looking in for the package you requested. If you followed the pre-course instructions and added the conda-forge and bioconda channels to your conda configuration, you should see them listed here. If you had not configured Conda to use these channels, you would have to specify them when installing packages, e.g. conda install -c bioconda fastqc.

Further down you will see something like this:

The following NEW packages will be INSTALLED:

  fastqc             bioconda/noarch::fastqc-0.12.1-hdfd78af_0
  <more packages>

This shows that the fastqc package will be installed from the bioconda channel (the noarch part shows that the package is not specific to any computer architecture). In the example above we see that version 0.12.1 will be installed. The hdfd78af_0 part is a unique build string that is used to differentiate between different builds of the same package version.

Once it is done, you can activate the environment:

conda activate project_a

By default, Conda will add information to your prompt telling you which environment that is active.

To see all your environments you can run:

conda info --envs

The active environment will be marked with an asterisk.

To see the installed packages and their versions in the active environment, run:

conda list

To save the installed packages to a file, run:

conda env export --from-history > environment.yml

Where --from-history only reports the packages requested to be installed and not additional dependencies. A caveat is that if no version was originally specified, then it is not included in the export file either.

Now, deactivate the environment by running conda deactivate.
List all environments again. Which environment is now marked as active?
Try to run FastQC:

fastqc --version

Did it work? Activate your project_a environment and run the fastqc --version command again. Does it work now?

Hopefully the FastQC software was not found in your base environment (unless you had installed it previously), but worked once your environment was activated.

2.2 Adding more packages

Now, let’s add another package (MultiQC) to our environment using conda install. Make sure that project_a is the active environment first.

conda install multiqc

If we don’t specify the package version, the latest available version will be installed. What version of MultiQC got installed?
Run the following to see what versions are available:

conda search multiqc

Now try to install a different version of MultiQC, e.g.:

conda install multiqc=1.13

Read the information that Conda displays in the terminal. It probably asks if you want to downgrade the initial MultiQC installation to the one specified here (1.13 in the example). You can only have one version of a given package in a given environment.

Let’s assume that you will have sequencing data in your Project A, and want to use the latest BBMap software to align your reads.

Find out what versions of BBMap are available in the Bioconda channel using conda search -c bioconda bbmap.
Now install the latest available version of BBMap in your project_a environment.

Let’s further assume that you have an old project (called Project Old) where you know you used BBMap 37.10. You just got back reviewer comments and they want you to include some alignment statistics. Unfortunately, you haven’t saved that information so you will have to rerun the alignment. Now, it is essential that you use the same version of BBMap that your results are based on, otherwise the alignment statistics will be misleading. Using Conda environments this becomes simple. You can just have a separate environment for your old project where you have an old version of BBMap without interfering with your new Project A where you want the latest version.

Make a new environment for your old project:

conda create -n project_old -c bioconda bbmap=37.10

List your environments (do you remember the command?).
Activate project_old and check the BBMap version (bbmap.sh --version).
Activate project_a again and check the BBMap version.

2.3 Removing packages

Now let’s try to remove an installed package from the active environment:

conda remove multiqc

Run conda deactivate to exit your active environment.
Now, let’s remove an environment:

conda env remove -n project_old

After making a few different environments and installing a bunch of packages, Conda can take up some disk space. You can remove unnecessary files with the command:

conda clean -a

This will remove package tar-balls that are left from package installations, unused packages (i.e. those not present in any environments), and cached data.

You can also specify the path to which the tar-balls are downloaded by setting the $CONDA_PKGS_DIRS environment variable. This can be useful especially on compute clusters where disk space in your home directory is limited. Compute clusters typically have a ‘scratch’ folder for storing temporary data which has a larger quota so setting $CONDA_PKGS_DIRS to point to this folder will save space in your home directory. For example, on the Dardel compute cluster the scratch folder is defined by the environment variable $PDC_TMP so for this cluster you can add the following line to your .bashrc file:

export CONDA_PKGS_DIRS=$PDC_TMP/conda_pkgs

From now on, Conda will download and store the package tar-balls in the $PDC_TMP/conda_pkgs folder.

You can also achieve the same effect by setting the pkgs_dirs config parameter with conda config (see configuration below).

Quick recap

In this section we’ve learned:

How to use conda install for installing packages on the fly.
How to create, activate and change between environments.
How to remove packages or environments and clean up.

3 Working with environments

We have up until now specified which Conda packages to install directly on the command line using the conda create and conda install commands. For working in projects this is not the recommended way. Instead, for increased control and reproducibility, it is better to use an environment file (in YAML format) that specifies the packages, versions and channels needed to create the environment for a project.

Throughout these tutorials we will use a case study where we analyse an RNA-seq experiment with the multi-resistant bacteria MRSA (please see the introduction). You will now start to make a Conda YAML file for this MRSA project. The file will contain a list of the software and versions needed to execute the analysis code.

In this Conda tutorial, all code for the analysis is available in the script code/run_qc.sh. This code will download the raw FASTQ-files and subsequently run quality control on these using the FastQC software.

3.1 Working with environments

We will start by making a Conda YAML-file that contains the required packages to perform these two steps. Later in the course, you will update the Conda YAML-file with more packages, as the analysis workflow is expanded.

Let’s get going! Make a YAML file called environment.yml looking like this, and save it in the current directory (which should be workshop-reproducible-research/tutorials/conda):

channels:
  - conda-forge
  - bioconda
dependencies:
  - fastqc=0.12.1

Now, make a new Conda environment from the YAML file (note that here the command is conda env create as opposed to conda create that we used before):

conda env create -n project_mrsa -f environment.yml

Tip

You can also specify exactly which channel a package should come from inside the environment file, using the channel::package=version syntax.

Tip

Instead of the -n flag you can use the -p flag to set the full path to where the Conda environment should be installed. In that way you can contain the Conda environment inside the project directory, which does make sense from a reproducibility perspective, and makes it easier to keep track of what environment belongs to what project. If you don’t specify -p the environment will be installed in the envs/ directory inside your Conda installation path.

Activate the environment!
Now we can run the code for the MRSA project found in code/run_qc.sh by running bash code/run_qc.sh. Do this! (You can also open the file and run the commands manually if you prefer.)

This should download the project FASTQ files and run FastQC on them (as mentioned above).

Check your directory contents (ls -Rlh, or in your file browser). It should now have the following structure:

   conda/
    |
    |- code/
    |   |- run_qc.sh
    |
    |- data/
    |   |- SRR935090.fastq.gz
    |   |- SRR935091.fastq.gz
    |   |- SRR935092.fastq.gz
    |
    |- results/
    |   |- fastqc/
    |       |- SRR935090_fastqc.html
    |       |- SRR935090_fastqc.zip
    |       |- SRR935091_fastqc.html
    |       |- SRR935091_fastqc.zip
    |       |- SRR935092_fastqc.html
    |       |- SRR935092_fastqc.zip
    |
    |- environment.yml

Note that all that was needed to carry out the analysis and generate these files and results was environment.yml (that we used to create a Conda environment with the required packages) and the analysis code in code/run_qc.sh.

3.2 Keeping track of dependencies

Projects can often be quite large and require lots of dependencies; it can feel daunting to try to capture all of that in a single Conda environment, especially when you consider potential incompatibilities that may arise. It can therefore be a good idea to start new projects with an environment file with each package you know that you will need to use, but without specifying exact versions (except for those packages where you know you need a specific version). This will install the latest compatible versions of all the specified software, making the start-up and installation part of new projects easier. You can then add the versions that were installed to your environment file afterwards, ensuring future reproducibility.

There is one command that can make this easier: conda env export. This allows you to export a list of the packages you’ve already installed, including their specific versions, meaning you can easily add them after the fact to your environment file. If you use the --no-builds flag, you’ll get a list of the packages minus their OS-specific build specifications, which is more useful for making the environment portable across systems. This way, you can start with an environment file with just the packages you need (without version), which will install the most up-to-date version possible, and then add the resulting version back in to the environment file using the export command!

Quick recap

In this section we’ve learned:

How to define our Conda environment using a YAML-file.
How to use conda env create to make a new environment from a YAML-file.
How to use conda env export to get a list of installed packages.
How to work in a project-like setting.

4 Extra material

The following extra material contains some more advanced things you can do with Conda and the command line in general, which is not part of the main course materials. All the essential skills of are covered by the previous section: the material here should be considered tips and tricks from people who use Conda as part of their daily work. You thus don’t need to use these things unless you want to, and you can even skip this part of the lesson if you like!

4.1 Configuration

The behaviour of your Conda installation can be changed using an optional configuration file .condarc. On a fresh Conda install no such file is included but it’s created in your home directory as ~/.condarc the first time you run conda config.

You can edit the .condarc file either using a text editor or by way of the conda config command. To list all config parameters and their settings run:

conda config --show

Similar to Conda environment files, the configuration file is in YAML syntax. This means that the config file is structured in the form of key:value pairs where the key is the name of the config parameter (e.g. auto_update_conda) and the value is the parameter setting (e.g. True).

Adding the name of a config parameter to conda config --show will show only that parameter, e.g. conda config --show channels.

You can change parameters with the --set, --add, --append and --remove flags to conda config.

If you for example want to enable the ‘Always yes’ behaviour which makes Conda automatically choose the yes option, such as when installing, you can run:

conda config --set always_yes True

Above we mentioned that you can set the $CONDA_PKGS_DIRS environment variable to specify where Conda should store downloaded packages. This can also be done using the conda config command. Taking the example from above for the Dardel compute cluster, you can set the pkgs_dirs parameter with:

conda config --set pkgs_dirs $PDC_TMP/conda_pkgs

To see details about a config parameter you can run conda config --describe parameter. Try running it on the channels parameter:

conda config --describe channels

In the beginning of this tutorial we added Conda channels to the .condarc file using conda config --add channels. To remove one of the channels from the configuration file you can run:

conda config --remove channels conda-forge

Check your .condarc file to see the change. To add the conda-forge channel back to the top of the channels simply run:

conda config --add channels conda-forge

To completely remove a parameter and all its values run:

conda config --remove-key parameter

For a list of Conda configuration parameters see the Conda configuration page.

Multiple config files

You can have multiple .condarc in different locations, and when using the Miniforge installer (like in the pre-course setup) you automatically get an additional one inside your Miniforge installation directory; you can use conda config --show-sources to see exactly where it is. While the commands we ran only changes the one in your home directory, Conda will still pick up the configuration coming from the Miniforge installation as well (as well as any other config files you may have), resulting in a union of your multiple configs. You can remove the one added by Miniforge by default it you like.

4.2 Managing Python versions

With Conda environments it’s possible to keep several different versions of Python on your computer at the same time, and switching between these versions is very easy. However, a single Conda environment can only contain one version of Python.

4.2.1 Your current Python installation

The base environment has its own version of Python installed. When you open a terminal (after having installed Conda on your system) this base environment is activated by default (as evidenced by (base) prepended to your prompt). You can check what Python version is installed in this environment by running python --version. To see the exact path to the Python executable type which python.

In addition to this your computer may already have Python installed in a separate (system-wide) location outside of the Conda installation. To see if that is the case type conda deactivate until your prompt is not prepended with a Conda environment name. Then type which python. If a path was printed to the terminal (e.g. /usr/bin/python) that means some Python version is already installed in that location. Check what version it is by typing python --version.

Now activate the base environment again by typing conda activate (or the equivalent conda activate base) then check the Python installation path and version using which and python --version as above. See the difference? When you activate an environment your $PATH variable is updated so that when you call python (or any other program) the system first searches the directory of the currently active environment.

4.2.2 Different Python versions

When you create a new Conda environment you can choose to install a specific version of Python in that environment as well. As an example, create an environment containing Python version 3.5:

Note

Python versions older than v3.8.5 are not available for Macs with the M-series chips, so if you are using one of those you will need to add --subdir osx-64 to the command, e.g.:

conda create --subdir osx-64 -n py35 python=3.5

conda create -n py35 python=3.5

Here we name the environment py35 but you can choose whatever name you want.

To activate the environment run:

conda activate py35

You now have a completely separate environment with its own Python version.

Let’s say you instead want an environment with Python version 2.7 installed. You may for instance want to run scripts or packages that were written for Python 2.x and are thus incompatible with Python 3.x. Simply create the new Conda environment with:

conda create -n py27 python=2.7

Activate this environment with:

conda activate py27

Now, switching between Python versions is as easy as typing conda activate py35 / conda activate py27.

Note

If you create an environment where none of the packages require Python, and you don’t explicitly install the python package then that new environment will use the Python version installed in your base environment.

4.3 Decorating your prompt

By default, the name of the currently activated environment is added to your command line prompt. This is a good thing, as it makes it easier to keep track of what environment and packages you have access to. The way this is done in the default implementation becomes an issue when using absolute paths for environments (specifying conda env create -p path/to/environment, though, as the entire path will be added to the prompt. This can take up a lot of unnecessary space on your screen, but can be solved in a number of ways.

The most straightforward way to solve this is to change the Conda configuration file, specifically the settings of the env_prompt configuration value which determines how Conda modifies your command line prompt. For more information about this setting you can run conda config --describe env_prompt and to see your current setting you can run conda config --show env_prompt.

By default env_prompt is set to ({default_env}) which modifies your prompt with the active environment name if it was installed using the -n flag or if the environment folder has a parent folder named envs/. Otherwise the full environment path (i.e. the ‘prefix’) is displayed.

If you instead set env_prompt to ({name}) Conda will modify your prompt with the folder name of the active environment. You can change the setting by running conda config --set env_prompt '({name}) '

If you wish to keep the ({default_env}) behaviour, or just don’t want to change your Conda config, an alternative is to keep Conda environment folders within a parent folder called envs/. This will make Conda only add the folder name of the Conda environment to your prompt when you activate it.

As an example, say you have a project called project_a with the project path ~/myprojects/project_a. You could then install the environment for project_a into a folder ~/myprojects/project_a/envs/project_a_environment. Activating the environment by pointing Conda to it (e.g. conda activate ~/myprojects/project_a/envs/project_a_environment) will only cause your prompt to be modified with project_a_environment.

4.4 Bash aliases for conda

Some programmers like to have aliases (i.e. shortcuts) for common commands. Two aliases that might be useful for you are alias coac='conda activate' and alias code='conda deactivate'. Don’t forget to add them to your ~/.bash_profile if you want to use them!

4.5 Rolling back to an earlier version of the environment

The history of the changes to an environment are automatically tracked. You can see revisions to an environment by using:

conda list --revisions

Which shows each revision (numbered) and what’s installed.

You can revert back to particular revision using:

conda install --revision 5

4.6 Mamba, the drop-in Conda replacement

There is another piece of software that is built on top of Conda as a drop-in replacement for it: Mamba. The reason for Mamba’s existence is that it used to have a better solver algorithm for the dependency tree than Conda did. These days, however, this algorithm is included in Conda as the default. There is still some minor reasons you might want to use Mamba, however, the first of which being that Mamba re-implements Conda in C++, which runs slightly faster than the Python-based Conda. This only yields a minor speed increase compared to the dependency-tree algorithm, though, so don’t expect major differences in execution time between Conda and Mamba. Another reason is that Mamba colours its output, which is nice if you care about that sort of thing. If you installed Conda as described in the pre-course material you’ll, conveniently, already have installed Mamba as well!

4.7 Pixi, for a faster, project-centric approach

Pixi is another package management tool that builds on the Conda ecosystem but which has a more organised, project-centric approach. Pixi does not have a base environment either, which is a very nice feature, as you can completely focus on the environments that your projects needs. The project-centric approach also makes it easy to have multiple environments together. Importantly, Pixi is also significantly faster than both Conda and Mamba. Indeed, several of the teachers here at the course have moved away from using Conda/Mamba and now almost exclusively use Pixi instead.

Tip

If Pixi is so good, why is it in the “extra materials” section rather than as a replacement for Conda? Firstly, Conda is still widely adopted throughout the field, not only as a stand-alone tool, but it’s also widely integrated into other tools such as IDEs, workflow managers, etc.. Secondly, Pixi still uses the Conda ecosystem with channels, packages, etc., so learning Conda is still time well spent, as a lot of it transfers directly to Pixi.

Installing Pixi is simple. Just run:

curl -fsSL https://pixi.sh/install.sh | bash

Then restart your shell and make sure that Pixi is installed by running pixi --version.

Each Pixi project is tied to a specific folder that contains a pixi.toml manifest file describing the project; the TOML format is similar to the YAML format we’ve already used in Conda, but with some structural differences. An example of a minimal pixi.toml file could look like this:

pixi.toml

[project]
name = "project_mrsa"
channels = ["conda-forge", "bioconda"]
platforms = ["linux-64", "osx-64"]

Tip

If you are using VSCode you can install an extension to get syntax highlighting for TOML files. Search for “TOML Support” in the extensions marketplace.

The pixi.toml file is built up of several ‘tables’ which are collections of key/value pairs. In the example above we have the [project] table which defines general information about the project, for example the project name, the Conda channels to use and which platforms the project should be compatible with. These are the minimum requirements for a pixi.toml file.

But you can also add a lot more information to this table, such as authors, description, version, license, homepage etc.

The easiest way to start a new project with Pixi is to run pixi init in the project folder. This will create a pixi.toml file with some default values that you can then edit to fit your project. Try this out in the tutorials/conda/ folder!

Make sure you are in the tutorials/conda/ folder and run:

pixi init

This will create a pixi.toml file in the folder. Open it up and take a look.

You should see something similar to the example above, but likely Pixi also added the authors key (where the value is the name and email address taken from your global git configuration), a description and a version. You can edit these values as you see fit.

Let’s add the bioconda channel to the channels list. Edit the channels key/value pair in the pixi.toml file so it looks like this:

pixi.toml

channels = ["conda-forge", "bioconda"]

At the end of the file there should also be two empty placeholder tables: [tasks] and [dependencies]. The tasks table allows you to add custom shell commands to your project. You can leave the tasks table empty as we won’t go through it in this tutorial, but you can read more about this in the Pixi documentation.

Let’s instead focus on the [dependencies] table. This is where you specify the software dependencies for your project, similar to how you add package names and versions in a Conda environment file. These dependencies can be specified in several ways, the most simple being e.g.:

[dependencies]
python = "*"

This will install the most recent version (the * character is a wildcard which translates into ‘any version’) of Python available in the channels specified in the channels key. You can also specify an explicit version number, e.g. python = "3.8" or combine this with an operator, e.g. python = ">=3.8" to specify that you want Python version 3.8 or later.

Let’s add the fastqc package to the [dependencies] table. Edit this section in the pixi.toml file so it looks like this:

pixi.toml

[dependencies]
fastqc = "*"

Now let’s use the pixi.toml file to create the environment for the project. Run:

pixi install

Pixi will download and install the packages specified under the [dependencies] table (for now just fastqc) and then report:

✔ The default environment has been installed.

You should now see a new file pixi.lock inside your current folder. This file was generated by Pixi during the install step and contains information about the environment that was created and the packages it contains. Importantly, the lock file should be treated as read-only, never edit this file by hand.

If you list hidden files and folder inside the directory (ls -l) you will also see a .pixi/ folder. This, or more specifically the .pixi/envs/default/ folder, is where the environment is stored by default. You can define several environments in the pixi.toml file and each environment will get a sub folder in .pixi/envs/. Here we only have one environment, the default environment.

So how do you activate the environment? You can do this by running:

pixi shell

This will start a new shell in the project environment and as you can see the name of the project is now prepended to the prompt. Try to run fastqc --version to see that the software is installed and working. To exit the shell/environment simply type exit (or hit Ctrl+D).

Let’s add a few more packages to the [dependencies] table. Edit the pixi.toml file so that the [dependencies] table looks like this:

pixi.toml

[dependencies]
fastqc = "*"
multiqc = "*"
seqtk = "*"
snakemake = ">=8"
bowtie2 = "*"
samtools = "*"
subread = "*"

Now directly run pixi shell from the command line. Because you updated the pixi.toml file Pixi will automatically detect that the environment needs to be updated and will do so before starting the shell. Consequently, the pixi.lock file will be updated to reflect the new environment. Try running e.g. snakemake --version to see that the new software is installed and working.

Tip

To see which packages are installed in the environment you can run pixi list from inside the project directory.

This was a very short introduction to Pixi, focusing on how it can be used to achieve the same type of functionality as Conda. As you can see, in Pixi the emphasis is on projects rather than environments. Whether this makes your work more organised and easier to manage is for you to decide.

There is much more you can do with this tool though so if you want to learn more about it you can read the Pixi documentation.