HPC Pipelines

Building bioinformatic pipelines

Author

Martin Dahlö

Published

06-Jan-2025

0.1 Connect to PDC

The first step of this lab is to open a ssh connection to PDC. Please refer to Connecting to PDC for instructions. Once connected to PDC, return here and continue reading the instructions below.

0.2 Logon to a node

Check which node you got when you booked resources this morning (replace username with your username)

squeue -u username

should look something like this

user@login1 ~ $ squeue -u user
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           5583899    shared interact    user  R       2:22      1 nid001009
user@login1 ~ $

where nid001009 is the name of the node I got (yours will probably be different). Note the numbers in the Time column. They show for how long the job has been running. When it reaches the time limit you requested the session will shut down, and you will lose all unsaved data. Connect to this node from the login node.

ssh -Y nid001009

If the list is empty you can run the allocation command again and it should be in the list:

salloc -A edu25.uppmax --reservation=edu25-03-24 -t 04:00:00 -p shared -c 1 

1 Copy files for lab

Now, you will need some files. To avoid all the course participants editing the same file all at once, undoing each other’s edits, each participant will get their own copy of the needed files. The files are located in the folder /sw/courses/ngsintro/hpc/pipeline/data.

Next, copy the lab files from this folder. -r means recursively, which means all the files including sub-folders of the source folder. Without it, only files directly in the source folder would be copied, NOT sub-folders and files in sub-folders.

Remember to use tab-complete to avoid typos and too much writing.

# syntax
cp -r <source> <destination>

cp -r /sw/courses/ngsintro/hpc/pipeline/data ~/ngsintro/hpc/pipeline

Have a look in ~/ngsintro/hpc/pipeline

cd ~/ngsintro/hpc/pipeline
ll

If you see files, the copying was successful.

2 Running dummy pipelines

Most of the work you will do in the future will be about running multiple programs after each other. This can be done manually, with you sitting by the computer and typing commands, waiting for them to finish, then start the next program. But what happens if the first program finished sometime during the night? You will not be there to start the next program, and the computer will stand idle until you have time to start the program.

To avoid this, scripts can be used. First, we’ll do an analysis manually without scripts, just to get the hang of it. Then we’ll start writing scripts for it!

2.1 Load the module

In this exercise, we’ll pretend that we are running analyses. This will give you a peek at what running programs in linux is like, and get you ready for the real stuff during the week!

The first thing you usually do is to load the modules for the programs you want to run. During this exercise we’ll only run my dummy scripts that don’t actually do any analysis, so they don’t have a module of their own. What we can do instead is to manually do what module loading usually does: to modify the $PATH variable.

The $PATH variable specifies directories where the computer should look for programs whenever you type a command. For instance, when you type

less

how does the computer know which program to start? You gave it the name less, but that could refer to any file named less in the computer, yet it starts the correct one every time. The answer is that it looks in the directories stored in the $PATH variable and start the first program it finds that is named less. You can check exactly which program it starts by typing which less. It will give you something like the following:

which less
/usr/bin/less

telling you that the program less is located in the directory /usr/bin.

To see which directories that are checked by default, type

echo $PATH

It should give you something like this, a long list of directories (abbreviated below), separated by colon signs:

echo $PATH
/pdc/software/modules/systemdefault/bin:/opt/cray/pe/mpich/8.1.28/of
i/crayclang/17.0/bin: ...
...

Try loading a module, and then look at the $PATH variable again. You’ll see that there are a few extra directories there now, after the module has been loaded.

module load bioinfo-tools samtools
echo $PATH
/pdc/software/eb/software/samtools/1.20/bin:/pdc/software/eb/softwar
e/htslib/1.20/bin:/pdc/software/eb/software/libdeflate/1.19/bin:/pdc
/software/eb/software/xz/5.4.5/bin:/pdc/software/eb/software/bzip2/1
.0.8/bin:/pdc/software/eb/software/ncurses/6.4/bin:/pdc/software/mod
ules/systemdefault/bin:/opt/cray/pe/mpich/8.1.28/ofi/crayclang/17.0/
bin: ...
...

To pretend that we are loading a module, instead of actually loading a module, we’ll manually do what the module system would have done. We will just add a the directory containing my dummy scripts to the $PATH variable, and it will be like we loaded the module for them. Now, when we type the name of one of my scripts, the computer will look in all the directories specified in the $PATH variable, which now includes the location where i keep my scripts. The computer will now find programs named as my scripts are and it will run them.

export PATH=$PATH:/sw/courses/ngsintro/hpc/pipeline/dummy_scripts

This will set the $PATH variable to whatever it is at the moment, and add a directory at the end of it. Note the lack of a dollar sign in front of the variable name directly after export. You don’t use dollar signs when assigning values to variables, and you always use dollar signs when getting values from variables.

Important

The export command affects only the terminal you type it in. If you have 2 terminals open, only the terminal you typed it in will have a modified path. If you close that terminal and open a new one, it will not have the modified path.

Enough with variables now. Let’s try the scripts out!

3 Running the programs

Let’s pretend that we want to run an exome analysis. You will learn how to do this for real later this week. This kind of analysis usually has the following steps:

  1. Filter out low quality reads.
  2. Align the reads to a reference genome.
  3. Find all the SNPs in the data.

To simulate this, I have written 3 programs:

  • filter_reads
  • align_reads
  • find_snps

To find out how to run the programs type

<program name> -h

# or
<program name> --help

This is useful to remember, since most programs has this function. If you do this for the filter program, you get

Usage: filter_reads -i <input file> -o <output file> [-c <cutoff>]

Example runs:

# Filter the reads in <input> using the default cutoff value. Save filtered reads to <output>
filter_reads -i <input> -o <output>
Ex.
filter_reads -i my_reads.rawdata.fastq -o my_reads.filtered.fastq

# Filter the reads in <input> using a more relaxed cutoff value. Save filtered reads to <output>
filter_reads --input <input> --output <output> --cutoff 30
Ex.
filter_reads --input ../../my_reads.rawdata.fastq --output /home/dahlo/results/my_reads.filtered.fastq --cutoff 30


Options:
  -h, --help            show this help message and exit
  -i INPUT, --input=INPUT
                        The path to your unfiltered reads file.
  -o OUTPUT, --output=OUTPUT
                        The path where to put the results of the filtering.
  -c CUTOFF, --cutoff=CUTOFF
                        The cutoff value for quality. Reads below this value
                        will be filtered (default: 35).
  -d, --devel           Development mode

This help text tells you that the program has to be run a certain way. The options -i and -o are mandatory, since they are explicitly written. The hard brackets [ ] around -c <cutoff> means that the cutoff value is not mandatory. They can be specified if the user wishes to change the cutoff from the default values.

Further down, in the Options: section, each of the options are explained more in detail. You can also see that each option can be specified in two way, a short and a long format. The cutoff value can be specified either by the short -c, or the long --cutoff. It doesn’t matter which format you choose, it’s completely up to you, which ever you feel more comfortable with.

Right, so now you know how to figure out how to run programs (just type the program name, followed by a -h or --help). Try doing a complete exome sequencing analysis, following the steps below.

First, go to the exome directory in the lab directory that you copied to your folder in step 2 in this lab:

cd ~/ngsintro/hpc/pipeline/exomeSeq

In there, you will find a folder called raw_data, containing a fastq file: my_reads.rawdata.fastq. This file contains the raw data that you will analyse.

  • Filter the raw data using the program filter_reads, to get rid of low quality reads.
  • Align the filtered reads with the program align_reads, to the human reference genome located here:
/sw/data/reference/Homo_sapiens/hg19/concat_rm/Homo_sapiens.GRCh37.57.dna_rm.concat.fa
  • Find SNPs in your aligned data with the program find_snps. To find SNPs we have to have a reference to compare our data with. The same reference genome as you aligned to is the one to use.

Do one step at a time, and check the --help of the programs to find out how they are to be run. Remember to name your files logically so that you don’t confuse them with each other.

Most pipelines work in a way where the output of the current program is the input of the next program in the pipeline. In this pipeline, raw data gets filtered, the filtered data gets aligned, and the aligned data gets searched for SNPs. The intermediate steps are usually not interesting after you have reached the end of the pipeline. Then, only the raw data and the final result is important.

4 Scripting a dummy pipeline

To run the pipeline in a script, just do exactly as you just did, but write the exact same commands to a file instead of directly to the terminal. When you run the script, the computer will run the script one line at a time, until it reaches the end of the file. Just like you did manually in the previous step.

The simplest way to work with scripts is to have 2 terminals open. One will have nano started where you write your script file, and the other will be on the command line where you can test your commands to make sure they work before you put them in the script. When you are sure a command works, you copy/paste it to the terminal with the script file in it.

Start writing you script with nano:

# if you have not already done so, load the nano module
module load nano

cd ~/ngsintro/hpc/pipeline/exomeSeq
nano exome_analysis_script.sh

The .sh ending is commonly used for shell scripts which is what we are creating. The default shell here is called bash, so whenever we write sh, the computer will use bash. If the default shell would change for some reason, maybe to zsh or any other type of shell, sh would point the the new shell instead.

Since our memory is far from perfect, try to always comment your scripts. The comments are rows that start with a hash sign #. These lines will not be interpreted as a command to be run, they will just be skipped. They are only meant for humans to read, and they are real lifesavers when you are reading old scripts you have forgotten what they do. Commands are hard for humans to read, so try to write a couple of words explaining what the command below does. You’ll be thankful later!

When you are finished with your script, you can test run it. To do that, use the program sh:

sh exome_analysis_script.sh

If you got everything right, you should see the whole pipeline being executed without you having to start each program manually. If something goes wrong, look at the output and try to figure out which step in the pipeline that get the error, and solve it.

A tip is to read the error list from the top-down. An error early in the pipeline will most likely cause a multitude of error further down the pipeline, so your best bet is to start from the top. Solve the problem, try the script again, until it works. The real beauty of scripts is that they can be re-run really easily. Maybe you have to change a variable or option in one of the early steps of the pipeline, just do it and re-run the whole thing.

5 Submitting a dummy pipeline

The whole point with large computer centers like this is that you can run multiple programs at the same time to speed things up. To do this efficiently you will have to submit jobs to the queue system. As you saw in yesterday’s exercise, it is ordinary shell scripts that you submit to the queue system, with a couple of extra options in the beginning. So to be able to submit our script to the queue system, the only thing we have to do is to add the queue system options in the beginning of the script.

The options needed by the queue are, as we learned previously:

  • Who is paying for the job?
  • How long time will the job need?
  • How many cores does the job need?

SLURM is also a bit strict when it comes formalities. It requires that you specify which program should be used to run the submitted script file. The standard program for this is bash, but we have to specify it on the first line of the script non the less. This is done by having the first line in the script looking link this:

#!/bin/bash -l

This is how Linux knows which program should open a file, since it does not care about the file ending like Windows commonly does (.ppt, .pdf, .html might be familiar file endings). The #! indicates that the next words will be the path to the program that should open the file. It could be #!/bin/bash, or #!/bin/python, or any other path to a executable program.

The -l after bash is a flag that tells bash that the script should be treated as a login shell, so everything behaves as when you are logged in. Without it commands like module and some other would not work. If it’s not a login shell, it’s running as a script, and then it does not need to bother making the interface human friendly so it will skip things to make it start faster and consume less resources.

The next couple of rows will contain all the options you want to give SLURM:

#!/bin/bash -l
#SBATCH -A edu25.uppmax
#SBATCH -t 00:05:00
#SBATCH -p shared
#SBATCH -J exome_analysis

SLURM options always start with #SBATCH followed by a flag (-A for account, -t for time, -p for partition, -J for job name) and the value for that flag. Your script should now look something like this (ignore the old project id and path to the scripts):

To submit this script to the queue:

sbatch exome_analysis_script.sh

6 RNAseq Analysis

The next step is to do a complete RNAseq analysis. The steps involved start off just like the exome analysis, but has a few extra steps. The goal of this part is to successfully run the pipeline using the queue system. To do this, you must construct the commands for each step, combine them in a script, include the SLURM options, and submit it. Much like what we did in the previous step, but with some complications.

Typical RNAseq analysis consists of multiple samples / time points:

  • Filter the reads for each sample.
  • Align the filtered reads for each sample to the same reference genome as before.
  • Do a differential expression analysis by comparing multiple samples.

The difficulty here is that you have not just 1 sample, you have 3 of them. And they all need to be filtered and aligned separately, and then compared to each other. The program that does the differential expression analysis in this exercise is called diff_exp and is located in the same directory as the previous scripts. The samples are filtered and aligned individually.