HPC Introduction

High-performance computing cluster: Dardel

Author

Martin Dahlö

Published

14-Nov-2024

0.1 Connect to PDC

The first step of this lab is to open a ssh connection to PDC. Please refer to Connecting to PDC for instructions. Once connected to PDC, return here and continue reading the instructions below.

0.2 Logon to a node

Usually you would do most of the work in this lab directly on one of the login nodes at PDC, but we have arranged for you to have one core each for better performance. This was covered briefly in the lecture notes.

Check which node you got when you booked resources this morning (replace username with your PDC username)

squeue -u username

should look something like this

user@login1 ~ $ squeue -u user
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           5583899    shared interact    user  R       2:22      1 nid001009
user@login1 ~ $

where nid001009 is the name of the node I got (yours will probably be different). Note the numbers in the Time column. They show for how long the job has been running. When it reaches the time limit you requested (7 hours in this case) the session will shut down, and you will lose all unsaved data. Connect to this node from within PDC.

ssh -Y nid001009

If the list is empty you can run the allocation command again and it should be in the list:

salloc -A edu24.uppmax -t 04:00:00 -p shared -c 4

1 Copy files for lab

Now, you will need some files. To avoid all the course participants editing the same file all at once, undoing each other’s edits, each participant will get their own copy of the needed files. The files are located in the folder

/sw/courses/ngsintro/hpc/intro

Next, copy the lab files from this folder. -r means recursively, which means all the files including sub-folders of the source folder. Without it, only files directly in the source folder would be copied, NOT sub-folders and files in sub-folders.

Remember to use tab-complete to avoid typos and too much writing.

# syntax
cp -r <source> <destination>

cp -r /sw/courses/ngsintro/hpc/intro ~/ngsintro

Have a look in the folder you just copied

user@login1 ~ $ cd ~/ngsintro/hpc_tutorial

user@login1 hpc_tutorial $ ll
total 128K
drwxrwxr-x 2 user user 2,0K May 18 16:21 .
drwxrwxr-x 4 user user 2,0K May 18 15:34 ..
-rwxrwxr-x 1 user user 1,2K May 18 16:21 data.bam
-rw-rw-r-- 1 user user  232 May 18 16:21 job_template
user@login1 hpc_tutorial $ 

2 Run programs

Among the files that were copied is data.bam. BAM is a popular format to store aligned sequence data, but since it is a, so called, binary format it doesn’t look that good if you are human. Try it using less:

less data.bam

Not so pretty.. Luckily for us, there is a program called samtools that is made for reading BAM files. To use it on PDC we must first load the module for samtools. Try starting samtools before loading the module.

user@login1 hpc_tutorial $ samtools
-bash: samtools: command not found

That did not work, try it again after loading the module:

user@login1 hpc_tutorial $ module load bioinfo-tools samtools/1.20
Message: NOTE: The modules made available by loding this module are all PDC legacy, please consider loading PDC installed modules when available, as they are optimized for working on Dardel

user@login1 hpc_tutorial $ samtools

Program: samtools (Tools for alignments in the SAM format)
Version: 1.20 (using htslib 1.20)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     fqidx          index/extract FASTQ
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags
     markdup        mark duplicates
     ampliconclip   clip oligos from the end of reads

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     consensus      produce a consensus Pileup/FASTA/FASTQ
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA
     import         Converts FASTA or FASTQ files to SAM/BAM/CRAM
     reference      Generates a reference from aligned data
     reset          Reverts aligner changes in reads

  -- Statistics
     bedcov         read depth per BED region
     coverage       alignment depth and percent coverage
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     cram-size      list CRAM Content-ID and Data-Series sizes
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)
     ampliconstats  generate amplicon specific stats

  -- Viewing
     flags          explain BAM flags
     head           header viewer
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM
     samples        list the samples in a set of SAM/BAM/CRAM files

  -- Misc
     help [cmd]     display this help message or help for [cmd]
     version        detailed version information

All modules are unloaded when you disconnect from PDC, so you will have to load the modules again every time you log in. If you load a module in a terminal window, it will not affect the modules you have loaded in another terminal window, even if both terminals are connected to PDC. Each terminal is independent of the others.

To use samtools to view a BAM file, use the following line.

user@login1 hpc_tutorial $ samtools view -h data.bam

@HD VN:1.0  SO:coordinate
@SQ SN:chr1 LN:249250621
@SQ SN:chr10    LN:135534747
@SQ SN:chr11    LN:135006516
@SQ SN:chr12    LN:133851895
@SQ SN:chr13    LN:115169878
@SQ SN:chr14    LN:107349540
@SQ SN:chr15    LN:102531392
@SQ SN:chr16    LN:90354753

-h also print the BAM file’s header, which is the rows starting with @ signs in the beginning of the file. These lines contain so called metadata; information about the data stored in the file. It contain things like which program was used to generate the BAM file and which chromosomes are present in the file. Try running the command without the -h to see the difference.

The not-binary version (ASCII, or text version) of a BAM file is called a SAM file, which was just printed directly into the terminal window. The SAM file is not to much use for us printed in the terminal window, aesthetics aside. It is probably much better to have the SAM file saved as an actual file, something that is very easy to do. Any text that is printed to the terminal can be saved to a file instead of the terminal window using a ‘crocodile mouth’, >

# syntax
program arguments > outfile

which will launch a program named program, supply it with the argument arguments, and write any output that would have been printed to the screen to the file outfile instead.

To use this on samtools,

samtools view -h data.bam > data.sam

Look at the created file:

ll

The SAM file is now human readable. Try viewing it with less:

less data.sam

You can also edit the file with nano:

nano data.sam

Try deleting the whole last line in the file, save it, and exit nano.

3 Modules

To view which module you have loaded at the moment, type

user@login1 hpc_tutorial $ module list

Currently Loaded Modules:
  1) craype-x86-rome                       11) PrgEnv-cray/8.5.0
  2) libfabric/1.15.2.0                    12) snic-env/1.0.0
  3) craype-network-ofi                    13) systemdefault/1.0.0 (S)
  4) perftools-base/23.12.0                14) bioinfo-tools
  5) xpmem/2.8.2-1.0_3.9__g84a27a5.shasta  15) ncurses/6.4
  6) cce/17.0.0                            16) bzip2/1.0.8
  7) craype/2.7.30                         17) xz/5.4.5
  8) cray-dsmml/0.2.2                      18) libdeflate/1.19
  9) cray-mpich/8.1.28                     19) htslib/1.20
 10) cray-libsci/23.12.5                   20) samtools/1.20

  Where:
   S:  Module is Sticky, requires --force to unload or purge

Let’s say that you want to make sure you are using the latest version samtools. Look at which version you have loaded at the moment (samtools/1.20).

Now type

module avail

to see which programs are available at PDC. Can you find samtools in the list? Which is the latest version of samtools available at PDC?

To change which samtools module you have loaded, you have to unload the the module you have loaded and then load the other module. To unload a module, use

module unload <module name>

Look in the list from module list to see the name of the module you want to unload. When the old module is unloaded, load samtools/1.15.1 (or try with the latest samtools module!).

4 Submitting a job

Not all jobs are as small as converting this tiny BAM file to a SAM file. Usually the BAM files are several gigabytes, and can take hours to convert to SAM files. You will not have reserved nodes waiting for you to do something either, so running programs is done by submitting a job to the queue system. What you submit to the queue system is a script file that will be executed as soon as it reaches the front of the queue. The scripting language used in these scripts is bash, which is the same language as you usually use in a terminal i.e. everything so far in the lecture and lab has been in the bash language (cd, ls, cp, mv, etc.).

Have a look at job_template in your hpc_tutorial folder.

less job_template

#! /bin/bash -l
#SBATCH -A XXXXXXX
#SBATCH -p shared
#SBATCH -J Template_script
#SBATCH -t 01:00:00

# load some modules
module load bioinfo-tools

# go to some directory
cd /cfs/klemming/projects/supr/naiss2099-99-999

# do something
echo Hello world!

Edit this file to make the job convert data.bam to a SAM file named jobData.sam. Remember how the queue works? Try to approximate the runtime of the job (almost instant in this case) and increase it by ~50%, and use that time approximation when writing your script file. Longer jobs will wait longer in the queue because it is harder to fit them into gaps in the queue! Also remember to change the project ID to match this course occasion.

Remember, just write the command you would run if you were sitting by the computer, i.e. load the correct modules, go to the correct folder, and run samtools the right way.

Submit your job using sbatch:

sbatch job_template

5 Job queue

If you want to know how your jobs are doing in the queue, you can check their status with $ squeue -u username

Rewrite the previous sbatch file so that you book 3 days of time, and to use a node instead of a core. This will cause your job to stand in the queue for a bit longer, so that we can have a look at it while it is queuing. Submit it to the queue and run squeue.

squeue -u username

  JOBID PARTITION     NAME     USER   ST       TIME  NODES NODELIST(REASON)
5104668    shared Template   username PD       0:00      1 (Priority)

Here we can see in the status column (ST)that the job is pending (PD) and has not started yet. The job is waiting for a node to become available. When the job starts, the status will change to R (running). If you run squeue again, you will see that the job is running.

In our case, we are not really interested in running this job at all. Let’s cancel it instead. This can be done with the command scancel. Syntax:

scancel <job id>

You see the job id number in the output from squeue.

scancel 3134401

If you have a lot of jobs running, you can cancel all of them by using the command scancel -u username.

6 Interactive jobs

Sometimes it is more convenient to work interactively on a node instead of submitting your work as a job. You have tried this during these labs, but you will not have the reservations we have during the course so your jobs might take a little while to start. You will have to book your jobs using the salloc command. Syntax:

salloc -A <project id> -t <time> -c <number of cores> -p shared

Try closing down your current session on the reserved node you connected to in the beginning of the lab by typing exit. Then make a new booking using salloc but without the reservation id,

salloc -A edu24.uppmax -t 3:00:00 -c 1 -p shared

Congratulations, you are now ready to be let loose on the cluster!

7 Extra

Extra material if you finish too fast.

7.1 Time and space

Remember the command projinfo (shows you how much of your allocated resources you have used) from the lecture? Try running it and see how you are doing. Run projinfo -h to see options you can give it.

Optional

This optional material on HPC pipelines will teach you the basics in creating pipelines. Continue with this if you finish the current lab ahead of time. Navigate to the exercise HPC Pipelines lab.