In code blocks, the dollar sign ($
) is not to be printed. The dollar sign is usually an indicator that the text following it should be typed in a terminal window.
The first step of this lab is to open a ssh connection to UPPMAX. Please go to the Contents page and click “Connecting to UPPMAX” under the “Additional content” topic. Once connected to UPPMAX, return here and continue reading the instructions below.
Usually you would do most of the work in this lab directly on one of the login nodes at UPPMAX, but we have arranged for you to have one core each for better performance. This was covered briefly in the lecture notes.
$ salloc -A snic2022-22-1124 -t 07:00:00 -p core -n 1 --no-shell &
check which node you got (replace username with your UPPMAX username)
$ squeue -u username
should look something like this
dahlo@rackham2 work $ squeue -u dahlo
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3132376 core sh dahlo R 0:04 1 r292
dahlo@rackham2 work $
where r292 is the name of the node I got (yours will probably be different). Note the numbers in the Time column. They show for how long the job has been running. When it reaches the time limit you requested (7 hours in this case) the session will shut down, and you will lose all unsaved data. Connect to this node from within UPPMAX.
$ ssh -Y r292
There is a UPPMAX specific tool called jobinfo
that supplies the same kind of information as squeue
that you can use as well ($ jobinfo -u username
).
Now you will need some files. To avoid all the course participants editing the same file all at once, undoing each other’s edits, each participant will get their own copy of the needed files. The files are located in the folder /sw/courses/ngsintro/linux/filetypes
Next, copy the lab files from this folder. -r
means recursively, which means all the files including sub-folders of the source folder. Without it, only files directly in the source folder would be copied, NOT sub-folders and files in sub-folders.
Remember to use tab-complete to avoid typos and too much writing.
$ cp -r <source> <destination>
$ cp -r /sw/courses/ngsintro/linux/filetypes /proj/snic2022-22-1124/nobackup/username/
Have a look in /proj/snic2022-22-1124/nobackup/username/.
$ cd /proj/snic2022-22-1124/nobackup/username/filetypes
$ tree
This will print a file tree, which gives you a nice overview of the folders where you are standing in. As you can see, you have a couple of files and a couple of empty folders. In the 0_ref folder you have a reference genome in fasta format and annotations for the genome in GTF format. In 0_seq you have a fastq file containing the reads we will align.
The best way to see all the different file formats is to run a small pipeline and see which files we encounter along the way. The pipeline is roughly the same steps you’ll do in the variant-calling part of the course, so for now we’ll stick with the dummy pipeline which some of you might have encoutered in the extra material for the UPPMAX exercise.
The programs in the dummy pipeline does not actually do any analysis but they work the same way as the real deal, although slightly simplified, to get you familiar with how to work with analysis programs. The data is from a sequencing of the adenovirus genome, which is tiny compared to the human genome (36kb vs 3gb).
The starting point of the pipeline are fresh reads from the sequencing machine in fastq format, and a reference genome in fasta format. The goal of the exercise is to look at our aligned reads in a genome viewer together with the annotations of the adenovirus genome.
First, let’s go through the steps of the pipeline:
The first thing you usually do is to load the modules for the programs you want to run. During this exercise we’ll only run my dummy scripts that don’t actually do any analysis, so they don’t have a module of their own. What we can do instead is to manually do what module loading usually does: to modify the $PATH
variable.
The $PATH
variable specifies directories where the computer should look for programs. For instance, when you type nano
, how does the computer know which program to start? You gave it the name nano
, but that could refer to any file named nano in the computer, yet it starts the correct one every time. The answer is that it looks in the directories stored in the $PATH
variable.
To see which directories that are available by default, type
$ echo $PATH
It should give you something like this, a list of directories, separated by colon signs:
dahlo@rackham2 work $ echo $PATH
/home/dahlo/perl//bin/:/home/dahlo/.pyenv/shims:/home/dahlo/.pyenv/bin:
/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:
/sbin:/opt/thinlinc/bin:/sw/uppmax/bin:/home/dahlo/usr/bin
Try loading a module, and then look at the $PATH
variable again. You’ll see that there are a few extra directories there now, after the module has been loaded.
dahlo@rackham2 work $ module load bioinfo-tools samtools/1.10
dahlo@rackham2 work $ echo $PATH
/sw/apps/bioinfo/samtools/1.10/rackham/bin:/home/dahlo/perl/bin:/home/dahlo/.pyenv/shims:
/home/dahlo/.pyenv/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:
/usr/sbin:/sbin:/opt/thinlinc/bin:/sw/uppmax/bin:/home/dahlo/usr/bin
To pretend that we are loading a module, we will just add a the directory containing my dummy scripts to the $PATH
variable, and it will be like we loaded the module for them.
$ export PATH=$PATH:/sw/courses/ngsintro/linux/uppmax_pipeline_exercise/dummy_scripts
This will set the $PATH
variable to whatever it is at the moment, and add a directory at the end of it. Note the lack of a dollar sign in front of the variable name directly after export
. You don’t use dollar sign when assigning values to variables, and you always use dollar signs when getting values from variables.
IMPORTANT: The export command affects only the terminal you type it in. If you have 2 terminals open, only the terminal you typed it in will have a modified path. If you close that terminal and open a new one, it will not have the modified path.
All aligners will have to index the reference genome you are aligning your data against. This is only done once per reference genome, and then you reuse that index whenever you need it. All aligners have their own kind of index unfortunately, so you will have to build one index for each aligner you want to use. In this lab, we will use the dummy aligher called align_reads
, and we will build a index using it’s indexing progam, called reference_indexer
.
First, have a look in the 0_ref folder
$ ll 0_ref
You should see 2 files: the fasta file, the gtf file. Have a look at each of them with less
, just to see how they look inside. To do the actual indexing of the genome:
Run reference_indexer.
Syntax: reference_indexer -r <name of the fasta file you want to index>
$ reference_indexer -r 0_ref/ad2.fa
Since the genome is so small this should only take a second or so. The human genome will probably take a couple of hours. Look in the 0_ref folder again and see if anything has changed.
$ ll 0_ref
The new file you see is the index file created by reference_indexer. This index is in the same format as you would get from the real program samtools. Try viewing the index file with less
and see how it looks. The samtools type of index contains one row per fasta record in the reference file. In this case, there is only one record for the adenovirus genome, and it’s called ad2
in the fasta file. The human reference genome typically have one record per chromosome, so a index of the human genome would then have 24 rows.
The numbers after the record name specifies how many bases the record has, how far into the file (in bytes) the record starts, the number of bases on each line in the record, and how many bytes each line takes up in the file. Using this information the program can quickly jump to the start location of each record, without having to read the file from the first row every time.
Other aligners might use more complex indexing of the file to speed up the alignment process even further, e.g. creating an index over where it can find all possible “words” that you can form with 5 or so bases, making it easier to find possible matching sites for reads. If the read starts with ATGTT
you can quickly look in the index and see all places in the geonome that contains this word and start looking if the rest of the read matches the area around the word.
This greatly decreases the number of places you have to look when looking for a match. These types of indexes usually take a long time to create (5+ hours maybe), but since you only have to do it once per reference genome it’s easily worth it, seeing how the alignment process probably would take 100s of hours without the index, instead of 6-12 hours.
We are now ready to align our reads.
Align reads using align_reads, naming the output file ad2.sam, placed in the 1_alignment folder.
Syntax: align_reads -r <reference genome> -i <fastq file with reads> -o <name of the output file>
$ align_reads -r 0_ref/ad2.fa -i 0_seq/ad2.fq -o 1_alignment/ad2.sam
This will create a SAM file in 1_alignment called ad2.sam. Have a look at it with less. If you think the file looks messy, add a -S
after less to make it stop wrapping long lines, less -S 1_alignment/ad2.sam
and scroll sideways using the arrow keys. As you can see there is one row per aligned read in this file. Each row contains information about the read, like the name of the read, where in the reference genome it aligned, and also a copy of the reads sequence and quality score, among other things.
The next step is to convert the SAM file to a BAM file. This is more or less just compressing the file, like creating a zip file. To do that we will use the dummy program sambam_tools, telling it we want to convert a file to BAM (-f bam), which file we want to convert (-i), where it should save the resulting BAM file (-o). Save the BAM file in the 2_bam folder and name it ad2.bam.
Syntax: sambam_tool -f bam -i <sam file> -o <bam>
$ sambam_tool -f bam -i 1_alignment/ad2.sam -o 2_bam/ad2.bam
Have a look in the 2_bam folder.
$ ll 2_bam
The created BAM file is an exact copy of the SAM file, but stored in a much more efficient format. Aligners usually have an option to output BAM format directly, saving you the trouble to convert it yourself, but not all tools can do this (they really should though). Have a look at the difference in file size, though in this example it’s quite an extreme difference (2.9 MB vs 0.3 MB). The quality score of all reads is the same (BBBBBBBBB..), and files with less differences are easier to compress. Usually the BAM file is about 25% of the size of the SAM file.
Since the BAM format is a binary format we can’t look at it with less
. We would have to use a tool, like samtools which you will probably see later in the week, to first convert the file back to a SAM file before we can read it. In that case we can just look at the SAM file before converting it since they will be the same.
A BAM file is taking up much less space than the SAM file, but we can still improve performance. An indexed BAM file is infinitely faster for programs to work with, but before we can index it, we have to sort it since it’s not possible to index an unsorted file in any meaningful way.
To sort the BAM file we’ll use the sambam_tool again, but specifying a different function, -f sort instead. Tell it to store the sorted BAM file in the 3_sorted folder and name the file ad2.sorted.bam
Syntax: sambam_tool -f sort -i <unsorted bam file> -o <sorted bam file>
$ sambam_tool -f sort -i 2_bam/ad2.bam -o 3_sorted/ad2.sorted.bam
This will sort the ad2.bam file and create a new BAM file which is sorted, called ad2.sorted.bam.
Now when we have a sorted BAM file, we can index it. Use the command
Syntax: sambam_tool -f index -i <sorted bam file>
$ sambam_tool -f index -i 3_sorted/ad2.sorted.bam
This will create an index named ad2.sorted.bam.bai in the same folder as the ad2.sorted.bam file is located. It’s nicer to have the .bam and .bai named to the same “prefix”, so rename the .bai file to not have the .bam in its name.
$ mv 3_sorted/ad2.sorted.bam.bai 3_sorted/ad2.sorted.bai
Now that we have to data aligned and prepared for easy access, we will view it in a genome viewer together with the annotations for the genome. Have a look at the annotations file with less
.
$ less -S 0_ref/ad2.gtf
The -S will tell less to not wrap the lines, and instead show one line per line. If the line is longer than the window, you can user the left and right arrow to scroll to the left and right. Many tabular files are much more readable when using the -S
option. Try viewing the file without it and see the difference.
To view the file, we will use the program IGV (Integrated Genome Viewer). Before we can do this, we have to load the module for IGV.
If you are using a Mac you might have to install the program XQuartz, if you have not already installed that program. By using -Y
in your ssh command you enable graphical transfer over ssh, but you will also have to have a program able to receive the graphics in order to display it.
$ module load bioinfo-tools IGV/2.4.2
Start it by typing the following command (now we’ll find out if you used -Y
in all your ssh connections!):
$ igv.sh
Optional
If you notice that IGV over Xforwarding is excruciatingly slow, you can try to use the web based ThinLinc client instead. Unfortunately this requires you to have set up a two factor authentification (2FA) with UPPMAX, so it’s something you can try on your own. Instructions for setting up the 2FA at UPPMAX. When you are all set up, go to the address https://rackham-gui.uppmax.uu.se and login with your normal UPPMAX username and password together with your 2FA (described at the login screen). This will get you a remote desktop on one of the login nodes, and you can open a terminal and run IGV there instead. Once IGV is started, either using Xforwarding or the remote desktop in your web browser, we are ready to go.
There are 3 files we have to load in IGV.
The first is the reference genome. Press the menu button located at “Genomes - Load Genome from File…” and find your reference genome in 0_ref/ad2.fa. If you are having trouble finding your files, note that IGV always starts in your home directory. Use the dropdown menu at the top to navigate to /proj/snic2022-22-1124/nobackup/….
The second file you have to load is the reads. Press the menu button “File - Load from File…” and select your 3_sorted/ad2.sorted.bam.
The last file you have to load is the annotation data. Press “File - Load from File…” again and select you annotation file in 0_ref/ad2.gtf.
This will show you the reference genome, how all the reads are aligned to it, and all the annotation data. Try zooming in on an area and have a look at the reads and annotations. The figures you see in the picture are all derived from the data in the files you have given it.
At the top of the window you have the overview of the current chromosome you are looking at, which tells you the scale you are zoomed at for the moment. When you zoom in you will see a red rectangle apper which shows you which portion of the chromosome you are looking at. Just below the scale you’ll see the coverage graph, which tells you how many reads cover each position along the reference genome. The colored bands you see here and there are SNPs, i.e. positions where the reads of your sample does not match the reference genome.
All the reads, the larger area in the middle of the window, are drawn from the data in the BAM file using the chromosome name, the starting position and the ending position of each read. When you zoom in more you will be able to see individual reads and how they are aligned. The annotation in GTF format are all plotted using the data in the GTF file, visible just under all the reads, are shown as blue rectangles.
The reference genome, a fasta file containing the DNA sequence of the reference genome, is visible at the bottom of the window if you zoom to the smallest level so you can see the bases of the genome.
The CRAM format is even more efficient than the BAM format. To create a CRAM file we’ll have to use samtools, so we will load the module for it.
$ module load bioinfo-tools samtools/1.10
Tell samtools that you want CRAM output (-C
) and specify which reference genome it should use to do the CRAM conversion (-T
)
Syntax: samtools view -C -T <reference genome> -o <name of cram file> <bam file to convert>
$ samtools view -C -T 0_ref/ad2.fa -o 4_cram/ad2.cram 3_sorted/ad2.sorted.bam
Compare the sizes of the convered BAM file and the newly created CRAM file:
$ ll -h 3_sorted/ad2.sorted.bam 4_cram/ad2.cram
This will list both the files, and print the file size in a human readable format (-h
). The CRAM file is roughly 1/3 of the size of the BAM file. This is probably because all the reads in the simulated data has the same quality value (BBBBBBBBBB). Fewer types of quality values are easier to compress, hence this amazing compression ratio. Real data will have much more diverse quality scores, and the CRAM file would be pethaps 70-80% of the original BAM file.
Optional
If you have been fast to finish this lab and you still have time left (or just can’t get enough of linux stuff), please have a look at the advanced linux tutorial where you can learn the basics in bash programming using variables, loops and control statements. This is the lab that is scheduled for after lunch today.