1 Zoom link for help over zoom
2 Connect to PDC
The first step of this lab is to open a ssh connection to PDC. Please refer to Connecting to PDC for instructions. Once connected to PDC, return here and continue reading the instructions below.
3 Logon to a node
Check which node you got when you booked resources this morning (replace username
with your username)
squeue -u username
should look something like this
user@login1 ~ $ squeue -u user
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5583899 shared interact user R 2:22 1 nid001009
user@login1 ~ $
where nid001009
is the name of the node I got (yours will probably be different). Note the numbers in the Time column. They show for how long the job has been running. When it reaches the time limit you requested the session will shut down, and you will lose all unsaved data. Connect to this node from the login node.
ssh -Y nid001009
If the list is empty you can run the allocation command again and it should be in the list:
salloc -A edu24.uppmax --reservation=edu24-11-26 -t 04:00:00 -p shared -c 1
4 First things first
We will have to load the module for nano
before we can start editing files.
module load nano
Let’s make sure nano has syntax highlighting enabled. What that will do is to paint the boring code in pretty colors, making it much easier to read it. See the difference for yourself by first looking at this file before you enable it:
nano /sw/courses/ngsintro/linux/qol/generate_random_protein.py
Close down nano when you have seen how boing it looks without colors by pressing ctrl+x
. Now, let’s enable syntax highlighting. To do this, we will simply tell nano to include the syntax highlighting instructions from a bunch of files that are already installed on the computer. Run this command to do just that:
find /pdc/software/eb/software/nano/7.2/share/nano/ -iname "*.nanorc" -exec echo include {} \; >> ~/.nanorc
This command will put one line per language instructions (~30 of them, located in /pdc/software/eb/software/nano/7.2/share/nano/
), into the nano autostart file (~/.nanorc
) and put the word ‘include’ infront of each file name. That will make nano include the instructions from each of those files whenever it starts. Now have a look at the same file as before and enjoy the colors:
nano /sw/courses/ngsintro/linux/qol/generate_random_protein.py
Then close nano and continue with the lab.
5 Copy files for lab
Now, you will need some files. To avoid all the course participants editing the same file all at once, undoing each other’s edits, each participant will get their own copy of the needed files. The files are located in the folder /sw/courses/ngsintro/linux/linux_advanced/
Next, copy the lab files from this folder. -r
means recursively, which means all the files including sub-folders of the source folder. Without it, only files directly in the source folder would be copied, NOT sub-folders and files in sub-folders.
Remember to use tab-complete to avoid typos and too much writing.
cp -r <source> <destination>
cp -r /sw/courses/ngsintro/linux/linux_advanced ~/ngsintro/linux_advanced
Have a look in ~/ngsintro/linux_advanced
cd ~/ngsintro/linux_advanced
ll
If you see files, the copying was successful.
If you for some reason have problems copying the files, or if you are not on the cluster when running this lab, you can download these files here. You can download and unpack the file using the commands below:
wget https://nbisweden.github.io/workshop-ngsintro/2411/topics/linux/assets/linux_advanced.tar.gz
mkdir -p ~/ngsintro
tar -C ~/ngsintro -xzvf linux_advanced.tar.gz
The -C ~/ngsintro
in that command will make the files be unpacked in the folder ~/ngsintro
instead of whichever folder you are in when you run the command. This makes sure the instructions in the lab will work as intended.
After unpacking, continue the lab from the “Using variables” section.
6 Using variables
Variables are like small notes you write stuff on. If you want to save the value of something to be able to use it later, variables is the way to go. Let’s try assigning values to some variables and see how we use them.
a=5
Now the values 5 is stored in the variable named a
. To use the variable we have to put a $
sign in front of it so that bash knows we are referring to the variable a
and not just typing the letter a
. To see the result of using the variable, we can use the program echo
, which prints whatever you give it to the terminal.
echo print this text to the terminal
echo "you can use quotes if you want to"
As you see, all the words you give it are printed just the way they are. Try putting the variable somewhere in the text.
echo Most international flights leave from terminal $a at Arlanda airport
Bash will see that you have a variable there and will replace the variable name with the value the variable have before sending the text to echo. If you change the value of a
and run the exact command again you will see that it changes.
a="five"
echo Most international flights leave from terminal $a at Arlanda airport
So without changing anything in the echo statement, we can make it output different things, all depending on the value of the variable a
. This is one of the main points of using variables, that you don’t have to change the code or script but you can still make it behave differently depending on the values of the variables.
You can also do mathematics with variables, but we have to tell bash that we want to do calculations first. We do this by wrapping the calculations inside a dollar sign (telling bash it’s a variable) and double parentheses, i.e. $((5+5))
.
a=4
echo $a squared is $(($a*$a))
Write a echo command that will print out the volume of a rectangular cuboid, with the side lengths specified by variables named x
, y
, and z
. To see that it works correctly, the volume of a rectangular cuboid with sides 5,5,5 is 125, and for 4,5,10 is 200. Give it a couple of tries on your own first. If you get completely stuck you can see a suggested solution below.
Solution
x=4
y=5
z=10
echo The volume of the rectangular cuboid with the sides $x,$y,$z is $(($x*$y*$z)).
7 Exercises
First off, let’s open another terminal to the cluster so that you have 2 of them open. Scripting is a lot easier if you have one terminal on the command line ready to run commands and test things, and another one with a text editor where you write the actual code. That way you will never have to close down the text editor when you want to run the script you are writing on, and then open it up again when you want to continue editing the code.
So open a new terminal window, connect it to the cluster and then connect it to the node you have booked. Make sure both terminals are in the ~/ngsintro/linux_advanced
directory, and start editing a new file with gedit
or nano
where you write your script. Name the file whatever you want, but in the examples I will refer to it as loop_01.sh
. Write your loops to this file (or create a new file for each new example) and test run it in the other terminal.
If you get error messages like this when you run gedit
, (gedit:27463): dconf-WARNING **: 10:59:00.575: failed to commit changes to dconf: Failed to execute child process “dbus-launch” (No such file or directory)
, and if you can’t change any preferences, you can try starting gedit through the graphical menu in ThinLinc instead. If you are using the Xfce desktop environment you should have a start-menu-like button at the top-left of the screen named Applications
, or if you right-click somewhere on the desktop you should find it in the context menu that pops up. In the Applications
menu, look in the category Accessories
and you should find a program called Text editor
which will start gedit
.
The most simple loops are the ones that loop over a predefined list. You saw examples of this in the lecture slides, for example:
for i in "Print these words" one by one;
do
echo $i
done
which will print the value of $i
in each iteration of the loop. Write this loop in the file you are editing with gedit
/nano
, save the file, and then run it in the other terminal you have open.
bash loop_01.sh
As you see, the words inside the quotation marks are treated as a single unit, unlike the words after. You can also iterate over numbers, so erase the previous loop you wrote and try this instead:
for number in 1 2 3;
do
echo $number
done
If everything worked correctly you should have gotten the numbers 1 2 3
printed to the screen. As you might guess, this way of writing the list of numbers to iterate over will not be usable once you have more than 10 or so numbers you want to loop over. Fortunately, the creators of bash (and most other computer languages) saw this problem coming a mile away and did something about it. To quickly create a list of numbers in bash, you can use something called a sequence expression to create the list for you.
for whatevernameyouwant in {12..72};
do
echo $whatevernameyouwant
done
7.1 Exercise 1
Let’s say it’s New Year’s Eve and you want to impress your friends with a computerized countdown of the last 10 seconds of the year (don’t we all?).
Start off with getting a loop to count down from 10 to 0 first. Notice how fast the computer counts? That won’t do if it’s seconds we want to be counting down. Try looking the man page for the sleep
command (man sleep
) and figure out how to use it. The point of using sleep
is to tell the computer to wait for 1 second after printing the number, instead of rushing to the next iteration in the loop directly. Try to implement this on your own.
Solution
# declare the values the loop will loop over
for secondsToGo in {10..0};
do
# print out the current number
echo $secondsToGo
# sleep for 1 second
sleep 1
done
# Declare the start of a new year in a festive manner
echo Happy New Year everyone!!
7.2 Exercise 2
Let’s try to do something similar to the example in the lecture slides, to run the same commands on multiple files. In the Intro to high-performance computing lab, we learned how to use samtools
to convert BAM files to SAM files so that humans can read them. In real life you will never do this, instead you will most likely always do it the other way around. SAM files take up ~4x more space on the hard drive compared to the same file in BAM format, so as soon as you see a SAM file you should instinctively convert it to a BAM file instead to conserve hard drive space. If you have many SAM files that needs converting you don’t want to sit there and type all the commands by hand like some kind of animal.
Write a script that converts all the SAM files in a specified directory to BAM files. Incidentally, you can find 50 SAM files in need of conversion in the folder called sam
in the folder you copied to your folder earlier in this lab (~/ngsintro/linux_advanced/sam
). Bonus points if you make the program take the specified directory as an argument, and another bonus point if you get the program to name the resulting BAM file to the same name as the SAM file but with a .bam
ending instead of .sam
.
Remember that you have to load the samtools
module to be able to run it. The way you get samtools
to convert a SAM file to a BAM file is by typing the following command:
samtools view -bS sample_1.sam > sample_1.bam
The -b
option tells samtools to output BAM format, and the -S
option tells samtools that the input is in SAM format.
Remember, Google is a good place to get help. If you get stuck, google bash remove file ending
or bash argument to script
and look for hits from StackOverflow/StackExchange or similar pages. There are always many different way to solve a problem. Try finding one you understand what they do and test if you can get them to work the way you want. If not, look for another solution and try that one instead.
Basic, without bonus points:
Solution
# load the modules needed for samtools
module load bioinfo-tools samtools
# move to the SAM files directory to start with
cd sam
# use ls to get the list to iterate over
for file in *.sam;
do
# do the actual converting, just slapping on .bam at the end of the name
samtools view -bS $file > $file.bam
done
Advanced, with bonus points:
Solution
# load the modules needed for samtools
module load bioinfo-tools samtools
# move to the SAM files directory to start with.
# $1 contains the first argument given to the program
cd $1
# use ls to get the list to iterate over.
for file in *.sam;
do
# print a message to the screen so that the user knows what is happening.
# $(basename $file .sam) means that it will take the file name and remove .sam
# at the end of the name.
echo "Converting $file to $(basename $file .sam).bam"
# do the actual converting
samtools view -bS $file > $(basename $file .sam).bam
done
7.3 Exercise 3
Let’s add a small thing to the exercise we just did. If there already exists a BAM file with the same name as the SAM file it’s not necessary to convert it again. Let’s use an if
statement to check if the file already exists before we do the conversion.
The following if
statement will check if a given filename exists, and prints a message depending on if it exists or not.
FILE=$1
if [ -f $FILE ];
then
echo "File $FILE exists."
else
echo "File $FILE does not exist."
fi
What we want to do is to check if the file doesn’t exists. The way to do that is to invert the answer of the check if the file does exist. To do that in bash, and many other languages, is to use the exclamation mark, !
, which in these kinds of logical situations means NOT or the opposite of.
FILE=$1
if [ ! -f $FILE ];
then
echo "File $FILE does not exist."
fi
Now, modify the previous exercise to only do the conversion if a file with the intended name of the BAM file doesn’t already exists. i.e; if you have a.sam
and want to create a BAM file named a.bam
, first check if a.bam
already exists and only do the conversion if it does not exist.
Basic:
Solution
# load the modules needed for samtools
module load bioinfo-tools samtools
# move to the SAM files directory to start with.
cd sam
# use ls to get the list to iterate over.
for file in *.sam;
do
# check if the intended output file does not already exists
if [ ! -f $file.bam ];
then
# do the actual converting, just slapping on .bam at the end of the name
samtools view -bS $file > $file.bam
fi
done
Advanced:
Solution
# load the modules needed for samtools
module load bioinfo-tools samtools
cd $1
# use ls to get the list to iterate over.
# $1 contains the first argument given to the program
for file in *.sam;
do
# basename will remove the path information to the file, and will also remove the .sam ending
filename_bam=$(basename $file .sam)
# add the .bam file ending to the filename
filename_bam=$filename_bam.bam
# check if the intended output file does not already exists.
if [ ! -f $filename_bam ];
then
# print a message to the screen so that the user knows what is happening.
echo "Converting $file to $filename_bam"
# do the actual converting
samtools view -bS $file > $filename_bam
else
# inform the user that the conversion is skipped
echo "Skipping conversion of $file as $filename_bam already exist"
fi
done
7.4 Bonus exercise 1
Maths and programming are usually a very good combination, so many of the examples of programming you’ll see involve some kind of maths. Now we will write a loop that will calculate the factorial of a number. As wikipedia will tell you, “the factorial of a non-negative integer n, denoted by n!, is the product of all positive integers less than or equal to n”, i.e. multiply all the integers, starting from 1, leading up to and including a number with each other.
The factorial of 5, written 5!
, would be 1*2*3*4*5=120
. Doing this by hand would start taking its time even after a couple of steps, but since we know how to loop that should not be a problem anymore.
Write a loop that will calculate the factorial of a given number stored in the variable $n
.
A problem that you will encounter is that the sequence expression, {1..10}
, from the previous exercise doesn’t handle variables. This is because of the way bash is built. The sequence expressions are handled before handling the variables so when bash tries to generate the sequence, the variable names have not yet been replaced with the values they contain. This leads to bash trying to create a sequence from 1 to $n
, which of course doesn’t mean anything.
To get around this we can use a different way of generating sequences (there are always alternatives). There is a program called seq
that does pretty much the same thing as the sequence expression, and since it is a program it will be executed after the variables have been handled. It’s as easy to use as the sequence expressions; instead of writing {1..10}
just write $( seq 1 10 )
.
The $()
tells bash to run something in a subshell, which pretty much means it will run the command within the paratheses and then take whatever that command printed to the screen and replace the parantheses expression:
echo $(seq 1 5)
becomes
echo 1 2 3 4 5
Solution
# set the number you want to calculate the factorial of
n=10
# you have to initialize a variable before you can start using it.
# Leaving this empty would lead to the first iteration of the loop trying
# to use a variable that has no value, which would cause it to crash
factorial=1
# declare the values the loop will loop over (1 to whatever $n is)
for i in $( seq 1 $n );
do
# set factorial to whatever factorial is at the moment, multiplied with the variable $i
factorial=$(( $factorial * $i ))
# an alternative solution which gives exactly the same result, but makes it a bit more readable maybe
# temporary_sum=$(( $factorial * $i ))
# factorial=$temporary_sum
done
# print the result
echo The factorial of $n is $factorial
7.5 Bonus exercise 2
Now, let’s combine everything you’ve learned so far in this course.
Write a script that runs the pipeline from the filetypes lab for each fastq file in a specified directory, using the same reference genome as in the filetype lab.
If that sounds too easy, make the script submits a slurm job for each sample that will run the pipeline for that sample on a calculation node (1 core, 5 minutes each). When the analysis is done, only fastq files and sorted and indexed BAM files should be in your folder.
There is a bunch of fastq files in the directory ~/ngsintro/linux_advanced/fastq
that is to be used for this exercise.
Basic solution:
Solution
# make the dummy pipeline available
export PATH=$PATH:/sw/courses/ngsintro/hpc/pipeline/dummy_scripts
# index the reference genome
reference_indexer -r ~/ngsintro/filetypes/0_ref/ad2.fa
# go to the input files
cd $1
# loop over all the fastq files
for file in *.fastq;
do
# align the reads
align_reads -r ~/ngsintro/filetypes/0_ref/ad2.fa -i $file -o $file.sam
# convert the sam file to a bam file
sambam_tool -f bam -i $file.sam -o $file.bam
# sort the bam file
sambam_tool -f sort -i $file.bam -o $file.sorted.bam
# index the bam file
sambam_tool -f index -i $file.sorted.bam
done
Advanced solution:
Solution
# make the dummy pipeline available in this script
export PATH=$PATH:/sw/courses/ngsintro/hpc/pipeline/dummy_scripts
# index the reference genome once, only if needed
if [ ! -f ~/ngsintro/filetypes/0_ref/ad2.fa.idx ];
then
reference_indexer -r ~/ngsintro/filetypes/0_ref/ad2.fa
fi
# find out the absolute path to the input files
cd $1
input_absolute_path=$(pwd)
# go back to the previous directory now that the absolute path has been saved
cd -
# alternative way to get the absolute path to the input files
# input_absolute_path=$(realpath $1)
# loop over all the fastq files
for file in $input_absolute_path/*.fastq;
do
# print status report
echo Processing $file
# save the file name without the path information for convenience
file_basename=$(basename $file)
# save the file name without the file ending for convenience
file_prefix=$(basename $file .fastq)
# print a temporary script file that will be submitted to slurm
echo """#!/bin/bash -l
#SBATCH -A edu24.uppmax
#SBATCH -p shared
#SBATCH -c 1
#SBATCH -t 00:05:00
#SBATCH -J $file_basename
# make the dummy pipeline available on the calculation node
echo "Loading modules"
export PATH=\$PATH:/sw/courses/ngsintro/hpc/pipeline/dummy_scripts
# go to the input files
cd $input_absolute_path
# align the reads
echo "Aligning the reads"
align_reads -r ~/ngsintro/filetypes/0_ref/ad2.fa -i $file_basename -o $file_prefix.sam
# convert the SAM file to a BAM file
echo "Converting sam to bam"
sambam_tool -f bam -i $file_prefix.sam -o $file_prefix.bam
# sort the BAM file
echo "Sorting the bam file"
sambam_tool -f sort -i $file_prefix.bam -o $file_prefix.sorted.bam
# index the BAM file
echo "Indexing the sorted bam file"
sambam_tool -f index -i $file_prefix.sorted.bam
# rename the index file to the same as the .bam file but with different file ending
mv $file_prefix.sorted.bam.bai $file_prefix.sorted.bai
# delete the files you don't want to keep
rm $file_prefix.sam $file_prefix.bam
echo "Finished"
""" > tmp.sbatch
# submit the temporary script file
sbatch tmp.sbatch
done
# remove the temporary job file now that everything has been submitted
rm tmp.sbatch