1 Zoom link for help over zoom
2 Connect to PDC
The first step of this lab is to open a ssh connection to PDC. Please refer to Connecting to PDC for instructions. Once connected to PDC, return here and continue reading the instructions below.
3 Logon to a node
Usually you would do most of the work in this lab directly on one of the login nodes, but we have arranged for you to have one core each for better performance. This was covered briefly in the lecture notes.
salloc -A edu24.uppmax --reservation=edu24-11-25 -t 07:00:00 -p shared -n 1
check which node you got (replace username
with your username)
squeue -u username
should look something like this
user@login1 ~ $ squeue -u username
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5583899 shared interact user R 2:22 1 nid001009
user@login1 ~ $
where nid001009
is the name of the node I got (yours will probably be different). Note the numbers in the Time column. They show for how long the job has been running. When it reaches the time limit you requested (7 hours in this case) the session will shut down, and you will lose all unsaved data. Connect to this node from the login node.
ssh -Y nid001009
5 Copy lab files
Now you will need some files. The files are located in the folder
/sw/courses/ngsintro/linux/linux_tutorial
or they can be downloaded if you are not logged on the cluster at the moment, files.tar.gz (instruction on how to download further down)
For structures sake, first create a folder called linux_tutorial
inside the workspace folder, where you can put all your files belonging to this lab.
mkdir ~/ngsintro/linux_tutorial
Next, copy the lab files to this folder.
# syntax
cp -r <source-folder> <destination-folder>
cp -r /sw/courses/ngsintro/linux/linux_tutorial/* ~/ngsintro/linux_tutorial
-r
denotes recursively, which means all the files including sub-folders of the source folder. Without it, only files directly in the source folder would be copied, not sub-folders and files in sub-folders.
Remember to tab-complete to avoid typos and too much writing.
If you are unable to copy the files on PDC, you can download the files from this link instead of copying them. This is done with the command wget
(web get). It works kind of the same way as the cp
command, but you give it a source URL instead of a source file, and you can specify the destination by giving it a prefix, a path that will be appended in front on the file name when it’s downloaded.
i.e; if you want to download the file http://somewhere.com/my.file
and you give it the prefix ~/analysis/
, the downloaded file will be saved as ~/analysis/my.file
.
# Ex:
wget -P ~/analysis/ http://somewhere.com/my.file
# or download the file to where you are standing by not specifying a prefix
wget https://nbisweden.github.io/workshop-ngsintro/2411/topics/linux/assets/files.tar.gz
6 Unpack files
Go to the folder you just copied and see what is in it.
Remember to tab-complete to avoid typos and too much writing.
cd ~/ngsintro/linux_tutorial
ll
tar.gz
is a file ending given to compressed files, something you will encounter quite often. Compression decreases the size of the files which is good when downloading, and it can take thousands of files and compress them all into a single compressed file. This is both convenient for the person downloading and speeds up the transfer more than you would think.
To unpack the files.tar.gz
file use the following line while standing in the newly copied linux_tutorial
folder.
tar -xzvf files.tar.gz
The command will always be the same for all tar.gz
files you want to unpack. -xzvf
means eXtract from a Zipped file, Verbose (prints the name of the file being unpacked), from the specified File (f
must always be the last of the letters).
Look in the folder again and see what we just unpacked:
[user@login1 linux_tutorial]$ ll
total 512
drwxrwsr-x 2 user g20XXXXX 2048 Sep 19 2012 a_strange_name
drwxrwsr-x 2 user g20XXXXX 2048 Sep 19 2012 backed_up_proj_folder
drwxrwsr-x 2 user g20XXXXX 2048 Sep 19 2012 external_hdd
-rwxrwxr-x 1 user g20XXXXX 17198 Sep 24 13:19 files.tar.gz
drwxrwsr-x 2 user g20XXXXX 2048 Sep 19 2012 important_results
drwxrwsr-x 2 user g20XXXXX 129024 Sep 19 2012 many_files
drwxrwsr-x 2 user g20XXXXX 2048 Sep 19 2012 old_project
-rwxrwxr-x 1 user g20XXXXX 0 Sep 24 13:19 other_file.old
drwxrwsr-x 2 user g20XXXXX 2048 Sep 19 2012 part_1
drwxrwsr-x 2 user g20XXXXX 2048 Sep 19 2012 part_2
drwxrwsr-x 2 user g20XXXXX 2048 Jan 28 2012 this_has_a_file
drwxrwsr-x 2 user g20XXXXX 2048 Jan 28 2012 this_is_empty
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19 2012 useless_file
7 Copying and moving files
Let’s move some files. Moving files might be one of the more common things you do, after cd
and ls
. You might want to organize your files in a better way, or move important result files to the project folder, who knows?
We will start with moving our important result to a backed-up folder. When months of analysis is done, the last thing you want is to lose your files. Typically this would mean that you move the final results to your project folder.
In this example, we want to move the result files only, located in the folder important_results
, to our fake project folder, called backed_up_proj_folder
.
The syntax for the move command is:
mv <source> <destination>
First, take a look inside the important_results
folder:
[user@login1 linux_tutorial]$ ll important_results/
total 0
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19 2012 dna_data_analysis_result_file_that_is_important-you_should_really_use_tab_completion_for_file_names.bam
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19 2012 temp_file-1
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19 2012 temp_file-2
You see that there are some unimportant temporary files that you have no interest in. Just to demonstrate the move command, I will show you how to move one of these temporary files to your backed-up project folder:
mv important_results/temp_file-1 backed_up_proj_folder/
Now do the same, but move the important DNA data file!
Look in the backed-up project folder to make sure you moved the file correctly.
[user@login1 linux_tutorial]$ ll backed_up_proj_folder/
total 0
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19 2012 dna_data_analysis_result_file_that_is_important-you_should_really_use_tab_completion_for_file_names.bam
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19 2012 last_years_data
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19 2012 temp_file-1
Another use for the move
command is to rename things. When you think of it, renaming is just a special case of moving. You move the file to a location and give the file a new name in the process. The location you move the file to can very well be the same folder the file already is in. To give this a try, we will rename the folder a_strange_name
to a better name.
mv a_strange_name a_better_name
Look around to see that the name change worked.
[user@login1 linux_tutorial]$ mv a_strange_name a_better_name
[user@login1 linux_tutorial]$ ll
total 448
drwxrwsr-x 2 user g20XXXXX 2048 Sep 19 2012 a_better_name
drwxrwsr-x 2 user g20XXXXX 2048 Sep 24 13:40 backed_up_proj_folder
drwxrwsr-x 2 user g20XXXXX 2048 Sep 19 2012 external_hdd
-rwxrwxr-x 1 user g20XXXXX 17198 Sep 24 13:36 files.tar.gz
drwxrwsr-x 2 user g20XXXXX 2048 Sep 24 13:40 important_results
drwxrwsr-x 2 user g20XXXXX 129024 Sep 19 2012 many_files
drwxrwsr-x 2 user g20XXXXX 2048 Sep 19 2012 old_project
-rwxrwxr-x 1 user g20XXXXX 0 Sep 24 13:36 other_file.old
drwxrwsr-x 2 user g20XXXXX 2048 Sep 19 2012 part_1
drwxrwsr-x 2 user g20XXXXX 2048 Sep 19 2012 part_2
drwxrwsr-x 2 user g20XXXXX 2048 Jan 28 2012 this_has_a_file
drwxrwsr-x 2 user g20XXXXX 2048 Jan 28 2012 this_is_empty
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19 2012 useless_file
Sometimes you don’t want to move things, you want to copy them. Moving a file will remove the original file, whereas copying the file will leave the original untouched. An example when you want to do this could be that you want to give a copy of a file to a friend. Imagine that you have a external hard drive that you want to place the file on. The file you want to give to your friend is data from last years project, which is located in your backed up project folder, backed_up_proj_folder/last_years_data
As with the move command, the syntax is
cp <source> <destination>
cp backed_up_proj_folder/last_years_data external_hdd/
Take a look in the external_hdd
to make sure the file got copied.
[user@login1 linux_tutorial]$ cp backed_up_proj_folder/last_years_data external_hdd/
[user@login1 linux_tutorial]$ ll external_hdd/
total 0
-rwxrwxr-x 1 user g20XXXXX 0 Sep 24 13:46 last_years_data
8 Deleting files
Sometimes you will delete files. Usually this is when you know that the file or files are useless to you, and they only take up space on your hard drive or your project’s folder.
To delete a file, we use the ReMove command, rm
. Syntax:
# syntax
rm <file to remove>
If you want, you can also specify multiple files at once, as many as you want!
rm <file to remove> <file to remove> <file to remove> <file to remove> <file to remove>
Danger
There is no trash bin in Linux. If you delete a file, it is gone. So be careful when deleting stuff.
Try it out by deleting the useless file in the folder you are standing in. First, look around in the folder to see the file.
[user@login1 linux_tutorial]$ ll
total 448
drwxrwsr-x 2 user g20XXXXX 2048 Sep 19 2012 a_better_name
drwxrwsr-x 2 user g20XXXXX 2048 Sep 24 13:40 backed_up_proj_folder
drwxrwsr-x 2 user g20XXXXX 2048 Sep 24 13:46 external_hdd
-rwxrwxr-x 1 user g20XXXXX 17198 Sep 24 13:36 files.tar.gz
drwxrwsr-x 2 user g20XXXXX 2048 Sep 24 13:40 important_results
drwxrwsr-x 2 user g20XXXXX 129024 Sep 19 2012 many_files
drwxrwsr-x 2 user g20XXXXX 2048 Sep 19 2012 old_project
-rwxrwxr-x 1 user g20XXXXX 0 Sep 24 13:36 other_file.old
drwxrwsr-x 2 user g20XXXXX 2048 Sep 19 2012 part_1
drwxrwsr-x 2 user g20XXXXX 2048 Sep 19 2012 part_2
drwxrwsr-x 2 user g20XXXXX 2048 Jan 28 2012 this_has_a_file
drwxrwsr-x 2 user g20XXXXX 2048 Jan 28 2012 this_is_empty
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19 2012 useless_file
Now remove it.
rm useless_file
Similarly, folders can be removed too. There is even a special command for removing folders, rmdir
. They work similar to rm
, except that they can’t remove files. There are two folders, this_is_empty
and this_has_a_file
, that we now will delete.
rmdir this_is_empty
rmdir this_has_a_file
If you look inside this_has_a_file
,
[user@login1 linux_tutorial]$ ll this_has_a_file
total 0
-rwxrwxr-x 1 user g20XXXXX 0 Jan 28 2012 file
you see that there is a file in there! Only directories that are completely empty can be deleted using rmdir
. To be able to delete this_has_a_file
, either delete the file manually and then remove the folder
rm this_has_a_file/file
rmdir this_has_a_file
or delete the directory recursively, which will remove this_has_a_file
and everything inside:
rm -r this_has_a_file
9 Open files
So what happens if you give your files bad names like file1
or results
? You take a break in a project and return to it 4 months later, and all those short names you gave your files doesn’t tell you at all what the files actually contain.
Of course, this never happens because you ALWAYS name your files so that you definitely know what they contain (right?). But let’s say it did happen. Then the only way out is to look at the contents of the files and try to figure out if it is the file you are looking for.
Now, we are looking for that really good script we wrote a couple of months ago in that other project. Look in the project’s folder, old_project
and find the script.
[user@login1 linux_tutorial]$ ll old_project/
total 96
-rwxrwxr-x 1 user g20XXXXX 39904 Sep 19 2012 a
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19 2012 stuff_1
-rwxrwxr-x 1 user g20XXXXX 1008 Sep 19 2012 the_best
Not so easy with those names.. We will have to use less
to look at the files and figure out which is which.
# syntax
less <filename>
Press q
to close it down, use arrows keys to scroll up/down.
Have a look at the_best
, that must be our script, right?
less old_project/the_best
I guess not. Carrot cakes might be the bomb, but they won’t solve bioinformatic problems. Have a look at the file a
instead.
That’s more like it!
Now imagine that you had hundreds of files with weird names, and you really needed to find it. Lesson learned: name your files so that you know what they are! And don’t be afraid to create folders to organise files.
Another thing to think about when opening files in Linux is which program should you open the file in? The programs we covered during the lectures are nano
and less
. The main difference between these programs in that less can’t edit files, only view them. Another difference is that less doesn’t load the whole file into the RAM memory when opening it.
So, why care about how the program works? I’ll show you why. This time we will be opening a larger file, located in the course’s project folder. It’s 65 megabytes, so it is a tiny file compared with bio-data. Normal sequencing files can easily be 100-1000 times larger than this.
First, open the file with nano
.
# load the nano module if you have not done so already
module load nano
# syntax
nano <filename>
nano /sw/courses/ngsintro/linux/linux_additional-files/large_file
Press Ctrl+X
to close it down, use arrows to scroll up/down).
Is the file loaded yet? Now take that waiting time and multiply it with 100-1000. Now open the file with less. Notice the difference?
head
and tail
works the same was as less
in this regard. They don’t load the whole file into RAM, they just take what they need.
To view the first rows of the large file, use head
.
# syntax
head <filename>
head /sw/courses/ngsintro/linux/linux_additional-files/large_file
Remember how to view an arbitrary number of first rows in a file?
# syntax
head -n <number of rows to view> <filename>
head -n 23 /sw/courses/ngsintro/linux/linux_additional-files/large_file
The same syntax for viewing the last rows of a file with tail:
# syntax
tail <filename>
tail /sw/courses/ngsintro/linux/linux_additional-files/large_file
# syntax
tail -n <number of rows to view> <filename>
tail -n 23 /sw/courses/ngsintro/linux/linux_additional-files/large_file
10 Wildcards
Sometimes (most of the time really) you have many files. So many that it would take you a day just to type all their names. This is where wildcards saves the day. The wildcard symbol in Linux is the star sign, *
, and it means literally anything. Say that you want to move all the files which has names starting with sample_1_
and the rest of the name doesn’t matter. You want all the files belonging to sample_1
. Then you could use the wildcard to represent the rest of the name.
DO NOT run this command, it’s just an example.
mv sample_1_* my_other_folder
We can try it out on the example files I have prepared. There are two folders called part_1
and part_2
. We want to collect all the .txt
files from both these folders in one of the folders. Look around in both the folders to see what they contain.
[user@login1 linux_tutorial]$ ll part_1/
total 0
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19 2012 file_1.txt
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19 2012 file_2.txt
[user@login1 linux_tutorial]$ ll part_2
total 0
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19 2012 file_3.txt
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19 2012 file_4.txt
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19 2012 garbage.tmp
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19 2012 incomplete_datasets.dat
We see that part_1
only contains .txt
files, and that part_2
contains some other files as well. The best option seems to be to move all .txt
files from part_2
info part_1
.
mv part_2/*.txt part_1/
The wildcard works with most, if not all, Linux commands. We can try using wildcards with ls
. Look in the folder many_files
. Yes, there are hundreds of .docx
files in there. But, there are a couple of .txt
files in there as well. Find out how many .docx
and .txt
files exist.
Try to figure out the solution on your own. Then check the answer below.
ll many_files/*.docx
ll many_files/*.txt
11 Utility commands
Ok, the last 2 commands for now are top
and man
.
top
can be useful when you want to look at which programs are being run on the computer, and how hard the computer is working. Type top
and have a look.
top
Press q
to exit.
Tasks: 376 total, 2 running, 290 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.7 us, 1.3 sy, 0.0 ni, 95.3 id, 0.1 wa, 0.0 hi, 0.6 si, 0.0 st
KiB Mem : 32590776 total, 16233548 free, 8394804 used, 7962424 buff/cache
KiB Swap: 99999744 total, 99999744 free, 0 used. 22658832 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3286 roy 20 0 4557248 522400 170808 R 12.3 1.6 62:49.20 gnome-shell
3113 roy 20 0 1282356 385012 290540 S 8.0 1.2 42:00.49 Xorg
22213 roy 20 0 5474576 544848 101592 S 5.6 1.7 102:55.33 zoom
6186 roy 20 0 710140 60504 35836 S 3.0 0.2 0:00.62 terminator
4248 roy 20 0 2737604 556212 140580 S 2.7 1.7 54:51.48 QtWebEngineProc
4632 roy 20 0 4866068 0.993g 281532 S 2.7 3.2 69:18.68 firefox
6548 roy 20 0 3703060 509340 189452 S 2.7 1.6 15:26.80 Web Content
9338 roy 20 0 4407324 846700 213324 S 2.7 2.6 15:53.71 Web Content
4776 roy 20 0 3310524 318364 102700 S 2.0 1.0 8:53.07 WebExtensions
6595 roy 20 0 4133152 992224 187540 S 1.3 3.0 18:51.05 Web Content
952 root -51 0 0 0 0 S 1.0 0.0 2:40.89 irq/51-SYNA2393
7800 roy 20 0 1213744 238536 129392 S 1.0 0.7 11:15.74 atom
1 root 20 0 226080 9836 6692 S 0.7 0.0 2:07.87 systemd
6690 roy 20 0 3492596 560304 166588 S 0.7 1.7 8:08.77 Web Content
12895 roy 20 0 3320332 294820 172212 S 0.7 0.9 7:05.93 Web Content
10 root 20 0 0 0 0 I 0.3 0.0 2:43.21 rcu_sched
1052 root 20 0 2505296 36228 22444 S 0.3 0.1 3:45.16 containerd
2631 gdm 20 0 4044492 198480 142328 S 0.3 0.6 1:55.32 gnome-shell
Each row in top corresponds to one program running on the computer, and the column describe various information about the program. The right-most column shows you which program the row is about.
There are mainly 2 things that are interesting when looking in top. The column %CPU
describes how much cpu is used by each program. If you are doing calculations, which is what bioinformatics is mostly about, the cpu usage should be high. The numbers in the column is how many percent of a core the program is running. If you have a computer with 8 cores, you can have 8 programs using 100% of a core each running at the same time without anything slowing down. As soon as you start a 9th program, it will have to share a core with another program and those 2 programs will run at half-speed since a core can only work that fast. In the example above, program gnome-shell
is using 12.3%
of a core.
The column %MEM
describes how much memory each program uses. The numbers mean how many percent of the total memory a program uses. In the example above, the program firefox is using 3.2%
of the total memory.
The area in the top describes the overall memory usage. Total
tells you how much memory the computer has, used
tells you how much of the memory is being used at the moment, and free
tells you how much memory is free at the moment.
Total = Used + Free
A warning sign you can look for in top is when you are running an analysis which seems to take forever to complete, and you see that there is almost no cpu usage on the computer. That means that the computer is not doing any calculation, which could be bad. If you look at the memory usage at the same time, and see that it’s maxed out (used 100% of total), you can just abort the analysis.
When the memory runs out, the computer more or less stops. Since it can’t fit everything into the RAM memory, it will start using the hard drive to store the things it can’t fit in the RAM. Since the hard drive is ~1000 times slower than the RAM, things will be going in slow-motion. The solution could be to either change the settings of the program you are running to decrease the memory usage (if the program has that functionality), or just get a computer with more memory.
You might wonder how the heck am I supposed to be able to remember all these commands, options and flags? The simple answer is that you won’t. Not all of them at least. You might remember ls
, but was it -l
or -a
you should use to see hidden files? You might wish that there was a manual for these things.
Good news everyone, there is a manual! To get all the nitty-gritty details about ls
, you use the man
command.
# syntax
man <command you want to look at>
man ls
LS(1) User Commands LS(1)
NAME
ls - list directory contents
SYNOPSIS
ls [OPTION]... [FILE]...
DESCRIPTION
List information about the FILEs (the current directory by default).
Sort entries alphabetically if none of -cftuvSUX nor --sort is speci‐
fied.
Mandatory arguments to long options are mandatory for short options too.
-a, --all
do not ignore entries starting with .
-A, --almost-all
do not list implied . and ..
--author
with -l, print the author of each file
-b, --escape
print C-style escapes for nongraphic characters
--block-size=SIZE
scale sizes by SIZE before printing them; e.g., '--block-size=M'
Manual page ls(1) line 1 (press h for help or q to quit)
This will open a less window (remember, q
to close it down, arrows to scroll) with the manual page about ls
. Here you will be able to read everything about ls
. You’ll see which flag does what (-a
is to show the hidden files, which in linux are files with a name starting with a dot .
), which syntax the program has, etc. If you are unsure about how to use a command, look it up using man
.
The man
pages can be a bit tricky to understand at first, but you get used to it with time. If it is still unclear, try searching for it on the internet. You are bound to find someone with the exact same question as you, that has already asked on a forum, and gotten a good answer. 5 years ago.
If you still have time left on the lab and you finished early, check out the Linux file permissions lab.