In code blocks, the dollar sign ($) is not to be printed. The dollar sign is usually an indicator that the text following it should be typed in a terminal window.

1 Connect to UPPMAX

The first step of this lab is to open a ssh connection to UPPMAX. Please go to the Contents page and click “Connecting to UPPMAX” under the “Additional content” topic. Unfortunately we can’t just make a link to it from this page due to technical limitations in the web framework we have to use for the course :( Once connected to UPPMAX, return here and continue reading the instructions below.

2 Logon to a node

Usually you would do most of the work in this lab directly on one of the login nodes at UPPMAX, but we have arranged for you to have one core each for better performance. This was covered briefly in the lecture notes.

$ salloc -A snic2022-22-769 -t 07:00:00 -p core -n 1 --no-shell --reservation=snic2022-22-769_1 &

check which node you got (replace username with your UPPMAX username)

$ squeue -u username

should look something like this

dahlo@rackham2 work $ squeue -u dahlo
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           3132376      core       sh    dahlo  R       0:04      1 r292
dahlo@rackham2 work $

where r292 is the name of the node I got (yours will probably be different). Note the numbers in the Time column. They show for how long the job has been running. When it reaches the time limit you requested (7 hours in this case) the session will shut down, and you will lose all unsaved data. Connect to this node from within UPPMAX.

$ ssh -Y r292

There is a UPPMAX specific tool called jobinfo that supplies the same kind of information as squeue that you can use as well ($ jobinfo -u username).

4 Copy lab files

Now you will need some files. To avoid all the course participants editing the same file all at once, undoing each other’s edits, each participant will get their own copy of the needed files.

The files are located in the folder /sw/courses/ngsintro/linux/linux_tutorial

or they can be downloaded if you are not on UPPMAX at the moment, files.tar.gz (instruction on how to download further down)

For structures sake, first create a folder with your username in the nobackup folder, and a folder called linux_tutorial inside that folder, where you can put all your lab files.

This can be done in 2 ways:

$ mkdir /proj/snic2022-22-769/nobackup/username
$ mkdir /proj/snic2022-22-769/nobackup/username/linux_tutorial

or

$ mkdir -p /proj/snic2022-22-769/nobackup/username/linux_tutorial

The reason for this is that Linux will not like it if you try to create the folder linux_tutorial inside a folder (the one named like your username) that does not exist yet. Then, you have the choice to either first create the one named like your username (the first way), or to tell Linux to create it for you by giving it the -p option (the second way).

Next, copy the lab files to this folder.

cp -r <source-folder> <destination-folder>
cp -r /sw/courses/ngsintro/linux/linux_tutorial/* /proj/snic2022-22-769/nobackup/username/linux_tutorial

-r denotes recursively, which means all the files including sub-folders of the source folder. Without it, only files directly in the source folder would be copied, NOT sub-folders and files in sub-folders.

Remember to tab-complete to avoid typos and too much writing.

If you are unable to copy the files on UPPMAX, you can download the files from this link instead of copying them. This is done with the command wget (web get). It works kind of the same way as the cp command, but you give it a source URL instead of a source file, and you specify the destination by giving it a prefix, a path that will be appended in front on the file name when it’s downloaded.

i.e; if you want to download the file http://somewhere.com/my.file and you give it the prefix ~/analysis/, the downloaded file will be saved as ~/analysis/my.file.

Ex: wget -P <destination prefix> <source URL>

5 Unpack files

Go to the folder you just copied and see what is in it.

Remember to tab-complete to avoid typos and too much writing.

$ cd /proj/snic2022-22-769/nobackup/username/linux_tutorial
ll

tar.gz is a file ending given to compressed files, something you will encounter quite often. Compression decreases the size of the files which is good when downloading, and it can take thousands of files and compress them all into a single compressed file. This is both convenient for the person downloading and speeds up the transfer more than you would think.

To unpack the files.tar.gz file use the following line while standing in the newly copied linux_tutorial folder.

$ tar -xzvf files.tar.gz

The command will always be the same for all tar.gz files you want to unpack. -xzvf means eXtract from a Zipped file, Verbose (prints the name of the file being unpacked), from the specified File (f must always be the last of the letters).

Look in the folder again and see what we just unpacked:

[user@milou2 linux_tutorial]$ ls -la
total 512
drwxrwsr-x 12 user g20XXXXX   2048 Sep 24 13:19 .
drwxrwsr-x  6 user g20XXXXX   2048 Sep 24 13:19 ..
drwxrwsr-x  2 user g20XXXXX   2048 Sep 19  2012 a_strange_name
drwxrwsr-x  2 user g20XXXXX   2048 Sep 19  2012 backed_up_proj_folder
drwxrwsr-x  2 user g20XXXXX   2048 Sep 19  2012 external_hdd
-rwxrwxr-x  1 user g20XXXXX  17198 Sep 24 13:19 files.tar.gz
drwxrwsr-x  2 user g20XXXXX   2048 Sep 19  2012 important_results
drwxrwsr-x  2 user g20XXXXX 129024 Sep 19  2012 many_files
drwxrwsr-x  2 user g20XXXXX   2048 Sep 19  2012 old_project
-rwxrwxr-x  1 user g20XXXXX      0 Sep 24 13:19 other_file.old
drwxrwsr-x  2 user g20XXXXX   2048 Sep 19  2012 part_1
drwxrwsr-x  2 user g20XXXXX   2048 Sep 19  2012 part_2
drwxrwsr-x  2 user g20XXXXX   2048 Jan 28  2012 this_has_a_file
drwxrwsr-x  2 user g20XXXXX   2048 Jan 28  2012 this_is_empty
-rwxrwxr-x  1 user g20XXXXX      0 Sep 19  2012 useless_file
[user@milou2 linux_tutorial]$

6 Copying and moving files

Let’s move some files. Moving files might be one of the more common things you do, after cd and ls. You might want to organize your files in a better way, or move important result files to the project folder, who knows?

We will start with moving our important result to a backed-up folder. When months of analysis is done, the last thing you want is to lose your files. Typically this would mean that you move the final results to your project folder.

In this example, we want to move the result files only, located in the folder important_results, to our fake project folder, called backed_up_proj_folder.

The syntax for the move command is:

$ mv <source> <destination>

First, take a look inside the important_results folder:

[user@milou2 linux_tutorial]$ ll important_results/
total 0
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19  2012 dna_data_analysis_result_file_that_is_important-you_should_really_use_tab_completion_for_file_names.bam
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19  2012 temp_file-1
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19  2012 temp_file-2
[user@milou2 linux_tutorial]$

You see that there are some unimportant temporary files that you have no interest in. Just to demonstrate the move command, I will show you how to move one of these temporary files to your backed-up project folder:

$ mv important_results/temp_file-1 backed_up_proj_folder/

Now do the same, but move the important DNA data file!

Look in the backed-up project folder to make sure you moved the file correctly.

[user@milou2 linux_tutorial]$ ll backed_up_proj_folder/
total 0
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19  2012 dna_data_analysis_result_file_that_is_important-you_should_really_use_tab_completion_for_file_names.bam
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19  2012 last_years_data
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19  2012 temp_file-1
[user@milou2 linux_tutorial]$

Another use for the move command is to rename things. When you think of it, renaming is just a special case of moving. You move the file to a location and give the file a new name in the process. The location you move the file to can very well be the same folder the file already is in. To give this a try, we will rename the folder a_strange_name to a better name.

$ mv a_strange_name a_better_name

Look around to see that the name change worked.

[user@milou2 linux_tutorial]$ mv a_strange_name a_better_name
[user@milou2 linux_tutorial]$ ll
total 448
drwxrwsr-x 2 user g20XXXXX   2048 Sep 19  2012 a_better_name
drwxrwsr-x 2 user g20XXXXX   2048 Sep 24 13:40 backed_up_proj_folder
drwxrwsr-x 2 user g20XXXXX   2048 Sep 19  2012 external_hdd
-rwxrwxr-x 1 user g20XXXXX  17198 Sep 24 13:36 files.tar.gz
drwxrwsr-x 2 user g20XXXXX   2048 Sep 24 13:40 important_results
drwxrwsr-x 2 user g20XXXXX 129024 Sep 19  2012 many_files
drwxrwsr-x 2 user g20XXXXX   2048 Sep 19  2012 old_project
-rwxrwxr-x 1 user g20XXXXX      0 Sep 24 13:36 other_file.old
drwxrwsr-x 2 user g20XXXXX   2048 Sep 19  2012 part_1
drwxrwsr-x 2 user g20XXXXX   2048 Sep 19  2012 part_2
drwxrwsr-x 2 user g20XXXXX   2048 Jan 28  2012 this_has_a_file
drwxrwsr-x 2 user g20XXXXX   2048 Jan 28  2012 this_is_empty
-rwxrwxr-x 1 user g20XXXXX      0 Sep 19  2012 useless_file
[user@milou2 linux_tutorial]$

Sometimes you don’t want to move things, you want to copy them. Moving a file will remove the original file, whereas copying the file will leave the original untouched. An example when you want to do this could be that you want to give a copy of a file to a friend. Imagine that you have a external hard drive that you want to place the file on. The file you want to give to your friend is data from last years project, which is located in your backed_up_project_folder, backed_up_proj_folder/last_years_data

As with the move command, the syntax is

$ cp <source> <destination>
$ cp backed_up_proj_folder/last_years_data external_hdd/

Take a look in the external_hdd to make sure the file got copied.

[user@milou2 linux_tutorial]$ cp backed_up_proj_folder/last_years_data external_hdd/
[user@milou2 linux_tutorial]$ ll external_hdd/
total 0
-rwxrwxr-x 1 user g20XXXXX 0 Sep 24 13:46 last_years_data
[user@milou2 linux_tutorial]$

7 Deleting files

Sometimes you will delete files. Usually this is when you know that the file or files are useless to you, and they only take up space on your hard drive or UPPMAX account.

To delete a file, we use the ReMove command, rm. Syntax:

$ rm <file to remove>

If you want, you can also specify multiple files at once, as many as you want!

$ rm <file to remove> <file to remove> <file to remove> <file to remove> <file to remove>

IMPORTANT: There is no trash bin in Linux. If you delete a file, it is gone. So be careful when deleting stuff.

Try it out by deleting the useless file in the folder you are standing in. First, look around in the folder to see the file.

[user@milou2 linux_tutorial]$ ll
total 448
drwxrwsr-x 2 user g20XXXXX   2048 Sep 19  2012 a_better_name
drwxrwsr-x 2 user g20XXXXX   2048 Sep 24 13:40 backed_up_proj_folder
drwxrwsr-x 2 user g20XXXXX   2048 Sep 24 13:46 external_hdd
-rwxrwxr-x 1 user g20XXXXX  17198 Sep 24 13:36 files.tar.gz
drwxrwsr-x 2 user g20XXXXX   2048 Sep 24 13:40 important_results
drwxrwsr-x 2 user g20XXXXX 129024 Sep 19  2012 many_files
drwxrwsr-x 2 user g20XXXXX   2048 Sep 19  2012 old_project
-rwxrwxr-x 1 user g20XXXXX      0 Sep 24 13:36 other_file.old
drwxrwsr-x 2 user g20XXXXX   2048 Sep 19  2012 part_1
drwxrwsr-x 2 user g20XXXXX   2048 Sep 19  2012 part_2
drwxrwsr-x 2 user g20XXXXX   2048 Jan 28  2012 this_has_a_file
drwxrwsr-x 2 user g20XXXXX   2048 Jan 28  2012 this_is_empty
-rwxrwxr-x 1 user g20XXXXX      0 Sep 19  2012 useless_file
[user@milou2 linux_tutorial]$

Now remove it.

$ rm useless_file

Similarly, folders can be removed too. There is even a special command for removing folders, rmdir. They work similar to rm, except that they can’t remove files. There are two folders, this_is_empty and this_has_a_file, that we now will delete.

$ rmdir this_is_empty
$ rmdir this_has_a_file

If you look inside this_has_a_file,

[user@milou2 linux_tutorial]$ ll this_has_a_file
total 0
-rwxrwxr-x 1 user g20XXXXX 0 Jan 28  2012 file
[user@milou2 linux_tutorial]$

you see that there is a file in there! Only directories that are completely empty can be deleted using rmdir. To be able to delete this_has_a_file, either delete the file manually and then remove the folder

$ rm this_has_a_file/file
$ rmdir this_has_a_file

or delete the directory recursively, which will remove this_has_a_file and everything inside:

$ rm -r this_has_a_file

8 Open files

So what happens if you give your files bad names like file1 or results? You take a break in a project and return to it 4 months later, and all those short names you gave your files doesn’t tell you at all what the files actually contain.

Of course, this never happens because you ALWAYS name your files so that you definitely know what they contain. But let’s say it did happen. Then the only way out is to look at the contents of the files and try to figure out if it is the file you are looking for.

Now, we are looking for that really good script we wrote a couple of months ago in that other project. Look in the project’s folder, old_project and find the script.

[user@milou2 linux_tutorial]$ ll old_project/
total 96
-rwxrwxr-x 1 user g20XXXXX 39904 Sep 19  2012 a
-rwxrwxr-x 1 user g20XXXXX     0 Sep 19  2012 stuff_1
-rwxrwxr-x 1 user g20XXXXX  1008 Sep 19  2012 the_best
[user@milou2 linux_tutorial]$

Not so easy with those names.. We will have to use less to look at the files and figure out which is which.

$ less <filename>

Press q to close it down, use arrows keys to scroll up/down.

Have a look at the_best, that must be our script, right?

$ less old_project/the_best

I guess not. Carrot cakes might be the bomb, but they won’t solve bioinformatic problems. Have a look at the file a instead.

That’s more like it!

Now imagine that you had hundreds of files with weird names, and you really needed to find it. Lesson learned: name your files so that you know what they are! And don’t be afraid to create folders to organise files.

Another thing to think about when opening files in Linux is which program should you open the file in? The programs we covered during the lectures are nano and less. The main difference between these programs in that less can’t edit files, only view them. Another difference is that less doesn’t load the whole file into the RAM memory when opening it.

So, why care about how the program works? I’ll show you why. This time we will be opening a larger file, located in the course’s project folder. It’s 65 megabytes, so it is a tiny file compared with bio-data. Normal sequencing files can easily be 100-1000 times larger than this.

First, open the file with nano.

$ nano <filename>
$ nano /sw/courses/ngsintro/linux/linux_additional-files/large_file

Press Ctrl+X to close it down, use arrows to scroll up/down).

Is the file loaded yet? Now take that waiting time and multiply it with 100-1000. Now open the file with less. Notice the difference?

head and tail works the same was as less in this regard. They don’t load the whole file into RAM, they just take what they need.

To view the first rows of the large file, use head.

$ head <filename>

$ head /sw/courses/ngsintro/linux/linux_additional-files/large_file

Remember how to view an arbitrary number of first rows in a file?

$ head -n <number of rows to view> <filename>

$ head -n 23 /sw/courses/ngsintro/linux/linux_additional-files/large_file

The same syntax for viewing the last rows of a file with tail:

$ tail <filename>

$ tail /sw/courses/ngsintro/linux/linux_additional-files/large_file

$ tail -n <number of rows to view> <filename>

$ tail -n 23 /sw/courses/ngsintro/linux/linux_additional-files/large_file

9 Wildcards

Sometimes (most of the time really) you have many files. So many that it would take you a day just to type all their names. This is where wildcards saves the day. The wildcard symbol in Linux is the star sign, * , and it means literally anything. Say that you want to move all the files which has names starting with sample_1_ and the rest of the name doesn’t matter. You want all the files belonging to sample_1. Then you could use the wildcard to represent the rest of the name.

DO NOT run this command, it’s just an example.

$ mv  sample_1_*  my_other_folder

We can try it out on the example files I have prepared. There are two folders called part_1 and part_2. We want to collect all the .txt files from both these folders in one of the folders. Look around in both the folders to see what they contain.

[user@milou2 linux_tutorial]$ ll part_1/
total 0
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19  2012 file_1.txt
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19  2012 file_2.txt
[user@milou2 linux_tutorial]$ ll part_2
total 0
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19  2012 file_3.txt
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19  2012 file_4.txt
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19  2012 garbage.tmp
-rwxrwxr-x 1 user g20XXXXX 0 Sep 19  2012 incomplete_datasets.dat
[user@milou2 linux_tutorial]$

We see that part_1 only contains .txt files, and that part_2 contains some other files as well. The best option seems to be to move all .txt files from part_2 info part_1.

$ mv part_2/*.txt part_1/

The wildcard works with most, if not all, Linux commands. We can try using wildcards with ls. Look in the folder many_files. Yes, there are hundreds of .docx files in there. But, there are a couple of .txt files in there as well. Find out how many .docx and .txt files exist.

Try to figure out the solution on your own. Then check the answer below.

$ ll many_files/*.docx
$ ll many_files/*.txt

10 Utility commands

Ok, the last 2 commands for now are top and man.

top can be useful when you want to look at which programs are being run on the computer, and how hard the computer is working. Type top and have a look.

$ top

Press q to exit.

Tasks: 376 total,   2 running, 290 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.7 us,  1.3 sy,  0.0 ni, 95.3 id,  0.1 wa,  0.0 hi,  0.6 si,  0.0 st
KiB Mem : 32590776 total, 16233548 free,  8394804 used,  7962424 buff/cache
KiB Swap: 99999744 total, 99999744 free,        0 used. 22658832 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                       
 3286 roy       20   0 4557248 522400 170808 R  12.3  1.6  62:49.20 gnome-shell                   
 3113 roy       20   0 1282356 385012 290540 S   8.0  1.2  42:00.49 Xorg                          
22213 roy       20   0 5474576 544848 101592 S   5.6  1.7 102:55.33 zoom                          
 6186 roy       20   0  710140  60504  35836 S   3.0  0.2   0:00.62 terminator                    
 4248 roy       20   0 2737604 556212 140580 S   2.7  1.7  54:51.48 QtWebEngineProc               
 4632 roy       20   0 4866068 0.993g 281532 S   2.7  3.2  69:18.68 firefox                       
 6548 roy       20   0 3703060 509340 189452 S   2.7  1.6  15:26.80 Web Content                   
 9338 roy       20   0 4407324 846700 213324 S   2.7  2.6  15:53.71 Web Content                   
 4776 roy       20   0 3310524 318364 102700 S   2.0  1.0   8:53.07 WebExtensions                 
 6595 roy       20   0 4133152 992224 187540 S   1.3  3.0  18:51.05 Web Content                   
  952 root     -51   0       0      0      0 S   1.0  0.0   2:40.89 irq/51-SYNA2393               
 7800 roy       20   0 1213744 238536 129392 S   1.0  0.7  11:15.74 atom                          
    1 root      20   0  226080   9836   6692 S   0.7  0.0   2:07.87 systemd                       
 6690 roy       20   0 3492596 560304 166588 S   0.7  1.7   8:08.77 Web Content                   
12895 roy       20   0 3320332 294820 172212 S   0.7  0.9   7:05.93 Web Content                   
   10 root      20   0       0      0      0 I   0.3  0.0   2:43.21 rcu_sched                     
 1052 root      20   0 2505296  36228  22444 S   0.3  0.1   3:45.16 containerd                    
 2631 gdm       20   0 4044492 198480 142328 S   0.3  0.6   1:55.32 gnome-shell

Each row in top corresponds to one program running on the computer, and the column describe various information about the program. The right-most column shows you which program the row is about.

There are mainly 2 things that are interesting when looking in top. The column %CPU describes how much cpu is used by each program. If you are doing calculations, which is what bioinformatics is mostly about, the cpu usage should be high. The numbers in the column is how many percent of a core the program is running. If you have a computer with 8 cores, like the UPPMAX computers, you can have 8 programs using 100% of a core each running at the same time without anything slowing down. As soon as you start a 9th program, it will have to share a core with another program and those 2 programs will run at half-speed since a core can only work that fast. In the example above, program gnome-shell is using 12.3% of a core.

The column %MEM describes how much memory each program uses. The numbers mean how many percent of the total memory a program uses. In the example above, the program firefox is using 3.2% of the total memory.

The area in the top describes the overall memory usage. Total tells you how much memory the computer has, used tells you how much of the memory is being used at the moment, and free tells you how much memory is free at the moment.

Total = Used + Free

A warning sign you can look for in top is when you are running an analysis which seems to take forever to complete, and you see that there is almost no cpu usage on the computer. That means that the computer is not doing any calculation, which could be bad. If you look at the memory usage at the same time, and see that it’s maxed out (used 100% of total), you can more or less abort the analysis.

When the memory runs out, the computer more or less stops. Since it can’t fit everything into the RAM memory, it will start using the hard drive to store the things it can’t fit in the RAM. Since the hard drive is ~1000 times slower than the RAM, things will be going in slow-motion. The solution could be to either change the settings of the program you are running to decrease the memory usage (if the program has that functionality), or just get a computer with more memory.

You might wonder how the heck am I supposed to be able to remember all these commands, options and flags? The simple answer is that you won’t. Not all of them at least. You might remember ls, but was it -l or -a you should use to see hidden files? You might wish that there was a manual for these things.

Good news everyone, there is a manual! To get all the nitty-gritty details about ls, you use the man command.

$ man <command you want to look at>

$ man ls
LS(1)                            User Commands                            LS(1)

NAME
       ls - list directory contents

SYNOPSIS
       ls [OPTION]... [FILE]...

DESCRIPTION
       List  information  about  the  FILEs (the current directory by default).
       Sort entries alphabetically if none of -cftuvSUX nor  --sort  is  speci‐
       fied.

       Mandatory arguments to long options are mandatory for short options too.

       -a, --all
              do not ignore entries starting with .

       -A, --almost-all
              do not list implied . and ..

       --author
              with -l, print the author of each file

       -b, --escape
              print C-style escapes for nongraphic characters

       --block-size=SIZE
              scale  sizes by SIZE before printing them; e.g., '--block-size=M'
 Manual page ls(1) line 1 (press h for help or q to quit)

This will open a less window (remember, q to close it down, arrows to scroll) with the manual page about ls. Here you will be able to read everything about ls. You’ll see which flag does what (-a is to show the hidden files, which in linux are files with a name starting with a dot .), which syntax the program has, etc. If you are unsure about how to use a command, look it up using man.

The man pages can be a bit tricky to understand at first, but you get used to it with time. If it is still unclear, try searching for it on the internet. You are bound to find someone with the exact same question as you, that has already asked on a forum, and gotten a good answer. 5 years ago.

Optional

If you still have time left on the lab and you finished early, check out the Linux file permissions lab on the Contents page.