In code blocks, the dollar sign ($) is not to be printed. The dollar sign is usually an indicator that the text following it should be typed in a terminal window.

1 Ownership & Permissions

As Linux can be a multi-user environment it is important that files and directories can have different owners and permissions to keep people from editing or executing your files.

1.1 Owners

The permissions are defined separately for users, groups and others.

The user is the username of the person who owns the file. By default the user who creates a file will become its owner. The group is a group of users that co-own the file. They will all have the same permissions to the file. This is useful in any project where a group of people are working together. The others is quite simply everyone else’s permissions.

1.2 Permissions

There are four permissions that a file or directory can have. Note the one character designations/flags, r,w,x and -.

In all cases, if the file or directory has the flag it means that it is enabled.

Read: r

File: Whether the file can be opened and read.
Directory: Whether the contents of the directory can be listed.

Write: w

File: Whether the file can be modified. (Note that for renaming or deleting a file you need additional directory permissions.)
Directory: Whether the files in the directory can be renamed or deleted.

Execute: x

File: Whether the file can be executed as a program or shell script.
Directory: Whether the directory can be entered using cd.

No permissions: -

2 Interpreting permissions

Start your terminal, log onto UPPMAX (check with squeue which core you had and ssh onto it, if for some reason your core is unavailable:

$ salloc -A g2021013 -t 04:30:00 -p core --no-shell --reservation=g2021013_18

make an empty directory we can work in and make a file.

$ cd /proj/g2021013/nobackup/<username>
$ mkdir advlinux
$ cd advlinux
$ touch  filename
$ ls -lh
total 0
 -rw-r--r-- 1 S_D staff 0B Sep 21 13:54 filename

(-lh means long and human readable, displaying more information about the files or directories in a human understandable format)

The first segment, -rw-r--r--, describes the ownerships and permissions of our newly created file. The very first character, in this case -, shows the files type. It can be any of these:

d = directory - = regular file l = symbolic link s = Unix domain socket p = named pipe c = character device file b = block device file

As expected the file we have just created is a regular file. Ignore the types other than directory, regular and symbolic link as they are outside the scope of this course.

The next nine characters, in our case rw-r--r--, can be divided into three groups consisting of three characters in order from left to right. In our case rw-, r-- and r--. The first group designates the users permissions, the second the groups permissions and the third the others permissions. As you may have guessed the within group permissions are ordered, the first always designates read permissions, the second write and the third executability.

This translates our files permissions to say this -rw-r--r--:

It is a regular file. The user has read & write permission, but not execute. The group has read permission but not write and execute. Everyone else (other), have read permission but not write and execute.

As another example, lets create a directory.

$ mkdir directoryname
$ ls -lh
total 0
drwxr-xr-x  2 S_D  staff    68B Sep 21 14:41 directoryname
-rw-r--r--  1 S_D  staff     0B Sep 21 13:54 filename

As you can see the first character correctly identifies it as d, a directory, and all user groups have x, execute permissions, to enter the directory by default.

3 Editing Ownership & Permissions

The command to set file permission is chmod which means CHange MODe. Only the owner can set file permissions.

  1. First you decide which group you want to set permissions for. User, u, group, g, other, o, or all three, a.
  2. Next you either add, +, remove, -, or wipe out previous and add new, =, permissions.
  3. Then you specify the kind of permission: r,w,x, or -.

Lets revisit our example file and directory to test this.

$ ls -lh
total 0
drwxr-xr-x  2 S_D  staff    68B Sep 21 14:41 directoryname
-rw-r--r--  1 S_D  staff     0B Sep 21 13:54 filename
$ chmod a=x filename
$ ls -lh
total 0
drwxr-xr-x  2 S_D  staff    68B Sep 21 14:41 directoryname
---x--x--x  1 S_D  staff     0B Sep 21 13:54 filename

As you can see this affected all three, a, it wiped the previous permissions, =, and added an executable permission, x, to all three groups.

Try some others both on the file and directory to get the hang of it.

$ chmod g+r filename
$ chmod u-x filename
$ chmod ug=rx filename
$ chmod a=- filename
$ chmod a+w directoryname

In no more than two commands, change the file permissions from

----------

to

-rw-rw--wx

Notice also that we here gave everyone writing permission to the file, that means that ANYONE can write to the file. Not very safe.

$ chmod ug+rw filename  
$ chmod o+wx filename

5 Grep

Some files can be so large that opening it in a program would be very hard on your computer. It could be a file containing biological data, it could be a log file of a transfer where we want check for any errors. No matter the reason, a handy tool to know is the grep command.

grep searches for a specific string in one or more files. Case sensitive/insensitive or regular expressions work as well.

Let’s start, as always, by cleaning our directory.

$ rm -r *

Then let’s create a file with some text in it that we can work with. I have supplied some great text below.

$ nano textfile

Cats sleep anywhere, any table, any chair.
Top of piano, window-ledge, in the middle, on the edge.
Open draw, empty shoe, anybody's lap will do.
Fitted in a cardboard box, in the cupboard with your frocks.
Anywhere! They don't care! Cats sleep anywhere.

Now let’s see how the grep command works. The syntax is:

grep "string" filename/filepattern

Some examples for you to try and think about:

$ grep "Cat" textfile
$ grep "cat" textfile

As you can see the last one did not return any results. Add a -i for case insensitive search.

$ grep -i "cat" textfile

Now let’s copy the file and check both of them together by matching a pattern for the filenames.

$ cp textfile textcopy
$ grep "Cat" text*

The * will ensure that any file starting with text and then anything following will be searched. This example would perhaps be more real if we had several text files with different texts and we were looking for a specific string in any of them.

Copy the file sample_1.sam to your folder using the command below

$ cp /sw/courses/ngsintro/linux/linux_additional-files/sample_1.sam .

Use grep to search in the file for a specific string of nucleotides, for example:

$ grep "TACCACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAG" sample_1.sam

Try with shorter sequences. When do you start getting lots of hits? This file is only a fraction of a genome, you would have gotten many times more hits doing this to a complete many GB large sam file.

Use grep to find all lines with chr1 in them. This output is too much to be meaningful. Send it to a file (>) where you have now effectively stored all the chromosome 1 information.

6 Piping

A useful tool in linux environment is the use of pipes. What they essentially do is connect the first command to the second command and the second command to the third command etc for as many commands as you want or need.

This is often used in UPPMAX jobs and other analysis as there are three major benefits. The first is that you do not have to stand in line to get a core or a node twice. The second is that you do not generate intermediary data which will clog your storage, you go from start file to result. The third is that it may actually be faster.

The pipe command has this syntax

command 1 | command 2

The | is the pipe symbol (on mac keyboard alt+7), signifying that whatever output usually comes out of command 1 should instead be directly sent to command 2 and output in the manner that command 2 inputs.

In a hypothetical situation you have a folder with hundreds of files and you know the file you are looking for is very large but you can’t remember its name.

Let’s do a ls -lh and pipe the output to be sorted by file size. -n means we are sorting numerically and not alphabetically, -k 5 says look at the fifth column of output, which happens to be the file size of ls command.

$ ls -lh | sort -k 5 -n

An example use would be to align a file and directly send the now aligned file to be converted into a different format that may be required for another part of the analysis.

The next step requires us to use a bioinformatics software called samtools. To be able to use this program we first have to load the module for it. We will cover this in the UPPMAX lectures, so if you are a bit too fast for you own good, you will just have to type this command:

$ module load bioinfo-tools samtools

Here is an example where we convert the samfile to a bamfile (-Sb literally means the input is Sam and to output in bam) and pipe it to immediately get sorted, not even creating the unsorted bamfile intermediary. Notice that samtools is made to take the single - after samtools sort as the position of the piped data from samtools view.

$ samtools view -bS sample_1.sam | samtools sort - -o outbam

This should have generated a file called outbam.bam in your current folder. We will have some more examples of pipes in the next section.

7 Word Count

wc for Word Count is a useful command for counting the number of occurrences of a word in a file. This is easiest explained with an example.

Let’s return to our sample_1.sam.

$ wc sample_1.sam
233288 3666760 58105794 sample_1.sam

This can be interpreted like this:

Number of lines = 233288
Number of words = 3666760
Number of characters = 58105794

To make this more meaningful, let’s use the pipes and grep command seen previously to see how many lines and how many times the string of nucleotides CATCATCAT exist in this file.

$ grep "CATCATCAT" sample_1.sam | wc
60  957 15074

To see only the line count you can add -l after wc and to count only characters -m.

Output only the amount of lines that have chr1 in them from sample_1.sam.

$ grep "chr1" sample_1.sam | wc -l

Count the lines containing CATCATCAT in the outbam.bam file.

$ samtools view outbam.bam | grep "CATCATCAT" | wc -l

8 Bonus exercise 1

These are some harder assignments, so don’t worry about it if you didn’t have time to do it.

Lets look at grep and use some regular expressions http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/

From file sample_1.sam find all lines that start with @ and put them in a file called at.txt.

$ grep "^@" sample_1.sam > at.txt

Find all the lines that end with at least 3 numbers from at.txt. Sometimes, you have to escape {} with \{\})

$ grep "[0-9]\{3\}$" sample_1.sam

9 Bonus exercise 2

[sed](http://www.grymoire.com/Unix/Sed.html) is a handy tool to replace strings in files.

You have realized that all the chromosomes have been misnamed as chr3 when they should be chr4. Use sed to replace chr3 with chr4 in sample_1.sam and output it to sample_2.sam.

The solution to this replaces the first instance on each line of chr3. What if we have multiple instances? What if we had wanted to replace chr1? This would effect chr10-19 as well! There are many things to consider :).

$ sed "s/chr1/chr2/" sample_1.sam > sample_2.sam

10 Bonus exercise 3

Bash loops are great for moving or renaming multiple files as well as many many other uses.

Create a couple of files as seen below

$ touch one.bam two.sam three.bam four.sam five.bam six.sam

All the files are actually in bam format. What a crazy mistake! Create a bash loop that changes all files ending in .sam to end with .bam instead.

The bash loop syntax is this:

$ for _variable_ in _pattern_; do _command with $variable_; done

To rename file1 to file2 you write this:

$ mv file1 file2

which effectively is the same thing as

$ cp file1 file2
$ rm file1

Ponder how this can be used to your advantage:

$ i=filename
$ echo ${i/name}stuff
filestuff

$ for f in *.sam
do
  mv $f ${f/.sam}.bam;
done