In code blocks, the dollar sign ($
) is not to be printed. The dollar sign is usually an indicator that the text following it should be typed in a terminal window.
As Linux can be a multi-user environment it is important that files and directories can have different owners and permissions to keep people from editing or executing your files.
The permissions are defined separately for users, groups and others.
The user is the username of the person who owns the file. By default the user who creates a file will become its owner. The group is a group of users that co-own the file. They will all have the same permissions to the file. This is useful in any project where a group of people are working together. The others is quite simply everyone else’s permissions.
There are four permissions that a file or directory can have. Note the one character designations/flags, r
,w
,x
and -
.
In all cases, if the file or directory has the flag it means that it is enabled.
Read: r
File: Whether the file can be opened and read.
Directory: Whether the contents of the directory can be listed.
Write: w
File: Whether the file can be modified. (Note that for renaming or deleting a file you need additional directory permissions.)
Directory: Whether the files in the directory can be renamed or deleted.
Execute: x
File: Whether the file can be executed as a program or shell script.
Directory: Whether the directory can be entered using cd
.
No permissions: -
Make an empty directory we can work in and make a file.
$ cd /proj/snic2022-22-1124/nobackup/username
$ mkdir advlinux
$ cd advlinux
$ touch filename
$ ls -lh
total 0
-rw-r--r-- 1 S_D staff 0B Sep 21 13:54 filename
(-lh
means long and human readable, displaying more information about the files or directories in a human understandable format)
This shows us a cryptic line for each file/folder, where the columns are as following:
-rw-rw-r-- : permissions
1 : number of linked hard-links
S_D : owner of the file
staff : to which group this file belongs to
0 : file size
Sep 21 13:54 : modification/creation date and time
filename : file/directory name
The first segment, -rw-r--r--
, describes the ownerships and permissions of our newly created file. The very first character, in this case -
, shows the file’s type. It can be any of these:
d
= directory
-
= regular file
l
= symbolic link
s
= Unix domain socket
p
= named pipe
c
= character device file
b
= block device file
As expected the file we have just created is a regular file. Ignore the types other than directory, regular and symbolic link as they are outside the scope of this course.
The next nine characters, in our case rw-r--r--
, can be divided into three groups consisting of three characters in order from left to right. In our case rw-
, r--
and r--
. The first group designates the users permissions, the second the groups permissions and the third the others permissions. As you may have guessed the within group permissions are ordered, the first always designates read permissions, the second write and the third executability.
This translates our files permissions to say this -rw-r--r--
:
- It is a regular file.
- The user has read & write permission, but not execute.
- The group has read permission but not write and execute.
- Everyone else (other), have read permission but not write and execute.
As another example, lets create a directory.
$ mkdir directoryname
$ ls -lh
total 0
drwxr-xr-x 2 S_D staff 68B Sep 21 14:41 directoryname
-rw-r--r-- 1 S_D staff 0B Sep 21 13:54 filename
As you can see the first character correctly identifies it as d
, a directory, and all user groups have x
, execute permissions, to enter the directory by default.
The command to set file permission is chmod
which means CHange MODe. Only the owner can set file permissions.
u
, group, g
, other, o
, or all three, a
.+
, remove, -
, or wipe out previous and add new, =
, permissions.r
,w
,x
, or -
.Lets revisit our example file and directory to test this.
$ ls -lh
total 0
drwxr-xr-x 2 S_D staff 68B Sep 21 14:41 directoryname
-rw-r--r-- 1 S_D staff 0B Sep 21 13:54 filename
$ chmod a=x filename
$ ls -lh
total 0
drwxr-xr-x 2 S_D staff 68B Sep 21 14:41 directoryname
---x--x--x 1 S_D staff 0B Sep 21 13:54 filename
As you can see this affected all three, a
, it wiped the previous permissions, =
, and added an executable permission, x
, to all three groups.
Try some others both on the file and directory to get the hang of it.
$ chmod g+r filename
$ chmod u-x filename
$ chmod ug=rx filename
$ chmod a=- filename
$ chmod a+w directoryname
In no more than two commands, change the file permissions from
----------
to
-rw-rw--wx
Notice also that we here gave everyone writing permission to the file, that means that ANYONE can write to the file. Not very safe.
$ chmod ug+rw filename
$ chmod o=wx filename
Much like a windows user has a shortcut on his desktop to WorldOfWarcraft.exe, being able to create links to files or directories is good to know in Linux. An important thing to remember about symbolic links is that they are not updated, so if you or someone else moves or removes the original file/directory the link will stop working.
Make sure you are standing in the directory /proj/snic2022-22-1124/nobackup/username/advlinux. Then remove our old file and directory.
IMPORTANT: Using rm -r *
removes all folder and subfolders from where you are standing. Be extremely careful when using this. Always double check current working directory or path.
$ rm -r *
Now that the directory is empty, let’s make a new folder and a new file in that folder.
$ mkdir stuff
$ touch stuff/linkfile
Lets put some information into the file, just put some text, anything, like “slartibartfast” or something.
$ nano stuff/linkfile
Now let’s create a link to this file in our original folder. ln
stands for link and -s
makes it symbolic. The other options are not in the scope of this course, but feel free to read about them on your own, https://stackoverflow.com/a/29786294
$ ln -s stuff/linkfile
$ ls -l
total 8
lrwxr-xr-x 1 S_D staff 14B Sep 21 15:38 linkfile -> stuff/linkfile
drwxr-xr-x 3 S_D staff 102B Sep 21 15:36 stuff
Notice that we see the type of the file is l
, for symbolic link, and that we have a pointer after the links name for where the link goes, -> stuff/linkfile
.
If want you to change the information in the file using the link file, then you should change the information in the original file.
Change the information using the original file, then check the link. Has the information changed in the original file and the linked file?
Now move or delete the original file. What happens to the link? What information is there now if you open the link?
Now create a new file in stuff/ with exactly the same name that your link file is pointing too with new information in it. What happens now? Is the link still broken? What is the content of the linked file now?
Now let’s create a link to a directory. Lets clean our workspace.
$ rm -r *
And create a directory three, arbitrarily, two directories away. The -p
option to mkdir will make it create all 3 directories as needed. Without it, it would crash saying it can’t create three
because the directory two
does not exist.
$ mkdir -p one/two/three
Now let’s enter the directory and create some files there.
$ cd one/two/three
$ touch a b c d e
$ ls -lh
total 0
-rw-r--r-- 1 S_D staff 0B Sep 21 16:11 a
-rw-r--r-- 1 S_D staff 0B Sep 21 16:11 b
-rw-r--r-- 1 S_D staff 0B Sep 21 16:11 c
-rw-r--r-- 1 S_D staff 0B Sep 21 16:11 d
-rw-r--r-- 1 S_D staff 0B Sep 21 16:11 e
Return to our starting folder and create a symbolic link to folder three.
$ cd ../../..
$ ln -s one/two/three
$ ls -lh
total 8
drwxr-xr-x 3 S_D staff 102B Sep 21 16:11 one
lrwxr-xr-x 1 S_D staff 13B Sep 21 16:13 three -> one/two/three
Once again, we see that it is correctly identified as a symbolic link, l
, that it’s default name is the same as the directory it is pointing to, same as the files link had the same name as the file by default previously, and that we have the additional pointer after the links name showing us where it’s going.
Run ls
and cd
on the link. Does it act just as if you were standing in directory two/ performing the very same actions on three/?
After entering the link directory using cd
, go back one step using cd ..
, where do you end up?
Moving, deleting or renaming the directory would, just like in the case with the file, break the link.
Some files can be so large that opening it in a program would be very hard on your computer. It could be a file containing biological data, it could be a log file of a transfer where we want check for any errors. No matter the reason, a handy tool to know is the grep
command.
grep
searches for a specific string in one or more files. Case sensitive/insensitive or regular expressions work as well.
Let’s start, as always, by cleaning our directory.
$ rm -r *
Then let’s create a file with some text in it that we can work with. I have supplied some great text below.
$ nano textfile
Cats sleep anywhere, any table, any chair.
Top of piano, window-ledge, in the middle, on the edge.
Open draw, empty shoe, anybody's lap will do.
Fitted in a cardboard box, in the cupboard with your frocks.
Anywhere! They don't care! Cats sleep anywhere.
Now let’s see how the grep command works. The syntax is:
grep "string" filename/filepattern
Some examples for you to try and think about:
$ grep "Cat" textfile
$ grep "cat" textfile
As you can see the last one did not return any results. Add a -i
for case insensitive search.
$ grep -i "cat" textfile
Now let’s copy the file and check both of them together by matching a pattern for the filenames.
$ cp textfile textcopy
$ grep "Cat" text*
The *
will ensure that any file starting with text and then anything following will be searched. This example would perhaps be more real if we had several text files with different texts and we were looking for a specific string in any of them.
Copy the file sample_1.sam to your folder using the command below
$ cp /sw/courses/ngsintro/linux/linux_additional-files/sample_1.sam .
Use grep
to search in the file for a specific string of nucleotides, for example:
$ grep "TACCACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAG" sample_1.sam
Try with shorter sequences. When do you start getting lots of hits? This file is only a fraction of a genome, you would have gotten many times more hits doing this to a complete many GB large sam file.
Use grep
to find all lines with chr1 in them. This output is too much to be meaningful. Send it to a file (>
) where you have now effectively stored all the chromosome 1 information.
A useful tool in linux environment is the use of pipes. What they essentially do is connect the first command to the second command and the second command to the third command etc for as many commands as you want or need.
This is often used in UPPMAX jobs and other analysis as there are three major benefits. The first is that you do not have to stand in line to get a core or a node twice. The second is that you do not generate intermediary data which will clog your storage, you go from start file to result. The third is that it may actually be faster.
The pipe command has this syntax
command 1 | command 2
The |
is the pipe symbol (on mac keyboard alt+7), signifying that whatever output usually comes out of command 1 should instead be directly sent to command 2 and output in the manner that command 2 inputs.
In a hypothetical situation you have a folder with hundreds of files and you know the file you are looking for is very large but you can’t remember its name.
Let’s do a ls -l
in the /etc
directory and pipe the output to be sorted by file size. -n
means we are sorting numerically and not alphabetically, -k 5
says look at the fifth column of output, which happens to be the file size of ls
command.
$ ls -l /etc | sort -k 5 -n
An example use would be to align a file and directly send the now aligned file to be converted into a different format that may be required for another part of the analysis.
The next step requires us to use a bioinformatics software called samtools. To be able to use this program we first have to load the module for it. We will cover this in the UPPMAX lectures, so if you are a bit too fast for you own good, you will just have to type this command:
$ module load bioinfo-tools samtools
Here is an example where we convert the samfile to a bamfile (-Sb
literally means the input is Sam and to output in bam) and pipe it to immediately get sorted, not even creating the unsorted bamfile intermediary. Notice that samtools is made to take the single -
after samtools sort as the position of the piped data from samtools view.
$ samtools view -bS sample_1.sam | samtools sort - -o outbam
This should have generated a file called outbam.bam in your current folder. We will have some more examples of pipes in the next section.
wc
for Word Count is a useful command for counting the number of occurrences of a word in a file. This is easiest explained with an example.
Let’s return to our sample_1.sam.
$ wc sample_1.sam
233288 3666760 58105794 sample_1.sam
This can be interpreted like this:
Number of lines = 233288
Number of words = 3666760
Number of characters = 58105794
To make this more meaningful, let’s use the pipes and grep
command seen previously to see how many lines and how many times the string of nucleotides CATCATCAT
exist in this file.
$ grep "CATCATCAT" sample_1.sam | wc
60 957 15074
To see only the line count you can add -l
after wc
and to count only characters -m.
Output only the amount of lines that have chr1 in them from sample_1.sam.
$ grep "chr1" sample_1.sam | wc -l
Count the lines containing CATCATCAT
in the outbam.bam file.
$ samtools view outbam.bam | grep "CATCATCAT" | wc -l
These are some harder assignments, so don’t worry about it if you didn’t have time to do it.
Lets look at grep
and use some regular expressions http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
From file sample_1.sam find all lines that start with @
and put them in a file called at.txt.
$ grep "^@" sample_1.sam > at.txt
Find all the lines that end with at least 3 numbers from at.txt. Sometimes, you have to escape {} with \{\})
$ grep "[0-9]\{3\}$" sample_1.sam
[sed](http://www.grymoire.com/Unix/Sed.html)
is a handy tool to replace strings in files.
You have realized that all the chromosomes have been misnamed as chr3 when they should be chr4. Use sed
to replace chr3 with chr4 in sample_1.sam and output it to sample_2.sam.
The solution to this replaces the first instance on each line of chr3. What if we have multiple instances? What if we had wanted to replace chr1? This would effect chr10-19 as well! There are many things to consider :).
$ sed "s/chr1/chr2/" sample_1.sam > sample_2.sam
Bash loops are great for moving or renaming multiple files as well as many many other uses.
Create a couple of files as seen below
$ touch one.bam two.sam three.bam four.sam five.bam six.sam
All the files are actually in bam format. What a crazy mistake! Create a bash loop that changes all files ending in .sam to end with .bam instead.
The bash loop syntax is this:
$ for _variable_ in _pattern_; do _command with $variable_; done
To rename file1 to file2 you write this:
$ mv file1 file2
which effectively is the same thing as
$ cp file1 file2
$ rm file1
Ponder how this can be used to your advantage:
$ i=filename
$ echo ${i/name}stuff
filestuff
$ for f in *.sam
do
mv $f ${f/.sam}.bam;
done