3 Main exercise
The main exercise covers Differential Gene Expression (DGE) workflow from raw reads to a list of differentially expressed genes.
3.1 Using Uppmax
Connect to UPPMAX.
ssh -Y username@rackham.uppmax.uu.se
Book a node.
For the RNA-Seq part of the course, we will work on the Rackham cluster. A standard compute node on cluster Rackham has 128 GB of RAM and 20 cores. We will use 1 core per person for this session. Therefore, each core gives you 6.4 GB of RAM. The code below is valid to run at the start of the day. If you are running it in the middle of a day, you need to decrease the time (-t
). Do not run this twice and also make sure you are not running computations on a login node.
Book resources for RNA-Seq day 1.
salloc -A g2019031 -t 08:00:00 -p core -n 1 --reservation=g2019031_30
Book resources for RNA-Seq day 2.
salloc -A g2019031 -t 08:00:00 -p core -n 1 --reservation=g2019031_31
3.1.1 Set-up directory
Setting up the directory structure is an important step as it helps to keep our raw data, intermediate data and results in an organised manner. All work must be carried out at this location /proj/g2019031/nobackup
where <username>
is your user name.
Create a directory named rnaseq
. All RNA-Seq related activities must be carried out in this sub-directory named rnaseq
.
mkdir rnaseq
Set up the below directory structure in your project directory.
<username>/
rnaseq/
+-- 1_raw/
+-- 2_fastqc/
+-- 3_mapping/
+-- 4_qualimap/
+-- 5_dge/
+-- 6_multiqc/
+-- reference/
| +-- mouse_chr19/
+-- scripts/
+-- funannot/
+-- plots
cd rnaseq
mkdir 1_raw 2_fastqc 3_mapping 4_qualimap 5_dge 6_multiqc reference scripts funannot plots
cd reference
mkdir mouse_chr19
cd ..
The 1_raw
directory will hold the raw fastq files (soft-links). 2_fastqc
will hold FastQC outputs. 3_mapping
will hold the STAR mapping output files. 4_qualimap
will hold the QualiMap output files. 5_dge
will hold the counts from featureCounts and all differential gene expression related files. 6_multiqc
will hold MultiQC outputs. reference
directory will hold the reference genome, annotations and STAR indices. The funannot
and plots
directory are optional for bonus steps.
It might be a good idea to open an additional terminal window. One to navigate through directories and another for scripting in the scripts
directory.
3.1.2 Create symbolic links
We have the raw fastq files in this remote directory: /sw/courses/ngsintro/rnaseq/main/1_raw/
. We are going to create symbolic links (soft-links) for these files from our 1_raw
directory to the remote directory. We do this because fastq files tend to be large files and simply copying them would use up a lot of storage space. Soft-linking files and folders allows us to work with those files as if they were actually there. Use pwd
to check if you are standing in the correct directory. You should be standing here:
/proj/g2019031/nobackup/<username>/rnaseq/1_raw
Run below to create softlinks. Note that the command ends in a space followed by a period.
ln -s /sw/courses/ngsintro/rnaseq/main/1_raw/*.gz .
Check if your files have linked correctly. You should be able to see as below.
ls -l
SRR3222409-19_1.fq.gz -> /sw/courses/ngsintro/rnaseq/main/1_raw/SRR3222409-19_1.fq.gz
SRR3222409-19_2.fq.gz -> /sw/courses/ngsintro/rnaseq/main/1_raw/SRR3222409-19_2.fq.gz
SRR3222410-19_1.fq.gz -> /sw/courses/ngsintro/rnaseq/main/1_raw/SRR3222410-19_1.fq.gz
SRR3222410-19_2.fq.gz -> /sw/courses/ngsintro/rnaseq/main/1_raw/SRR3222410-19_2.fq.gz
SRR3222411-19_1.fq.gz -> /sw/courses/ngsintro/rnaseq/main/1_raw/SRR3222411-19_1.fq.gz
SRR3222411-19_2.fq.gz -> /sw/courses/ngsintro/rnaseq/main/1_raw/SRR3222411-19_2.fq.gz
SRR3222412-19_1.fq.gz -> /sw/courses/ngsintro/rnaseq/main/1_raw/SRR3222412-19_1.fq.gz
SRR3222412-19_2.fq.gz -> /sw/courses/ngsintro/rnaseq/main/1_raw/SRR3222412-19_2.fq.gz
SRR3222413-19_1.fq.gz -> /sw/courses/ngsintro/rnaseq/main/1_raw/SRR3222413-19_1.fq.gz
SRR3222413-19_2.fq.gz -> /sw/courses/ngsintro/rnaseq/main/1_raw/SRR3222413-19_2.fq.gz
SRR3222414-19_1.fq.gz -> /sw/courses/ngsintro/rnaseq/main/1_raw/SRR3222414-19_1.fq.gz
SRR3222414-19_2.fq.gz -> /sw/courses/ngsintro/rnaseq/main/1_raw/SRR3222414-19_2.fq.gz
3.2 FastQC
Quality check using FastQC
After receiving raw reads from a high throughput sequencing centre it is essential to check their quality. FastQC provides a simple way to do some quality control check on raw sequence data. It provides a modular set of analyses which you can use to get a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.
Change into the 2_fastqc
directory. Use pwd
to check if you are standing in the correct directory. You should be standing here:
/proj/g2019031/nobackup/<username>/rnaseq/2_fastqc
Load Uppmax modules bioinfo-tools
and FastQC FastQC/0.11.5
.
module load bioinfo-tools
module load FastQC/0.11.8
Once the module is loaded, FastQC program is available through the command fastqc
. Use fastqc --help
to see the various parameters available to the program. We will use -o
to specify the output directory path and finally, the name of the input fastq file to analyse. The syntax will look like below.
fastqc -o . ../1_raw/filename.fq.gz
Based on the above command, we will write a bash loop to process all fastq files in the directory. Writing multi-line commands through the terminal can be a pain. Therefore, we will run larger scripts from a bash script file. Move to your scripts
directory and create a new file named fastqc.sh
.
You should be standing here to run this:
/proj/g2019031/nobackup/<username>/rnaseq/scripts
The command below creates a new file in the current directory.
touch fastqc.sh
Use a text editor (nano
,Emacs
,gedit
etc.) to edit fastqc.sh
.
gedit
behaves like a regular text editor with a standard graphical interface.
gedit fastqc.sh&
Adding &
at the end sends that process to the background, so that the console is free to accept new commands.
Then add the lines below and save the file.
#!/bin/bash
for i in ../1_raw/*.gz
do
echo "Running $i ..."
fastqc -o . "$i"
done
This script loops through all files ending in .gz. In each iteration of the loop, it executes fastqc on the file. The -o .
flag to fastqc indicates that the output must be exported in this current directory.
While standing in the 2_fastqc
directory, run the file fastqc.sh
. Use pwd
to check if you are standing in the correct directory.
You should be standing here to run this:
/proj/g2019031/nobackup/<username>/rnaseq/2_fastqc
bash ../scripts/fastqc.sh
After the fastqc run, there should be a .zip
file and a .html
file for every fastq file. The .html
file is the report that you need. Open the .html
in the browser and view it. You do not need to necessarily look at all files now. We will do a comparison with all samples when using the MultiQC tool.
firefox file.html &
Optional
Download the .html
file to your computer and view it.
All users can use an SFTP browser like Filezilla or Cyberduck for a GUI interface. Windows users can also use the MobaXterm SFTP file browser to drag and drop.
Linux and Mac users can use SFTP or SCP by running the below command in a LOCAL terminal and NOT on Uppmax. Open a terminal locally on your computer, move to a suitable download directory and run the command below.
scp user@rackham.uppmax.uu.se:/proj/g2019031/nobackup/<username>/rnaseq/2_fastqc/SRR3222409_1_fastqc.html ./
Go back to the FastQC website and compare your report with the sample report for Good Illumina data and Bad Illumina data.
Discuss based on your reports, whether your data is of good enough quality and/or what steps are needed to fix it.
3.3 STAR
Mapping reads using STAR
After verifying that the quality of the raw sequencing reads is acceptable, we will map the reads to the reference genome. There are many mappers/aligners available, so it may be good to choose one that is adequate for your type of data. Here, we will use a software called STAR (Spliced Transcripts Alignment to a Reference) as it is good for generic purposes, fast, easy to use and has been shown to outperform many of the other tools when aligning 2x76bp paired-end data. Before we begin mapping, we need to obtain genome reference sequence (.fasta
file) and a corresponding annotation file (.gtf
) and build a STAR index. Due to time constraints, we will build an index only on chromosome 19.
3.3.1 Get reference
It is best if the reference genome (.fasta
) and annotation (.gtf
) files come from the same source to avoid potential naming conventions problems. It is also good to check in the manual of the aligner you use for hints on what type of files are needed to do the mapping.
What is the idea behind building STAR index? What files are needed to build one? Where do we take them from? Could one use a STAR index that was generated before? Browse through Ensembl and try to find the files needed. Note that we are working with Mouse (Mus musculus).
Move into the reference
directory and download the Chr 19 genome (.fasta
) file and the genome-wide annotation file (.gtf
) from Ensembl.
You should be standing here to run this:
/proj/g2019031/nobackup/<username>/rnaseq/reference
You are most likely to use the latest version of ensembl release genome and annotations when starting a new analysis. For this exercise, we will choose ensembl version 99.
wget ftp://ftp.ensembl.org/pub/release-99/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.chromosome.19.fa.gz
wget ftp://ftp.ensembl.org/pub/release-99/gtf/mus_musculus/Mus_musculus.GRCm38.99.gtf.gz
Decompress the files for use.
gunzip Mus_musculus.GRCm38.dna.chromosome.19.fa.gz
gunzip Mus_musculus.GRCm38.99.gtf.gz
From the full gtf file, we will also extract chr 19 alone to create a new gtf file for use later.
cat Mus_musculus.GRCm38.99.gtf | grep -E "^#|^19" > Mus_musculus.GRCm38.99-19.gtf
Check what you have in your directory.
ls -l
drwxrwsr-x 2 user gXXXXXXX 4.0K Jan 22 21:59 mouse_chr19
-rw-rw-r-- 1 user gXXXXXXX 26M Jan 22 22:46 Mus_musculus.GRCm38.99-19.gtf
-rw-rw-r-- 1 user gXXXXXXX 771M Jan 22 22:10 Mus_musculus.GRCm38.99.gtf
-rw-rw-r-- 1 user gXXXXXXX 60M Jan 22 22:10 Mus_musculus.GRCm38.dna.chromosome.19.fa
3.3.2 Build index
Move into the reference
directory if not already there. Load module STAR. Remember to load bioinfo-tools
if you haven’t done so already.
module load bioinfo-tools
module load star/2.7.2b
To search for other available versions of STAR, use module spider star
.
Create a new bash script in your scripts
directory named star_index.sh
and add the following lines:
#!/bin/bash
# load module
module load bioinfo-tools
module load star/2.7.2b
star \
--runMode genomeGenerate \
--runThreadN 1 \
--genomeSAindexNbases 11 \
--genomeDir ./mouse_chr19 \
--genomeFastaFiles ./Mus_musculus.GRCm38.dna.chromosome.19.fa \
--sjdbGTFfile ./Mus_musculus.GRCm38.99.gtf
The above script means that STAR should run in genomeGenerate
mode to build an index. It should use 1 core for computation. The --genomeSAindexNbases
argument needs to be adjusted for genome size. Here we set it to a value that is optimal for one chromosome. This option doesn’t have to be adjusted for full genome. In any case, if it needs to be adjusted, STAR will tell you about it and what value. The output files must be directed to the indicated directory. The paths to the .fasta
file and the annotation file (.gtf
) is also shown. STAR arguments are described in the STAR manual.
Use pwd
to check if you are standing in the correct directory. Then, run the script from the reference
directory.
bash ../scripts/star_index.sh
Once the indexing is complete, move into the mouse_chr19
directory and make sure you have all the files.
ls -l
-rw-rw-r-- 1 user gXXXXXXX 9 Jan 22 22:16 chrLength.txt
-rw-rw-r-- 1 user gXXXXXXX 12 Jan 22 22:16 chrNameLength.txt
-rw-rw-r-- 1 user gXXXXXXX 3 Jan 22 22:16 chrName.txt
-rw-rw-r-- 1 user gXXXXXXX 11 Jan 22 22:16 chrStart.txt
-rw-rw-r-- 1 user gXXXXXXX 778K Jan 22 22:18 exonGeTrInfo.tab
-rw-rw-r-- 1 user gXXXXXXX 393K Jan 22 22:18 exonInfo.tab
-rw-rw-r-- 1 user gXXXXXXX 56K Jan 22 22:18 geneInfo.tab
-rw-rw-r-- 1 user gXXXXXXX 61M Jan 22 22:19 Genome
-rw-rw-r-- 1 user gXXXXXXX 613 Jan 22 22:19 genomeParameters.txt
-rw-rw-r-- 1 user gXXXXXXX 473M Jan 22 22:19 SA
-rw-rw-r-- 1 user gXXXXXXX 1.5G Jan 22 22:19 SAindex
-rw-rw-r-- 1 user gXXXXXXX 230K Jan 22 22:18 sjdbInfo.txt
-rw-rw-r-- 1 user gXXXXXXX 241K Jan 22 22:18 sjdbList.fromGTF.out.tab
-rw-rw-r-- 1 user gXXXXXXX 202K Jan 22 22:18 sjdbList.out.tab
-rw-rw-r-- 1 user gXXXXXXX 250K Jan 22 22:18 transcriptInfo.tab
The index for the whole genome would be created in a similar manner. It just requires more time (ca. 4h) to run.
3.3.3 Map reads
Now that we have the index ready, we are ready to map reads. Move to the directory 3_mapping
. Use pwd
to check if you are standing in the correct directory.
You should be standing here to run this:
/proj/g2019031/nobackup/<username>/rnaseq/3_mapping
We will create softlinks to the fastq files from here to make things easier.
cd 3_mapping
ln -s ../1_raw/* .
These are the parameters that we want to specify for the STAR mapping run:
- Run mode is now
alignReads
- Specify the full genome index path
- Specify the number of threads
- We must indicate the input is gzipped and must be uncompressed
- Indicate read1 and read2 since we have paired-end reads
- Specify the annotation (.gtf) file
- Specify an output file name
- Specify that the output must be BAM and the reads must be sorted by coordinate
STAR arguments are described in the STAR manual. Our mapping script will look like this:
star \
--runMode alignReads \
--genomeDir "../reference/mouse_chr19" \
--runThreadN 1 \
--readFilesCommand zcat \
--readFilesIn sample_1.fq.gz sample_2.fq.gz \
--sjdbGTFfile "../reference/Mus_musculus.GRCm38.99.gtf" \
--outFileNamePrefix "sample1" \
--outSAMtype BAM SortedByCoordinate
But, we will generalise the above script to be used as a bash script to read any two input files and to automatically create the output filename.
Now create a new bash script file named star_align.sh
in your scripts
directory and add the script below to it.
#!/bin/bash
module load bioinfo-tools
module load star/2.7.2b
# get output filename prefix
prefix=$( basename "$1" | sed -E 's/_.+$//' )
star \
--runMode alignReads \
--genomeDir "../reference/mouse_chr19" \
--runThreadN 1 \
--readFilesCommand zcat \
--readFilesIn $1 $2 \
--sjdbGTFfile "../reference/Mus_musculus.GRCm38.99.gtf" \
--outFileNamePrefix "$prefix" \
--outSAMtype BAM SortedByCoordinate
In the above script, the two input fastq files as passed in as parameters $1
and $2
. The output filename prefix is automatically created using this line prefix=$( basename "$1" | sed -E 's/_.+$//' )
from input filename of $1
. For example, a file with path /bla/bla/sample_1.fq.gz
will have the directory stripped off using the function basename
to get sample_1.fq.gz
. This is piped (|
) to sed
where all text starting from _
to end of string (specified by this regular expression _.+$
matching _1.fq.gz
) is removed and the prefix will be just sample
. This approach will work only if your filenames are labelled suitably.
Now we can run the bash script like below while standing in the 3_mapping
directory.
bash ../scripts/star_align.sh sample_1.fq.gz sample_2.fq.gz
Similarly run the other samples.
Optional
Try to create a new bash loop script (star_align_batch.sh
) to iterate over all fastq files in the directory and run the mapping using the star_align.sh
script. Note that there is a bit of a tricky issue here. You need to use two fastq files (_1
and _2
) per run rather than one file. There are many ways to do this and here is one.
## find only files for read 1 and extract the sample name
lines=$(find *_1.fq.gz | sed "s/_1.fq.gz//")
for i in ${lines}
do
## use the sample name and add suffix (_1.fq.gz or _2.fq.gz)
echo "Mapping ${i}_1.fq.gz and ${i}_2.fq.gz ..."
bash ../scripts/star_align.sh "${i}_1.fq.gz ${i}_2.fq.gz"
done
Run the star_align_batch.sh
script in the 3_mapping
directory.
bash ../scripts/star_align_batch.sh
At the end of the mapping jobs, you should have the following list of output files for every sample:
ls -l
-rw-rw-r-- 1 user gXXXXXXX 18M Jan 22 22:30 SRR3222409-19Aligned.sortedByCoord.out.bam
-rw-rw-r-- 1 user gXXXXXXX 2.0K Jan 22 22:30 SRR3222409-19Log.final.out
-rw-rw-r-- 1 user gXXXXXXX 457M Jan 22 22:30 SRR3222409-19Log.out
-rw-rw-r-- 1 user gXXXXXXX 246 Jan 22 22:30 SRR3222409-19Log.progress.out
-rw-rw-r-- 1 user gXXXXXXX 127K Jan 22 22:29 SRR3222409-19SJ.out.tab
drwx--S--- 2 user gXXXXXXX 4.0K Jan 22 22:29 SRR3222409-19_STARgenome
The .bam
file contains the alignment of all reads to the reference genome in binary format. BAM files are not human readable directly. To view a BAM file in text format, you can use samtools view
functionality.
module load samtools/1.9
samtools view SRR3222409-19Aligned.sortedByCoord.out.bam | head
SRR3222409.13658290 163 19 3084385 255 98M = 3084404 120 CTTTAAGATAAGTGCCGGTTGCAGCCAGCTGTGAGAGCTGCACTCCCTTCTCTGCTCTAAAGTTCCCTCTTCTCAGAAGGTGGCACCACCCTGAGCTG DB@D@GCHHHEFHIIG<CHHHIHHIIIIHHHIIIIGHIIIIIFHIGHIHIHIIHIIHIHIIHHHHIIIIIIHFFHHIIIGCCCHHHH1GHHIIHHIII NH:i:1 HI:i:1 AS:i:193 nM:i:2
SRR3222409.13658290 83 19 3084404 255 101M = 3084385 -120 TGCAGCCAGCTGTGAGAGCTGCACTCCCTTCTCTGCTCTAAAGTTCCCTCTTCTCAGAAGGTGGCACCACCCTGAGCTGCTGGCAGTGAGTCTGTTCCAAG IIIIHECHHH?IHHHIIIHIHIIIHEHHHCHHHIHIIIHHIHIIIHHHHHHIHEHIIHIIHHIIHHIHHIGHIGIIIIIIIHHIIIHHIHEHCHHG@<<BD NH:i:1 HI:i:1 AS:i:193 nM:i:2
SRR3222409.13741570 163 19 3085066 255 25M2I74M = 3085166 201 ATAGTACCTGGCAACAAAAAAAAAAAAGCTTTTGGCTAAAGACCAATGTGTTTAAGAGATAAAAAAAGGGGTGCTAATACAGAAGCTGAGGCCTTAGAAGA 0B@DB@HCCH1<<CGECCCGCHHIDD?01<<G1</<1<FH1F11<1111<<<<11<CGC1<G1<F//DHHI0/01<<1FG11111111<111<1D1<1D1< NH:i:1 HI:i:1 AS:i:186 nM:i:3
Can you identify what some of these columns are? SAM format description is available here.
The Log.final.out
file gives a summary of the mapping run. This file is used by MultiQC later to collect mapping statistics.
Inspect one of the mapping log files to identify the number of uniquely mapped reads and multi-mapped reads.
cat SRR3222409-19Log.final.out
Started job on | Jan 22 22:29:17
Started mapping on | Jan 22 22:29:35
Finished on | Jan 22 22:30:05
Mapping speed, Million of reads per hour | 18.84
Number of input reads | 156992
Average input read length | 200
UNIQUE READS:
Uniquely mapped reads number | 152768
Uniquely mapped reads % | 97.31%
Average mapped length | 199.74
Number of splices: Total | 82077
Number of splices: Annotated (sjdb) | 81433
Number of splices: GT/AG | 81397
Number of splices: GC/AG | 188
Number of splices: AT/AC | 228
Number of splices: Non-canonical | 264
Mismatch rate per base, % | 0.17%
Deletion rate per base | 0.02%
Deletion average length | 1.42
Insertion rate per base | 0.01%
Insertion average length | 1.25
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 1431
% of reads mapped to multiple loci | 0.91%
Number of reads mapped to too many loci | 11
% of reads mapped to too many loci | 0.01%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 2782
% of reads unmapped: too short | 1.77%
Number of reads unmapped: other | 0
% of reads unmapped: other | 0.00%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
The BAM file names can be simplified by renaming them. This command renames all BAM files.
rename "Aligned.sortedByCoord.out" "" *.bam
Next, we need to index these BAM files. Indexing creates .bam.bai
files which are required by many downstream programs to quickly and efficiently locate reads anywhere in the BAM file.
Index all BAM files.
module load samtools/1.9
for i in *.bam
do
echo "Indexing $i ..."
samtools index $i
done
Finally, we should have .bai
index files for all BAM files.
ls -l
-rw-rw-r-- 1 user gXXXXXXX 18M Jan 22 22:30 SRR3222409-19.bam
-rw-rw-r-- 1 user gXXXXXXX 47K Jan 22 22:41 SRR3222409-19.bam.bai
If you are running short of time or unable to run the mapping, you can copy over results for all samples that have been prepared for you before class. They are available at this location: /sw/courses/ngsintro/rnaseq/main/3_mapping/
.
cp -r /sw/courses/ngsintro/rnaseq/main/3_mapping/* /proj/g2019031/nobackup/[user]/rnaseq/3_mapping/
3.4 QualiMap
Post-alignment QC using QualiMap
Some important quality aspects, such as saturation of sequencing depth, read distribution between different genomic features or coverage uniformity along transcripts, can be measured only after mapping reads to the reference genome. One of the tools to perform this post-alignment quality control is QualiMap. QualiMap examines sequencing alignment data in SAM/BAM files according to the features of the mapped reads and provides an overall view of the data that helps to the detect biases in the sequencing and/or mapping of the data and eases decision-making for further analysis.
Read through QualiMap documentation and see if you can figure it out how to run it to assess post-alignment quality on the RNA-seq mapped samples. Here is the RNA-Seq specific tool explanation. The tool is already installed on Uppmax as a module.
Load the QualiMap module version 2.2.1 and create a bash script named qualimap.sh
in your scripts
directory.
Add the following script to it. Note that we are using the smaller GTF file with chr19 only.
#!/bin/bash
# load modules
module load bioinfo-tools
module load QualiMap/2.2.1
# get output filename prefix
prefix=$( basename "$1" .bam)
export DISPLAY=:0
qualimap rnaseq -pe \
-bam $1 \
-gtf "../reference/Mus_musculus.GRCm38.99-19.gtf" \
-outdir "../4_qualimap/${prefix}/" \
-outfile "$prefix" \
-outformat "HTML" \
--java-mem-size=6G >& "${prefix}-qualimap.log"
The line prefix=$( basename "$1" .bam)
is used to remove directory path and .bam
from the input filename and create a prefix which will be used to label output. The export DISPLAY=:0
forces a ‘headless mode’ on the JAVA application, which would otherwise throw an error about X11 display. If that doesn’t work, one can also try unset DISPLAY
or export DISPLAY=""
. The last part >& "${prefix}-qualimap.log"
saves the standard-out as a log file.
Create a new bash loop script named qualimap_batch.sh
with a bash loop to run the qualimap script over all BAM files. The loop should look like below.
for i in ../3_mapping/*.bam
do
echo "Running QualiMap on $i ..."
bash ../scripts/qualimap.sh $i
done
Run the loop script qualimap_batch.sh
in the directory 4_qualimap
.
bash ../scripts/qualimap_batch.sh
Qualimap should have created a directory for every BAM file.
drwxrwsr-x 5 user gXXXXXXX 4.0K Jan 22 22:53 SRR3222409-19
-rw-rw-r-- 1 user gXXXXXXX 669 Jan 22 22:53 SRR3222409-19-qualimap.log
Inside every directory, you should see:
ls -l
drwxrwsr-x 2 user gXXXXXXX 4.0K Jan 22 22:53 css
drwxrwsr-x 2 user gXXXXXXX 4.0K Jan 22 22:53 images_qualimapReport
-rw-rw-r-- 1 user gXXXXXXX 12K Jan 22 22:53 qualimapReport.html
drwxrwsr-x 2 user gXXXXXXX 4.0K Jan 22 22:53 raw_data_qualimapReport
-rw-rw-r-- 1 user gXXXXXXX 1.2K Jan 22 22:53 rnaseq_qc_results.txt
You can download the HTML files locally to your computer if you wish. If you do so, note that you MUST also download the dependency files (ie; css folder and images_qualimapReport folder), otherwise the HTML file may not render correctly.
Inspect the HTML output file and try to make sense of it.
firefox qualimapReport.html &
If you are running out of time or were unable to run QualiMap, you can also copy pre-run QualiMap output from this location: /sw/courses/ngsintro/rnaseq/main/4_qualimap/
.
cp -r /sw/courses/ngsintro/rnaseq/main/4_qualimap/* /proj/g2019031/nobackup/<username>/rnaseq/4_qualimap/
Check the QualiMap report for one sample and discuss if the sample is of good quality. You only need to do this for one file now. We will do a comparison with all samples when using the MultiQC tool.
3.5 featureCounts
Counting mapped reads using featureCounts
After ensuring mapping quality, we can move on to enumerating reads mapping to genomic features of interest. Here we will use featureCounts, an ultrafast and accurate read summarization program, that can count mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins and chromosomal locations.
Read featureCounts documentation and see if you can figure it out how to use paired-end reads using an unstranded library to count fragments overlapping with exonic regions and summarise over genes.
Load the subread module on Uppmax. Create a bash script named featurecounts.sh
in the directory scripts
.
We could run featureCounts on each BAM file, produce a text output for each sample and combine the output. But the easier way is to provide a list of all BAM files and featureCounts will combine counts for all samples into one text file.
Below is the script that we will use:
#!/bin/bash
# load modules
module load bioinfo-tools
module load subread/2.0.0
featureCounts \
-a "../reference/Mus_musculus.GRCm38.99.gtf" \
-o "counts.txt" \
-F "GTF" \
-t "exon" \
-g "gene_id" \
-p \
-s 0 \
-T 1 \
../3_mapping/*.bam
In the above script, we indicate the path of the annotation file (-a "../reference/Mus_musculus.GRCm38.99.gtf"
), specify the output file name (-o "counts.txt"
), specify that that annotation file is in GTF format (-F "GTF"
), specify that reads are to be counted over exonic features (-t "exon"
) and summarised to the gene level (-g "gene_id"
). We also specify that the reads are paired-end (-p
), the library is unstranded (-s 0
) and the number of threads to use (-T 1
).
Run the featurecounts bash script in the directory 5_dge
. Use pwd
to check if you are standing in the correct directory.
You should be standing here to run this:
/proj/g2019031/nobackup/<username>/rnaseq/5_dge
bash ../scripts/featurecounts.sh
You should have two output files:
ls -l
-rw-rw-r-- 1 user gXXXXXXX 2.8M Sep 15 11:05 counts.txt
-rw-rw-r-- 1 user gXXXXXXX 658 Sep 15 11:05 counts.txt.summary
Inspect the files and try to make sense of them.
Important
For downstream steps, we will NOT use this counts.txt file. Instead we will use counts_full.txt from the back-up folder. This contains counts across all chromosomes. This is located here: /sw/courses/ngsintro/rnaseq/main/5_dge/
.
cp /sw/courses/ngsintro/rnaseq/main/5_dge/counts_full.txt /proj/g2019031/nobackup/<username>/rnaseq/5_dge/
3.6 MultiQC
Combined QC report using MultiQC
We will use the tool MultiQC to crawl through the output, log files etc from FastQC, STAR, QualiMap and featureCounts to create a combined QC report.
Run MultiQC as shown below in the 6_multiqc
directory. You should be standing here to run this:
/proj/g2019031/nobackup/<username>/rnaseq/6_multiqc
module load bioinfo-tools
module load MultiQC/1.8
multiqc --interactive ../
The output should look like below:
ls -l
drwxrwsr-x 2 user gXXXXXXX 4.0K Sep 6 22:33 multiqc_data
-rw-rw-r-- 1 user gXXXXXXX 1.3M Sep 6 22:33 multiqc_report.html
Open the MultiQC HTML report using firefox
and/or transfer to your computer and inspect the report. You can also download the file locally to your computer.
firefox multiqc_report.html &
3.7 DESeq2
Differential gene expression using DESeq2
The easiest way to perform differential expression is to use one of the statistical packages, within R environment, that were specifically designed for analyses of read counts arising from RNA-seq, SAGE and similar technologies. Here, we will one of such packages called DESeq2. Learning R is beyond the scope of this course so we prepared basic ready to run R scripts to find DE genes between conditions KO and Wt.
Move to the 5_dge
directory and load R modules for use.
module load R/3.6.1
module load R_packages/3.6.1
Use pwd
to check if you are standing in the correct directory. Copy the following file to the 5_dge
directory: /sw/courses/ngsintro/rnaseq/main/5_dge/dge.R
Make sure you have the counts_full.txt
. If not, you can copy this file too: /sw/courses/ngsintro/rnaseq/main/5_dge/counts_full.txt
cp /sw/courses/ngsintro/rnaseq/main/5_dge/dge.R .
cp /sw/courses/ngsintro/rnaseq/main/5_dge/counts_full.txt .
Now, run the R script from the schell in 5_dge
directory.
Rscript dge.R
If you are curious what’s inside dge.R, you are welcome to explore it using a text editor.
This should have produced the following output files:
ls -l
-rw-rw-r-- 1 user gXXXXXXX 282K Jan 22 23:16 counts_vst_full.Rds
-rw-rw-r-- 1 user gXXXXXXX 1.7M Jan 22 23:16 counts_vst_full.txt
-rw-rw-r-- 1 user gXXXXXXX 727K Jan 22 23:16 dge_results_full.Rds
-rw-rw-r-- 1 user gXXXXXXX 1.7M Jan 22 23:16 dge_results_full.txt
Essentially, we have two outputs: dge_results_full and counts_vst_full. dge_results_full is the list of differentially expressed genes. This is available in human readable tab-delimited .txt file and R readable binary .Rds file. The counts_vst_full is variance-stabilised normalised counts, useful for exploratory analyses.
Copy the results text file (dge_results_full.txt
) to your computer and inspect the results. What are the columns? How many differentially expressed genes are present after adjusted p-value of 0.05? How many genes are upregulated and how many are down-regulated? How does this change if we set a fold-change cut-off of 1?
Open in a spreadsheet editor like Microsoft Excel or LibreOffice Calc.
If you do not have the results or were unable to run the DGE step, you can copy these two here which will be required for functional annotation (optional).
cp /sw/courses/ngsintro/rnaseq/main/5_dge/dge_results_full.txt .
cp /sw/courses/ngsintro/rnaseq/main/5_dge/dge_results_full.Rds .