These steps are to be carried out after the fastq files have been aligned to the reference.
The first step after you have finished your mapping is to get a general feel of how the mapping went. Most mapping programs produce some sort of summary output, either to a file or to standard out. For example, if using the mapper Bowtie you need to pipe that output to a file to see the summary statistics.
The most important parts to look at are the proportion of uniquely mapping, multi-mapping and unmapped reads. We ideally want the uniquely mapping reads to be as high as possible. Multi-mapping or unmapped reads could indicate poor quality of the reads, adapter contamination or other reasons for low quality scores.
After mapping with star of all samples, we ran MultiQC to summarize all the log-files. In this case we had a folder structure with sampleName.hisat2_summary.txt, so to make sure that MultiQC understands what is the sample name, we used the -dd 2
command (e.g. it splits up the path and names the samples after the second last part).
multiqc -d -dd 2 .
The RSeQC package is one of many tools to get basic mapping statistics from your BAM files. This package provides a number of useful modules that can comprehensively evaluate high throughput sequence data, especially RNA-seq data. Some basic modules quickly inspect sequence quality, nucleotide composition bias, PCR bias and GC bias, while RNA-seq specific modules evaluate sequencing saturation, mapped reads distribution, coverage uniformity, strand specificity, etc. You can read more about the package at the RSeQC website.
The RSeQC package contains many steps that are equivalent to FastQC analysis, e.g. read quality, sequence composition (NVC), GC-bias etc, but the results may be different since many of the low quality reads may not map to the genome and therefore will not be included in the BAM file.
Some steps of the RSeQC package require a file with gene annotations in BED format. These can be downloaded from sources such as UCSC, RefSeq and Ensembl. In this case, the RSeQC team have already created annotation files in some common formats that can be downloaded from their website, but if you have data for a less studied organism you may need to create a BED-file on your own.
Two annotation files have already been downloaded into subdirectory references/annotations
for you to use. These are: Mus_musculus.GRCm38.101.HouseKeepingGenes.bed and Mus_musculus.GRCm38.101.bed.
In this tutorial, we will not run all the different parts of the RSeQC package, only the most relevant ones for this experiment. The different scripts in the RSeQC package are well described on their website, so read the instructions there and specify the input/output files to fit your file names and folder structure.
The steps that we are going to run are:
The geneBody_coverage.py script takes a very long time to run, so we have created a subsection of annotations to run it on. Use the file Mus_musculus.GRCm38.101.HouseKeepingGenes.bed
. If you want to run multiple files this might take to long also. Then you can even further reduce the file by only using the first 500 house keeping genes. For more details look at code below how to do that and how to run all files in a loop.
When running read_distribution.py, an out file cannot be specified. Instead you need to pipe (>
) the output to a file, or look at the output in the terminal.
Run RSeQC for one sample and have a look at your output.
Optional
If you are working on a organism that does not have a annotation file in bed12 that is needed to use this RSeQC it is possible to transform a GTF file to Bed12 file. This is done by using tools from uscs and needs a extra step to first go from GTF to genePred and then from genePred to bed. gtfToGenePred and genePredToBed can both be downloaded from ucsc utilities homepage.
#To unzip all annotation files if you don't have
gunzip references/annotations/*.gz
gtfToGenePred \
references/annotations/Mus_musculus.GRCm38.101.gtf \
references/annotations/Mus_musculus.GRCm38.101.genePred
genePredToBed \
references/annotations/Mus_musculus.GRCm38.101.genePred\
references/annotations/Mus_musculus.GRCm38.101.bed
MutliQC can also include the information from RSeQC. Rerun MultiQC after you have done your RSeQC steps.
It was created using commands:
multiqc -f -d -dd 2 .
Have a look at the reports. What is your conclusion, do your samples look good? Is there anything that looks strange in any sample, or do you feel comfortable using all the samples in your analysis?
##########################################
### RUN RSEQC scripts on bamfiles ###
#########################################
cd ~/RNAseq/
mkdir results/03-postAlignmentQC/geneBody_coverage
gunzip rawData/references/annotations/Mus_musculus.GRCm38.101.HouseKeepingGenes.bed.gz
head -n 500 rawData/references/annotations/Mus_musculus.GRCm38.101.HouseKeepingGenes.bed > rawData/references/annotations/Mus_musculus.GRCm38.101.HouseKeepingGenes.first500.bed
geneBody_coverage.py -r rawData/references/annotations/Mus_musculus.GRCm38.101.HouseKeepingGenes.first500.bed \
-i results/02-mapping/ -o results/03-postAlignmentQC/geneBody_coverage/allSamples
mkdir results/03-postAlignmentQC/read_distribution
mkdir results/03-postAlignmentQC/infer_experiment
mkdir results/03-postAlignmentQC/junction_saturation
mkdir results/03-postAlignmentQC/junction_annotation
gunzip rawData/references/annotations/Mus_musculus.GRCm38.101.bed
for i in $(ls results/02-mapping/*.sorted.bam); do
sample=${i%.sorted.bam}
sample=${sample#results/02-mapping/}
echo "running read distribution on $sample"
read_distribution.py -i $i \
-r rawData/references/annotations/Mus_musculus.GRCm38.101.bed \
>results/03-postAlignmentQC/read_distribution/$sample.readCoverage.txt
infer_experiment.py -i $i \
-r rawData/references/annotations/Mus_musculus.GRCm38.101.bed \
>results/03-postAlignmentQC/infer_experiment/$sample.infer_experiment.txt
junction_saturation.py -i $i \
-r rawData/references/annotations/Mus_musculus.GRCm38.101.bed \
-o results/03-postAlignmentQC/junction_saturation/$sample
junction_annotation.py -i $i \
-r rawData/references/annotations/Mus_musculus.GRCm38.101.bed \
-o results/03-postAlignmentQC/junction_annotation/$sample
done
cd ~/RNAseq/results/
multiqc -f -d -dd 2 .