Once all the read samples have been mapped to the reference we want to quantify how many of the reads that are mapping to different parts of the genome. In most cases this regions are the genes but you can ask to get counts for other parts of the genome as long as you can provide a annotation file with that information. In most cases the format GTF or GFF3 are the most common formats for genome annotation and the ones that most programs use.
In our case we are going to use a program called featureCounts that is fast and reliable.
Then by providing the mapped read files and the annotation you can quantify how many reads that mapped to your reference. There are many settings that you can use to decide how you want to do the counting. In our case we will count the reads that are in the annotated exons (-t exon
) and sum them up per gene (-g gene
) . In featureCounts, this is done by the following command.
# With only one bam file
featureCounts -T {threads} -s 2 -t exon -g gene_id -a {input.gtf} -o {output.countTable} bamFiles/file1.bam > results/Counts.featureCount.log
# Or multiple bamfiles.
featureCounts -T {threads} -s 2 -t exon -g gene_id -a {input.gtf} -o {output.countTable} bamFiles/*.bam > results/Counts.featureCount.log
This will create a file that contains a matrix were the rows represents the genes and the columns represent the samples. Feature count will also include some information per gene to see were it is annotated in the reference. The file is easy to open in R for further analysis.
##########################################
### RUN featureCounts ###
#########################################
cd ~/RNAseq/
gunzip rawData/references/annotations/Mus_musculus.GRCm38.101.gtf.gz
featureCounts -T 4 -s 2 -t exon -g gene_id \
-a rawData/references/annotations/Mus_musculus.GRCm38.101.gtf \
-o results/04-quantificationFeatureCounts/allSamples.featureCounts \
results/02-mapping/*1M.sorted.bam \
> results/04-quantificationFeatureCounts/allSamples.featureCounts.log