Quantification

Once all the read samples have been mapped to the reference we want to quantify how many of the reads that are mapping to different parts of the genome. In most cases this regions are the genes but you can ask to get counts for other parts of the genome as long as you can provide a annotation file with that information. In most cases the format GTF or GFF3 are the most common formats for genome annotation and the ones that most programs use.

In our case we are going to use a program called featureCounts that is fast and reliable.

First load the modules.

module load bioinfo-tools
module load subread/1.5.2

Then by providing the mapped read files and the annotation you can quantify how many reads that mapped to your reference. There are many settings that you can use to decide how you want to do the counting. In our case we will count the reads that are in the annotated exons (-t exon) and sum them up per gene (-g gene) . In featureCounts, this is done by the following command.

# With only one bam file
featureCounts -T {threads} -p -t exon -g gene_id -a {input.gtf} -o {output.countTable} bamFiles/file1.bam > results/Counts.featureCount.log

# Or multiple bamfiles.
featureCounts -T {threads} -p -t exon -g gene_id -a {input.gtf} -o {output.countTable} bamFiles/*.bam > results/Counts.featureCount.log

This will create a file that contains a matrix were the rows represents the genes and the columns represent the samples. Feature count will also include some information per gene to see were it is annotated in the reference. The file is easy to open in R for further analysis.

counts <- read.delim("./data/counts_raw.txt",header=TRUE)
head(counts)

##                 Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6
## ENSG00000000003      321      303      204      492      455      359
## ENSG00000000005        0        0        0        0        0        0
## ENSG00000000419      696      660      472      951      963      689
## ENSG00000000457       59       54       44      109       73       66
## ENSG00000000460      399      405      236      445      454      374
## ENSG00000000938        0        0        0        0        0        1
##                 Sample_7 Sample_8 Sample_9 Sample_10 Sample_11 Sample_12
## ENSG00000000003      376      523      450       950       760      1436
## ENSG00000000005        0        0        0         0         0         0
## ENSG00000000419      706     1041      796      1036       789      1413
## ENSG00000000457       60      125       74       108       115       174
## ENSG00000000460      316      505      398       141       168       259
## ENSG00000000938        0        0        0         1         0         0

End of document

Quantification

Workshop on RNA-Seq

Johan Reimegård • 03-Jun-2019