Once all the read samples have been mapped to the reference we want to quantify how many of the reads that are mapping to different parts of the genome. In most cases this regions are the genes but you can ask to get counts for other parts of the genome as long as you can provide a annotation file with that information. In most cases the format GTF or GFF3 are the most common formats for genome annotation and the ones that most programs use.
In our case we are going to use a program called featureCounts that is fast and reliable.
First load the modules.
module load bioinfo-tools
module load subread/1.5.2
Then by providing the mapped read files and the annotation you can quantify how many reads that mapped to your reference. There are many settings that you can use to decide how you want to do the counting. In our case we will count the reads that are in the annotated exons (-t exon
) and sum them up per gene (-g gene
) . In featureCounts, this is done by the following command.
# With only one bam file
featureCounts -T {threads} -p -t exon -g gene_id -a {input.gtf} -o {output.countTable} bamFiles/file1.bam > results/Counts.featureCount.log
# Or multiple bamfiles.
featureCounts -T {threads} -p -t exon -g gene_id -a {input.gtf} -o {output.countTable} bamFiles/*.bam > results/Counts.featureCount.log
This will create a file that contains a matrix were the rows represents the genes and the columns represent the samples. Feature count will also include some information per gene to see were it is annotated in the reference. The file is easy to open in R for further analysis.
counts <- read.delim("./data/count_raw.txt",header=TRUE)
head(counts)
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6
## ENSG00000000003 321 303 204 492 455 359
## ENSG00000000005 0 0 0 0 0 0
## ENSG00000000419 696 660 472 951 963 689
## ENSG00000000457 59 54 44 109 73 66
## ENSG00000000460 399 405 236 445 454 374
## ENSG00000000938 0 0 0 0 0 1
## Sample_7 Sample_8 Sample_9 Sample_10 Sample_11 Sample_12
## ENSG00000000003 376 523 450 950 760 1436
## ENSG00000000005 0 0 0 0 0 0
## ENSG00000000419 706 1041 796 1036 789 1413
## ENSG00000000457 60 125 74 108 115 174
## ENSG00000000460 316 505 398 141 168 259
## ENSG00000000938 0 0 0 1 0 0
End of document