Bulk RNASeq Analysis

Roy Francis / Dag Ahrén

14-Nov-2024

What is RNA?

  • The transcriptome is spatially and temporally dynamic
  • Data comes from functional units (coding regions)
  • Only a tiny fraction of the genome

Applications

  • Identify gene sequences in genomes (annotation)
  • Learn about gene function
  • Differential gene expression
  • Explore isoform and allelic expression
  • Understand co-expression, pathways and networks
  • Gene fusion
  • RNA editing

Workflow

Conesa et al. (2016)

De-Novo assembly

  • When no reference genome available
  • To identify novel genes/transcripts/isoforms
  • Identify fusion genes
  • Assemble transcriptome from short reads
  • Access quality of assembly and refine
  • Map reads back to assembled transcriptome

Trinity, Hsieh et al. (2019), Wang & Gribskov (2017)

Experimental design

  • Biological replicates: 6 - 12 Schurch et al. (2016)
  • Sample size estimation Hart et al. (2013)
  • Power analysis rnaseq-power web app, Zhao et al. (2018)
  • Balanced design to avoid batch effects
  • RIN values have strong effect Gallego Romero et al. (2014)

Library & Sequencing

  • polyA selection / Ribosomal RNA depletion
  • Single-end / Paired-end

Library prep

  • 80% of the RNA in a cell is ribosomal RNA (rRNA)
  • rRNA can be eliminated using polyA selection or rRNA depletion
    • PolyA selection mostly captures only protein-coding genes / mRNA but gives cleaner results
    • Depletion of rRNA is the solution if it’s important to retain all RNA species
  • smallRNAs are purified through size selection
  • PCR amplification may be needed for low quantity input (See section PCR duplicates)
  • Use stranded (directional) libraries Zhao et al. (2015), Levin et al. (2010)
    • Accurately identify sense/antisense transcript
    • Resolve overlapping genes
  • Exome capture
  • Library normalisation to concentrate specific transcripts
  • Libraries should have high complexity / low duplication. Daley & Smith (2013)

Sequencing

  • Short reads vs long reads (Illumina/PacBio)
  • Read length Chhangawala et al. (2015)
    • Greater than 50bp does not improve DGE
    • Longer reads are better for isoforms
  • Pooling samples
  • Sequencing depth (Coverage / Reads per sample)
  • Single-end reads (Cheaper?)
  • Use paired-end reads
    • Increased mappable reads
    • Increased power in assemblies
    • Better for structural variation and isoforms
    • Decreased false-positives for DGE
  • More replicates are better than more depth Liu et al. (2014)

Corley et al. (2017)

Workflow • DGE

Read QC

  • Number of reads
  • Per base sequence quality
  • Per sequence quality score
  • Per base sequence content
  • Per sequence GC content
  • Per base N content
  • Sequence length distribution
  • Sequence duplication levels
  • Overrepresented sequences
  • Adapter content
  • Kmer content

FastQC, MultiQC, https://sequencing.qcfail.com/

FastQC

Good quality

Poor quality

Read QC • PBSQ, PSQS

Per base sequence quality

Per sequence quality scores

Trimming

  • Trimming reads to remove adapter/readthrough or low quality bases
  • Related options are hard clipping, filtering reads
  • Sliding window trimming
  • Filter by min/max read length
    • Remove reads less than ~18nt
  • Demultiplexing/Splitting

When to avoid trimming?

  • Read trimming may not always be necessary Liao & Shi (2020)
  • Fixed read length may sometimes be more important
  • Expected insert size distribution may be more important for assemblers

Cutadapt, fastp, Prinseq

Mapping

  • Aligning reads back to a reference sequence
  • Mapping to genome vs transcriptome
  • Splice-aware alignment (genome) (STAR, HISAT2 etc)

STAR, HiSat2, Baruzzo et al. (2017)

Aligners • Metrics

Baruzzo et al. (2017)

Aligners time and RAM

Program Time_Min Memory_GB
HISATx1 22.7 4.3
HISATx2 47.7 4.3
HISAT 26.7 4.3
STAR 25 28
STARx2 50.5 28
GSNAP 291.9 20.2
TopHat2 1170 4.3

  • Reads (FASTQ)
@ST-E00274:179:HHYMLALXX:8:1101:1641:1309 1:N:0:NGATGT
NCATCGTGGTATTTGCACATCTTTTCTTATCAAATAAAAAGTTTAACCTACTCAGTTATGCGCATACGTTTTTTGATGGCATTTCCATAAACCGATTTTTTTTTTATGCACGTACCCAAAACGTGCAGAAAAATACGCTGCTAGAAATGTA
+
#AAAFAFA<-AFFJJJAFA-FFJJJJFFFAJJJJ-<FFJJJ-A-F-7--FA7F7-----FFFJFA<FFFFJ<AJ--FF-A<A-<JJ-7-7-<FF-FFFJAFFAA--A--7FJ-7----77-A--7F7)---7F-A----7)7-----7<<-

@instrument:runid:flowcellid:lane:tile:xpos:ypos read:isfiltered:controlnumber:sampleid

  • Reference Genome/Transcriptome (FASTA)
>1 dna:chromosome chromosome:GRCz10:1:1:58871917:1 REF
GATCTTAAACATTTATTCCCCCTGCAAACATTTTCAATCATTACATTGTCATTTCCCCTC
  • Annotation (GTF/GFF)
#!genome-build GRCz10
4       ensembl_havana  gene    6732    52059   .       -       .       gene_id "ENSDARG00000104632"; gene_version "2"; gene_name "rerg"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTDARG00000044080"; havana_gene_version "1";

seq source feature start end score strand frame attribute

Illumina FASTQ format, GTF format

Alignment

  • SAM/BAM (Sequence Alignment Map format)
ST-E00274:188:H3JWNCCXY:4:1102:32431:49900      163     1       1       60      8S139M4S      =       385     535     TATTTAGAGATCTTAAACATCCATTCCCCCTGCAAACATTTTCAATCATTACATTGTCATTTTCCCTCCAAATTAAATTTAGCCAGAGGCGCACAACATACGACCTCTAAAAAAGGTGCTGGAACATGTACCTATATGCAGCACCACCATC     AAAFAFFAFFFFJ7FFFFJ<JAFA7F-<AJ7JJ<FFFJ--<FAJF<7<7FAFJ-<AFA<-JJJ-AF-AJ-FF<F--A<FF<-7777-7JA-77A---F-7AAFF-FJA--77FJ<--77)))7<JJA<J77<-------<7--))7)))7-    NM:i:4   MD:Z:12T0T40C58T25      AS:i:119        XS:i:102        XA:Z:17,-53287490,4S33M4D114M,11;     MQ:i:60 MC:Z:151M       RG:Z:ST-E00274_188_H3JWNCCXY_4

query flag ref pos mapq cigar mrnm mpos tlen seq qual opt

Never store alignment files in raw SAM format. Always compress it! SAM format

Format Size_GB
SAM 7.4
BAM 1.9
CRAM lossless Q 1.4
CRAM 8 bins Q 0.8
CRAM no Q 0.26

Visualisation • IGV

IGV, UCSC Genome Browser, SeqMonk, More

Visualisation • tview

samtools tview alignment.bam genome.fasta

Visualisation • SeqMonk

SeqMonk

Alignment QC

  • Number of reads mapped/unmapped/paired etc
  • Uniquely mapped
  • Insert size distribution
  • Coverage
  • Gene body coverage
  • Biotype counts / Chromosome counts
  • Counts by region: gene/intron/non-genic
  • Sequencing saturation
  • Strand specificity

STAR (final log file), samtools stats, bamtools stats, QoRTs, RSeQC, Qualimap

Alignment QC • STAR Log

MultiQC can be used to summarise and plot STAR log files.

Alignment QC • Features

QoRTs was run on all samples and summarised using MultiQC.

Alignment QC • QoRTs

Alignment QC • Examples

Read mapping profile

Gene body coverage
Sigurgeirsson et al. (2014)

Alignment QC • Examples

Insert size

Saturation curve

Francis et al. (2013)

Quantification • Counts

  • Read counts = gene expression
  • Intersection on gene models
  • Reads can be quantified on any feature (gene, transcript, exon etc)

featureCounts, HTSeq

Quantification

PCR duplicates

  • Computational deduplication not recommended Klepikova et al. (2017), Parekh et al. (2016)
  • Use PCR-free library-prep kits
  • Use UMIs during library-prep Fu et al. (2018)

Multi-mapping

  • Added (BEDTools multicov)
  • Discard (featureCounts, HTSeq)
  • Distribute counts (Cufflinks, featureCounts)
  • Rescue
    • Probabilistic assignment (Rcount, Cufflinks)
    • Prioritise features (Rcount)
    • Probabilistic assignment with EM (RSEM)

Quantification • Abundance

  • Count methods
    • Provide no inference on isoforms
    • Cannot accurately measure fold change
  • Probabilistic assignment
    • Deconvolute ambiguous mappings
    • Transcript-level
    • cDNA reference

Kallisto, Salmon

  • Ultra-fast & alignment-free
  • Bootstrapping & quantification confidence
  • Transcript-level counts
  • Transcript-level estimates improves gene-level estimates Soneson et al. (2015), tximport
  • Evaluation and comparison of isoform quantification tools Zhang et al. (2017)

RSEM, Kallisto, Salmon

Quantification QC

ENSG00000000003    140   242   188   143   287   344   438   280   253
ENSG00000000005    0     0     0     0     0     0     0     0     0
ENSG00000000419    69    98    77    55    52    94    116   79    69
ENSG00000000457    56    75    104   79    157   205   183   178   153
ENSG00000000460    33    27    23    19    27    42    69    44    40

Pairwise correlation between samples must be high (>0.9)

MultiQC

Normalization

  • Control for Sequencing depth, compositional bias and more
  • Median of Ratios (DESeq2) and TMM (edgeR) perform the best

  • For DGE using DGE packages, use raw counts
  • For clustering, heatmaps etc use VST, VOOM or RLOG
  • For own analysis, plots etc, use TPM
  • Other solutions: spike-ins/house-keeping genes

Dillies et al. (2013), Evans et al. (2018), Wagner et al. (2012)

Exploratory

  • Remove lowly expressed genes
  • Heatmaps, MDS, PCA etc.

pheatmap

  • Transform raw counts to VST, VOOM, RLOG, TPM etc

Batch correction

  • Estimate variation explained by variables (PVCA)
  • Find confounding effects as surrogate variables (SVA)
  • Model known batches in the LM/GLM model
  • Correct known batches (ComBat from SVA)(Can overcorrect! Zindler et al. (2020))
  • Interactively evaluate batch effects and correction (BatchQC) Manimaran et al. (2016)

Differential expression

  • Univariate testing gene-by-gene
  • More descriptive, less predictive

Differential expression

  • DESeq2, edgeR (Neg-binom > GLM > Test)
  • Limma-Voom (Neg-binom > Voom-transform > LM > Test)
  • DESeq2 ~age+condition
    • Estimate size factors estimateSizeFactors()
    • Estimate gene-wise dispersion estimateDispersions()
    • Fit curve to gene-wise dispersion estimates
    • Shrink gene-wise dispersion estimates
    • GLM fit for each gene
    • Wald test nbinomWaldTest()

DESeq2, edgeR, limma, Seyednasrollah et al. (2015)

DGE

  • Results results()
log2 fold change (MLE): type type2 vs control
Wald test p-value: type type2 vs control
DataFrame with 1 row and 6 columns
                        baseMean     log2FoldChange             lfcSE
                       <numeric>          <numeric>         <numeric>
ENSG00000000003 242.307796723287 -0.932926089608546 0.114285150312285
                             stat               pvalue                 padj
                        <numeric>            <numeric>            <numeric>
ENSG00000000003 -8.16314356729037 3.26416150242775e-16 1.36240609998527e-14
  • Summary summary()
out of 17889 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 4526, 25%
LFC < 0 (down)     : 5062, 28%
outliers [1]       : 25, 0.14%
low counts [2]     : 0, 0%
(mean count < 3)

  • MA plot plotMA()

  • Volcano plot

  • Normalised counts plotCounts()

Functional analysis • Gene Ontology

  • Gene set analysis (GSA)
  • Gene set enrichment analysis (GSEA)
  • Gene ontology / Reactome databases

Functional analysis • Kegg

  • Pathway analysis (Kegg)

Webgestalt, EnrichR, clusterProfiler, ClueGO, pathview

Summary

  • Sound experimental design to avoid confounding
  • Plan carefully about lib prep, sequencing etc based on experimental objective
  • For DGE, biological replicates may be more important than other considerations (paired-end, sequencing depth, long reads etc)
  • Discard low quality bases, reads, genes and samples
  • Verify that tools and methods align with data assumptions
  • Experiment with multiple pipelines and tools
  • QC! QC everything at every step

Conesa, A., Madrigal, P., Tarazona, S., Gomez-Cabrero, D., Cervera, A., McPherson, A., … & Mortazavi, A. (2016). A survey of best practices for RNA-seq data analysis. Genome biology, 17(1), 1-19.

Further learning

Thank you. Questions?

References

Baruzzo, G., Hayer, K. E., Kim, E. J., Di Camillo, B., FitzGerald, G. A., & Grant, G. R. (2017). Simulation-based comprehensive benchmarking of RNA-seq aligners. Nature Methods, 14(2), 135–139. https://www.nature.com/articles/nmeth.4106
Chhangawala, S., Rudy, G., Mason, C. E., & Rosenfeld, J. A. (2015). The impact of read length on quantification of differentially expressed genes and splice junction detection. Genome Biology, 16(1), 1–10. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4531809/
Conesa, A., Madrigal, P., Tarazona, S., Gomez-Cabrero, D., Cervera, A., McPherson, A., Szcześniak, M. W., Gaffney, D. J., Elo, L. L., Zhang, X., et al. (2016). A survey of best practices for RNA-seq data analysis. Genome Biology, 17(1), 1–19.
Corley, S. M., MacKenzie, K. L., Beverdam, A., Roddam, L. F., & Wilkins, M. R. (2017). Differentially expressed genes from RNA-seq and functional enrichment results are affected by the choice of single-end versus paired-end reads and stranded versus non-stranded protocols. BMC Genomics, 18(1), 1–13. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5442695/
Daley, T., & Smith, A. D. (2013). Predicting the molecular complexity of sequencing libraries. Nature Methods, 10(4), 325–327. https://www.nature.com/articles/nmeth.2375
Dillies, M.-A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, G., Castel, D., Estelle, J., et al. (2013). A comprehensive evaluation of normalization methods for illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics, 14(6), 671–683.
Evans, C., Hardin, J., & Stoebel, D. M. (2018). Selecting between-sample RNA-seq normalization methods from the perspective of their assumptions. Briefings in Bioinformatics, 19(5), 776–792.
Francis, W. R., Christianson, L. M., Kiko, R., Powers, M. L., Shaner, N. C., & D Haddock, S. H. (2013). A comparison across non-model animals suggests an optimal sequencing depth for de novotranscriptome assembly. BMC Genomics, 14(1), 1–12. https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-167
Fu, Y., Wu, P.-H., Beane, T., Zamore, P. D., & Weng, Z. (2018). Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers. Bmc Genomics, 19, 1–14.
Gallego Romero, I., Pai, A. A., Tung, J., & Gilad, Y. (2014). RNA-seq: Impact of RNA degradation on transcript quantification. BMC Biology, 12(1), 1–13. https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007-12-42
Hart, S. N., Therneau, T. M., Zhang, Y., Poland, G. A., & Kocher, J.-P. (2013). Calculating sample size estimates for RNA sequencing data. Journal of Computational Biology, 20(12), 970–978. https://www.liebertpub.com/doi/10.1089/cmb.2012.0283
Hsieh, P.-H., Oyang, Y.-J., & Chen, C.-Y. (2019). Effect of de novo transcriptome assembly on transcript quantification. Scientific Reports, 9(1), 8304. https://www.nature.com/articles/s41598-019-44499-3
Klepikova, A. V., Kasianov, A. S., Chesnokov, M. S., Lazarevich, N. L., Penin, A. A., & Logacheva, M. (2017). Effect of method of deduplication on estimation of differential gene expression using RNA-seq. PeerJ, 5, e3091. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5357343/
Levin, J. Z., Yassour, M., Adiconis, X., Nusbaum, C., Thompson, D. A., Friedman, N., Gnirke, A., & Regev, A. (2010). Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nature Methods, 7(9), 709–715. https://www.nature.com/articles/nmeth.1491
Liao, Y., & Shi, W. (2020). Read trimming is not required for mapping and quantification of RNA-seq reads at the gene level. NAR Genomics and Bioinformatics, 2(3), lqaa068. https://pubmed.ncbi.nlm.nih.gov/33575617/
Liu, Y., Zhou, J., & White, K. P. (2014). RNA-seq differential expression studies: More sequence or more replication? Bioinformatics, 30(3), 301–304. https://academic.oup.com/bioinformatics/article/30/3/301/228651
Manimaran, S., Selby, H. M., Okrah, K., Ruberman, C., Leek, J. T., Quackenbush, J., Haibe-Kains, B., Bravo, H. C., & Johnson, W. E. (2016). BatchQC: Interactive software for evaluating sample and batch effects in genomic data. Bioinformatics, 32(24), 3836–3838.
Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., & Gilad, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research, 18(9), 1509–1517. https://genome.cshlp.org/content/18/9/1509.long
Parekh, S., Ziegenhain, C., Vieth, B., Enard, W., & Hellmann, I. (2016). The impact of amplification on differential expression analyses by RNA-seq. Scientific Reports, 6(1), 25533. https://www.nature.com/articles/srep25533
Roberts, A., Trapnell, C., Donaghey, J., Rinn, J. L., & Pachter, L. (2011). Improving RNA-seq expression estimates by correcting for fragment bias. Genome Biology, 12(3), 1–14.
Schurch, N. J., Schofield, P., Gierliński, M., Cole, C., Sherstnev, A., Singh, V., Wrobel, N., Gharbi, K., Simpson, G. G., Owen-Hughes, T., et al. (2016). How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? Rna, 22(6), 839–851. https://rnajournal.cshlp.org/content/early/2016/03/30/rna.053959.115.abstract
Seyednasrollah, F., Laiho, A., & Elo, L. L. (2015). Comparison of software packages for detecting differential expression in RNA-seq studies. Briefings in Bioinformatics, 16(1), 59–70.
Sigurgeirsson, B., Emanuelsson, O., & Lundeberg, J. (2014). Sequencing degraded RNA addressed by 3’tag counting. PloS One, 9(3), e91851. https://pubmed.ncbi.nlm.nih.gov/24632678/
Soneson, C., Love, M. I., & Robinson, M. D. (2015). Differential analyses for RNA-seq: Transcript-level estimates improve gene-level inferences. F1000Research, 4.
Wagner, G. P., Kin, K., & Lynch, V. J. (2012). Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory in Biosciences, 131, 281–285.
Wang, S., & Gribskov, M. (2017). Comprehensive evaluation of de novo transcriptome assembly programs and their effects on differential gene expression analysis. Bioinformatics, 33(3), 327–333. https://academic.oup.com/bioinformatics/article/33/3/327/2580374
Zhang, C., Zhang, B., Lin, L.-L., & Zhao, S. (2017). Evaluation and comparison of computational tools for RNA-seq isoform quantification. BMC Genomics, 18(1), 1–11.
Zhao, S., Li, C.-I., Guo, Y., Sheng, Q., & Shyr, Y. (2018). RnaSeqSampleSize: Real data based sample size estimation for RNA sequencing. BMC Bioinformatics, 19(1), 1–8. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2191-5
Zhao, S., Zhang, Y., Gordon, W., Quan, J., Xi, H., Du, S., Schack, D. von, & Zhang, B. (2015). Comparison of stranded and non-stranded RNA-seq transcriptome profiling and investigation of gene overlap. BMC Genomics, 16(1), 1–14. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4559181/
Zindler, T., Frieling, H., Neyazi, A., Bleich, S., & Friedel, E. (2020). Simulating ComBat: How batch correction can lead to the systematic introduction of false positive results in DNA methylation microarray studies. BMC Bioinformatics, 21, 1–15.

Hands-On tutorial

Main exercise

  • 01 Check the quality of the raw reads with FastQC
  • 02 Map the reads to the reference genome using HISAT2
  • 03 Assess the post-alignment quality using QualiMap
  • 04 Count the reads overlapping with genes using featureCounts
  • 05 Find differentially expressed genes using DESeq2 in R

Bonus exercises

  • 01 Functional annotation of DE genes using GO/Reactome/Kegg databases
  • 02 RNA-Seq figures and plots using R
  • 03 Visualisation of RNA-seq BAM files using IGV genome browser

Data: /sw/courses/ngsintro/rnaseq/
Work: ~/ngsintro/rnaseq/

Hands-On tutorial

  • Course data directory

/sw/courses/ngsintro/rnaseq/

rnaseq/
+-- bonus/
|   +-- assembly/
|   +-- exon/
|   +-- funannot/
|   +-- plots/
+-- documents/
+-- main/
    +-- 1_raw/
    +-- 2_fastqc/
    +-- 3_mapping/
    +-- 4_qualimap/
    +-- 5_dge/
    +-- 6_multiqc/
    +-- reference/
    |   +-- mouse_chr19_hisat2/
    +-- scripts/
  • Your work directory

~/ngsintro/rnaseq/

[user]/
rnaseq/
  +-- 1_raw/
  +-- 2_fastqc/
  +-- 3_mapping/
  +-- 4_qualimap/
  +-- 5_dge/
  +-- 6_multiqc/
  +-- reference/
  |   +-- mouse_chr19_hisat2/
  +-- scripts/
  +-- funannot/
  +-- plots/