Bulk RNASeq Analysis

What is RNA?

The transcriptome is spatially and temporally dynamic
Data comes from functional units (coding regions)
Only a tiny fraction of the genome

Applications

Identify gene sequences in genomes (annotation)
Learn about gene function
Differential gene expression
Explore isoform and allelic expression
Understand co-expression, pathways and networks
Gene fusion
RNA editing

Workflow

Conesa et al. (2016)

De-Novo assembly

When no reference genome available
To identify novel genes/transcripts/isoforms
Identify fusion genes
Assemble transcriptome from short reads
Access quality of assembly and refine
Map reads back to assembled transcriptome

Trinity, Hsieh et al. (2019), Wang & Gribskov (2017)

Experimental design

Biological replicates: 6 - 12 Schurch et al. (2016)
Sample size estimation Hart et al. (2013)
Power analysis rnaseq-power web app, Zhao et al. (2018)
Balanced design to avoid batch effects
RIN values have strong effect Gallego Romero et al. (2014)

Library & Sequencing

polyA selection / Ribosomal RNA depletion
Single-end / Paired-end

Library prep

80% of the RNA in a cell is ribosomal RNA (rRNA)
rRNA can be eliminated using polyA selection or rRNA depletion
- PolyA selection mostly captures only protein-coding genes / mRNA but gives cleaner results
- Depletion of rRNA is the solution if it’s important to retain all RNA species
smallRNAs are purified through size selection
PCR amplification may be needed for low quantity input (See section PCR duplicates)
Use stranded (directional) libraries Zhao et al. (2015), Levin et al. (2010)
- Accurately identify sense/antisense transcript
- Resolve overlapping genes
Exome capture
Library normalisation to concentrate specific transcripts
Libraries should have high complexity / low duplication. Daley & Smith (2013)

Sequencing

Short reads vs long reads (Illumina/PacBio)
Read length Chhangawala et al. (2015)
- Greater than 50bp does not improve DGE
- Longer reads are better for isoforms
Pooling samples
Sequencing depth (Coverage / Reads per sample)
Single-end reads (Cheaper?)
Use paired-end reads
- Increased mappable reads
- Increased power in assemblies
- Better for structural variation and isoforms
- Decreased false-positives for DGE
More replicates are better than more depth Liu et al. (2014)

Corley et al. (2017)

Workflow • DGE

Read QC

Number of reads
Per base sequence quality
Per sequence quality score
Per base sequence content
Per sequence GC content
Per base N content
Sequence length distribution
Sequence duplication levels
Overrepresented sequences
Adapter content
Kmer content

FastQC, MultiQC, https://sequencing.qcfail.com/

FastQC

Good quality

Poor quality

Read QC • PBSQ, PSQS

Per base sequence quality

Per sequence quality scores

Trimming

Trimming reads to remove adapter/readthrough or low quality bases
Related options are hard clipping, filtering reads
Sliding window trimming
Filter by min/max read length
- Remove reads less than ~18nt
Demultiplexing/Splitting

When to avoid trimming?

Read trimming may not always be necessary Liao & Shi (2020)
Fixed read length may sometimes be more important
Expected insert size distribution may be more important for assemblers

Cutadapt, fastp, Prinseq

Mapping

Aligning reads back to a reference sequence
Mapping to genome vs transcriptome
Splice-aware alignment (genome) (STAR, HISAT2 etc)

STAR, HiSat2, Baruzzo et al. (2017)

Aligners • Metrics

Baruzzo et al. (2017)

Aligners time and RAM

Program	Time_Min	Memory_GB
HISATx1	22.7	4.3
HISATx2	47.7	4.3
HISAT	26.7	4.3
STAR	25	28
STARx2	50.5	28
GSNAP	291.9	20.2
TopHat2	1170	4.3

Reads (FASTQ)

@ST-E00274:179:HHYMLALXX:8:1101:1641:1309 1:N:0:NGATGT
NCATCGTGGTATTTGCACATCTTTTCTTATCAAATAAAAAGTTTAACCTACTCAGTTATGCGCATACGTTTTTTGATGGCATTTCCATAAACCGATTTTTTTTTTATGCACGTACCCAAAACGTGCAGAAAAATACGCTGCTAGAAATGTA
+
#AAAFAFA<-AFFJJJAFA-FFJJJJFFFAJJJJ-<FFJJJ-A-F-7--FA7F7-----FFFJFA<FFFFJ<AJ--FF-A<A-<JJ-7-7-<FF-FFFJAFFAA--A--7FJ-7----77-A--7F7)---7F-A----7)7-----7<<-

@instrument:runid:flowcellid:lane:tile:xpos:ypos read:isfiltered:controlnumber:sampleid

Reference Genome/Transcriptome (FASTA)

>1 dna:chromosome chromosome:GRCz10:1:1:58871917:1 REF
GATCTTAAACATTTATTCCCCCTGCAAACATTTTCAATCATTACATTGTCATTTCCCCTC

Annotation (GTF/GFF)

#!genome-build GRCz10
4       ensembl_havana  gene    6732    52059   .       -       .       gene_id "ENSDARG00000104632"; gene_version "2"; gene_name "rerg"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTDARG00000044080"; havana_gene_version "1";

seq source feature start end score strand frame attribute

Illumina FASTQ format, GTF format

Alignment

SAM/BAM (Sequence Alignment Map format)

ST-E00274:188:H3JWNCCXY:4:1102:32431:49900      163     1       1       60      8S139M4S      =       385     535     TATTTAGAGATCTTAAACATCCATTCCCCCTGCAAACATTTTCAATCATTACATTGTCATTTTCCCTCCAAATTAAATTTAGCCAGAGGCGCACAACATACGACCTCTAAAAAAGGTGCTGGAACATGTACCTATATGCAGCACCACCATC     AAAFAFFAFFFFJ7FFFFJ<JAFA7F-<AJ7JJ<FFFJ--<FAJF<7<7FAFJ-<AFA<-JJJ-AF-AJ-FF<F--A<FF<-7777-7JA-77A---F-7AAFF-FJA--77FJ<--77)))7<JJA<J77<-------<7--))7)))7-    NM:i:4   MD:Z:12T0T40C58T25      AS:i:119        XS:i:102        XA:Z:17,-53287490,4S33M4D114M,11;     MQ:i:60 MC:Z:151M       RG:Z:ST-E00274_188_H3JWNCCXY_4

query flag ref pos mapq cigar mrnm mpos tlen seq qual opt

Never store alignment files in raw SAM format. Always compress it! SAM format

Format	Size_GB
SAM	7.4
BAM	1.9
CRAM lossless Q	1.4
CRAM 8 bins Q	0.8
CRAM no Q	0.26

Visualisation • IGV

IGV, UCSC Genome Browser, SeqMonk, More

Visualisation • `tview`

samtools tview alignment.bam genome.fasta

Visualisation • SeqMonk

SeqMonk

Alignment QC

Number of reads mapped/unmapped/paired etc
Uniquely mapped
Insert size distribution
Coverage
Gene body coverage
Biotype counts / Chromosome counts
Counts by region: gene/intron/non-genic
Sequencing saturation
Strand specificity

STAR (final log file), samtools stats, bamtools stats, QoRTs, RSeQC, Qualimap

Alignment QC • STAR Log

MultiQC can be used to summarise and plot STAR log files.

Alignment QC • Features

QoRTs was run on all samples and summarised using MultiQC.

Alignment QC • QoRTs

Alignment QC • Examples

Read mapping profile

Gene body coverage
Sigurgeirsson et al. (2014)

Alignment QC • Examples

Insert size

Saturation curve

Francis et al. (2013)

Quantification • Counts

Read counts = gene expression
Intersection on gene models
Reads can be quantified on any feature (gene, transcript, exon etc)

featureCounts, HTSeq

Quantification

PCR duplicates

Computational deduplication not recommended Klepikova et al. (2017), Parekh et al. (2016)
Use PCR-free library-prep kits
Use UMIs during library-prep Fu et al. (2018)

Multi-mapping

Added (BEDTools multicov)
Discard (featureCounts, HTSeq)
Distribute counts (Cufflinks, featureCounts)
Rescue
- Probabilistic assignment (Rcount, Cufflinks)
- Prioritise features (Rcount)
- Probabilistic assignment with EM (RSEM)

Quantification • Abundance

Count methods
- Provide no inference on isoforms
- Cannot accurately measure fold change

Probabilistic assignment
- Deconvolute ambiguous mappings
- Transcript-level
- cDNA reference

Kallisto, Salmon

Ultra-fast & alignment-free
Bootstrapping & quantification confidence
Transcript-level counts
Transcript-level estimates improves gene-level estimates Soneson et al. (2015), tximport
Evaluation and comparison of isoform quantification tools Zhang et al. (2017)

RSEM, Kallisto, Salmon

Quantification QC

ENSG00000000003    140   242   188   143   287   344   438   280   253
ENSG00000000005    0     0     0     0     0     0     0     0     0
ENSG00000000419    69    98    77    55    52    94    116   79    69
ENSG00000000457    56    75    104   79    157   205   183   178   153
ENSG00000000460    33    27    23    19    27    42    69    44    40

Pairwise correlation between samples must be high (>0.9)

MultiQC

Normalization

Control for Sequencing depth, compositional bias and more
Median of Ratios (DESeq2) and TMM (edgeR) perform the best

For DGE using DGE packages, use raw counts
For clustering, heatmaps etc use VST, VOOM or RLOG
For own analysis, plots etc, use TPM
Other solutions: spike-ins/house-keeping genes

Dillies et al. (2013), Evans et al. (2018), Wagner et al. (2012)

Exploratory

Remove lowly expressed genes
Heatmaps, MDS, PCA etc.

pheatmap

Transform raw counts to VST, VOOM, RLOG, TPM etc

Batch correction

Estimate variation explained by variables (PVCA)

Find confounding effects as surrogate variables (SVA)
Model known batches in the LM/GLM model
Correct known batches (ComBat from SVA)(Can overcorrect! Zindler et al. (2020))
Interactively evaluate batch effects and correction (BatchQC) Manimaran et al. (2016)

Differential expression

Univariate testing gene-by-gene
More descriptive, less predictive

Differential expression

DESeq2, edgeR (Neg-binom > GLM > Test)
Limma-Voom (Neg-binom > Voom-transform > LM > Test)
DESeq2 ~age+condition
- Estimate size factors estimateSizeFactors()
- Estimate gene-wise dispersion estimateDispersions()
- Fit curve to gene-wise dispersion estimates
- Shrink gene-wise dispersion estimates
- GLM fit for each gene
- Wald test nbinomWaldTest()

DESeq2, edgeR, limma, Seyednasrollah et al. (2015)

DGE

Results results()

log2 fold change (MLE): type type2 vs control
Wald test p-value: type type2 vs control
DataFrame with 1 row and 6 columns
                        baseMean     log2FoldChange             lfcSE
                       <numeric>          <numeric>         <numeric>
ENSG00000000003 242.307796723287 -0.932926089608546 0.114285150312285
                             stat               pvalue                 padj
                        <numeric>            <numeric>            <numeric>
ENSG00000000003 -8.16314356729037 3.26416150242775e-16 1.36240609998527e-14

Summary summary()

out of 17889 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 4526, 25%
LFC < 0 (down)     : 5062, 28%
outliers [1]       : 25, 0.14%
low counts [2]     : 0, 0%
(mean count < 3)

MA plot plotMA()

Volcano plot

Normalised counts plotCounts()

Functional analysis • Gene Ontology

Gene set analysis (GSA)
Gene set enrichment analysis (GSEA)
Gene ontology / Reactome databases

Functional analysis • Kegg

Pathway analysis (Kegg)

Webgestalt, EnrichR, clusterProfiler, ClueGO, pathview

Summary

Sound experimental design to avoid confounding
Plan carefully about lib prep, sequencing etc based on experimental objective
For DGE, biological replicates may be more important than other considerations (paired-end, sequencing depth, long reads etc)
Discard low quality bases, reads, genes and samples
Verify that tools and methods align with data assumptions
Experiment with multiple pipelines and tools
QC! QC everything at every step

Conesa, A., Madrigal, P., Tarazona, S., Gomez-Cabrero, D., Cervera, A., McPherson, A., … & Mortazavi, A. (2016). A survey of best practices for RNA-seq data analysis. Genome biology, 17(1), 1-19.

Further learning

Griffith lab RNA-Seq using HiSat & StringTie tutorial
HBC Training DGE using DeSeq2 tutorial
RNA-Seq Blog
SciLifeLab courses

Thank you. Questions?

References

Baruzzo, G., Hayer, K. E., Kim, E. J., Di Camillo, B., FitzGerald, G. A., & Grant, G. R. (2017). Simulation-based comprehensive benchmarking of RNA-seq aligners. Nature Methods, 14(2), 135–139. https://www.nature.com/articles/nmeth.4106

Chhangawala, S., Rudy, G., Mason, C. E., & Rosenfeld, J. A. (2015). The impact of read length on quantification of differentially expressed genes and splice junction detection. Genome Biology, 16(1), 1–10. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4531809/

Conesa, A., Madrigal, P., Tarazona, S., Gomez-Cabrero, D., Cervera, A., McPherson, A., Szcześniak, M. W., Gaffney, D. J., Elo, L. L., Zhang, X., et al. (2016). A survey of best practices for RNA-seq data analysis. Genome Biology, 17(1), 1–19.

Corley, S. M., MacKenzie, K. L., Beverdam, A., Roddam, L. F., & Wilkins, M. R. (2017). Differentially expressed genes from RNA-seq and functional enrichment results are affected by the choice of single-end versus paired-end reads and stranded versus non-stranded protocols. BMC Genomics, 18(1), 1–13. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5442695/

Daley, T., & Smith, A. D. (2013). Predicting the molecular complexity of sequencing libraries. Nature Methods, 10(4), 325–327. https://www.nature.com/articles/nmeth.2375

Dillies, M.-A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, G., Castel, D., Estelle, J., et al. (2013). A comprehensive evaluation of normalization methods for illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics, 14(6), 671–683.

Evans, C., Hardin, J., & Stoebel, D. M. (2018). Selecting between-sample RNA-seq normalization methods from the perspective of their assumptions. Briefings in Bioinformatics, 19(5), 776–792.

Francis, W. R., Christianson, L. M., Kiko, R., Powers, M. L., Shaner, N. C., & D Haddock, S. H. (2013). A comparison across non-model animals suggests an optimal sequencing depth for de novotranscriptome assembly. BMC Genomics, 14(1), 1–12. https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-167

Fu, Y., Wu, P.-H., Beane, T., Zamore, P. D., & Weng, Z. (2018). Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers. Bmc Genomics, 19, 1–14.

Gallego Romero, I., Pai, A. A., Tung, J., & Gilad, Y. (2014). RNA-seq: Impact of RNA degradation on transcript quantification. BMC Biology, 12(1), 1–13. https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007-12-42

Hart, S. N., Therneau, T. M., Zhang, Y., Poland, G. A., & Kocher, J.-P. (2013). Calculating sample size estimates for RNA sequencing data. Journal of Computational Biology, 20(12), 970–978. https://www.liebertpub.com/doi/10.1089/cmb.2012.0283

Hsieh, P.-H., Oyang, Y.-J., & Chen, C.-Y. (2019). Effect of de novo transcriptome assembly on transcript quantification. Scientific Reports, 9(1), 8304. https://www.nature.com/articles/s41598-019-44499-3

Klepikova, A. V., Kasianov, A. S., Chesnokov, M. S., Lazarevich, N. L., Penin, A. A., & Logacheva, M. (2017). Effect of method of deduplication on estimation of differential gene expression using RNA-seq. PeerJ, 5, e3091. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5357343/

Levin, J. Z., Yassour, M., Adiconis, X., Nusbaum, C., Thompson, D. A., Friedman, N., Gnirke, A., & Regev, A. (2010). Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nature Methods, 7(9), 709–715. https://www.nature.com/articles/nmeth.1491

Liao, Y., & Shi, W. (2020). Read trimming is not required for mapping and quantification of RNA-seq reads at the gene level. NAR Genomics and Bioinformatics, 2(3), lqaa068. https://pubmed.ncbi.nlm.nih.gov/33575617/

Liu, Y., Zhou, J., & White, K. P. (2014). RNA-seq differential expression studies: More sequence or more replication? Bioinformatics, 30(3), 301–304. https://academic.oup.com/bioinformatics/article/30/3/301/228651

Manimaran, S., Selby, H. M., Okrah, K., Ruberman, C., Leek, J. T., Quackenbush, J., Haibe-Kains, B., Bravo, H. C., & Johnson, W. E. (2016). BatchQC: Interactive software for evaluating sample and batch effects in genomic data. Bioinformatics, 32(24), 3836–3838.

Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., & Gilad, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research, 18(9), 1509–1517. https://genome.cshlp.org/content/18/9/1509.long

Parekh, S., Ziegenhain, C., Vieth, B., Enard, W., & Hellmann, I. (2016). The impact of amplification on differential expression analyses by RNA-seq. Scientific Reports, 6(1), 25533. https://www.nature.com/articles/srep25533

Roberts, A., Trapnell, C., Donaghey, J., Rinn, J. L., & Pachter, L. (2011). Improving RNA-seq expression estimates by correcting for fragment bias. Genome Biology, 12(3), 1–14.

Schurch, N. J., Schofield, P., Gierliński, M., Cole, C., Sherstnev, A., Singh, V., Wrobel, N., Gharbi, K., Simpson, G. G., Owen-Hughes, T., et al. (2016). How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? Rna, 22(6), 839–851. https://rnajournal.cshlp.org/content/early/2016/03/30/rna.053959.115.abstract

Seyednasrollah, F., Laiho, A., & Elo, L. L. (2015). Comparison of software packages for detecting differential expression in RNA-seq studies. Briefings in Bioinformatics, 16(1), 59–70.

Sigurgeirsson, B., Emanuelsson, O., & Lundeberg, J. (2014). Sequencing degraded RNA addressed by 3’tag counting. PloS One, 9(3), e91851. https://pubmed.ncbi.nlm.nih.gov/24632678/

Soneson, C., Love, M. I., & Robinson, M. D. (2015). Differential analyses for RNA-seq: Transcript-level estimates improve gene-level inferences. F1000Research, 4.

Wagner, G. P., Kin, K., & Lynch, V. J. (2012). Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory in Biosciences, 131, 281–285.

Wang, S., & Gribskov, M. (2017). Comprehensive evaluation of de novo transcriptome assembly programs and their effects on differential gene expression analysis. Bioinformatics, 33(3), 327–333. https://academic.oup.com/bioinformatics/article/33/3/327/2580374

Zhang, C., Zhang, B., Lin, L.-L., & Zhao, S. (2017). Evaluation and comparison of computational tools for RNA-seq isoform quantification. BMC Genomics, 18(1), 1–11.

Zhao, S., Li, C.-I., Guo, Y., Sheng, Q., & Shyr, Y. (2018). RnaSeqSampleSize: Real data based sample size estimation for RNA sequencing. BMC Bioinformatics, 19(1), 1–8. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2191-5

Zhao, S., Zhang, Y., Gordon, W., Quan, J., Xi, H., Du, S., Schack, D. von, & Zhang, B. (2015). Comparison of stranded and non-stranded RNA-seq transcriptome profiling and investigation of gene overlap. BMC Genomics, 16(1), 1–14. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4559181/

Zindler, T., Frieling, H., Neyazi, A., Bleich, S., & Friedel, E. (2020). Simulating ComBat: How batch correction can lead to the systematic introduction of false positive results in DNA methylation microarray studies. BMC Bioinformatics, 21, 1–15.

Hands-On tutorial

Main exercise

01 Check the quality of the raw reads with FastQC
02 Map the reads to the reference genome using HISAT2
03 Assess the post-alignment quality using QualiMap
04 Count the reads overlapping with genes using featureCounts
05 Find differentially expressed genes using DESeq2 in R

Bonus exercises

01 Functional annotation of DE genes using GO/Reactome/Kegg databases
02 RNA-Seq figures and plots using R
03 Visualisation of RNA-seq BAM files using IGV genome browser

Data: /sw/courses/ngsintro/rnaseq/dardel
Work: ~/ngsintro/rnaseq/

Hands-On tutorial

Course data directory

/sw/courses/ngsintro/rnaseq/dardel

dardel/
├── bonus
│   ├── assembly
│   ├── exon
│   ├── funannot
│   └── plots
├── main
│   ├── 1_raw
│   ├── 2_fastqc
│   ├── 3_mapping
│   ├── 4_qorts
│   ├── 4_qualimap
│   ├── 5_dge
│   ├── 6_multiqc
│   ├── reference
│   │   └── mouse_chr19_hisat2
│   └── scripts
├── main_full
│   └── nextflow
├── r
└── README.md

Your work directory

~/ngsintro/rnaseq/

rnaseq/
├── 1_raw
├── 2_fastqc
├── 3_mapping
├── 4_picard
├── 4_qualimap
├── 5_dge
├── 6_multiqc
├── funannot
├── plots
├── reference
└── scripts

Bulk RNASeq Analysis

What is RNA?

Applications

Workflow

De-Novo assembly

Experimental design

Library & Sequencing

Library prep

Sequencing

Workflow • DGE

Read QC

FastQC

Read QC • PBSQ, PSQS

Trimming

Mapping

Aligners • Metrics

Aligners time and RAM

Alignment

Visualisation • IGV

Visualisation • tview

Visualisation • SeqMonk

Alignment QC

Alignment QC • STAR Log

Alignment QC • Features

Alignment QC • QoRTs

Alignment QC • Examples

Alignment QC • Examples

Quantification • Counts

Quantification

Quantification • Abundance

Quantification QC

MultiQC

Normalization

Exploratory

Batch correction

Differential expression

Differential expression

DGE

Functional analysis • Gene Ontology

Functional analysis • Kegg

Summary

Further learning

Thank you. Questions?

References

Hands-On tutorial

Hands-On tutorial

Visualisation • `tview`