class: center, middle, inverse, title-slide # RNA-Seq Quality Control ## RNA-Seq Analysis Workshop ###
Roy Francis
| 23-Oct-2018 --- layout: true <link href="https://fonts.googleapis.com/css?family=Lato:400,700|Roboto:400,700" rel="stylesheet"> <link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.3.1/css/all.css" integrity="sha384-mzrmE5qonljUremFsqc01SB46JvROS7bZs3IO2EmfFsd15uHvIt+Y8vEf7N7fWAU" crossorigin="anonymous"> --- name: contents class: spaced ## Contents * [Workflow](#workflow) * [RNA extraction](#rna-extraction) * [Read QC](#read-qc) * [Alignment QC](#alignment-qc-overview) * [Quantification QC](#quantification-qc) * [Exploratory](#exploratory-heatmap) * [Batch correction](#batch-correction) * [Spike-Ins](#spike-in) --- name: workflow class: spaced ## Workflow .size-70[![](images/workflow.svg)] --- name: exp-design class: spaced ## Experimental design .pull-left-50[ - Balanced design - Technical replicates not necessary (.medium[.altcol[Marioni *et al.*, 2008]]) - Biological replicates: 6 - 12 (.medium[.altcol[Schurch *et al.*, 2016]]) - ENCODE consortium - Previous publications - Power analysis ] .pull-right-50[ .size-90[![](images/batch-effect.svg)] ] .citation[ <span style="display:block;"><i class="fas fa-link"></i> Busby, Michele A., *et al.* "Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression." [Bioinformatics 29.5 (2013): 656-657](https://academic.oup.com/bioinformatics/article/29/5/656/252753)</span> <span style="display:block;"><i class="fas fa-link"></i> Marioni, John C., *et al.* "RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays." [Genome research (2008)](https://genome.cshlp.org/content/18/9/1509.long)</span> <span style="display:block;"><i class="fas fa-link"></i> Schurch, Nicholas J., *et al.* "How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use?." [Rna (2016)](http://rnajournal.cshlp.org/content/early/2016/03/30/rna.053959.115.abstract)</span> <span style="display:block;"><i class="fas fa-link"></i> Zhao, Shilin, *et al.* "RnaSeqSampleSize: real data based sample size estimation for RNA sequencing." [BMC bioinformatics 19.1 (2018): 191](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2191-5)</span> ] --- name: rna-extraction class: spaced ## RNA extraction .pull-left-50[ - Sample processing and storage - RNA quality/quantity - RIN values (Strong effect) - DNAse treatment - RNA type - Contamination/Cross-contamination - Batch effect - Extraction method bias (GC bias) ] .pull-right-50[ .size-75[![](images/degradation.jpg)] ] .citation[ <span style="display:block;"><i class="fas fa-link"></i> Romero, Irene Gallego, *et al*. "RNA-seq: impact of RNA degradation on transcript quantification." [BMC biology 12.1 (2014): 42](https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007-12-42)</span> <span style="display:block;"><i class="fas fa-link"></i> Kim, Young-Kook, *et al*. "Short structured RNAs with low GC content are selectively lost during extraction from a small number of cells." [Molecular cell 46.6 (2012): 893-895](https://www.cell.com/molecular-cell/fulltext/S1097-2765(12)00481-9).</span> ] --- name: library-prep class: spaced ## Library prep .pull-left-50[ - PolyA selection - rRNA depletion - Size selection - PCR amplification (.medium[See section PCR duplicates]) - Stranded (directional) libraries - Accurately identify sense/antisense transcript - Resolve overlapping genes - Exome capture - Library normalisation - Batch effect ] .pull-right-50[ ![](images/rnaseq_library_prep.svg) ] .citation[ <span style="display:block;"><i class="fas fa-link"></i> Zhao, Shanrong, et al. "Comparison of stranded and non-stranded RNA-seq transcriptome profiling and investigation of gene overlap." [BMC genomics 16.1 (2015): 675](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4559181/)</span> <span style="display:block;"><i class="fas fa-link"></i> Levin, Joshua Z., et al. "Comprehensive comparative analysis of strand-specific RNA sequencing methods." [Nature methods 7.9 (2010): 709](https://www.nature.com/articles/nmeth.1491)</span> ] --- name: read-qc ## Read QC - Number of reads - Per base sequence quality - Per sequence quality score - Per base sequence content - Per sequence GC content - Per base N content - Sequence length distribution - Sequence duplication levels - Overrepresented sequences - Adapter content - Kmer content <img src="images/qc.jpg" style="height:250px; position:fixed; right:0px; bottom: 0px; margin-right: 100px; margin-bottom: 310px;"> <i class="fas fa-toolbox"></i> [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), [MultiQC](http://multiqc.info/) https://sequencing.qcfail.com/ ![](images/qcfail.jpg) --- name: read-qc-2 ## Read QC | PBSQ, PSQS .size-90[.vsmall[**Per base sequence quality**] ![](images/pbsq.jpg)] .size-90[.vsmall[**Per sequence quality scores**] ![](images/psqs.jpg)] --- name: read-qc-3 ## Read QC | PBSC, PSGC .size-90[.vsmall[**Per base sequence content**] ![](images/pbsc.jpg)] .size-90[.vsmall[**Per sequence GC content**] ![](images/psgc.jpg)] --- name: read-qc-4 ## Read QC | SDL, AC .size-90[.vsmall[**Sequence duplication level**] ![](images/sdl.jpg)] .size-90[.vsmall[**Adapter content**] ![](images/ac.jpg)] --- name: fastqc ## FastQC .size-75[[![](images/fastqc_good.png)](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html)] .size-75[[![](images/fastqc_bad.png)](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html)] --- name: trimming ## Trimming .pull-left-50[ - Trim IF necessary - Synthetic bases can be an issue for SNP calling - Insert size distribution may be more important for assemblers - Trim/Clip/Filter reads - Remove adapter sequences - Trim reads by quality - Sliding window trimming - Filter by min/max read length - Remove reads less than ~22nt - Demultiplexing/Splitting <i class="fas fa-toolbox"></i> [Cutadapt](https://github.com/marcelm/cutadapt/), [fastp](https://github.com/OpenGene/fastp), [Skewer](https://github.com/relipmoc/skewer), [Prinseq](http://prinseq.sourceforge.net/) ] .pull-right-50[ ![](images/rnaseq_read_through.svg) ] --- name: alignment-qc-overview class: spaced ## Alignment QC - Number of reads mapped/unmapped/paired etc - Uniquely mapped - Insert size distribution - Coverage - Gene body coverage - Biotype counts / Chromosome counts - Counts by region: gene/intron/non-genic - Sequencing saturation - Strand specificity <i class="fas fa-toolbox"></i> STAR (final log file), samtools stats, bamtools stats, [QoRTs](https://hartleys.github.io/QoRTs/), [RSeQC](http://rseqc.sourceforge.net/), [Qualimap](http://qualimap.bioinfo.cipf.es/) --- name: alignment-qc-qorts ## Alignment QC | QoRTs ![](images/qorts.png) --- name: alignment-qc-star ## Alignment QC | STAR Log MultiQC can be used to summarise and plot STAR log files. ![](images/star_alignment_plot.svg) --- name: samtools-stats ## BAM QC | samtools `samtools stats file.bam` ``` SN raw total sequences: 522095280 SN filtered sequences: 0 SN sequences: 522095280 SN is sorted: 1 SN 1st fragments: 261047640 SN last fragments: 261047640 SN reads mapped: 514139025 SN reads mapped and paired: 510035006 SN reads unmapped: 7956255 SN reads properly paired: 460249078 SN reads paired: 522095280 SN reads duplicated: 60151694 SN reads MQ0: 54098384 SN reads QC failed: 0 SN non-primary alignments: 15023188 SN total length: 78437013272 SN bases mapped: 77238941462 SN bases mapped (cigar): 74139898333 SN bases trimmed: 0 SN bases duplicated: 9022025650 SN mismatches: 1695194781 SN error rate: 2.286481e-02 SN average length: 150 SN maximum length: 151 SN average quality: 37.6 ... ``` --- name: bamtools-stats ## BAM QC | bamtools `bamtools stats file.bam` ``` ********************************************** Stats for BAM file(s): ********************************************** Total reads: 537118468 Mapped reads: 529162213 (98.5187%) Forward strand: 270376825 (50.3384%) Reverse strand: 266741643 (49.6616%) Failed QC: 0 (0%) Duplicates: 61425418 (11.4361%) Paired-end reads: 537118468 (100%) 'Proper-pairs': 465991264 (86.7576%) Both pairs mapped: 524501668 (97.651%) Read 1: 268374707 Read 2: 268743761 Singletons: 4660545 (0.867694%) ``` --- name: alignment-qc-qorts-features ## Alignment QC | Features QoRTs was run on all samples and summarised using MultiQC. ![](images/qorts_alignments.svg) --- name: alignment-qc-1 ## Alignment QC .pull-left-50[ **Soft clipping** ![](images/clipping_good.png) ] .pull-right-50[ **Gene body coverage** ![](images/gene-body-coverage.png) ] --- name: alignment-qc-2 ## Alignment QC .pull-left-50[ **Insert size** ![](images/inner_distance.png) ] .pull-right-50[ **Saturation curve** ![](images/saturation.png) ] --- name: multiqc ## MultiQC [![](images/multiqc.png)](https://multiqc.info/examples/rna-seq/multiqc_report.html) --- name: pcr-duplicates ## Quantification | PCR duplicates - Ignore for RNA-Seq data - Computational deduplication (Don't!) - Use PCR-free library-prep kits - Use UMIs .size-70[![](images/pcr-duplicates.png)] .citation[ <span style="display:block;"><i class="fas fa-link"></i> Fu, Yu, *et al*. "Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers." [BMC genomics 19.1 (2018): 531](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4933-1)</span> <span style="display:block;"><i class="fas fa-link"></i> Parekh, Swati, *et al*. "The impact of amplification on differential expression analyses by RNA-seq." [Scientific reports 6 (2016): 25533](https://www.nature.com/articles/srep25533)</span> <span style="display:block;"><i class="fas fa-link"></i> Klepikova, Anna V., *et al*. "Effect of method of deduplication on estimation of differential gene expression using RNA-seq." [PeerJ 5 (2017): e3091](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5357343/)</span> ] --- name: quantification-qc ## Quantification QC ``` ENSG00000000003 140 242 188 143 287 344 438 280 253 ENSG00000000005 0 0 0 0 0 0 0 0 0 ENSG00000000419 69 98 77 55 52 94 116 79 69 ENSG00000000457 56 75 104 79 157 205 183 178 153 ENSG00000000460 33 27 23 19 27 42 69 44 40 ENSG00000000938 7 38 13 17 35 76 53 37 24 ENSG00000000971 545 878 694 636 647 216 492 798 323 ENSG00000001036 79 154 74 80 128 167 220 147 72 ``` .pull-left-50[ - Pairwise correlation between samples must be high (>0.9) .size-60[![](images/correlation.png)] ] .pull-right-50[ - Count QC using RNASeqComp .size-80[![](images/rnaseqcomp.gif)] ] <i class="fas fa-toolbox"></i> [RNASeqComp](https://bioconductor.org/packages/release/bioc/html/rnaseqcomp.html) .citation[ <span style="display:block;"><i class="fas fa-link"></i> Teng, Mingxiang, *et al*. "A benchmark for RNA-seq quantification pipelines." [Genome biology 17.1 (2016): 74](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0940-1)</span> ] --- name: exploratory-heatmap ## Exploratory | Heatmap - Remove lowly expressed genes - Transform raw counts to VST, VOOM, RLOG, TPM etc - Sample-sample distance/correlation clustering heatmap <img src="rnaseq_qc_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto auto auto 0;" /> <i class="fas fa-toolbox"></i> [`pheatmap()`](https://github.com/raivokolde/pheatmap) --- name: exploratory-mds ## Exploratory | MDS
<i class="fas fa-toolbox"></i> [`cmdscale()`](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/cmdscale.html), [plotly](https://plot.ly/r/) --- name: batch-correction ## Batch correction - Estimate variation explained by variables (.medium[PVCA]) .size-70[![](images/pvca.png)] - Find confounding effects as surrogate variables (.medium[SVA]) - Model known batches in the LM/GLM model - Correct known batches (.medium[ComBat])(Harsh!) - Interactively evaluate batch effects and correction (.medium[BatchQC]) <i class="fas fa-toolbox"></i> [SVA](http://bioconductor.org/packages/release/bioc/html/sva.html), [PVCA](https://bioconductor.org/packages/release/bioc/html/pvca.html), [BatchQC](http://bioconductor.org/packages/release/bioc/html/BatchQC.html) .citation[ <span style="display:block;"><i class="fas fa-link"></i> Liu, Qian, and Marianthi Markatou. "Evaluation of methods in removing batch effects on RNA-seq data." [Infectious Diseases and Translational Medicine 2.1 (2016): 3-9](http://www.tran-med.com/article/2016/2411-2917-2-1-3.html)</span> <span style="display:block;"><i class="fas fa-link"></i> Manimaran, Solaiappan, et al. "BatchQC: interactive software for evaluating sample and batch effects in genomic data." [Bioinformatics 32.24 (2016): 3836-3838](https://academic.oup.com/bioinformatics/article/32/24/3836/2525651)</span> ] --- name: spike-in ## Spike-In .pull-left-50[ * Add synthetic RNA into samples as control * Usually added before library prep * Useful for * Estimating sensitivity * Estimating accuracy * Detecting biases * Normalisation * Absolute quantification * Comparing datasets * [ERCC RNA Spike-In Mix](https://www.thermofisher.com/order/catalog/product/4456740)/[Exiqon Small RNA Spike-In](https://www.exiqon.com/mirna-NGS) ] .pull-right-50[ ![](images/ercc.png) ] --- name: summary class: spaced ## Summary - Sound experimental design to avoid confounding - Plan carefully about lib prep, sequencing etc based on experimental objective - Biological replicates may be more important than paired-end reads or long reads - Discard low quality bases, reads, genes and samples - Verify that tools and methods align with data assumptions - Experiment with multiple pipelines and tools - QC! QC everything at every step .large[<i class="fas fa-link"></i> Conesa, Ana, *et al.* "A survey of best practices for RNA-seq data analysis." [Genome biology 17.1 (2016): 13](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8)] --- name: end-slide class: end-slide # Thank you! Questions? <p style="text-align: left; font-size: small;"> Built on : <i class='fa fa-calendar' aria-hidden='true'></i> 23-Oct-2018 at <i class='fa fa-clock-o' aria-hidden='true'></i> 18:50:26 </p> <hr> <b>2018</b> Roy Francis | [SciLifeLab](https://www.scilifelab.se/) | [NBIS](https://nbis.se/)