class: center, middle, inverse, title-slide # Differential Gene Expression ## Workshop on RNA-Seq ###
Roy Francis
| 18-Nov-2021 ### NBIS, SciLifeLab --- exclude: true count: false <link href="https://fonts.googleapis.com/css?family=Roboto|Source+Sans+Pro:300,400,600|Ubuntu+Mono&subset=latin-ext" rel="stylesheet"> <link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.3.1/css/all.css" integrity="sha384-mzrmE5qonljUremFsqc01SB46JvROS7bZs3IO2EmfFsd15uHvIt+Y8vEf7N7fWAU" crossorigin="anonymous"> <!-- ------------ Only edit title, subtitle & author above this ------------ --> --- name: dge ## Preparation - Create the DESeq2 object ```r library(DESeq2) mr$Group <- factor(mr$Group) d <- DESeqDataSetFromMatrix(countData=cf,colData=mr,design=~Group) d ``` ``` ## class: DESeqDataSet ## dim: 10573 6 ## metadata(1): version ## assays(1): counts ## rownames(10573): ENSMUSG00000098104 ENSMUSG00000033845 ... ## ENSMUSG00000063897 ENSMUSG00000095742 ## rowData names(0): ## colnames(6): DSSd00_1 DSSd00_2 ... DSSd07_2 DSSd07_3 ## colData names(7): SampleName SampleID ... Group Replicate ``` - Categorical variables must be factors - Building GLM models: `~var`, `~covar+var` ??? - The model `~var` asks DESeq2 to find DEGs between the levels under the variable *var*. - The model `~covar+var` asks DESeq2 to find DEGs between the levels under the variable *var* while controlling for the covariate *covar*. --- name: dge-sf ## Size factors - Normalisation factors are computed ```r d <- DESeq2::estimateSizeFactors(d,type="ratio") sizeFactors(d) ``` ``` ## DSSd00_1 DSSd00_2 DSSd00_3 DSSd07_1 DSSd07_2 DSSd07_3 ## 1.0136617 0.9570561 0.9965245 1.0354178 1.0780855 1.0017753 ``` --- name: dge-dispersion-1 ## Dispersion - Dispersion is a measure of variability in gene expression for a given mean - Dispersion is unreliable for low mean counts <img src="slide_dge_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto auto auto 0;" /> --- name: dge-dispersion-2 ## Dispersion - Genes with similar mean values must have similar dispersion - Estimate likely (ML) dispersion for each gene based on counts - Fit a curve through the gene-wise estimates - Shrink dispersion towards the curve ```r d <- DESeq2::estimateDispersions(d) ``` ![](data/deseq_dispersion.png) --- name: dge-test ## Testing - Log2 fold changes changes are computed after GLM fitting `FC = counts group B / counts group A` ```r dg <- nbinomWaldTest(d) resultsNames(dg) ``` ``` ## [1] "Intercept" "Group_day07_vs_day00" ``` -- - Use `results()` to customise/return results - Set coefficients using `contrast` or `name` - Filtering results by fold change using `lfcThreshold` - `cooksCutoff` removes outliers - `independentFiltering` removes low count genes - `pAdjustMethod` sets method for multiple testing correction - `alpha` set the significance threshold --- name: dge-test-2 ## Testing ```r res <- results(dg,name="Group_day07_vs_day00",alpha=0.05) summary(res) ``` ``` ## ## out of 10573 with nonzero total read count ## adjusted p-value < 0.05 ## LFC > 0 (up) : 193, 1.8% ## LFC < 0 (down) : 238, 2.3% ## outliers [1] : 1, 0.0095% ## low counts [2] : 4920, 47% ## (mean count < 21) ## [1] see 'cooksCutoff' argument of ?results ## [2] see 'independentFiltering' argument of ?results ``` - Alternative way to specify contrast ```r results(dg,contrast=c("Group","day07","day00"),alpha=0.05) ``` --- name: dge-test-3 ## Testing ```r head(res) ``` ``` ## log2 fold change (MLE): Group day07 vs day00 ## Wald test p-value: Group day07 vs day00 ## DataFrame with 6 rows and 6 columns ## baseMean log2FoldChange lfcSE stat pvalue ## <numeric> <numeric> <numeric> <numeric> <numeric> ## ENSMUSG00000098104 18.8505 0.205656 0.401543 0.512164 0.6085362 ## ENSMUSG00000033845 23.3333 0.653565 0.379627 1.721596 0.0851426 ## ENSMUSG00000025903 37.1016 0.672348 0.298923 2.249232 0.0244977 ## ENSMUSG00000033793 33.3673 0.144833 0.305139 0.474646 0.6350394 ## ENSMUSG00000025907 22.3875 0.821006 0.376414 2.181125 0.0291742 ## ENSMUSG00000051285 21.1485 0.452451 0.378725 1.194669 0.2322163 ## padj ## <numeric> ## ENSMUSG00000098104 NA ## ENSMUSG00000033845 0.377432 ## ENSMUSG00000025903 0.177491 ## ENSMUSG00000033793 0.886264 ## ENSMUSG00000025907 0.201741 ## ENSMUSG00000051285 NA ``` --- name: dge-test-4 ## Testing - Use `lfcShrink()` to correct fold changes for genes with high dispersion or low counts - Does not change number of DE genes ![](data/lfc_shrink.png) <!-- --------------------- Do not edit this and below --------------------- --> --- # Acknowledgements - RNA-seq analysis [Bioconductor vignette](http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html) - [DGE Workshop](https://github.com/hbctraining/DGE_workshop/tree/master/lessons) by HBC training --- name: end_slide class: end-slide, middle count: false # Thank you. Questions? .end-text[ <p>R version 4.0.5 (2021-03-31)<br><p>Platform: x86_64-pc-linux-gnu (64-bit)</p><p>OS: Ubuntu 18.04.6 LTS</p><br> <span class="small">Built on : <i class='fa fa-calendar' aria-hidden='true'></i> 18-Nov-2021 at <i class='fa fa-clock-o' aria-hidden='true'></i> 14:02:00</span> <b>2021</b> • [SciLifeLab](https://www.scilifelab.se/) • [NBIS](https://nbis.se/) ]