class: center, middle, inverse, title-slide # Mathematical Statistics in R ## RaukR 2019 • Advanced R for Bioinformatics ###
Nikolay Oskolkov
### NBIS, SciLifeLab --- exclude: true count: false <link href="https://fonts.googleapis.com/css?family=Roboto|Source+Sans+Pro:300,400,600|Ubuntu+Mono&subset=latin-ext" rel="stylesheet"> <!-- ----------------- Only edit title & author above this ----------------- --> --- name: Statistics ## What is Mathematical Statistics? * Can Mathematical Statistics mean * a statistical test? * a probability distribution? * or maybe a p-value? * We are not going to draw marbles or roll dice .center[ <img src="Dice.jpg" style="width: 70%;" />] .center[ <span style="color:red">**Classic statistics is not the only way to analyze your data**</span>] --- name: Types of Analysis ## Different Types of Data Analysis * Depends on the amount of data we have * Balance between the numbers of features and observations + P is the number of features (genes, proteins, genetic variants etc.) + N is the number of observations (samples, cells, nucleotides etc.) .center[ <img src="AmountOfData.png" style="width: 90%;" />] --- name: Do we have Big Data in Life Sciences? ## Do we have Big Data in Life Sciences? .center[ <img src="BigData.png" style="width: 100%;" />] --- name: Precision Medicine: Why isn't it in the Clinics? ## Precision Medicine: Why isn't it in the Clinics? .center[ <img src="PrecisionMedicine.png" style="width: 90%;" />] --- name: The Curse of Dimensionality ## The Curse of Dimensionality .pull-left-50[ <img src="DarkMagic.jpg" style="width: 90%;" />] .pull-right-50[ `$$Y = \alpha + \beta X$$` `$$\beta = \left(X^TX\right)^{-1}X^TY$$` `$$\left(X^TX\right)^{-1} \sim \frac{1}{\rm{det}\left(X^TX\right)}\dots\,\rightarrow\,\infty\hbox{,}\,\,\,\,\,\,\,\,n\ll p$$` <br/> .center[ <span style="color:red">**The math blows up in high dimensions**</span>] <br/> .center[ <img src="ndim_ball.png" style="width: 110%;" />] <br/> `$$l_d = r\sqrt{d}\hbox{, where d is dimensionality of space}$$` <br/> .center[ <span style="color:red">**Euclidean distance fails, hard to define metric**</span>] ] --- name: Low Dimensional Space ## Low Dimensional Space ```r set.seed(123) n <- 20 # number of samples p <- 2 # number of features / dimensions Y <- rnorm(n) X <- matrix(rnorm(n*p),n,p) summary(lm(Y~X)) ``` ``` ## ## Call: ## lm(formula = Y ~ X) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.0522 -0.6380 0.1451 0.3911 1.8829 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.14950 0.22949 0.651 0.523 ## X1 -0.09405 0.28245 -0.333 0.743 ## X2 -0.11919 0.24486 -0.487 0.633 ## ## Residual standard error: 1.017 on 17 degrees of freedom ## Multiple R-squared: 0.02204, Adjusted R-squared: -0.09301 ## F-statistic: 0.1916 on 2 and 17 DF, p-value: 0.8274 ``` --- name: Going to Higher Dimensions ## Going to Higher Dimensions ```r set.seed(123456); n <- 20; p <- 10 Y <- rnorm(n); X <- matrix(rnorm(n*p),n,p); summary(lm(Y~X)) ``` ``` ## ## Call: ## lm(formula = Y ~ X) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.0255 -0.4320 0.1056 0.4493 1.0617 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.54916 0.26472 2.075 0.0679 . ## X1 0.30013 0.21690 1.384 0.1998 ## X2 0.68053 0.27693 2.457 0.0363 * ## X3 -0.10675 0.26010 -0.410 0.6911 ## X4 -0.21367 0.33690 -0.634 0.5417 ## X5 -0.19123 0.31881 -0.600 0.5634 ## X6 0.81074 0.25221 3.214 0.0106 * ## X7 0.09634 0.24143 0.399 0.6992 ## X8 -0.29864 0.19004 -1.571 0.1505 ## X9 -0.78175 0.35408 -2.208 0.0546 . ## X10 0.83736 0.36936 2.267 0.0496 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8692 on 9 degrees of freedom ## Multiple R-squared: 0.6592, Adjusted R-squared: 0.2805 ## F-statistic: 1.741 on 10 and 9 DF, p-value: 0.2089 ``` --- name: Even Higher Dimensions ## Even Higher Dimensions ```r set.seed(123456); n <- 20; p <- 20 Y <- rnorm(n); X <- matrix(rnorm(n*p),n,p); summary(lm(Y~X)) ``` ``` ## ## Call: ## lm(formula = Y ~ X) ## ## Residuals: ## ALL 20 residuals are 0: no residual degrees of freedom! ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.34889 NA NA NA ## X1 0.66218 NA NA NA ## X2 0.76212 NA NA NA ## X3 -1.35033 NA NA NA ## X4 -0.57487 NA NA NA ## X5 0.02142 NA NA NA ## X6 0.40290 NA NA NA ## X7 0.03313 NA NA NA ## X8 -0.31983 NA NA NA ## X9 -0.92833 NA NA NA ## X10 0.18091 NA NA NA ## X11 -1.37618 NA NA NA ## X12 2.11438 NA NA NA ## X13 -1.75103 NA NA NA ## X14 -1.55073 NA NA NA ## X15 0.01112 NA NA NA ## X16 -0.50943 NA NA NA ## X17 -0.47576 NA NA NA ## X18 0.31793 NA NA NA ## X19 1.43615 NA NA NA ## X20 NA NA NA NA ## ## Residual standard error: NaN on 0 degrees of freedom ## Multiple R-squared: 1, Adjusted R-squared: NaN ## F-statistic: NaN on 19 and 0 DF, p-value: NA ``` --- name: Dimensionality Reduction ## Dimensionality Reduction <img src="Presentation_GeneralStats_files/figure-html/PCA-1.svg" style="display: block; margin: auto;" /> <br/> .center[ <span style="color:red">**Dimensionality Reduction is not for visualization but overcoming the Curse of Dimensionality**</span>] --- name: Linear Dimensionality Reduction ## Linear Dimensionality Reduction .center[ <img src="LinearDimReduct.png" style="width: 90%;" />] --- name: Non-Linear Dimensionality Reduction ## Non-Linear Dimensionality Reduction .center[ <img src="NonLinearDimReduct.png" style="width: 90%;" />] --- name: Frequentist Statistics Failure ## Frequentist Statistics Failure .center[ <img src="Anscombes_quartet.png" style="width: 90%;" />] --- name: Frequentist Statistics Brain Damaging (Cont.) ## Frequentist Statistics Brain Damaging .center[ <img src="DataSaurus.gif.png" style="width: 90%;" />] .center[ <img src="BoxViolinSmaller.gif.png" style="width: 90%;" />] --- name: Pvalue is not good for ranking features ## Pvalue is not good for ranking features .pull-left-50[ <img src="Pvalue.png" style="width: 100%;" />] .pull-right-50[ ```r FC <- 1.02 x_mean <- 5; x_sd <- 1 N_vector<-seq(from=100,to=10000,by=100) x1 <- rnorm(N_vector, x_mean, x_sd) x2 <- rnorm(N_vector, x_mean*FC, x_sd) ``` <img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" /> ] --- name: Maximum Likelihood Principle ## Maximum Likelihood Principle * We maximize probability to observe the data `\(X_i\)` `$$\rm{L}\,(\,\rm{X_i} \,|\, \mu,\sigma\,) = \prod_{i=0}^{N}\frac{1}{\sqrt{2\pi\sigma²}} \exp^{\displaystyle -\frac{(X_i-\mu)^2}{2\sigma²}}\\ \mu = \frac{1}{N}\sum_{i=0}^N \rm{X_i}\\ \sigma^2 = \frac{1}{N}\sum_{i=0}^N (\rm{X_i}-\mu)^2$$` -- * Maximum Likelihood has many assumptions: * Large sample size * Gaussian distribution * Homoscedasticity * Uncorrelated errors * Convergence of covariance * Those assumptions are not fulfilled in the real world --- name: Statistical Test ## Two-Groups Statistical Test ```r set.seed(12) X<-c(rnorm(20,mean=5,sd=2),12,15,14,16) Y<-c(rnorm(24,mean=7,sd=2)) boxplot(X,Y,ylab="DIFFERENCE",names=c("X","Y")) ``` <img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" /> --- name: Parametric Statistical Test ## Parametric Statistical Test Fails ```r t.test(X,Y) ``` ``` ## ## Welch Two Sample t-test ## ## data: X and Y ## t = -1.084, df = 31.678, p-value = 0.2865 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -2.8819545 0.8804344 ## sample estimates: ## mean of x mean of y ## 5.989652 6.990412 ``` <img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto;" /> --- name: Resampling ## Resampling ```r observed <- median(Y)-median(X) print(paste0("Observed difference = ",observed)) res <- vector(length=1000) for(i in 1:1000){Z<-c(X,Y); Y_res<-sample(Z,length(Y),FALSE); X_res<-sample(Z,length(X),FALSE); res[i]<-median(Y_res)-median(X_res)} hist(abs(res), breaks=100, main="Resampled", xlab="Difference") print(paste0("p-value = ", sum(abs(res) >= abs(observed))/1000)) ``` ``` ## [1] "Observed difference = 2.33059085402647" ## [1] "p-value = 0.001" ``` <img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto;" /> --- name: ML Does Not Stand Non-Independence ## ML Does Not Stand Non-Independence ``` ## n1 n2 n3 n4 n5 ## p1 -0.6760258 -1.2307634 1.66039982 0.196033326 -0.2981471 ## p2 -1.5834993 0.6494188 -0.01267663 -1.064763128 -0.1792141 ## p3 0.3152418 -0.5791937 -1.79593465 -0.312303710 0.2671534 ## p4 -0.9359010 0.1212546 -0.36279328 -0.553364109 1.0598898 ## p5 -2.0411903 0.6899356 -1.03923098 0.008958754 -0.2249498 ``` * Two types of non-independence in data * between samples * between features .center[ .pull-left-50[ ### Random Effects <img src="Random_Effects.jpg" style="width: 60%;" />] .pull-right-50[ ### Lasso <img src="lasso.jpg" style="width: 40%;" />]] --- name: Linear Model with Non-Independence ## Linear Model with Non-Independence ```r library("lme4") library("ggplot2") ggplot(sleepstudy,aes(x=Days,y=Reaction)) + geom_point() + geom_smooth(method="lm") ``` <img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-14-1.svg" style="display: block; margin: auto;" /> --- name: Fit Linear Model for Each Individual ## Fit Linear Model for Each Individual ```r ggplot(sleepstudy, aes(x = Days, y = Reaction)) + geom_smooth(method = "lm", level = 0.95) + geom_point() + facet_wrap( ~ Subject, nrow = 3, ncol = 6) ``` <img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" /> --- name: The Case of the Missing Heritability ## Random Effects and Missing Heritability .center[ <img src="MissingHeritability.png" style="width: 80%;" />] --- name: Random Effects Modelling ## Random Effects Modelling * Allow individual level Slopes and Intercepts * This is nothing else than Bayesian Priors on coefficients `$$\rm{Reaction} = \alpha_i + \beta_i \rm{Days}$$` ```r lmerfit <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy) ``` .pull-left-50[ <img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-17-1.svg" style="display: block; margin: auto auto auto 0;" /> ] .pull-right-50[ <img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-18-1.svg" style="display: block; margin: auto auto auto 0;" /> ] .center[ <span style="color:red">**Shrinkage: Introduce shared variance parameter**</span>] --- name: Linear Mixed Models (LMM) ## Linear Mixed Models (LMM) ```r summary(lmer(Reaction ~ Days + (Days | Subject), sleepstudy)) ``` ``` ## Linear mixed model fit by REML ['lmerMod'] ## Formula: Reaction ~ Days + (Days | Subject) ## Data: sleepstudy ## ## REML criterion at convergence: 1743.6 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -3.9536 -0.4634 0.0231 0.4633 5.1793 ## ## Random effects: ## Groups Name Variance Std.Dev. Corr ## Subject (Intercept) 611.90 24.737 ## Days 35.08 5.923 0.07 ## Residual 654.94 25.592 ## Number of obs: 180, groups: Subject, 18 ## ## Fixed effects: ## Estimate Std. Error t value ## (Intercept) 251.405 6.824 36.843 ## Days 10.467 1.546 6.771 ## ## Correlation of Fixed Effects: ## (Intr) ## Days -0.138 ``` --- name: LMM Average Fit ## LMM Average Fit <img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-20-1.svg" style="display: block; margin: auto;" /> --- name: LMM Individual Fit ## LMM Individual Fit <img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-21-1.svg" style="display: block; margin: auto;" /> --- name: What is Bayesian Statistics for You? ## What is Bayesian Statistics for You? .center[ <img src="BayesianStatistics.png" style="width: 90%;" />] .small[ * **Handling Missing Data** * **Handling Non-Gaussian Data** * **Cause or Consequence** * **Lack of Statistical Power** * **Overfitting and Correction for Multiple Testing (FDR)** * **Testing for Significance and P-Value** ] --- name: Frequentist vs. Bayesian Fitting ## Frequentist vs. Bayesian Linear Model .center[ .pull-left-50[ ### Maximum Likelihood `$$y = \alpha+\beta x$$` <br/> `$$L(y) \sim e^{-\frac{(y-\alpha-\beta x)^2}{2\sigma^2}}$$` <br/> `$$\max_{\alpha,\beta,\sigma}L(y) \Longrightarrow \hat\alpha, \hat\beta, \hat\sigma$$` ]] -- .center[ .pull-right-50[ ### Bayesian Linear Fitting `$$y \sim \it N(\mu,\sigma) \quad\textrm{- Likelihood L(y)}$$` <br/> `$$\mu = \alpha + \beta x$$` <br/> `$$\alpha \sim \it N(\mu_\alpha,\sigma_\alpha) \quad\textrm{- Prior on} \quad\alpha \\ \beta \sim \it N(\mu_\beta,\sigma_\beta) \quad\textrm{- Prior on} \quad\beta$$` <br/> `$$P(\mu_\alpha,\sigma_\alpha,\mu_\beta,\sigma_\beta,\sigma) \sim L(y)N(\mu_\alpha,\sigma_\alpha)N(\mu_\beta,\sigma_\beta)$$` <br/> `$$\max_{\mu_\alpha,\sigma_\alpha,\mu_\beta,\sigma_\beta,\sigma}P(\mu_\alpha,\sigma_\alpha,\mu_\beta,\sigma_\beta,\sigma) \Longrightarrow \hat\mu_\alpha,\hat\sigma_\alpha,\hat\mu_\beta,\hat\sigma_\beta,\hat\sigma$$` ]] --- name: Bayesian Linear Model ## Bayesian Linear Model ```r library("brms") options(mc.cores = parallel::detectCores()) brmfit <- brm(Reaction ~ Days + (Days | Subject), data = sleepstudy) summary(brmfit) ``` ``` ## Family: gaussian ## Links: mu = identity; sigma = identity ## Formula: Reaction ~ Days + (Days | Subject) ## Data: sleepstudy (Number of observations: 180) ## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; ## total post-warmup samples = 4000 ## ## Group-Level Effects: ## ~Subject (Number of levels: 18) ## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat ## sd(Intercept) 26.81 6.91 15.47 42.69 1749 1.00 ## sd(Days) 6.55 1.53 4.09 10.12 1413 1.00 ## cor(Intercept,Days) 0.10 0.30 -0.49 0.67 999 1.00 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat ## Intercept 251.15 7.36 236.19 265.63 1561 1.00 ## Days 10.47 1.67 7.13 13.67 1429 1.00 ## ## Family Specific Parameters: ## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat ## sigma 25.94 1.59 23.08 29.34 2808 1.00 ## ## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample ## is a crude measure of effective sample size, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1). ``` --- name: Bayesian Population-Level Fit ## Bayesian Population-Level Fit <img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-23-1.svg" style="display: block; margin: auto auto auto 0;" /> --- name: Bayesian Group-Level Fit ## Bayesian Group-Level Fit <img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-24-1.svg" style="display: block; margin: auto auto auto 0;" /> --- name: Feature Non-Independence: LASSO ## Feature Non-Independence: LASSO `$$Y = \beta_1X_1+\beta_2X_2+\epsilon \\ \textrm{OLS} = (y-\beta_1X_1-\beta_2X_2)^2 \\ \textrm{Penalized OLS} = (y-\beta_1X_1-\beta_2X_2)^2 + \lambda(|\beta_1|+|\beta_2|)$$` .center[ <img src="CV_lambda.png" style="width: 70%;" />] --- name: Causality Inference: get rid of confounders ## Causality Inference: get rid of confounders .pull-left-50[ `$$Y \sim X + C\hbox{, where C is a confounder}$$` <img src="IV.png" style="width: 70%;" />] .pull-right-50[ .center[ <span style="color:red">**Mendelian Randomization**</span> <img src="InsEndCancer.png" style="width: 70%;" /> ]] .center[ <span style="color:red">**Method of Instrumental Variables**</span>] ```r N <- 1e3 # Number of Observations C <- rnorm(n = N); X <- 2 * C + rnorm(n = N) Y <- 2 * X + 2 * C + rnorm(n = N) summary(lm(Y ~ X)) ``` ``` ## ## Call: ## lm(formula = Y ~ X) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.4905 -0.8991 0.0072 0.9051 4.8766 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.04544 0.04245 1.07 0.285 ## X 2.79987 0.01876 149.27 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.339 on 998 degrees of freedom ## Multiple R-squared: 0.9571, Adjusted R-squared: 0.9571 ## F-statistic: 2.228e+04 on 1 and 998 DF, p-value: < 2.2e-16 ``` --- name: When confounders are known: Controlled Regression ## Known confounders: Controlled Regression ```r summary(lm(Y ~ X + C)) ``` ``` ## ## Call: ## lm(formula = Y ~ X + C) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.1141 -0.6955 -0.0089 0.7270 3.8172 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.06693 0.03250 2.06 0.0397 * ## X 2.00552 0.03314 60.52 <2e-16 *** ## C 1.97305 0.07419 26.59 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.025 on 997 degrees of freedom ## Multiple R-squared: 0.9749, Adjusted R-squared: 0.9749 ## F-statistic: 1.938e+04 on 2 and 997 DF, p-value: < 2.2e-16 ``` <br/> .center[ <span style="color:red">**What if confounders are not known?**</span>] --- name: Good Instrumental Variable ## Good Instrumental Variable ```r set.seed(1) library("AER") N <- 1e3 # Number of Observations weak <- rnorm(n = N) good <- rnorm(n = N) C <- rnorm(n = N) X <- 0.000001 * weak + 2 * good + 2 * C + rnorm(n = N) Y <- 2 * X + 2 * C + rnorm(n = N) df <- cbind(Y, X, C, good, weak) colnames(df) <- c('Y', 'X','C', 'good', 'weak') df <- data.frame(df) summary(ivreg(Y ~ X | good, data = df)) ``` ``` ## ## Call: ## ivreg(formula = Y ~ X | good, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.78427 -1.48164 0.05357 1.37862 6.99009 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.01017 0.06867 0.148 0.882 ## X 2.02609 0.03207 63.183 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.172 on 998 degrees of freedom ## Multiple R-Squared: 0.9237, Adjusted R-squared: 0.9236 ## Wald test: 3992 on 1 and 998 DF, p-value: < 2.2e-16 ``` --- name: Weak Instrumental Variable ## Weak Instrumental Variable ```r summary(ivreg(Y ~ X | weak, data = df)) ``` ``` ## ## Call: ## ivreg(formula = Y ~ X | weak, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6.319 -1.157 0.027 1.218 5.711 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.003767 0.056301 0.067 0.947 ## X 2.458141 0.401503 6.122 1.32e-09 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.77 on 998 degrees of freedom ## Multiple R-Squared: 0.9493, Adjusted R-squared: 0.9492 ## Wald test: 37.48 on 1 and 998 DF, p-value: 1.324e-09 ``` --- name: What does ivreg do under the hood? ## What does ivreg do under the hood? ```r summary(lm(Y~as.numeric(predict(lm(X~good))))) ``` ``` ## ## Call: ## lm(formula = Y ~ as.numeric(predict(lm(X ~ good)))) ## ## Residuals: ## Min 1Q Median 3Q Max ## -22.8345 -4.2676 -0.0506 4.3873 20.1810 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.01017 0.20723 0.049 0.961 ## as.numeric(predict(lm(X ~ good))) 2.02609 0.09677 20.938 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.553 on 998 degrees of freedom ## Multiple R-squared: 0.3052, Adjusted R-squared: 0.3045 ## F-statistic: 438.4 on 1 and 998 DF, p-value: < 2.2e-16 ``` --- name: Lasso helps to build good Instrumental Variables ## Lasso helps to build good Instrumental Variables .pull-left-50[ <img src="CausalXonY.png" style="width: 100%;" />] .pull-right-50[ <img src="LassoIV.png" style="width: 100%;" />] --- name: Simplex Space: a few words about normalization ## Simplex Space: a few words about normalization We draw two vectors (genes) from normal distributions with different means and sds ```r set.seed(123) gene1<-rnorm(100,mean=10,sd=1) gene2<-rnorm(100,mean=100,sd=5) cor.test(gene1,gene2,method="spearman") ``` ``` ## ## Spearman's rank correlation rho ## ## data: gene1 and gene2 ## S = 166000, p-value = 0.9693 ## alternative hypothesis: true rho is not equal to 0 ## sample estimates: ## rho ## 0.00390039 ``` --- name: Library Size Normalization ## Library Size Normalization We create an expression data frame with 100 samples and 2 genes ```r expr<-as.data.frame(t(data.frame(gene1=gene1,gene2=gene2))) colnames(expr)<-paste0("sample",seq(1:100)) expr[1:2,1:4] ``` ``` ## sample1 sample2 sample3 sample4 ## gene1 9.439524 9.769823 11.55871 10.07051 ## gene2 96.447967 101.284419 98.76654 98.26229 ``` We calculate library size normalized expression data frame ```r expr_n<-apply(expr,2,function(x) x/sum(x)) expr_n[1:2,1:4] ``` ``` ## sample1 sample2 sample3 sample4 ## gene1 0.08914674 0.08797343 0.1047694 0.092959 ## gene2 0.91085326 0.91202657 0.8952306 0.907041 ``` --- name: Spurious Correlation After Normalization ## Spurious Correlation After Normalization .pull-left-50[ .center[ <span style="color:red">**Before Library Size Normalization**</span>] ```r plot(expr1,expr2,xlab="gene1",ylab="gene2") ``` <img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-34-1.svg" style="display: block; margin: auto;" /> ] .pull-right-50[ .center[ <span style="color:red">**After Library Size Normalization**</span>] ```r plot(expr_n1,expr_n2,xlab="gene1",ylab="gene2") ``` <img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-35-1.svg" style="display: block; margin: auto;" /> ] ```r cor.test(expr_n1,expr_n2,method="spearman") ``` ``` ## ## Spearman's rank correlation rho ## ## data: expr_n1 and expr_n2 ## S = 333300, p-value < 2.2e-16 ## alternative hypothesis: true rho is not equal to 0 ## sample estimates: ## rho ## -1 ``` --- name: OMICs Integration ## OMICs Integration .center[ <img src="OMICsIntegr.png" style="width: 100%;" />] --- name: Single Cell OMICs Integration ## Single Cell OMICs Integration .center[ <img src="SingleCellMultiOmics.png" style="width: 100%;" />] --- name: Single Cell OMICs Integration ## Single Cell OMICs Integration .center[ <img src="scRNAseqIntegr.png" style="width: 100%;" />] --- name: scNMT OMICs Integration ## scNMT OMICs Integration .center[ <img src="scNMT.png" style="width: 100%;" />] <!-- --------------------- Do not edit this and below --------------------- --> --- name: end-slide class: end-slide, middle count: false # Thank you. Questions? <p>R version 3.6.0 (2019-04-26)<br><p>Platform: x86_64-pc-linux-gnu (64-bit)</p><p>OS: Ubuntu 14.04.6 LTS</p><br> Built on : <i class='fa fa-calendar' aria-hidden='true'></i> 16-jun-2019 at <i class='fa fa-clock-o' aria-hidden='true'></i> 09:14:31 __2019__ • [SciLifeLab](https://www.scilifelab.se/) • [NBIS](https://nbis.se/)