class: center, middle, inverse, title-slide # Statistics / Machine Learning in R ## RaukR 2021 • R for Bioinformatics ###
Nikolay Oskolkov
### NBIS, SciLifeLab --- exclude: true count: false <link href="https://fonts.googleapis.com/css?family=Roboto|Source+Sans+Pro:300,400,600|Ubuntu+Mono&subset=latin-ext" rel="stylesheet"> <link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.3.1/css/all.css" integrity="sha384-mzrmE5qonljUremFsqc01SB46JvROS7bZs3IO2EmfFsd15uHvIt+Y8vEf7N7fWAU" crossorigin="anonymous"> <!-- ----------------- Only edit title & author above this ----------------- --> --- name: Types of Analysis ## Different Types of Data Analysis * Depends on the amount of data we have * Balance between the numbers of features and observations + P is the number of features (genes, proteins, genetic variants etc.) + N is the number of observations (samples, cells, nucleotides etc.) <br/> .center[ <img src="images/AmountOfData.png" style="width: 90%;" />] --- name: The Curse of Dimensionality ## The Curse of Dimensionality .pull-left-50[ <img src="images/DarkMagic.jpg" style="width: 90%;" />] .pull-right-50[ .center[ <span style="color:black">**This is how data looks like in high dimensions:**</span>] .center[ <img src="images/ndim_ball.png" style="width: 110%;" />] `$$Y = \alpha + \beta X$$` `$$\beta = \left(X^TX\right)^{-1}X^TY$$` `$$\left(X^TX\right)^{-1} \sim \frac{1}{\rm{det}\left(X^TX\right)}\dots\,\rightarrow\,\infty\hbox{,}\,\,\,\,\,\,\,\,n\ll p$$` .center[ <span style="color:red">**Inverted covariance matrix diverges in high dimensions**</span>] <br/> `$$E[\hat{\sigma}^2]=\frac{n-p}{n}\sigma^2$$` .center[ <span style="color:red">**Maximum Likelihood variance estimator is severely biased in high dimensions**</span>] ] --- name: Equidistant Points in High Dimensions ## Equidistant Points in High Dimensions .panelset[ .panel[.panel-name[R Code] ```r n <- 1000; pair_dist <- list(); p <- c(2, 32, 512) for(i in 1:length(p)) { X <- matrix(rnorm(n * p[i]), n, p[i]) pair_dist[[i]] <- as.vector(dist(X)); pair_dist[[i]] <- pair_dist[[i]] / max(pair_dist[[i]]) } ``` ] .panel[.panel-name[Plot] <img src="presentation_files/figure-html/Hist_Plot-1.png" style="display: block; margin: auto;" /> ]] --- name: Neighboring Points in High Dimensions ## Neighboring Points in High Dimensions .panelset[ .panel[.panel-name[R Code] ```r n <- 1000 first_dist <- vector() mid_dist <- vector() last_dist <- vector() num_dim <- 2 ^ seq(1, 10, by=1) for(p in num_dim) { X <- matrix(rnorm(n*p), n, p) dist_X <- sort(as.vector(dist(X)), decreasing=FALSE) first_dist <- append(first_dist, dist_X[1]) mid_dist <- append(mid_dist, dist_X[round(length(dist_X)/2, 0)]) last_dist <- append(last_dist, dist_X[length(dist_X)]) } ``` ] .panel[.panel-name[Plot] <img src="presentation_files/figure-html/Neighbors Plot-1.png" style="display: block; margin: auto;" /> ]] --- name: Low Dimensional Space ## Low Dimensional Space ```r set.seed(123) n <- 20 # number of samples p <- 2 # number of features / dimensions Y <- rnorm(n) X <- matrix(rnorm(n*p),n,p) summary(lm(Y~X)) ``` ``` ## ## Call: ## lm(formula = Y ~ X) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.0522 -0.6380 0.1451 0.3911 1.8829 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.14950 0.22949 0.651 0.523 ## X1 -0.09405 0.28245 -0.333 0.743 ## X2 -0.11919 0.24486 -0.487 0.633 ## ## Residual standard error: 1.017 on 17 degrees of freedom ## Multiple R-squared: 0.02204, Adjusted R-squared: -0.09301 ## F-statistic: 0.1916 on 2 and 17 DF, p-value: 0.8274 ``` --- name: Going to Higher Dimensions ## Going to Higher Dimensions ```r set.seed(123456); n <- 20; p <- 10 Y <- rnorm(n); X <- matrix(rnorm(n*p),n,p); summary(lm(Y~X)) ``` ``` ## ## Call: ## lm(formula = Y ~ X) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.0255 -0.4320 0.1056 0.4493 1.0617 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.54916 0.26472 2.075 0.0679 . ## X1 0.30013 0.21690 1.384 0.1998 ## X2 0.68053 0.27693 2.457 0.0363 * ## X3 -0.10675 0.26010 -0.410 0.6911 ## X4 -0.21367 0.33690 -0.634 0.5417 ## X5 -0.19123 0.31881 -0.600 0.5634 ## X6 0.81074 0.25221 3.214 0.0106 * ## X7 0.09634 0.24143 0.399 0.6992 ## X8 -0.29864 0.19004 -1.571 0.1505 ## X9 -0.78175 0.35408 -2.208 0.0546 . ## X10 0.83736 0.36936 2.267 0.0496 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8692 on 9 degrees of freedom ## Multiple R-squared: 0.6592, Adjusted R-squared: 0.2805 ## F-statistic: 1.741 on 10 and 9 DF, p-value: 0.2089 ``` --- name: Even Higher Dimensions ## Even Higher Dimensions ```r set.seed(123456); n <- 20; p <- 20 Y <- rnorm(n); X <- matrix(rnorm(n*p),n,p); summary(lm(Y~X)) ``` ``` ## ## Call: ## lm(formula = Y ~ X) ## ## Residuals: ## ALL 20 residuals are 0: no residual degrees of freedom! ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.34889 NaN NaN NaN ## X1 0.66218 NaN NaN NaN ## X2 0.76212 NaN NaN NaN ## X3 -1.35033 NaN NaN NaN ## X4 -0.57487 NaN NaN NaN ## X5 0.02142 NaN NaN NaN ## X6 0.40290 NaN NaN NaN ## X7 0.03313 NaN NaN NaN ## X8 -0.31983 NaN NaN NaN ## X9 -0.92833 NaN NaN NaN ## X10 0.18091 NaN NaN NaN ## X11 -1.37618 NaN NaN NaN ## X12 2.11438 NaN NaN NaN ## X13 -1.75103 NaN NaN NaN ## X14 -1.55073 NaN NaN NaN ## X15 0.01112 NaN NaN NaN ## X16 -0.50943 NaN NaN NaN ## X17 -0.47576 NaN NaN NaN ## X18 0.31793 NaN NaN NaN ## X19 1.43615 NaN NaN NaN ## X20 NA NA NA NA ## ## Residual standard error: NaN on 0 degrees of freedom ## Multiple R-squared: 1, Adjusted R-squared: NaN ## F-statistic: NaN on 19 and 0 DF, p-value: NA ``` --- name: High dimensional data in Life Sciences ## High dimensional data in Life Sciences * Proteomics, metabolomics: 50 - 100 * Metagenomics: 100 - 500 * Transcriptomics: 20 000 * Microscopy imaging: 1 000 000 * Genomics, epigenomics: 30 000 000 .center[ <img src="images/DataLifeSciences.png" style="width: 70%;" />] --- name: Ways to Overcome The Curse of Dimensionality ## Ways to Overcome The Curse of Dimensionality * Increasing sample size - often cannot afford * Dimensionality reduction - PCA, UMAP etc. * Regularizations - LASSO, Ridge, Elastic Net, Dropout etc. * Bayesian statistics - Prior plays a role of Regularization <br/> .center[ <img src="images/Overcome_CD.png" style="width: 100%;" />] --- name: Dimensionality Reduction ## Dimensionality Reduction .center[ <img src="images/PCA_tSNE_MNIST.png" style="width: 90%;" />] --- name: Dimension Reduction: Matrix Factorization ## Dimension Reduction: Matrix Factorization .center[ <img src="images/MatrixFactor.png" style="width: 100%;" />] --- name: Dimension Reduction: Neighborhood Graph ## Dimension Reduction: Neighborhood Graph .center[ <img src="images/tSNE_Scheme.png" style="width: 68%;" />] --- name: Regularizations: LASSO ## Regularizations: LASSO `$$Y = \beta_1X_1+\beta_2X_2+\epsilon \\ \textrm{OLS} = (Y-\beta_1X_1-\beta_2X_2)^2 \\ \textrm{Penalized OLS} = (Y-\beta_1X_1-\beta_2X_2)^2 + \lambda(|\beta_1|+|\beta_2|)$$` .center[ <img src="images/CV_lambda.png" style="width: 70%;" />] --- name: K-Fold Cross-Validation ## Regularizations: K-Fold Cross-Validation .center[ <img src="images/Kfold_CrossVal.png" style="width: 100%;" />] --- name: Priors in Bayesian Statistics - Regularizations ## Priors in Bayesian Statistics - Regularizations `$$Y = \beta_1X_1+\beta_2X_2+\epsilon$$` * **Maximum Likelihood** principle: maximize probability to observe data given parameters: `$$Y \sim N(\,\beta_1X_1+\beta_2X_2, \sigma^2\,) \\ \rm{L}\,(\,\rm{Y} \,|\, \beta_1,\beta_2\,) = \frac{1}{\sqrt{2\pi\sigma²}} \exp^{-\frac{(Y-\beta_1X_1-\beta_2X_2)^2}{2\sigma²}}$$` -- * **Bayes theorem**: maximize posterior probability to observe parameters given data: `$$P(\rm{parameters} \,|\, \rm{data})=\frac{L(\rm{data} \,|\, \rm{parameters})*P(\rm{parameters})}{\int{L(\rm{data} \,|\, \rm{parameters})*P(\rm{parameters}) \, d(\rm{parameters})}}$$` -- `$$P(\, \beta_1,\beta_2 \,|\, \rm{Y}\,) \sim \rm{L}\,(\,\rm{Y} \,|\, \beta_1,\beta_2\,)*P(\beta_1,\beta_2) \sim \exp^{-\frac{(Y-\beta_1X_1-\beta_2X_2)^2}{2\sigma²}}*\exp^{-\lambda(|\beta_1|+|\beta_2|)}$$` `$$-\log{P(\, \beta_1,\beta_2 \,|\, \rm{Y}\,)} \sim (Y-\beta_1X_1-\beta_2X_2)^2 + \lambda(|\beta_1|+|\beta_2|)$$` -- <br/> * LASSO is a form of Bayes theorem with Laplace prior --- name: Moving from Statistics to Machine Learning ## Moving from Statistics to Machine Learning A few peculiarities of data analysis that are typical for Machine Learning: * Very focused on **prediction** instead of biomarker discovery * Focus on **cross-validation** and regularizations * More **algorithmic** and less "pen-and-paper" derivations * More **Python** speaking than R speaking community .center[ <img src="images/MachineLearning.jpg" style="width: 70%;" />] --- name: Statistics vs. Machine Learning ## Statistics vs. Machine Learning .center[ <img src="images/Stats_vs_ML.png" style="width: 100%;" />] --- name: How does Machine Learning work? ## How does Machine Learning work? Machine Learning by default involves five basic steps: 1. Split data set into **train**, **validation** and **test** subsets. 2. Fit the model on the train subset. 3. Validate your model on the validation subset. 4. Repeat steps 1-3 a number of times and tune **hyperparameters**. 5. Test the accuracy of the optimized model on the test subset. .center[ <img src="images/TrainTestSplit.png" style="width: 60%;" />] --- name: Toy Example of Machine Learning ## Toy Example of Machine Learning ```r set.seed(12345) N<-100 x<-rnorm(N) y<-2*x+rnorm(N) df<-data.frame(x,y) plot(y~x,data=df, col="blue") ``` <img src="presentation_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- name: Toy Example: Train and Test Subsets ## Toy Example: Train and Test Subsets We randomly assign 70% of the data to training and 30% to test subsets: ```r set.seed(123) train<-df[sample(1:dim(df)[1],0.7*dim(df)[1]),] test<-df[!rownames(df)%in%rownames(train),] ``` <img src="presentation_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- name: Toy Example: Validation of Model ## Toy Example: Validation of Model ```r test_predicted<-as.numeric(predict(lm(y~x,data=train),newdata=test)) plot(test$y~test_predicted,ylab="True y",xlab="Pred y",col="red") abline(lm(test$y~test_predicted),col="darkgreen") ``` <img src="presentation_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- name: Toy Example: Validation of Model (Cont.) ## Toy Example: Validation of Model (Cont.) ```r summary(lm(test$y~test_predicted)) ``` ``` ## ## Call: ## lm(formula = test$y ~ test_predicted) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.80597 -0.78005 0.07636 0.52330 2.61924 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.02058 0.21588 0.095 0.925 ## test_predicted 0.89953 0.08678 10.366 4.33e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.053 on 28 degrees of freedom ## Multiple R-squared: 0.7933, Adjusted R-squared: 0.7859 ## F-statistic: 107.4 on 1 and 28 DF, p-value: 4.329e-11 ``` Thus the model explains 79% of variation on the test subset. --- name: What is a Hyperparameter? ## What is a Hyperparameter? * Hyperparameters are Machine Learning design parameters that are set before the learning process starts * For the toy model above, a hyperparameter can be e.g. the number of covariates to adjust the main variable x of interest for ```r set.seed(1) for(i in 1:10) { df[,paste0("PC",i)]<-1*(1-i/10)*y+rnorm(N) } head(df) ``` ``` ## x y color PC1 PC2 PC3 ## 1 0.5855288 1.3949830 red 0.6290309 0.4956198 1.3858900 ## 2 0.7094660 0.2627087 red 0.4200812 0.2522828 1.8727694 ## 3 -0.1093033 0.2038119 red -0.6521979 -0.7478721 1.7292568 ## 4 -0.4534972 -2.2317496 blue -0.4132938 -1.6273709 -1.8931325 ## 5 0.6058875 1.3528592 blue 1.5470811 0.4277027 -1.3382341 ## 6 -1.8179560 -4.1719599 blue -4.5752323 -1.5702807 -0.4227104 ## PC4 PC5 PC6 PC7 PC8 ## 1 1.7306635 1.7719325 0.63529634 0.07742793 -0.4285716 ## 2 -0.8896729 2.0270091 -0.19178516 1.58123715 2.0241138 ## 3 2.0936245 -0.5010914 -1.10171748 0.58945128 -0.0492363 ## 4 -1.7226819 -1.5067426 -0.88140715 -0.12733353 -0.4603672 ## 5 2.4658608 0.2602076 1.53274473 0.26918441 -0.8528851 ## 6 -0.9909633 -2.4616374 -0.07481652 -2.38832183 -2.1785221 ## PC9 PC10 ## 1 -0.9474105 -1.5414026 ## 2 -1.7998121 0.1943211 ## 3 1.0156630 0.2644225 ## 4 -0.2350367 -1.1187352 ## 5 -0.4643425 0.6509530 ## 6 -0.5951440 -1.0329002 ``` --- name: How does Cross-Validation work? ## How does Cross-Validation work? * We should not include all PCs - this can be prone to overfitting * Cross-Validation is a way to combat overfitting ```r train<-df[sample(1:dim(df)[1],0.6*dim(df)[1]),] val_test<-df[!rownames(df)%in%rownames(train),] validate<-val_test[sample(1:dim(val_test)[1],0.25*dim(val_test)[1]),] test<-val_test[!rownames(val_test)%in%rownames(validate),] ``` <img src="presentation_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- name: How does Cross-Validation work? (Cont.) ## How does Cross-Validation work? (Cont.) * Fit linear regression model on training set and assess the error on validation set * Error: root mean square difference between y predicted by the trained model on the validation set, and the true y from the validation set * Looks like no dramatic decrease of RMSE after PC2 <img src="presentation_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- name: Ultimate Model Evaluation ## Ultimate Model Evaluation * Thus optimal model is y~x+PC1+PC2 * Perform final evaluation of the optimized / trained model on the test data set and report the final evaluation metric, adjusted R squared in our case * The model explains over 90% of variation on the unseen test data set ```r summary(lm(predict(lm(y~x+PC1+PC2,data=train),newdata=test)~test$y)) ``` ``` ## ## Call: ## lm(formula = predict(lm(y ~ x + PC1 + PC2, data = train), newdata = test) ~ ## test$y) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.02409 -0.26361 -0.00061 0.19345 0.95801 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.22155 0.09485 2.336 0.0269 * ## test$y 0.93912 0.03994 23.514 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5002 on 28 degrees of freedom ## Multiple R-squared: 0.9518, Adjusted R-squared: 0.9501 ## F-statistic: 552.9 on 1 and 28 DF, p-value: < 2.2e-16 ``` --- name: From Linear Models to ANNs ## From Linear Models to ANNs * ANN: Mathematical algorithm / function Y = f(X) with special architecture * Highly non-linear dues to activation functions * Backward propagation for minimizing error .center[ <img src="images/ANN.png" style="width: 90%;" />] --- name: Universal Approximation Theorem ## Universal Approximation Theorem .center[ <img src="images/UAT.png" style="width: 100%;" />] --- name: Single Cells make Big Data ## Single Cells make Big Data .center[ <img src="images/SingleCellBigData.png" style="width: 90%;" />] <!-- --------------------- Do not edit this and below --------------------- --> --- name: end-slide class: end-slide, middle count: false # Thank you. Questions? <p>R version 4.1.0 (2021-05-18)<br><p>Platform: x86_64-conda-linux-gnu (64-bit)</p><p>OS: Ubuntu 20.04.2 LTS</p><br> Built on : <i class='fa fa-calendar' aria-hidden='true'></i> 22-Jun-2021 at <i class='fa fa-clock-o' aria-hidden='true'></i> 13:37:05 <b>2021</b> • [SciLifeLab](https://www.scilifelab.se/) • [NBIS](https://nbis.se/) • [RaukR](https://nbisweden.github.io/workshop-RaukR-2106/)