class: center, middle, inverse, title-slide # Statistical Workshop ## Advanced R for Bioinformatics. Visby, 2018. ### Bengt Sennblad ### 20 Juni, 2018 --- class: spaced ## Statistical workshop * Aim + to make you aware of some of the problems in statistic analysis + to give a survey of selected methods, and their implementation in R, to solve these problems and provide an intuition of - how and why the methods work - the connection between methods + to familiarize you with math/stat notation so you can extend your knowledge by self-studies .pull-left-50[ * Problems covered + Overfitting + Ill-posed problems + Feature selection + Data integration ] .pull-right-50[ * Methods covered + Regularization + Model selection + Dimensionality reduction ] --- name: toydata ## Toy example data We will use a simulated toy data set for demonstration examples<sup>*</sup>. ```r N=100 # no samples P=10 # no variables # Draw variables, x_1,...,x_2, from a uniform distribution X=matrix(runif(N*(P+1)), nrow=N, ncol=P) # Draw effects b from a uniform distribution b=c(runif(P, min=1.5, max=3.5)) # intercept b0=3 # generate a y1 variable from alm of 3 first X variables only y1 <- b0 + X[,seq(1,3)] %*% b [seq(1,3)] + rnorm(N) # Store in the Y matrix (really just a vector here) Y = as.matrix(y1) ``` <br><br><br><br> *.small[In this r code, and also in some of the coming slides, we make use of matrix notation of linear models to simplify notation, see the associated "extended_reading" document]. --- name: overfit1 ## Overfitting | `Likelihood example` * Consider the two models `\begin{aligned} y & \sim \beta_0 + \beta_1 x_1 & (1) \\ y & \sim \beta_0 + \beta_1 x_1 + \beta_2 x_2 & (2) \end{aligned}` with log likelihoods `\begin{aligned} \log L[\beta_0, \beta_1 | x_1, y] & \propto \log Pr[y | x_1, \beta_0, \beta_1] & (1) \\ \log L[\beta_0, \beta_1, \beta_2 | x_1, x_2, y] & \propto \log Pr[y | x_1, x_2, \beta_0, \beta_1, \beta_2] & (2) \end{aligned}` * What are the maximum likelihood estimates (ML) of the two models? Let's plot them! -- 2 variables are clearly better than 1 variable <br><br><br> <img src="" style="display: block; margin: auto auto auto 0;" /> --- name: overfit2 ## Overfitting | `Likelihood example` * Consider the two models `\begin{aligned} y & \sim \beta_0 + \beta_1 x_1 & (1) \\ y & \sim \beta_0 + \beta_1 x_1 + \beta_2 x_2 & (2) \end{aligned}` with log likelihoods `\begin{aligned} \log L[\beta_0, \beta_1 | x_1, y] & \propto \log Pr[y | x_1, \beta_0, \beta_1] & (1) \\ \log L[\beta_0, \beta_1, \beta_2 | x_1, x_2, y] & \propto \log Pr[y | x_1, x_2, \beta_0, \beta_1,\beta_2] & (2) \end{aligned}` * What are the maximum likelihood estimates (ML) of the two models? Let's plot them! -- 2 variables are clearly better than 1 variable * In fact, as we include more parameters, ML will increase monotonically * Why is this? Why could this be a problem? <img src="" style="display: block; margin: auto auto auto 0;" /> --- name: overfit3 ## Overfitting | `Likelihood example` ### Why is this? -- #### Nested models * Model (1) can be described as a special case of Model (2) with the constraints on `\(\beta_2=0\)` * Therefore Model (2) will always have equal or better ML than Model (1) -- ### Why could this be a problem? -- #### Overfitting * In the example, the simulated data was generated from the 3 first variables + thus, the subsequent variables increase ML by modeling noise in data * This is difficult to detect by just looking at the likelihoods -- ### Solution: -- * Seek the simplest model that is "good enough" + Regularization/Bayesian priors + Model comparison + Cross-validation --- name: regularization ## Regularization * Regularization is a frequentist concept that adds auxiliary criteria, so-called *regularization terms*, to probabilistic models. + We will examplify this with a *penalized likelihood*, which attempts to balance the model's log likelihood against the number of parameters in the model + We will later cover other regularization applications when we discuss feature selection --- name: pen1 ## Regularization | `Penalized likelihood` * A very simplified penalized likelihood of our full model is `\begin{aligned} \log pL[\boldsymbol{\beta} | X, Y] & = \log Pr[Y | X, \boldsymbol{\beta}] - \#(\boldsymbol{\beta}\neq 0), \\ \end{aligned}` where, in the regularization term, `\(\#(\boldsymbol{\beta}\neq 0)\)` denote the number of non-zero elements in `\(\boldsymbol\beta\)` - We then choose the model and parameters that maximize the penalized likelihood - Applying this pL to our example, solves the overfitting problem <img src="" style="display: block; margin: auto auto auto 0;" /> --- name: pen2 ## Regularization | `Penalized likelihood` * More generally, we might want to penalize each `\(\beta_i\)` to be close to some expected value `\(m_i\)`, i.e., `\begin{aligned} \log pL[\boldsymbol{\beta} | X, Y] & = \log Pr[Y | X, \boldsymbol{\beta}] - w||\beta-m||_2^2 \\ \end{aligned}` where + `\(w\)` is some weight factor for the regularization term + `\(||\beta-m||_2^2=(||\beta-m||_2)^2=\sum_{i=1}^n (\beta_i-m_i)^2\)` is the squared `\(L_2\)` *norm*<sup>*</sup>. * This is a more typical and formally correct pL model -- * As a parenthesis, we make a note that *norms* (typically the `\(L_2\)` and the `\(L_1\)` norm)<sup>*</sup> features frequently in statistics, and particularly in regularization models + For example, the least squares method (which, for linear models, estimates ML) can be written as a minimization of the squared `\(L_2\)` norm of residuals: `$$\min_{\boldsymbol{\beta}}\left\{\sum_{i=1}^n (y_i-x_{i,\cdot}\boldsymbol{\beta})^2\right\} = \min_{\boldsymbol{\beta}}\left\{||Y-X\boldsymbol{\beta}||_2^2\right\}$$` *.small[See "extended_reading" document.] --- name: pen3 ## Regularization | `Bayesian interpretation` -- * Exponentiating (removing the logarithm) of the `\(\log pL\)` for the general penalized likelihood model yields `$$pL[\boldsymbol{\beta} | X, Y] = Pr[y | X, \boldsymbol{\beta}] e^{-w||\boldsymbol{\beta}-m||_2^2}$$` -- * The penalizing term can, then, be viewed as a Bayesian prior on the values of `\(\boldsymbol{\beta}\)` with `\(\Pr[\boldsymbol{\beta}] = e^{w||\boldsymbol{\beta}-m||_2^2}\)`, i.e., we obtain the Bayesian model `$$Pr[Y | X, \boldsymbol{\beta}] Pr[\boldsymbol{\beta}]$$` + In fact, setting `\(w=2\sigma^2\)` and adding a constant `\(-\log\sqrt{2\sigma^2\pi}\)` to the `\(\log pL\)` equation would make the prior on each `\(\beta_i \sim N(m,\sigma^2)\)`. -- * The corresponding penalized likelihood formula for our first very simple toy pL example is `\begin{aligned} pL[\boldsymbol{\beta} | X, Y] = Pr[Y | X, \boldsymbol{\beta}] e^{-\#(\boldsymbol{\beta}\neq 0)}, \end{aligned}` + Here, the prior on `\(\boldsymbol{\beta}\)` becomes `\(Pr[\boldsymbol{\beta}]=e^{-\#(\boldsymbol{\beta}\neq 0)}\)`. --- name: modcomp ## Model comparison Compare two models and decide which one is the best. Approaches include * Likelihood ratio test - based on likelihood only * AIC (Akaike information criterion) - Based on information theory/regularization * Bayesian information criterion - Bayesian version of AIC --- name: lrt ## Model comparison | `Likelihood ratio test (LRT)` * Method -- uses likelihoods, only + Define the likelihood ratio, `\(LR\)`, for our original models (1) and (2) as `$$LR = \frac{ML[\beta_0, \beta_1 | x_1, y]}{ML[\beta_0, \beta_1, \beta_2 | x_1, x_2, y]}$$` + Then, for nested models, the *deviance* `\(2 \log LR \sim \chi^2(\delta)\)`, (where `\(\delta\)` is the difference in dimensionality between models), which provides a significance test for rejecting the simpler model (1) * Applied to our example data set, LRT correctly discards the simpler model when sequentially adding 3 first variables, but subsequent addition of variables does not improve the model significantly. <table class="table table-striped" style="font-size: 14px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Compared models </th> <th style="text-align:right;"> logL 1st model </th> <th style="text-align:right;"> logL 2nd model </th> <th style="text-align:left;"> logLR </th> <th style="text-align:left;"> P-value </th> <th style="text-align:left;"> Significance </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 vs 2 variables </td> <td style="text-align:right;"> -172.30 </td> <td style="text-align:right;"> -161.85 </td> <td style="text-align:left;"> -10.44 </td> <td style="text-align:left;"> 4.87e-06 </td> <td style="text-align:left;"> *** </td> </tr> <tr> <td style="text-align:left;"> 2 vs 3 variables </td> <td style="text-align:right;"> -161.85 </td> <td style="text-align:right;"> -139.08 </td> <td style="text-align:left;"> -22.78 </td> <td style="text-align:left;"> 1.49e-11 </td> <td style="text-align:left;"> *** </td> </tr> <tr> <td style="text-align:left;"> 3 vs 4 variables </td> <td style="text-align:right;"> -139.08 </td> <td style="text-align:right;"> -138.97 </td> <td style="text-align:left;"> -0.1104 </td> <td style="text-align:left;"> 0.638 </td> <td style="text-align:left;"> - </td> </tr> <tr> <td style="text-align:left;"> 4 vs 5 variables </td> <td style="text-align:right;"> -138.97 </td> <td style="text-align:right;"> -138.30 </td> <td style="text-align:left;"> -0.6655 </td> <td style="text-align:left;"> 0.249 </td> <td style="text-align:left;"> - </td> </tr> <tr> <td style="text-align:left;"> 5 vs 6 variables </td> <td style="text-align:right;"> -138.30 </td> <td style="text-align:right;"> -138.26 </td> <td style="text-align:left;"> -0.04381 </td> <td style="text-align:left;"> 0.767 </td> <td style="text-align:left;"> - </td> </tr> <tr> <td style="text-align:left;"> 6 vs 7 variables </td> <td style="text-align:right;"> -138.26 </td> <td style="text-align:right;"> -137.84 </td> <td style="text-align:left;"> -0.4136 </td> <td style="text-align:left;"> 0.363 </td> <td style="text-align:left;"> - </td> </tr> <tr> <td style="text-align:left;"> 7 vs 8 variables </td> <td style="text-align:right;"> -137.84 </td> <td style="text-align:right;"> -137.82 </td> <td style="text-align:left;"> -0.02251 </td> <td style="text-align:left;"> 0.832 </td> <td style="text-align:left;"> - </td> </tr> <tr> <td style="text-align:left;"> 8 vs 9 variables </td> <td style="text-align:right;"> -137.82 </td> <td style="text-align:right;"> -137.82 </td> <td style="text-align:left;"> -0.0001522 </td> <td style="text-align:left;"> 0.986 </td> <td style="text-align:left;"> - </td> </tr> <tr> <td style="text-align:left;"> 9 vs 10 variables </td> <td style="text-align:right;"> -137.82 </td> <td style="text-align:right;"> -137.62 </td> <td style="text-align:left;"> -0.1968 </td> <td style="text-align:left;"> 0.53 </td> <td style="text-align:left;"> - </td> </tr> </tbody> </table> --- name: aic1 ## Model comparison | `AIC` * Method -- uses likelihood and number of parameters to estimate the information lost by selecting a model + Define, for a model `\(m\)` with estimated parameters `\(\boldsymbol{\beta}\)`, `$$AIC_m = 2\#(\boldsymbol{\beta}\neq 0)-2\log ML[\boldsymbol{\beta}|X,Y]$$` + Let `\(AIC_{min}\)` be the minimum AIC among a set of compared models + Let the *relative likelihood* for model `\(m\)` is `$$rL = e^\frac{ AIC_{min} - AIC_{m} }{2}$$` - `\(rL\)` can be interpreted as proportional to the probability that the model `\(m\)` minimizes the information loss. <!-- and can be interpreted as --> <!-- `\(rL \propto Pr[m\textrm{ minimizes estimated information loss}]\)`. --> * Notice that `$$rL = \frac{ML[\boldsymbol{\beta}_{m}|X,Y]e^{-\#(\boldsymbol{\beta}_m\neq 0])}}{ML[\boldsymbol{\beta}_{min}|X,Y]e^{-\#(\boldsymbol{\beta}_{min}\neq 0))}}$$` we see that `\(rL\)` can be viewed as a penalized likelihood ratio under our very simple pL toy model. --- name: aic1 ## Model comparison | `AIC` * Evaluation + Typical strategy is to select the model, `\(m\)` with `\(AIC_m=AIC_{min}\)` + In theory, in analogy with LRT, significance test could possibly be devised, but this is not done in practice. * AIC applied to our example data results in <table class="table" style="font-size: 14px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Compared models </th> <th style="text-align:right;"> AIC </th> <th style="text-align:left;"> Minimum AIC </th> <th style="text-align:left;"> rL </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 variable </td> <td style="text-align:right;"> 350.59 </td> <td style="text-align:left;"> - </td> <td style="text-align:left;"> 2.763e-14 </td> </tr> <tr> <td style="text-align:left;"> 2 variables </td> <td style="text-align:right;"> 331.71 </td> <td style="text-align:left;"> - </td> <td style="text-align:left;"> 3.476e-10 </td> </tr> <tr> <td style="text-align:left;"> 3 variables </td> <td style="text-align:right;"> 288.15 </td> <td style="text-align:left;"> Yes </td> <td style="text-align:left;"> 1.000e+00 </td> </tr> <tr> <td style="text-align:left;"> 4 variables </td> <td style="text-align:right;"> 289.93 </td> <td style="text-align:left;"> - </td> <td style="text-align:left;"> 4.107e-01 </td> </tr> <tr> <td style="text-align:left;"> 5 variables </td> <td style="text-align:right;"> 290.60 </td> <td style="text-align:left;"> - </td> <td style="text-align:left;"> 2.938e-01 </td> </tr> <tr> <td style="text-align:left;"> 6 variables </td> <td style="text-align:right;"> 292.52 </td> <td style="text-align:left;"> - </td> <td style="text-align:left;"> 1.125e-01 </td> </tr> <tr> <td style="text-align:left;"> 7 variables </td> <td style="text-align:right;"> 293.69 </td> <td style="text-align:left;"> - </td> <td style="text-align:left;"> 6.266e-02 </td> </tr> <tr> <td style="text-align:left;"> 8 variables </td> <td style="text-align:right;"> 295.64 </td> <td style="text-align:left;"> - </td> <td style="text-align:left;"> 2.364e-02 </td> </tr> <tr> <td style="text-align:left;"> 9 variables </td> <td style="text-align:right;"> 297.64 </td> <td style="text-align:left;"> - </td> <td style="text-align:left;"> 8.695e-03 </td> </tr> <tr> <td style="text-align:left;"> 10 variables </td> <td style="text-align:right;"> 299.25 </td> <td style="text-align:left;"> - </td> <td style="text-align:left;"> 3.887e-03 </td> </tr> </tbody> </table> --- name: illposed1 ## Ill-posed problems aka overparameterization * A *well-posed problem* fulfills requirements 1. solution exists 2. unique solution 3. stable solution (small changes in input does not give huge changes in result) * Otherwise the problem is *ill-posed* + specifically if it fails 3., it is *ill-conditioned* (multi-collinearity) * Examples + Least squares regression with more parameters `\(\beta_i\)` than observations has multiple solutions + SNPs in LD may give rise to multi-collinearity * How can we solve this? -- * Solution: Add additional assumptions (constraints), implemented through, e.g., + prior distributions (Bayesian), e.g. `\(\boldsymbol{\beta}\sim N(m,\sigma^2)\)` + add a regularization term (frequentists), e.g., `\(-w(\boldsymbol{\beta}-m)^2\)` --- name: featsel ## Feature selection * Aim: Simplifying the (full) model without compromising explanatory power - balance between "simplicity" (number of variables) and "bestness" (likelihood) * methods + model comparison - As described above + regularized methods - LASSO (Least Absolute Shrinkage and Selection Operator) - Ridge regression - Elastic-net + dimensionality reduction to obtain fewer summary variables - PLS - CCA --- name: lasso1 ## Feature selection | `LASSO` * Builds on a regression model `\(Y \sim X\boldsymbol{\beta}\)` with regularization + The variables Y and X must be centered and standardized to ensure that all variables are given equal weight in the model selection. * Most common notation for lasso is *Least Squares* with an auxiliary criterion/constraint: `$$min_{\boldsymbol{\beta}}\left\{\frac{1}{N} ||Y-X\boldsymbol{\beta}||_2^2\right\} \textrm{subject to} ||\boldsymbol{\beta}||_1\leq \lambda$$` where `\(||\boldsymbol{\beta}||_1=\sum_{i=1}^n \beta_i\)` is the `\(L_1\)` norm of `\(\boldsymbol{\beta}\)`. * Other ways of viewing LASSO: + pL with regularization term `\(-\lambda\boldsymbol{\beta}\)` + Bayesian prior on `\(\beta_j ∼ LaPlace(0, 1/\lambda)\)` * Alternatives to LASSO, differing mainly in the auxiliary criterion - *Elastic-net*, which uses a squared `\(L_1\)` norm - *Ridge regression* which uses a `\(L_2\)` norm --- name: lasso2 ## Feature selection | `LASSO` * There are two main R-packages for performing LASSO, *lars* and *glmnet* + The packages may produce somewhat different results + The difference between the packages lies in - the models supported * lars: LASSO on linear model and selected GLMs * glmnet: LASSO, ridge regression or Elastic net on a larger selection of GLMs - methods for optimization- and standardization * Lasso is often visualized with the following plot (lars on our toy data) <img src="" style="display: block; margin: auto auto auto 0;" /> --- name: lasso3 ## Feature selection | `LASSO` * A problem is to decide `\(\lambda\)` to use as the cutoff. The two methods provide different methods for this .pull-left-50[ * glmnet ues cross-validation to obtain an estimate of the minimum `\(\lambda\)` attainable, and then view the estimated `\(\beta_i\)` under this `\(\lambda\)`. <table> <thead> <tr> <th style="text-align:right;"> Variable </th> <th style="text-align:right;"> beta(lambda=0.11) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3.392205 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2.075677 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 1.625815 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2.267262 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 0.000000 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 0.000000 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 0.000000 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 0.000000 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 0.000000 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 0.000000 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 0.000000 </td> </tr> </tbody> </table> ] .pull-right-50[ * There also exists a fairly new significance test on `\(\lambda\)`, implemented in the package covTest (works only with lars on linear models) <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> Variable </th> <th style="text-align:left;"> P-value </th> <th style="text-align:left;"> Significance </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 0.042 </td> <td style="text-align:left;"> * </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 0.0026 </td> <td style="text-align:left;"> ** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> *** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 0.8026 </td> <td style="text-align:left;"> - </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> 7 </td> <td style="text-align:left;"> 0.8853 </td> <td style="text-align:left;"> - </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> 10 </td> <td style="text-align:left;"> 0.8268 </td> <td style="text-align:left;"> - </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 0.9654 </td> <td style="text-align:left;"> - </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> 0.9752 </td> <td style="text-align:left;"> - </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 0.9879 </td> <td style="text-align:left;"> - </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 0.9852 </td> <td style="text-align:left;"> - </td> </tr> </tbody> </table> ] --- name: dimred ## Feature selection | `Dimensionality reduction` * Recall how PCA projects `\(X\)` onto a new basis of lower dimensionality <br><br><br><br><br><br><br><br> <img src="" width="2667" style="display: block; margin: auto auto auto 0;" /> --- name: dimred ## Feature selection | `Dimensionality reduction` * Recall how PCA projects `\(X\)` onto a new basis of lower dimensionality + extend to project 2 matrices `\(X\)` and `\(Y\)` onto a *common* basis using projection matrices `\(V\)` and `\(W\)` - `\(T=XV\)` and `\(U=YW\)` are "components" that captures the *joint* variation in `\(X\)` and `\(Y\)` - `\(V\)` and `\(W\)` are "loadings" that describes the contribution from original variables and defines the common basis <img src="" width="2667" style="display: block; margin: auto auto auto 0;" /> --- name: dimred2 ## Feature selection | `Dimensionality reduction` * How do we best define an appropriate common basis? + select `\(V\)` and `\(W\)` so that - PLS: the *covariance* between `\(T=XV\)` and `\(U=YW\)` is maximized - CCA: the *correlation* between `\(T=XV\)` and `\(U=YW\)` is maximized -- * PLS (Partial least squares) vs CCA (Canonical correlation analysis) + Similarities - uses regularization to limit choices of `\(V\)` and `\(W\)` for high-dimensional data - solutions connected to singular value decomposition (SVD) theory<sup>*</sup> + Differences - relative constraints on `\(W\)` and `\(V\)` - motivation/use areas + In practice, quite equivalent behaviour when it comes to prediction, but differs in details .small[<sup>*</sup> see "extended_reading" document] --- name: pls1 ## Feature selection | `Dimensionality reduction` ### PLS (Partial least squares or Projection to latent structures) * Components are often called *latent variables* * 2 "modes": + Regression mode: - builds on a *'multivariate regression'*, `$$YW = U ≈ T = XV \Rightarrow XVW^{-1} ≈ Y \sim X\boldsymbol{\beta} \Rightarrow \boldsymbol{\beta}=VW^{-1}$$` - this regression relation is used in an iterative estimation of the column vectors `\((v_i,w_i)\)`, where `\(v_i\in V\)` and `\(w_i \in W\)` - Can use a multivariate `\(Y\)` and handles co-linear, missing, noisy, correlated data better than standard multivariate regression + Canonical mode -- similar to CCA - column vectors `\((v_i,w_i)\)` are estimated in a single SVD - identifying similarities between data sets * Variants + several variants(OPLS, O2PLS, OnPLS) + MixOmics sPLS combines PLS with lasso. --- name: pls2 ## Feature selection | `PLS Example` * Apply MixOmics PLS in regression mode to our toy data with an additional `\(y2\)` variable, generated using `\(x_8,x_9,x_{10}\)` ```r y2 = b0 + X[,seq(8,10)] %*% b[seq(8,10)] + rnorm(N) # Y2 ~ x_8, x_9 and x_10 Y = matrix(c(y1, y2), ncol=2) # save Y-vars in Y Z = cut( y1+y2, 2 ) # bogus classification of Y based on sum of y1 and y2 ``` -- .pull-left-50[ .small[ * Find significant components <img src="" style="display: block; margin: auto auto auto 0;" /> ] ] -- .pull-right-50[ .small[ * Check "loadings" <img src="" style="display: block; margin: auto auto auto 0;" /> ] ] -- .pull-left-50[ .small[ * Plot our samples on components from `\(X\)` and `\(Y\)` <img src="" style="display: block; margin: auto auto auto 0;" /> ] ] -- .pull-right-50[ .small[ * compare to PCA on `\(X\)` alone <img src="" style="display: block; margin: auto auto auto 0;" /> ] ] --- name: integration1 ## Integrative analysis * Simultaneous analysis of several data sets `\(\{X_1, \ldots, X_k\}, k\geq 2\)`, e.g., from different omics, possibly including response data `\(Y\)` ("*supervised*") or not ("*unsupervised*") * Approaches + Methods based on dimensionality reduction - Joint variation, only - Joint + unique variation + Machine learning - e.g., deep learning + Network-based methods - Multilevel network - often rather specialized, e.g., for cancer - different network learning methods, network fusion - Path modeling - for time series, causal relations, etc - variants: ML-based, PLS-based, Bayesian networks --- name: integration2 ## Integrative analysis | `Dimensionality reduction` #### Estimating joint variation only * Aims to identify common patterns of variation among `\(\{X_i\}\)` (and `\(Y\)`) * Approach: + simultaneous component analysis (SCA) = generalized PCA for 2 matrices + PLS + CCA - For all, generalization to `\(>2\)` matrices through averaging over pairwise comparisons of matrices * MixOmics package - MINT: when `\(\{X_i\}\)` share variables (e.g., integrating RNAseq data for different samples) - DIABLO: when `\(\{X_i\}\)` share samples (e.g., integrating RNAseq, metabolomics, epigenetic data for same sample) --- name: integration3 ## Integrative analysis | `Dimensionality reduction` #### Estimating joint + unique variation * Aims to identify joint variation among `\(\{X_i\}\)` (and `\(Y\)`), as well as unique variation in each `\(X_i\)` * PLS-based approach + identify `\(v_p,w_p\)` for the joint variation as described above, e.g., using SVD of `\(XY^T\)` - the unique variation is *orthogonal* to `\(Y\)`, so not included in `\(v_p,w_p\)` - remove joint variation from `\(X\)` and estimate unique variation `\(v_o,w_o\)`. + variants - OPLS: PLS-regression mode for two matrices - O2PLS: PLS canonical mode for two matrices. - OnPLS: PLS canonical mode for multiple matrices. * Other approaches + JIVE, DISCO uses SCA --- name: integration3 ## Integrative analysis | `MINT example` * MINT PLS applied to our toy data, with `\(X\)` split into 4 sub matrices `\(\{X_1, \ldots, X_4\}\)`, representing different studies, and using outcome `\(Y\)`. ```r # the vector indicating X membership in each independent study study = as.factor(c(rep(1,25), rep(2,25), rep(3,25), rep(4,25))) ## 1 2 3 4 ## 25 25 25 25 ``` -- .pull-left-50[ <img src="" style="display: block; margin: auto auto auto 0;" /> ] -- .pull-right-50[ <img src="" style="display: block; margin: auto auto auto 0;" /> ] -- .pull-left-50[ <img src="" style="display: block; margin: auto auto auto 0;" /> ] --- name: integration5 ### What can we use the results for? -- * Prediction + Apply the model to new data `\(X'\)` and predict the outcome `\(Y'\)` -- * Classification + Cluster on the latent variables `\(\{T,U\}\)` and identify, e.g., disease subtypes -- * Identification of "drivers" and "biomarkers" + Look at the loadings and see which original variables `\(X\)` and `\(Y\)` contribute most to the projected variabels in `\(T\)` and `\(U\)`. - However, remember that these drivers are not working alone -- it is a multivariate model we have estimated! --- name: end-slide class: end-slide # Thank you