Statistics / Machine Learning in R

class: center, middle, inverse, title-slide

# Statistics / Machine Learning in R
## RaukR 2021 • R for Bioinformatics
### Nikolay Oskolkov
### NBIS, SciLifeLab

---

exclude: true
count: false

---
name: Types of Analysis

## Different Types of Data Analysis

* Depends on the amount of data we have
* Balance between the numbers of features and observations
 + P is the number of features (genes, proteins, genetic variants etc.)
 + N is the number of observations (samples, cells, nucleotides etc.)

.center[
<img src="images/AmountOfData.png" style="width: 90%;" />]

---
name: The Curse of Dimensionality

## The Curse of Dimensionality

.pull-left-50[
<img src="images/DarkMagic.jpg" style="width: 90%;" />]

.pull-right-50[

.center[
**This is how data looks like in high dimensions:**]
.center[
<img src="images/ndim_ball.png" style="width: 110%;" />]

`$$Y = \alpha + \beta X$$`
`$$\beta = \left(X^TX\right)^{-1}X^TY$$`

`$$\left(X^TX\right)^{-1} \sim \frac{1}{\rm{det}\left(X^TX\right)}\dots\,\rightarrow\,\infty\hbox{,}\,\,\,\,\,\,\,\,n\ll p$$`

.center[
**Inverted covariance matrix diverges in high dimensions**]

`$$E[\hat{\sigma}^2]=\frac{n-p}{n}\sigma^2$$`

.center[
**Maximum Likelihood variance estimator is severely biased in high dimensions**]
]

---
name: Equidistant Points in High Dimensions

## Equidistant Points in High Dimensions

.panelset[
.panel[.panel-name[R Code]

```r
n <- 1000;
pair_dist <- list();
p <- c(2, 32, 512)

for(i in 1:length(p)) {
 X <- matrix(rnorm(n * p[i]), n, p[i])
 pair_dist[[i]] <- as.vector(dist(X));
 pair_dist[[i]] <- pair_dist[[i]] / max(pair_dist[[i]])
}
```
]

.panel[.panel-name[Plot]
<img src="presentation_files/figure-html/Hist_Plot-1.png" style="display: block; margin: auto;" />
]]

---
name: Neighboring Points in High Dimensions

## Neighboring Points in High Dimensions

.panelset[
.panel[.panel-name[R Code]

```r
n <- 1000
first_dist <- vector()
mid_dist <- vector()
last_dist <- vector()
num_dim <- 2 ^ seq(1, 10, by=1)
for(p in num_dim) {
 X <- matrix(rnorm(n*p), n, p)
 dist_X <- sort(as.vector(dist(X)), decreasing=FALSE)
 first_dist <- append(first_dist, dist_X[1])
 mid_dist <- append(mid_dist, dist_X[round(length(dist_X)/2, 0)])
 last_dist <- append(last_dist, dist_X[length(dist_X)])
}
```
]

.panel[.panel-name[Plot]
<img src="presentation_files/figure-html/Neighbors Plot-1.png" style="display: block; margin: auto;" />
]]

---
name: Low Dimensional Space

## Low Dimensional Space

```r
set.seed(123)
n <- 20 # number of samples
p <- 2 # number of features / dimensions
Y <- rnorm(n)
X <- matrix(rnorm(n*p),n,p)
summary(lm(Y~X))
```

```
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0522 -0.6380  0.1451  0.3911  1.8829 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.14950    0.22949   0.651    0.523
## X1          -0.09405    0.28245  -0.333    0.743
## X2          -0.11919    0.24486  -0.487    0.633
## 
## Residual standard error: 1.017 on 17 degrees of freedom
## Multiple R-squared:  0.02204,	Adjusted R-squared:  -0.09301 
## F-statistic: 0.1916 on 2 and 17 DF,  p-value: 0.8274
```

---
name: Going to Higher Dimensions

## Going to Higher Dimensions

```r
set.seed(123456); n <- 20; p <- 10
Y <- rnorm(n); X <- matrix(rnorm(n*p),n,p); summary(lm(Y~X))
```

```
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0255 -0.4320  0.1056  0.4493  1.0617 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.54916    0.26472   2.075   0.0679 .
## X1           0.30013    0.21690   1.384   0.1998  
## X2           0.68053    0.27693   2.457   0.0363 *
## X3          -0.10675    0.26010  -0.410   0.6911  
## X4          -0.21367    0.33690  -0.634   0.5417  
## X5          -0.19123    0.31881  -0.600   0.5634  
## X6           0.81074    0.25221   3.214   0.0106 *
## X7           0.09634    0.24143   0.399   0.6992  
## X8          -0.29864    0.19004  -1.571   0.1505  
## X9          -0.78175    0.35408  -2.208   0.0546 .
## X10          0.83736    0.36936   2.267   0.0496 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8692 on 9 degrees of freedom
## Multiple R-squared:  0.6592,	Adjusted R-squared:  0.2805 
## F-statistic: 1.741 on 10 and 9 DF,  p-value: 0.2089
```

---
name: Even Higher Dimensions

## Even Higher Dimensions

```r
set.seed(123456); n <- 20; p <- 20
Y <- rnorm(n); X <- matrix(rnorm(n*p),n,p); summary(lm(Y~X))
```

```
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
## ALL 20 residuals are 0: no residual degrees of freedom!
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.34889        NaN     NaN      NaN
## X1           0.66218        NaN     NaN      NaN
## X2           0.76212        NaN     NaN      NaN
## X3          -1.35033        NaN     NaN      NaN
## X4          -0.57487        NaN     NaN      NaN
## X5           0.02142        NaN     NaN      NaN
## X6           0.40290        NaN     NaN      NaN
## X7           0.03313        NaN     NaN      NaN
## X8          -0.31983        NaN     NaN      NaN
## X9          -0.92833        NaN     NaN      NaN
## X10          0.18091        NaN     NaN      NaN
## X11         -1.37618        NaN     NaN      NaN
## X12          2.11438        NaN     NaN      NaN
## X13         -1.75103        NaN     NaN      NaN
## X14         -1.55073        NaN     NaN      NaN
## X15          0.01112        NaN     NaN      NaN
## X16         -0.50943        NaN     NaN      NaN
## X17         -0.47576        NaN     NaN      NaN
## X18          0.31793        NaN     NaN      NaN
## X19          1.43615        NaN     NaN      NaN
## X20               NA         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,	Adjusted R-squared:    NaN 
## F-statistic:   NaN on 19 and 0 DF,  p-value: NA
```

---
name: High dimensional data in Life Sciences

## High dimensional data in Life Sciences

* Proteomics, metabolomics: 50 - 100
* Metagenomics: 100 - 500
* Transcriptomics: 20 000
* Microscopy imaging: 1 000 000
* Genomics, epigenomics: 30 000 000

.center[
<img src="images/DataLifeSciences.png" style="width: 70%;" />]

---
name: Ways to Overcome The Curse of Dimensionality

## Ways to Overcome The Curse of Dimensionality

* Increasing sample size - often cannot afford
* Dimensionality reduction - PCA, UMAP etc.
* Regularizations - LASSO, Ridge, Elastic Net, Dropout etc.
* Bayesian statistics - Prior plays a role of Regularization

.center[
<img src="images/Overcome_CD.png" style="width: 100%;" />]

---
name: Dimensionality Reduction

## Dimensionality Reduction

.center[
<img src="images/PCA_tSNE_MNIST.png" style="width: 90%;" />]

---
name: Dimension Reduction: Matrix Factorization

## Dimension Reduction: Matrix Factorization

.center[
<img src="images/MatrixFactor.png" style="width: 100%;" />]

---
name: Dimension Reduction: Neighborhood Graph

## Dimension Reduction: Neighborhood Graph

.center[
<img src="images/tSNE_Scheme.png" style="width: 68%;" />]

---
name: Regularizations: LASSO

## Regularizations: LASSO

`$$Y = \beta_1X_1+\beta_2X_2+\epsilon \\
\textrm{OLS} = (Y-\beta_1X_1-\beta_2X_2)^2 \\
\textrm{Penalized OLS} = (Y-\beta_1X_1-\beta_2X_2)^2 + \lambda(|\beta_1|+|\beta_2|)$$`

.center[
<img src="images/CV_lambda.png" style="width: 70%;" />]

---
name: K-Fold Cross-Validation

## Regularizations: K-Fold Cross-Validation

.center[
<img src="images/Kfold_CrossVal.png" style="width: 100%;" />]

---
name: Priors in Bayesian Statistics - Regularizations

## Priors in Bayesian Statistics - Regularizations

`$$Y = \beta_1X_1+\beta_2X_2+\epsilon$$`
* **Maximum Likelihood** principle: maximize probability to observe data given parameters:
`$$Y \sim N(\,\beta_1X_1+\beta_2X_2, \sigma^2\,) \\
\rm{L}\,(\,\rm{Y} \,|\, \beta_1,\beta_2\,) =
\frac{1}{\sqrt{2\pi\sigma²}} \exp^{-\frac{(Y-\beta_1X_1-\beta_2X_2)^2}{2\sigma²}}$$`

--
* **Bayes theorem**: maximize posterior probability to observe parameters given data:
`$$P(\rm{parameters} \,|\, \rm{data})=\frac{L(\rm{data} \,|\, \rm{parameters})*P(\rm{parameters})}{\int{L(\rm{data} \,|\, \rm{parameters})*P(\rm{parameters}) \, d(\rm{parameters})}}$$`
--
`$$P(\, \beta_1,\beta_2 \,|\, \rm{Y}\,) \sim \rm{L}\,(\,\rm{Y} \,|\, \beta_1,\beta_2\,)*P(\beta_1,\beta_2) \sim \exp^{-\frac{(Y-\beta_1X_1-\beta_2X_2)^2}{2\sigma²}}*\exp^{-\lambda(|\beta_1|+|\beta_2|)}$$`
`$$-\log{P(\, \beta_1,\beta_2 \,|\, \rm{Y}\,)} \sim (Y-\beta_1X_1-\beta_2X_2)^2 + \lambda(|\beta_1|+|\beta_2|)$$`
--
 
* LASSO is a form of Bayes theorem with Laplace prior

---
name: Moving from Statistics to Machine Learning

## Moving from Statistics to Machine Learning

A few peculiarities of data analysis that are typical for Machine Learning:

* Very focused on **prediction** instead of biomarker discovery
* Focus on **cross-validation** and regularizations
* More **algorithmic** and less "pen-and-paper" derivations
* More **Python** speaking than R speaking community

.center[
<img src="images/MachineLearning.jpg" style="width: 70%;" />]

---
name: Statistics vs. Machine Learning

## Statistics vs. Machine Learning

.center[
<img src="images/Stats_vs_ML.png" style="width: 100%;" />]

---
name: How does Machine Learning work?

## How does Machine Learning work?

Machine Learning by default involves five basic steps:

1. Split data set into **train**, **validation** and **test** subsets.
2. Fit the model on the train subset.
3. Validate your model on the validation subset.
4. Repeat steps 1-3 a number of times and tune **hyperparameters**.
5. Test the accuracy of the optimized model on the test subset.

.center[
<img src="images/TrainTestSplit.png" style="width: 60%;" />]

---
name: Toy Example of Machine Learning

## Toy Example of Machine Learning

```r
set.seed(12345)
N<-100
x<-rnorm(N)
y<-2*x+rnorm(N)
df<-data.frame(x,y)
plot(y~x,data=df, col="blue")
```

---
name: Toy Example: Train and Test Subsets

## Toy Example: Train and Test Subsets

We randomly assign 70% of the data to training and 30% to test subsets:

```r
set.seed(123)
train<-df[sample(1:dim(df)[1],0.7*dim(df)[1]),]
test<-df[!rownames(df)%in%rownames(train),]
```

---
name: Toy Example: Validation of Model

## Toy Example: Validation of Model

```r
test_predicted<-as.numeric(predict(lm(y~x,data=train),newdata=test))
plot(test$y~test_predicted,ylab="True y",xlab="Pred y",col="red")
abline(lm(test$y~test_predicted),col="darkgreen")
```

---
name: Toy Example: Validation of Model (Cont.)

## Toy Example: Validation of Model (Cont.)

```r
summary(lm(test$y~test_predicted))
```

```
## 
## Call:
## lm(formula = test$y ~ test_predicted)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.80597 -0.78005  0.07636  0.52330  2.61924 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.02058    0.21588   0.095    0.925    
## test_predicted  0.89953    0.08678  10.366 4.33e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.053 on 28 degrees of freedom
## Multiple R-squared:  0.7933,	Adjusted R-squared:  0.7859 
## F-statistic: 107.4 on 1 and 28 DF,  p-value: 4.329e-11
```

Thus the model explains 79% of variation on the test subset.

---
name: What is a Hyperparameter?

## What is a Hyperparameter?

* Hyperparameters are Machine Learning design parameters that are set before the learning process starts
* For the toy model above, a hyperparameter can be e.g. the number of covariates to adjust the main variable x of interest for

```r
set.seed(1)
for(i in 1:10)
{
 df[,paste0("PC",i)]<-1*(1-i/10)*y+rnorm(N)
}
head(df)
```

```
##            x          y color        PC1        PC2        PC3
## 1  0.5855288  1.3949830   red  0.6290309  0.4956198  1.3858900
## 2  0.7094660  0.2627087   red  0.4200812  0.2522828  1.8727694
## 3 -0.1093033  0.2038119   red -0.6521979 -0.7478721  1.7292568
## 4 -0.4534972 -2.2317496  blue -0.4132938 -1.6273709 -1.8931325
## 5  0.6058875  1.3528592  blue  1.5470811  0.4277027 -1.3382341
## 6 -1.8179560 -4.1719599  blue -4.5752323 -1.5702807 -0.4227104
##          PC4        PC5         PC6         PC7        PC8
## 1  1.7306635  1.7719325  0.63529634  0.07742793 -0.4285716
## 2 -0.8896729  2.0270091 -0.19178516  1.58123715  2.0241138
## 3  2.0936245 -0.5010914 -1.10171748  0.58945128 -0.0492363
## 4 -1.7226819 -1.5067426 -0.88140715 -0.12733353 -0.4603672
## 5  2.4658608  0.2602076  1.53274473  0.26918441 -0.8528851
## 6 -0.9909633 -2.4616374 -0.07481652 -2.38832183 -2.1785221
##          PC9       PC10
## 1 -0.9474105 -1.5414026
## 2 -1.7998121  0.1943211
## 3  1.0156630  0.2644225
## 4 -0.2350367 -1.1187352
## 5 -0.4643425  0.6509530
## 6 -0.5951440 -1.0329002
```

---
name: How does Cross-Validation work?

## How does Cross-Validation work?

* We should not include all PCs - this can be prone to overfitting
* Cross-Validation is a way to combat overfitting

```r
train<-df[sample(1:dim(df)[1],0.6*dim(df)[1]),]
val_test<-df[!rownames(df)%in%rownames(train),]
validate<-val_test[sample(1:dim(val_test)[1],0.25*dim(val_test)[1]),]
test<-val_test[!rownames(val_test)%in%rownames(validate),]
```
<img src="presentation_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" />

---
name: How does Cross-Validation work? (Cont.)

## How does Cross-Validation work? (Cont.)

* Fit linear regression model on training set and assess the error on validation set
* Error: root mean square difference between y predicted by the trained model on the validation set, and the true y from the validation set
* Looks like no dramatic decrease of RMSE after PC2

---
name: Ultimate Model Evaluation

## Ultimate Model Evaluation

* Thus optimal model is y~x+PC1+PC2
* Perform final evaluation of the optimized / trained model on the test data set and report the final evaluation metric, adjusted R squared in our case
* The model explains over 90% of variation on the unseen test data set

```r
summary(lm(predict(lm(y~x+PC1+PC2,data=train),newdata=test)~test$y))
```

```
## 
## Call:
## lm(formula = predict(lm(y ~ x + PC1 + PC2, data = train), newdata = test) ~ 
## test$y)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -1.02409 -0.26361 -0.00061 0.19345 0.95801 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 0.22155 0.09485 2.336 0.0269 * 
## test$y 0.93912 0.03994 23.514 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5002 on 28 degrees of freedom
## Multiple R-squared: 0.9518,	Adjusted R-squared: 0.9501 
## F-statistic: 552.9 on 1 and 28 DF, p-value: < 2.2e-16
```

---
name: From Linear Models to ANNs

## From Linear Models to ANNs

* ANN: Mathematical algorithm / function Y = f(X) with special architecture

* Highly non-linear dues to activation functions

* Backward propagation for minimizing error

.center[
<img src="images/ANN.png" style="width: 90%;" />]

---
name: Universal Approximation Theorem

## Universal Approximation Theorem

.center[
<img src="images/UAT.png" style="width: 100%;" />]

---
name: Single Cells make Big Data

## Single Cells make Big Data

.center[
<img src="images/SingleCellBigData.png" style="width: 90%;" />]

---
name: end-slide
class: end-slide, middle
count: false

# Thank you. Questions?

R version 4.1.0 (2021-05-18) Platform: x86_64-conda-linux-gnu (64-bit)OS: Ubuntu 20.04.2 LTS

Built on : 22-Jun-2021 at 13:37:05

2021 • [SciLifeLab](https://www.scilifelab.se/) • [NBIS](https://nbis.se/) • [RaukR](https://nbisweden.github.io/workshop-RaukR-2106/)