Machine Learning in R

class: center, middle, inverse, title-slide

# Machine Learning in R
## RaukR 2019 • Advanced R for Bioinformatics
### Nikolay Oskolkov
### NBIS, SciLifeLab

---

exclude: true
count: false

---
name: Machine Learning for Everyone

## Machine Learning for Everyone

.center[
<img src="MachineLearningEveryone.png" style="width: 100%;" />]

---
name: Stereotypes about Machine Learning

## Stereotypes about Machine Learning

.center[
<img src="MLStereotypes.png" style="width: 90%;" />]

---
name: Data Science and Artificial Intelligence

## Data Science and Artificial Intelligence

* The world’s most valuable resource is no longer oil, but Data
* Big Data is arriving, Bioinformatics learns from Data Science
* Data Science speaks Python, Julia, Java Script, Scala and R
* Apache Spark, Probabilistic Programming and AI become common

.center[
<img src="Oil_vs_Data.jpg" style="width: 90%;" />]

---
name: What is Machine Learning?

## What is Machine Learning?

* Machine Learning maps input X to output Y as
`$$Y = f ( X )$$`
without necessarily knowing the functional form of f

* Machine Learning provides two major things:

* **Prediction**
  * **Feature Selection**

* Machine Learning can be categorized into:

* **Parametric**: assumtion on f(X), often linear, easy to learn, fast, little data needed, poor prediction (example: Linear and Logistic Regression)
  * **Non-Parametric**: assumtion free, difficult to train, slow, needs a lot of data, higher prediction power (example: Random Forest, LASSO, Neural Networks)

---
name: Diversity of Machine Learning

## Diversity of Machine Learning

![Diversity of Machine Learning](MachineLearningAlgorithms.png)

---
name: Supervised vs. Unsupervised

## Supervised vs. Unsupervised

.center[
<img src="SupervisedVsUnsupervisedAnalysis.png" style="width: 90%;" />]

---
name: Main Steps of Machine Learning

## Main Steps of Machine Learning

* To start Machine Learning one needs first to clean the data: impute, correct for batch-effects, normalize, standardize etc.

* Next step is to figure out features in the data. Note: Deep Learning skips this step and works directly on raw data

* Machine Learning model is fitted on the training and evaluated on an independent subset

![Main Steps of Machine Learning](ML.png)

---
name: How does Machine Learning work?

## How does Machine Learning work?

Machine Learning by default involves five basic steps:

1. Split data set into **train**, **validation** and **test** subsets.
2. Fit the model in the train subset.
3. Validate your model on the validation subset.
4. Repeat steps 1-3 a number of times and tune **hyperparameters**.
5. Test the accuracy of the optimized model on the test subset.

.center[
<img src="TrainTestSplit.png" style="width: 60%;" />]

---
name: Toy Example of Machine Learning

## Toy Example of Machine Learning

```r
set.seed(12345)
N<-100
x<-rnorm(N)
y<-2*x+rnorm(N)
df<-data.frame(x,y)
plot(y~x,data=df, col="blue")
```

---
name: Train and Test Subsets

## Train and Test Subsets

We randomly assign 70% of the data to training and 30% to test subsets:

```r
set.seed(123)
train<-df[sample(1:dim(df)[1],0.7*dim(df)[1]),]
test<-df[!rownames(df)%in%rownames(train),]
```

---
name: Validation of Model

## Validation of Model

```r
test_predicted<-as.numeric(predict(lm(y~x,data=train),newdata=test))
plot(test$y~test_predicted,ylab="True y",xlab="Pred y",col="darkgreen")
abline(lm(test$y~test_predicted),col="darkgreen")
```

---
name: Validation of Model (Cont.)

## Validation of Model (Cont.)

```r
summary(lm(test$y~test_predicted))
```

```
## 
## Call:
## lm(formula = test$y ~ test_predicted)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.80597 -0.78005  0.07636  0.52330  2.61924 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.02058    0.21588   0.095    0.925    
## test_predicted  0.89953    0.08678  10.366 4.33e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.053 on 28 degrees of freedom
## Multiple R-squared:  0.7933,	Adjusted R-squared:  0.7859 
## F-statistic: 107.4 on 1 and 28 DF,  p-value: 4.329e-11
```

Thus the model explains 79% of variation on the test subset.

---
name: What is a Hyperparameter?

## What is a Hyperparameter?

* Hyperparameters are Machine Learning design parameters which are set before the learning process starts
* For the toy model a hyperparameter can be e.g. the number of covariates to adjust the main variable x of interest for

```r
set.seed(1)
for(i in 1:10)
{
 df[,paste0("PC",i)]<-1*(1-i/10)*y+rnorm(N)
}
head(df)
```

```
##            x          y color        PC1        PC2        PC3        PC4
## 1  0.5855288  1.3949830   red  0.6290309  0.4956198  1.3858900  1.7306635
## 2  0.7094660  0.2627087   red  0.4200812  0.2522828  1.8727694 -0.8896729
## 3 -0.1093033  0.2038119   red -0.6521979 -0.7478721  1.7292568  2.0936245
## 4 -0.4534972 -2.2317496  blue -0.4132938 -1.6273709 -1.8931325 -1.7226819
## 5  0.6058875  1.3528592  blue  1.5470811  0.4277027 -1.3382341  2.4658608
## 6 -1.8179560 -4.1719599  blue -4.5752323 -1.5702807 -0.4227104 -0.9909633
##          PC5         PC6         PC7        PC8        PC9       PC10
## 1  1.7719325  0.63529634  0.07742793 -0.4285716 -0.9474105 -1.5414026
## 2  2.0270091 -0.19178516  1.58123715  2.0241138 -1.7998121  0.1943211
## 3 -0.5010914 -1.10171748  0.58945128 -0.0492363  1.0156630  0.2644225
## 4 -1.5067426 -0.88140715 -0.12733353 -0.4603672 -0.2350367 -1.1187352
## 5  0.2602076  1.53274473  0.26918441 -0.8528851 -0.4643425  0.6509530
## 6 -2.4616374 -0.07481652 -2.38832183 -2.1785221 -0.5951440 -1.0329002
```

---
name: How does Cross-Validation work?

## How does Cross-Validation work?

* We should not include all PCs - overfitting
* Cross-Validation is a way to combat overfitting

```r
train<-df[sample(1:dim(df)[1],0.6*dim(df)[1]),]
val_test<-df[!rownames(df)%in%rownames(train),]
validate<-val_test[sample(1:dim(val_test)[1],0.25*dim(val_test)[1]),]
test<-val_test[!rownames(val_test)%in%rownames(validate),]
```
<img src="Presentation_MachineLearning_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" />

---
name: How does Cross-Validation work? (Cont.)

## How does Cross-Validation work? (Cont.)

* Let us fit the linear regression model in the training set and validate the error in the validation data set
* Error: root mean squared difference between y predicted by the trained model for validation set and the real y in the validation set
* Looks like no drammatic decrease of RMSE after PC2

---
name: Ultimate Model Evaluation

## Ultimate Model Evaluation

* Thus optimal model is y~x+PC1+PC2
* Perform final evaluation of the optimized/trained model on the test data set and report the final accuracy (adjusted R squared)
* The model explains over 90% of variation on the unseen test data set

```r
summary(lm(predict(lm(y~x+PC1+PC2,data=train),newdata=test)~test$y))
```

```
## 
## Call:
## lm(formula = predict(lm(y ~ x + PC1 + PC2, data = train), newdata = test) ~ 
## test$y)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -1.02409 -0.26361 -0.00061 0.19345 0.95801 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 0.22155 0.09485 2.336 0.0269 * 
## test$y 0.93912 0.03994 23.514 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5002 on 28 degrees of freedom
## Multiple R-squared: 0.9518,	Adjusted R-squared: 0.9501 
## F-statistic: 552.9 on 1 and 28 DF, p-value: < 2.2e-16
```

---
name: Underfitting vs. Overfitting

## Underfitting vs. Overfitting

![Decision Tree](Underfitting_vs_Overfitting.png)

* Akaike Information Criterion (AIC):

`$$\rm AIC = 2k - 2ln(L)$$`
* Random Forest: each tree overfitted, but ensemble of trees performs very well

---
name: Bias-Variance

## Bias-Variance Tradeoff

.pull-left-50[
![Bias Variance](biasvariance.png)
]

.pull-right-50[
![Bias Variance Score](Bias_Variance_Score.png)
]

`$$Y = f(X) \Longrightarrow\rm{Reality} \\
Y = \hat{f}(X) + \rm{Error}\Longrightarrow\rm{Model} \\
Error^2 = (Y - \hat{f}(X))^2 = Bias^2 + Variance$$`

* It is mathematically proven that ensemble learning keeps the Bias the same but leads to a large decrease of Variance

---
name: KNN and SVM

## KNN and SVM

.pull-left-50[
![kNN](kNN.png)

* How many out of K neighbors belong to each class

* Majority voting

* Non-linear

]
.pull-right-50[
![SVM](SVM.png)

* Draw hyperplane that separates classes

* Maximize margins

* Can be linear and non-linear
]

---
name: Random Forest

## What is Random Forest?

![Decision Tree](DecisionTree.png)

---
name: Classification: Pima Indians Diabetes

## Classification: Pima Indians Diabetes

```r
library("mlbench")
data(PimaIndiansDiabetes2)
head(PimaIndiansDiabetes2,4)
```

```
##   pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1        6     148       72      35      NA 33.6    0.627  50      pos
## 2        1      85       66      29      NA 26.6    0.351  31      neg
## 3        8     183       64      NA      NA 23.3    0.672  32      pos
## 4        1      89       66      23      94 28.1    0.167  21      neg
```

```r
summary(results)
```

```
## 
## Call:
## summary.resamples(object = results)
## 
## Models: lda, cart, knn, svm, rf 
## Number of resamples: 50 
## 
## ROC 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lda  0.6240602 0.8075188 0.8496241 0.8387103 0.8780739 0.9488722    0
## cart 0.5766917 0.6664197 0.7097966 0.7213344 0.7794339 0.9195046    0
## knn  0.6408669 0.7710305 0.8120301 0.8108961 0.8377820 0.9699248    0
## svm  0.7293233 0.8034056 0.8337240 0.8349049 0.8735294 0.9609023    0
## rf   0.7187970 0.7934211 0.8417293 0.8316356 0.8721805 0.9195489    0
## 
## Sens 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lda  0.7352941 0.8285714 0.8857143 0.8758655 0.9142857 1.0000000    0
## cart 0.5588235 0.7771008 0.8285714 0.8296303 0.8857143 0.9714286    0
## knn  0.6285714 0.7941176 0.8260504 0.8188571 0.8571429 1.0000000    0
## svm  0.7428571 0.8285714 0.8571429 0.8634958 0.8857143 1.0000000    0
## rf   0.7142857 0.8000000 0.8285714 0.8417143 0.8857143 1.0000000    0
## 
## Spec 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lda  0.3157895 0.4868421 0.5789474 0.5652632 0.6315789 0.7894737    0
## cart 0.3157895 0.4736842 0.5263158 0.5547368 0.6710526 0.7894737    0
## knn  0.4210526 0.5789474 0.6315789 0.6252632 0.7236842 0.8421053    0
## svm  0.3684211 0.4736842 0.5789474 0.5631579 0.6315789 0.7894737    0
## rf   0.3684211 0.5263158 0.6315789 0.6136842 0.7236842 0.7894737    0
```

---
name: Compare Machine Learning Methods

## Compare Machine Learning Methods

```r
dotplot(results)
```

---
name: Feature Selection

## Feature Selection

```r
feat<-varImp(best_model)$importance$pos
names(feat)<-rownames(varImp(best_model)$importance)
barplot(sort(feat,decreasing=T),ylab="FEATURE IMPORTANCE",col="darkred")
```

---
name: Make Predictions

## Make Predictions: Confusion Matrix

```r
library("e1071")
predictions <- predict(best_model, test)
confusionMatrix(predictions, test$diabetes)
```

```
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg 132  32
##        pos  21  46
##                                           
##                Accuracy : 0.7706          
##                  95% CI : (0.7109, 0.8232)
##     No Information Rate : 0.6623          
##     P-Value [Acc > NIR] : 0.0002237       
##                                           
##                   Kappa : 0.4687          
##                                           
##  Mcnemar's Test P-Value : 0.1695641       
##                                           
##             Sensitivity : 0.8627          
##             Specificity : 0.5897          
##          Pos Pred Value : 0.8049          
##          Neg Pred Value : 0.6866          
##              Prevalence : 0.6623          
##          Detection Rate : 0.5714          
##    Detection Prevalence : 0.7100          
##       Balanced Accuracy : 0.7262          
##                                           
##        'Positive' Class : neg             
## 
```

---
name: ROC Curves

## ROC Curves on Test Data Set

---
name: Why such a hype about Deep Learning?

## Why such a hype about Deep Learning?

.center[
<img src="DLHype.png" style="width: 100%;" />]

.center[
**Because Deep Learning delivers state-of-the-art results**]

---
name: Promises of  Deep Learning for Society

## Promises of  Deep Learning for Society

.center[
<img src="DLPromises.png" style="width: 90%;" />]

---
name: Promises of  Deep Learning for Life Sciences

## Promises of  Deep Learning for Life Sciences

.center[
<img src="DLLifeSciences.png" style="width: 80%;" />]

---
name: What is Deep Learning?

## What is Deep Learning?

* Artificial Neural Networks (ANN) with multiple layers

* Main advantages of Deep Learning over Machine Learning:
  * Feature extraction
  * Scalability

* Universal Approximation Theorem

---
name: Artificial Neural Networks (ANN)

## Artificial Neural Networks (ANN)

* Mathematical algorithm/function with special architecture

* Highly non-linear dues to activation functions

* Backward propagation for minimizing error

.center[
<img src="ANN.png" style="width: 90%;" />]

---
name: Convolutional Neural Networks (CNN)

## Convolutional Neural Networks (CNN)

.center[
<img src="CNN.png" style="width: 100%;" />]

---
name: Computer Vision

## Computer Vision

.center[
<img src="NASnet.jpg" style="width: 100%;" />]

---
name: Object Detection

## Object Detection

.pull-left-50[
<img src="RPN.jpeg" style="width: 70%;" />]

.pull-right-50[
<img src="SelectiveSearch.png" style="width: 80%;" />
<img src="Faster-rcnn.png" style="width: 80%;" />
]

---
name: Human Protein Atlas

## Human Protein Atlas

.center[
<img src="RBC_WBC.png" style="width: 70%;" />]

.center[
<img src="HPA.png" style="width: 70%;" />]

---
name: Image Annotation

## Image Annotation

.center[
<img src="LabelImg.png" style="width: 50%;" />]

.center[
<img src="tSNE_cells.png" style="width: 40%;" />]

---
name: Beauty of Neural Networks

## Beauty of Neural Networks

* Maximum Likelihood
* Cross-Validation
* Bootstrapping
* Regularization
* Bayesian Inference
* Multivariate Feature Selection
* Monte Carlo Approximation
* Bagging and Boosting

![Training](Training.png)

---
name: Single Cells make Big Data

## Single Cells make Big Data

.center[
<img src="SingleCellBigData.png" style="width: 90%;" />]

---
name: Autoencoders

## Autoencoders

.center[
<img src="Autoencoder.png" style="width: 90%;" />]

---
name: Autoencoders for scRNAseq

## Autoencoders for scRNAseq

.center[
<img src="AutoencoderscRNAseq.png" style="width: 90%;" />]

---
name: Autoencoders for Large-Scale scRNAseq

## Autoencoders for Large-Scale scRNAseq

.center[
<img src="Autoencoder10X.png" style="width: 90%;" />]

---
name: Frequentist Image Recognition

## Frequentist Image Recognition

.center[
<img src="FreqImgRecogn.png" style="width: 90%;" />]

---
name: Bayesian Image Recognition

## Bayesian Image Recognition

.center[
<img src="BayesImgRecogn.png" style="width: 90%;" />]

---
name: Bayesian Deep Learning for Single Cell

## Bayesian Deep Learning for Single Cell

.center[
<img src="BayesianDeepLearning.png" style="width: 90%;" />]

---
name: Reinforcement Learning

## Reinforcement Learning

.center[
<img src="EA_RL.gif.png" style="width: 100%;" />]

---
name: Towards Self-Trainable Machines

## Towards Self-Trainable Machines

.pull-left-50[
.center[
<img src="ReinforcementLearning.png" style="width: 100%;" />
<img src="QLearning.jpeg" style="width: 100%;" />]
]

.pull-right-50[
.center[
<img src="RL_Loss.png" style="width: 100%;" />
<img src="OpenAIGym.jpeg" style="width: 100%;" />
]]

---
name: Bioinformatics Learns from Google

## Bioinformatics Learns from Google

.center[
<img src="DeepMind.png" style="width: 90%;" />]

---
name: Total Science: Pan-Science

## Total Science: Pan-Science

.pull-left-50[
.center[
Total Football
<img src="TotalFootball.jpg" style="width: 100%;" />]]

.pull-right-50[
.center[
Total Hockey
<img src="RedArmy.jpg" style="width: 80%;" />]]

.center[
<img src="Leonardo.png" style="width: 60%;" />]

---
name: end-slide
class: end-slide, middle
count: false

# Thank you. Questions?

R version 3.6.0 (2019-04-26) Platform: x86_64-pc-linux-gnu (64-bit)OS: Ubuntu 14.04.6 LTS

Built on : 16-jun-2019 at 17:03:52

__2019__ • [SciLifeLab](https://www.scilifelab.se/) • [NBIS](https://nbis.se/)