Mathematical Statistics in R

class: center, middle, inverse, title-slide

# Mathematical Statistics in R
## RaukR 2019 • Advanced R for Bioinformatics
### Nikolay Oskolkov
### NBIS, SciLifeLab

---

exclude: true
count: false

---
name: Statistics

## What is Mathematical Statistics?

* Can Mathematical Statistics mean
  * a statistical test? 
  * a probability distribution? 
  * or maybe a p-value?

* We are not going to draw marbles or roll dice

.center[
<img src="Dice.jpg" style="width: 70%;" />]

.center[
**Classic statistics is not the only way to analyze your data**]

---
name: Types of Analysis

## Different Types of Data Analysis

* Depends on the amount of data we have
* Balance between the numbers of features and observations
 + P is the number of features (genes, proteins, genetic variants etc.)
 + N is the number of observations (samples, cells, nucleotides etc.)

.center[
<img src="AmountOfData.png" style="width: 90%;" />]

---
name: Do we have Big Data in Life Sciences?
## Do we have Big Data in Life Sciences?

.center[
<img src="BigData.png" style="width: 100%;" />]

---
name: Precision Medicine: Why isn't it in the Clinics? 
## Precision Medicine: Why isn't it in the Clinics?

.center[
<img src="PrecisionMedicine.png" style="width: 90%;" />]

---
name: The Curse of Dimensionality

## The Curse of Dimensionality

.pull-left-50[
<img src="DarkMagic.jpg" style="width: 90%;" />]

.pull-right-50[

`$$Y = \alpha + \beta X$$`
`$$\beta = \left(X^TX\right)^{-1}X^TY$$`

`$$\left(X^TX\right)^{-1} \sim \frac{1}{\rm{det}\left(X^TX\right)}\dots\,\rightarrow\,\infty\hbox{,}\,\,\,\,\,\,\,\,n\ll p$$`

.center[
**The math blows up in high dimensions**]

.center[
<img src="ndim_ball.png" style="width: 110%;" />]

`$$l_d = r\sqrt{d}\hbox{, where d is dimensionality of space}$$`

.center[
**Euclidean distance fails, hard to define metric**]
]

---
name: Low Dimensional Space

## Low Dimensional Space

```r
set.seed(123) 
n <- 20 # number of samples
p <- 2 # number of features / dimensions
Y <- rnorm(n)
X <- matrix(rnorm(n*p),n,p) 
summary(lm(Y~X))
```

```
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0522 -0.6380  0.1451  0.3911  1.8829 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.14950    0.22949   0.651    0.523
## X1          -0.09405    0.28245  -0.333    0.743
## X2          -0.11919    0.24486  -0.487    0.633
## 
## Residual standard error: 1.017 on 17 degrees of freedom
## Multiple R-squared:  0.02204,	Adjusted R-squared:  -0.09301 
## F-statistic: 0.1916 on 2 and 17 DF,  p-value: 0.8274
```

---
name: Going to Higher Dimensions

## Going to Higher Dimensions

```r
set.seed(123456); n <- 20; p <- 10
Y <- rnorm(n); X <- matrix(rnorm(n*p),n,p); summary(lm(Y~X))
```

```
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0255 -0.4320  0.1056  0.4493  1.0617 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.54916    0.26472   2.075   0.0679 .
## X1           0.30013    0.21690   1.384   0.1998  
## X2           0.68053    0.27693   2.457   0.0363 *
## X3          -0.10675    0.26010  -0.410   0.6911  
## X4          -0.21367    0.33690  -0.634   0.5417  
## X5          -0.19123    0.31881  -0.600   0.5634  
## X6           0.81074    0.25221   3.214   0.0106 *
## X7           0.09634    0.24143   0.399   0.6992  
## X8          -0.29864    0.19004  -1.571   0.1505  
## X9          -0.78175    0.35408  -2.208   0.0546 .
## X10          0.83736    0.36936   2.267   0.0496 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8692 on 9 degrees of freedom
## Multiple R-squared:  0.6592,	Adjusted R-squared:  0.2805 
## F-statistic: 1.741 on 10 and 9 DF,  p-value: 0.2089
```

---
name: Even Higher Dimensions

## Even Higher Dimensions

```r
set.seed(123456); n <- 20; p <- 20
Y <- rnorm(n); X <- matrix(rnorm(n*p),n,p); summary(lm(Y~X))
```

```
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
## ALL 20 residuals are 0: no residual degrees of freedom!
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.34889         NA      NA       NA
## X1           0.66218         NA      NA       NA
## X2           0.76212         NA      NA       NA
## X3          -1.35033         NA      NA       NA
## X4          -0.57487         NA      NA       NA
## X5           0.02142         NA      NA       NA
## X6           0.40290         NA      NA       NA
## X7           0.03313         NA      NA       NA
## X8          -0.31983         NA      NA       NA
## X9          -0.92833         NA      NA       NA
## X10          0.18091         NA      NA       NA
## X11         -1.37618         NA      NA       NA
## X12          2.11438         NA      NA       NA
## X13         -1.75103         NA      NA       NA
## X14         -1.55073         NA      NA       NA
## X15          0.01112         NA      NA       NA
## X16         -0.50943         NA      NA       NA
## X17         -0.47576         NA      NA       NA
## X18          0.31793         NA      NA       NA
## X19          1.43615         NA      NA       NA
## X20               NA         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,	Adjusted R-squared:    NaN 
## F-statistic:   NaN on 19 and 0 DF,  p-value: NA
```

---
name: Dimensionality Reduction

## Dimensionality Reduction

.center[
**Dimensionality Reduction is not for visualization but overcoming the Curse of Dimensionality**]

---
name: Linear Dimensionality Reduction

## Linear Dimensionality Reduction

.center[
<img src="LinearDimReduct.png" style="width: 90%;" />]

---
name: Non-Linear Dimensionality Reduction

## Non-Linear Dimensionality Reduction

.center[
<img src="NonLinearDimReduct.png" style="width: 90%;" />]

---
name: Frequentist Statistics Failure

## Frequentist Statistics Failure

.center[
<img src="Anscombes_quartet.png" style="width: 90%;" />]

---
name: Frequentist Statistics Brain Damaging (Cont.)

## Frequentist Statistics Brain Damaging

.center[
<img src="DataSaurus.gif.png" style="width: 90%;" />]

.center[
<img src="BoxViolinSmaller.gif.png" style="width: 90%;" />]

---
name: Pvalue is not good for ranking features
## Pvalue is not good for ranking features

.pull-left-50[
<img src="Pvalue.png" style="width: 100%;" />]

.pull-right-50[

```r
FC <- 1.02
x_mean <- 5; x_sd <- 1
N_vector<-seq(from=100,to=10000,by=100)
x1 <- rnorm(N_vector, x_mean, x_sd)
x2 <- rnorm(N_vector, x_mean*FC, x_sd)
```
<img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" />
]

---
name: Maximum Likelihood Principle

## Maximum Likelihood Principle

* We maximize probability to observe the data `$X_i$`
`$$\rm{L}\,(\,\rm{X_i} \,|\, \mu,\sigma\,) =
\prod_{i=0}^{N}\frac{1}{\sqrt{2\pi\sigma²}} \exp^{\displaystyle -\frac{(X_i-\mu)^2}{2\sigma²}}\\
\mu = \frac{1}{N}\sum_{i=0}^N \rm{X_i}\\
\sigma^2 = \frac{1}{N}\sum_{i=0}^N (\rm{X_i}-\mu)^2$$`

--
* Maximum Likelihood has many assumptions:
  * Large sample size
  * Gaussian distribution
  * Homoscedasticity
  * Uncorrelated errors
  * Convergence of covariance
* Those assumptions are not fulfilled in the real world

---
name: Statistical Test

## Two-Groups Statistical Test

```r
set.seed(12)
X<-c(rnorm(20,mean=5,sd=2),12,15,14,16)
Y<-c(rnorm(24,mean=7,sd=2))
boxplot(X,Y,ylab="DIFFERENCE",names=c("X","Y"))
```

---
name: Parametric Statistical Test

## Parametric Statistical Test Fails

```r
t.test(X,Y)
```

```
## 
## 	Welch Two Sample t-test
## 
## data: X and Y
## t = -1.084, df = 31.678, p-value = 0.2865
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.8819545 0.8804344
## sample estimates:
## mean of x mean of y 
## 5.989652 6.990412
```
<img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto;" />

---
name: Resampling

## Resampling

```r
observed <- median(Y)-median(X)
print(paste0("Observed difference = ",observed))
res <- vector(length=1000)
for(i in 1:1000){Z<-c(X,Y); Y_res<-sample(Z,length(Y),FALSE);
X_res<-sample(Z,length(X),FALSE); res[i]<-median(Y_res)-median(X_res)}
hist(abs(res), breaks=100, main="Resampled", xlab="Difference")
print(paste0("p-value = ", sum(abs(res) >= abs(observed))/1000))
```

```
## [1] "Observed difference = 2.33059085402647"
## [1] "p-value = 0.001"
```

---
name: ML Does Not Stand Non-Independence

## ML Does Not Stand Non-Independence

```
##            n1         n2          n3           n4         n5
## p1 -0.6760258 -1.2307634  1.66039982  0.196033326 -0.2981471
## p2 -1.5834993  0.6494188 -0.01267663 -1.064763128 -0.1792141
## p3  0.3152418 -0.5791937 -1.79593465 -0.312303710  0.2671534
## p4 -0.9359010  0.1212546 -0.36279328 -0.553364109  1.0598898
## p5 -2.0411903  0.6899356 -1.03923098  0.008958754 -0.2249498
```

* Two types of non-independence in data
  * between samples
  * between features

.center[
.pull-left-50[
### Random Effects
<img src="Random_Effects.jpg" style="width: 60%;" />]
.pull-right-50[
### Lasso
<img src="lasso.jpg" style="width: 40%;" />]]

---
name: Linear Model with Non-Independence

## Linear Model with Non-Independence

```r
library("lme4")
library("ggplot2")
ggplot(sleepstudy,aes(x=Days,y=Reaction)) + geom_point() + 
  geom_smooth(method="lm")
```

---
name: Fit Linear Model for Each Individual

## Fit Linear Model for Each Individual

```r
ggplot(sleepstudy, aes(x = Days, y = Reaction)) + 
  geom_smooth(method = "lm", level = 0.95) + geom_point() + 
  facet_wrap( ~ Subject, nrow = 3, ncol = 6)
```

---
name: The Case of the Missing Heritability

## Random Effects and Missing Heritability

.center[
<img src="MissingHeritability.png" style="width: 80%;" />]

---
name: Random Effects Modelling

## Random Effects Modelling

* Allow individual level Slopes and Intercepts
* This is nothing else than Bayesian Priors on coefficients

`$$\rm{Reaction} = \alpha_i + \beta_i \rm{Days}$$`

```r
lmerfit <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
```

.pull-left-50[
<img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-17-1.svg" style="display: block; margin: auto auto auto 0;" />
]
.pull-right-50[
<img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-18-1.svg" style="display: block; margin: auto auto auto 0;" />
]

.center[
**Shrinkage: Introduce shared variance parameter**]

---
name: Linear Mixed Models (LMM)

## Linear Mixed Models (LMM)

```r
summary(lmer(Reaction ~ Days + (Days | Subject), sleepstudy))
```

```
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (Days | Subject)
##    Data: sleepstudy
## 
## REML criterion at convergence: 1743.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9536 -0.4634  0.0231  0.4633  5.1793 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  Subject  (Intercept) 611.90   24.737       
##           Days         35.08    5.923   0.07
##  Residual             654.94   25.592       
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  251.405      6.824  36.843
## Days          10.467      1.546   6.771
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.138
```

---
name: LMM Average Fit

## LMM Average Fit

---
name: LMM Individual Fit

## LMM Individual Fit

---
name: What is Bayesian Statistics for You?

## What is Bayesian Statistics for You?

.center[
<img src="BayesianStatistics.png" style="width: 90%;" />]

.small[
* **Handling Missing Data**
* **Handling Non-Gaussian Data**
* **Cause or Consequence**
* **Lack of Statistical Power**
* **Overfitting and Correction for Multiple Testing (FDR)**
* **Testing for Significance and P-Value**
]

---
name: Frequentist vs. Bayesian Fitting

## Frequentist vs. Bayesian Linear Model

.center[
.pull-left-50[
### Maximum Likelihood
`$$y = \alpha+\beta x$$`
 
`$$L(y) \sim e^{-\frac{(y-\alpha-\beta x)^2}{2\sigma^2}}$$`
 
`$$\max_{\alpha,\beta,\sigma}L(y) \Longrightarrow \hat\alpha, \hat\beta, \hat\sigma$$`
]]
--
.center[
.pull-right-50[
### Bayesian Linear Fitting
`$$y \sim \it N(\mu,\sigma) \quad\textrm{- Likelihood L(y)}$$`
 
`$$\mu = \alpha + \beta x$$`
 
`$$\alpha \sim \it N(\mu_\alpha,\sigma_\alpha) \quad\textrm{- Prior on} \quad\alpha \\
\beta \sim \it N(\mu_\beta,\sigma_\beta) \quad\textrm{- Prior on} \quad\beta$$`

`$$P(\mu_\alpha,\sigma_\alpha,\mu_\beta,\sigma_\beta,\sigma) \sim L(y)N(\mu_\alpha,\sigma_\alpha)N(\mu_\beta,\sigma_\beta)$$`

`$$\max_{\mu_\alpha,\sigma_\alpha,\mu_\beta,\sigma_\beta,\sigma}P(\mu_\alpha,\sigma_\alpha,\mu_\beta,\sigma_\beta,\sigma) \Longrightarrow \hat\mu_\alpha,\hat\sigma_\alpha,\hat\mu_\beta,\hat\sigma_\beta,\hat\sigma$$`
]]

---
name: Bayesian Linear Model

## Bayesian Linear Model

```r
library("brms")
options(mc.cores = parallel::detectCores())
brmfit <- brm(Reaction ~ Days + (Days | Subject), data = sleepstudy)
summary(brmfit)
```

```
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: Reaction ~ Days + (Days | Subject) 
##    Data: sleepstudy (Number of observations: 180) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Group-Level Effects: 
## ~Subject (Number of levels: 18) 
##                     Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept)          26.81      6.91    15.47    42.69       1749 1.00
## sd(Days)                6.55      1.53     4.09    10.12       1413 1.00
## cor(Intercept,Days)     0.10      0.30    -0.49     0.67        999 1.00
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept   251.15      7.36   236.19   265.63       1561 1.00
## Days         10.47      1.67     7.13    13.67       1429 1.00
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma    25.94      1.59    23.08    29.34       2808 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

---
name: Bayesian Population-Level Fit

## Bayesian Population-Level Fit

---
name: Bayesian Group-Level Fit

## Bayesian Group-Level Fit

---
name: Feature Non-Independence: LASSO

## Feature Non-Independence: LASSO

`$$Y = \beta_1X_1+\beta_2X_2+\epsilon \\
\textrm{OLS} = (y-\beta_1X_1-\beta_2X_2)^2 \\
\textrm{Penalized OLS} = (y-\beta_1X_1-\beta_2X_2)^2 + \lambda(|\beta_1|+|\beta_2|)$$`

.center[
<img src="CV_lambda.png" style="width: 70%;" />]

---
name: Causality Inference: get rid of confounders

## Causality Inference: get rid of confounders

.pull-left-50[
`$$Y \sim X + C\hbox{,  where C is a confounder}$$`

<img src="IV.png" style="width: 70%;" />]

.pull-right-50[
.center[
**Mendelian Randomization**
<img src="InsEndCancer.png" style="width: 70%;" />
]]

.center[
**Method of Instrumental Variables**]

```r
N <- 1e3 # Number of Observations
C <- rnorm(n = N); X <- 2 * C + rnorm(n = N)
Y <- 2 * X + 2 * C + rnorm(n = N)
summary(lm(Y ~ X))
```

```
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -4.4905 -0.8991 0.0072 0.9051 4.8766 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 0.04544 0.04245 1.07 0.285 
## X 2.79987 0.01876 149.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.339 on 998 degrees of freedom
## Multiple R-squared: 0.9571,	Adjusted R-squared: 0.9571 
## F-statistic: 2.228e+04 on 1 and 998 DF, p-value: < 2.2e-16
```

---
name: When confounders are known: Controlled Regression

## Known confounders: Controlled Regression

```r
summary(lm(Y ~ X + C))
```

```
## 
## Call:
## lm(formula = Y ~ X + C)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -3.1141 -0.6955 -0.0089 0.7270 3.8172 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 0.06693 0.03250 2.06 0.0397 * 
## X 2.00552 0.03314 60.52 <2e-16 ***
## C 1.97305 0.07419 26.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.025 on 997 degrees of freedom
## Multiple R-squared: 0.9749,	Adjusted R-squared: 0.9749 
## F-statistic: 1.938e+04 on 2 and 997 DF, p-value: < 2.2e-16
```
 
.center[
**What if confounders are not known?**]

---
name: Good Instrumental Variable

## Good Instrumental Variable

```r
set.seed(1)
library("AER")
N <- 1e3 # Number of Observations
weak <- rnorm(n = N)
good <- rnorm(n = N)
C <- rnorm(n = N) 
X <- 0.000001 * weak + 2 * good + 2 * C + rnorm(n = N)
Y <- 2 * X + 2 * C + rnorm(n = N)
df <- cbind(Y, X, C, good, weak)
colnames(df) <- c('Y', 'X','C', 'good', 'weak')
df <- data.frame(df)
summary(ivreg(Y ~ X | good, data = df))
```

```
## 
## Call:
## ivreg(formula = Y ~ X | good, data = df)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -7.78427 -1.48164 0.05357 1.37862 6.99009 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 0.01017 0.06867 0.148 0.882 
## X 2.02609 0.03207 63.183 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.172 on 998 degrees of freedom
## Multiple R-Squared: 0.9237,	Adjusted R-squared: 0.9236 
## Wald test: 3992 on 1 and 998 DF, p-value: < 2.2e-16
```

---
name: Weak Instrumental Variable
## Weak Instrumental Variable

```r
summary(ivreg(Y ~ X | weak, data = df))
```

```
## 
## Call:
## ivreg(formula = Y ~ X | weak, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.319 -1.157  0.027  1.218  5.711 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.003767   0.056301   0.067    0.947    
## X           2.458141   0.401503   6.122 1.32e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.77 on 998 degrees of freedom
## Multiple R-Squared: 0.9493,	Adjusted R-squared: 0.9492 
## Wald test: 37.48 on 1 and 998 DF,  p-value: 1.324e-09
```

---
name: What does ivreg do under the hood?
## What does ivreg do under the hood?

```r
summary(lm(Y~as.numeric(predict(lm(X~good)))))
```

```
## 
## Call:
## lm(formula = Y ~ as.numeric(predict(lm(X ~ good))))
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -22.8345 -4.2676 -0.0506 4.3873 20.1810 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 0.01017 0.20723 0.049 0.961 
## as.numeric(predict(lm(X ~ good))) 2.02609 0.09677 20.938 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.553 on 998 degrees of freedom
## Multiple R-squared: 0.3052,	Adjusted R-squared: 0.3045 
## F-statistic: 438.4 on 1 and 998 DF, p-value: < 2.2e-16
```

---
name: Lasso helps to build good Instrumental Variables
## Lasso helps to build good Instrumental Variables

.pull-left-50[
<img src="CausalXonY.png" style="width: 100%;" />]

.pull-right-50[
<img src="LassoIV.png" style="width: 100%;" />]

---
name: Simplex Space: a few words about normalization

## Simplex Space: a few words about normalization

We draw two vectors (genes) from normal distributions with different means and sds

```r
set.seed(123)
gene1<-rnorm(100,mean=10,sd=1)
gene2<-rnorm(100,mean=100,sd=5)
cor.test(gene1,gene2,method="spearman")
```

```
## 
## 	Spearman's rank correlation rho
## 
## data:  gene1 and gene2
## S = 166000, p-value = 0.9693
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## 0.00390039
```

---
name: Library Size Normalization 
## Library Size Normalization

We create an expression data frame with 100 samples and 2 genes

```r
expr<-as.data.frame(t(data.frame(gene1=gene1,gene2=gene2)))
colnames(expr)<-paste0("sample",seq(1:100))
expr[1:2,1:4]
```

```
##         sample1    sample2  sample3  sample4
## gene1  9.439524   9.769823 11.55871 10.07051
## gene2 96.447967 101.284419 98.76654 98.26229
```

We calculate library size normalized expression data frame

```r
expr_n<-apply(expr,2,function(x) x/sum(x))
expr_n[1:2,1:4]
```

```
##          sample1    sample2   sample3  sample4
## gene1 0.08914674 0.08797343 0.1047694 0.092959
## gene2 0.91085326 0.91202657 0.8952306 0.907041
```

---
name: Spurious Correlation After Normalization

## Spurious Correlation After Normalization

.pull-left-50[
.center[
**Before Library Size Normalization**]

```r
plot(expr1,expr2,xlab="gene1",ylab="gene2")
```

<img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-34-1.svg" style="display: block; margin: auto;" />
]

.pull-right-50[
.center[
**After Library Size Normalization**]

```r
plot(expr_n1,expr_n2,xlab="gene1",ylab="gene2")
```

<img src="Presentation_GeneralStats_files/figure-html/unnamed-chunk-35-1.svg" style="display: block; margin: auto;" />
]

```r
cor.test(expr_n1,expr_n2,method="spearman")
```

```
## 
## 	Spearman's rank correlation rho
## 
## data: expr_n1 and expr_n2
## S = 333300, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho 
## -1
```

---
name: OMICs Integration

## OMICs Integration

.center[
<img src="OMICsIntegr.png" style="width: 100%;" />]

---
name: Single Cell OMICs Integration

## Single Cell OMICs Integration

.center[
<img src="SingleCellMultiOmics.png" style="width: 100%;" />]

---
name: Single Cell OMICs Integration

## Single Cell OMICs Integration

.center[
<img src="scRNAseqIntegr.png" style="width: 100%;" />]

---
name: scNMT OMICs Integration

## scNMT OMICs Integration

.center[
<img src="scNMT.png" style="width: 100%;" />]

---
name: end-slide
class: end-slide, middle
count: false

# Thank you. Questions?

R version 3.6.0 (2019-04-26) Platform: x86_64-pc-linux-gnu (64-bit)OS: Ubuntu 14.04.6 LTS

Built on : 16-jun-2019 at 09:14:31

__2019__ • [SciLifeLab](https://www.scilifelab.se/) • [NBIS](https://nbis.se/)