Basic Statistics

# Basic Statistics | <code>Models</code>
## RaukR 2019 • Advanced R for Bioinformatics
### Bengt Sennblad
### NBIS, SciLifeLab

---

---
name: models1
## Models

* What is a model?

--
  + simplification (abstraction) of reality that helps us address a specific problem
  + Aims 
      - enhance computation/estimation
      - reflect biology
  
---

## Models | `Example: R proficiency as a function of beer`
.pull-left-50[
We want to model how beer consumption affects the ability to program well in R

* `$x=$` Number of beers drunk
* `$y=$` Proficiency to program in R
]

.pull-right-50[
<img src="presentation1_files/figure-html/beer1-1.svg" style="display: block; margin: auto auto auto 0;" />

]

This is a *linear model*, `$$y = \beta_0 + \beta_1 x$$` where `$y$` changes in proportion to `$x$`, according to its parameters `$\beta_0=1.0, \beta_1 = -0.1$`

***
--

__Notice that a model is always restricted by its assumptions:__
* In the model plot above, I have used strong assumptions about `$\beta_1$` and `$\beta_0$` being exactly `$\beta_0=1.0$` and `$\beta_1=-0.1$`. 
* Therefore, people with a different susceptibility to alcohol `$(\beta_1 \neq -0.1)$` or a different initial proficiency in R `$(\beta_0 \neq 1.0)$` are *not* correctly modeled by this model.

---
## Models | `Example: R proficiency as a function of beer`
.center[ .large[How can we solve this? ] ]
--

Several possibilities:

* Include population variation in the model (more later)

* Create a hierarchical model comprising
  + a main linear model `$$y=\beta_0 + \beta_1 x,$$` whose parameters are modeled by
  + linear submodels, including body mass `$(m)$` and *hyperparameters*: 
      - alcohol susceptibility, `$\beta_1=am + b$` 
      - initial proficiency, `$\beta_0=cm+d$`
* Both these combined

]

--
 
 
.pull-right-50[
<img src="presentation1_files/figure-html/beer2-1.svg" style="display: block; margin: auto auto auto 0;" />
]
--

---

## Models | `Example`
.pull-left-50[ 
More realistic?
<img src="presentation1_files/figure-html/beer3-1.svg" style="display: block; margin: auto auto auto 0;" />

* This is a logistic model `$y = \frac{1}{1+e^{-(\beta_0 +\beta_1x)}}$`, where there are saturation towards the extremes
* By transforming `$y$` into log-odds, we get a linear expression in the r.h.s.  `$$\log\frac{y}{1-y} = \beta_0 + \beta_1 x$$`
]

* Models can also be described as a graph ('graph models'):
 + This is a graph model showing whether it is possible to fly to RaukR from a selection of cities of the world
<img src="presentation1_files/figure-html/models1-1.svg" style="display: block; margin: auto auto auto 0;" />

]

---

# But what can models be used for?
--

Perform the lab tasks:
## Task | `Simulation`
## Task | `Probability of data`
## Task | `Statistical test`

**Note**:
* Don't get stuck on *Challenges* or *Extra reading*

---
name: Simulation1

## Task | `Simulation`

Generate 100 samples from `$Y = \beta_0 + \beta_1* X,$` with parameters `$\beta_0=0.3$` and `$\beta_1=0.2$`

**Think about**
* Does the plotted results look biologically reasonable?
    + if not: what could be the reason?

```r
#parameters
b0 = 0.3
b1 = 0.2
N=100
sim1 = data.frame(genos=round(runif(N,min=0,max=2)))

sim1$phenos = b0 + b1 * sim1$genos

plot(x=sim1$genos, y=sim1$phenos)
```
]

.pull-right-50[
<img src="presentation1_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto auto auto 0;" />
]

---
name: Simulation2

## Task | `Simulation`

### Deterministic vs statistical models
* `$Y = \beta_1* X + \beta_0$` is a *deterministic* model
    + does not model any variation
    + Common, e.g., in classical physics (velocity `$v = \frac{\Delta s}{\Delta t}$`)
* `$Y = \beta_1* X + \beta_0 +\epsilon,$` where the *residuals* `$\epsilon ~\sim N(mean=0,sd=\sigma^2)$` is a *statistical* (equiv. *stochastic*, *random*) model
    + attempts to model variation around a population mean (the *residuals*) determined by the model
    + generates a stochastic variable, `$Y$`, and is used in statistical analysis

```r
#parameters
b0 = 0.3
b1 = 0.2
N=100
sim1 = data.frame(genos=round(runif(N,min=0,max=2)))

sim1$phenos = b0 + b1 * sim1$genos + rnorm(N, mean=0, sd=0.05)

plot(x=sim1$genos, y=sim1$phenos)
```
]

.pull-right-50[
<img src="presentation1_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto auto auto 0;" />
]

---
name: simulation3
## Task | `Simulation`

### Something about the `$\sim$` notation
* Statistical notation 
    + `$x\sim U(0,1)$` means that `$x$` is a stochastic variable with a uniform distribution in the interval [0,1]
* R formulas
    + `y~x` in `lm` function is a shorthand for `$y = \beta_0 + \beta_1 x$`.

---
name: Simulation3

## Task | `Simulation`

**Think about**
* When can simulated data be useful?

**Uses for simulated data**

* *Oracle knowledge* when evaluating performance of methods, e.g., Type I and II errors 
* Estimating probabilities and probability distributions of, e.g., data and summary statistics of data (next task)

---
name: Probability1

## Task | `Probability of observed data`

For the linear model `$Y=\beta_0+\beta_1 X,$` with parameters `$(\beta_0=0.3, \beta_1=0.2, \sigma^2=0.05$`), estimate the conditional probability `$$Pr[Y <= 0.65|x=2],$$` using a

#### Simulation solution

and a

#### Analytic solution

---

name: Probability2
## Task | `Probability of observed data`
#### Simulation solution

**Think about**
* What shape does the plotted histogram have? 
* Where approximately is the mean?
* Does this make sense in light of the generative model we used?

```r
#parameters
b0 = 0.3
b1 = 0.2
N = 1000
x = 2
y = b0 + b1 * x + rnorm(N, mean=0, sd=0.05)

h=hist(y, frequency=TRUE, right=FALSE, 
       breaks=seq(0,1.0,0.05), plot=FALSE)
h$counts = h$counts/N
plot(h, xlim=c(0,1.0), ylim =c(0,0.6), labels=TRUE)

# compute requested probability from hist
paste("Pr[Y<=0.65|X=2] = ", sum(h$counts[1:13]))
```
]

```
## [1] "Pr[Y<=0.65|X=2] = 0.183"
```

<img src="presentation1_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto auto auto 0;" />
]

---
name: Probaility3
## Task | `Probability of observed data`
### Adding mean and multiplying sd

`$Y \sim N(mean=0,sd=1) \Leftrightarrow Y \sim \mu+\sigma * N(mean = 0, sd=1)$`
<img src="presentation1_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto auto auto 0;" />
--

* `$N(mean=0, sd=1)$` is called the * General Normal distribution*

???
`\begin{eqnarray*}
Y &\sim& \mu + N(mean=0,sd=1)\\
Y&\sim & N(mean = \mu, sd=\sigma)\\
Y&\sim & \sigma * N(mean = \mu, sd=1)
\end{eqnarray*}`

---
name: Probability4

## Task | `Probability of observed data`
**Think about**
* What have we plotted; can we plot `$Pr[Y<=y|X,\theta]$` more directly?

We estimated the interval probabilities of the model `$Y\sim N(mean=\beta_0+\beta_1 X, sd=\sigma)$`...

<img src="presentation1_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto auto auto 0;" />
]

.pull-right-50[
...but we could actually have plotted the *cumulative probability function* (CDF) to directly estimate `$Pr[Y<=0.65|X=x, \theta]$`

<img src="presentation1_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto auto auto 0;" />
]

---
name: Probability5

## Task | `Probability of observed data`

### Analytic solution

Use R's function `pnorm` for calculate the CDF

* Do the result fit that from the simulation?
 + If not, how can we improve the fit?
* What can this result be used for?
]
--
 
.pull-right[

```r
b0 = 0.3
b1 = 0.2
x = 2
y = 0.65
mu = b0 + b1 * x
paste("Pr[Y<=.6|X=2] = ",
 pnorm(y, mean = mu, sd=0.05))
```

```
## [1] "Pr[Y<=.6|X=2] = 0.158655253931457"
```
]
--
.pull-left-100[
**p-values**
* Probability fo finding a more extreme value or test statistic under the NULL model.
 + p-values can refer to different meanings of *extreme*. Using our model as an NULL model:
 - *left-tailed* p-value
 - *right-tailed* p-value
 - *double-tailed* p-value
 + If the p-value is *significant*, i.e., below some threshold `$\alpha$` (typically `$\alpha = 0.05$`) we can reject the NULL hypothesis that the data is generated from our model.
]
---
name: Probability6

## Task | `Probability of observed data`

** *Challenge*: Why look at intervals of `$Y$` rather than specific values?**

* **For continuous `$Y$`, probabilities only exist for intervals; for any specific `$y$`, `$Pr[Y=y] = 0$`**
--

* But it's handy to calculate probabilities for specific values?
--

* Let's look at the average probability over the interval of the histogram interval including 0.65 
<img src="presentation1_files/figure-html/unnamed-chunk-14-1.svg" style="display: block; margin: auto auto auto 0;" />
--

* Asympotically approaches a limit, which is defined as the *Probability density function* (PDF);  PDF for 0.65 = 4.83941
--

* PDF is the derivative of CDF
* PDF is not a proper probability!

---

### Student's t-test for `$y\sim N(mean=\mu,sd)$`
* The t-test uses normalized residuals as a test statistics: 
`\begin{eqnarray*}
t  
&= \sum_{i}\frac{y_i-\mu}{s/\sqrt{N}} 
&= \frac{\bar{y}-\mu}{s/\sqrt{N}}
\end{eqnarray*}`
where `$s/\sqrt{N}$` is an estimate of the standard deviation, `$\sigma$`, from the observed data. 
* Normalized residuals have a *General Normal distribution*, `$N(mean=0,sd=1)$`

**Think about**

* Does the t-test reject the NULL model or not?

```
## [1] "generate data with c0 = 0.1 , c1 = 0.5"
## 
## 	One Sample t-test
## 
## data: y
## t = 79.213, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0.7
## 95 percent confidence interval:
## 1.072927 1.092090
## sample estimates:
## mean of x 
## 1.082509
```

---
name: tests
## Task | `Statistical tests`

* Test if two samples are generated by the same (unknown) model using the standardized difference in means, typically, using the NULL hypothesis that this difference is 0. Example code below

```r
# data 1
b0 = 0.3
b1 = 0.2
x = rep(x,N)
N=100
y = b0 + b1 * x + rnorm(N, mean=0, sd=0.05)

#data 2
c0 = 0.4
c1 = 0.2
N=100
yp = c0 + c1 *  + rnorm(N, mean=0, sd=0.05)

t.test(y,yp)
```

```
## 
## 	Welch Two Sample t-test
## 
## data: y and yp
## t = 61.492, df = 107.92, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2963383 0.3160794
## sample estimates:
## mean of x mean of y 
## 0.7048302 0.3986213
```
]
.pull-right-50[
<img src="presentation1_files/figure-html/unnamed-chunk-17-1.svg" style="display: block; margin: auto auto auto 0;" />
]
---

* In linear models, e.g., in R's `lm`, a *t*-test is used to test if the estimated parameters are significantly different from 0.

```r
summary(lm(phenos~genos, data=sim1))
```

```
## 
## Call:
## lm(formula = phenos ~ genos, data = sim1)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -0.099606 -0.032441 -0.004768 0.028748 0.126614 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 0.296934 0.007460 39.81 <2e-16 ***
## genos 0.197828 0.005953 33.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04572 on 98 degrees of freedom
## Multiple R-squared: 0.9185,	Adjusted R-squared: 0.9177 
## F-statistic: 1104 on 1 and 98 DF, p-value: < 2.2e-16
```

---
name: nonparam
.pull-left-50[
## Types of models

* Deterministic models
    + structure
    + parameter values
* Statistical models
    + Additional *random distribution*
    + Parameteric models
        - structure fixed *a priori*
            * Example: Normal family of distribution (incl. linear model)
    + Non-parametric distribution
        - estimates structure from data
            * Example: histograms, Kernel density estimation
]
--

* Parametric tests
    + assumes statistical distribution *a priori*
        - Student's t-test
        - ANOVA
* Non-parametric tests
    + no assumption of statistical distribution of data
    + typically uses ranking of values
        - Example: Mann-Whitney U tests
        - Kruskal-Wallis test
]

---
name: report

## Session

* This presentation was created in RStudio using [`remarkjs`](https://github.com/gnab/remark) framework through R package [`xaringan`](https://github.com/yihui/xaringan).
* For R Markdown, see <http://rmarkdown.rstudio.com>
* For R Markdown presentations, see <https://rmarkdown.rstudio.com/lesson-11.html>

```r
R.version
```

```
##                _                           
## platform       x86_64-apple-darwin15.6.0   
## arch           x86_64                      
## os             darwin15.6.0                
## system         x86_64, darwin15.6.0        
## status                                     
## major          3                           
## minor          5.3                         
## year           2019                        
## month          03                          
## day            11                          
## svn rev        76217                       
## language       R                           
## version.string R version 3.5.3 (2019-03-11)
## nickname       Great Truth
```

---
name: end-slide
class: end-slide, middle
count: false

# Thank you. Questions?

R version 3.5.3 (2019-03-11) Platform: x86_64-apple-darwin15.6.0 (64-bit)OS: macOS High Sierra 10.13.6

Built on : 17-Jun-2019 at 08:51:16

__2019__ • [SciLifeLab](https://www.scilifelab.se/) • [NBIS](https://nbis.se/)