class: center, middle, inverse, title-slide # Basic Statistics |
Models
## RaukR 2019 • Advanced R for Bioinformatics ###
Bengt Sennblad
### NBIS, SciLifeLab --- exclude: true count: false <link href="https://fonts.googleapis.com/css?family=Roboto|Source+Sans+Pro:300,400,600|Ubuntu+Mono&subset=latin-ext" rel="stylesheet"> <!-- ----------------- Only edit title & author above this ----------------- --> --- name: models1 ## Models * What is a model? -- + simplification (abstraction) of reality that helps us address a specific problem + Aims - enhance computation/estimation - reflect biology --- ## Models | `Example: R proficiency as a function of beer` .pull-left-50[ We want to model how beer consumption affects the ability to program well in R * `\(x=\)` Number of beers drunk * `\(y=\)` Proficiency to program in R ] -- .pull-right-50[ <img src="presentation1_files/figure-html/beer1-1.svg" style="display: block; margin: auto auto auto 0;" /> ] -- This is a *linear model*, `$$y = \beta_0 + \beta_1 x$$` where `\(y\)` changes in proportion to `\(x\)`, according to its parameters `\(\beta_0=1.0, \beta_1 = -0.1\)` *** -- __Notice that a model is always restricted by its assumptions:__ * In the model plot above, I have used strong assumptions about `\(\beta_1\)` and `\(\beta_0\)` being exactly `\(\beta_0=1.0\)` and `\(\beta_1=-0.1\)`. * Therefore, people with a different susceptibility to alcohol `\((\beta_1 \neq -0.1)\)` or a different initial proficiency in R `\((\beta_0 \neq 1.0)\)` are *not* correctly modeled by this model. --- ## Models | `Example: R proficiency as a function of beer` .center[ .large[How can we solve this? ] ] -- Several possibilities: .pull-left-50[ * Include population variation in the model (more later) * Create a hierarchical model comprising + a main linear model `$$y=\beta_0 + \beta_1 x,$$` whose parameters are modeled by + linear submodels, including body mass `\((m)\)` and *hyperparameters*: - alcohol susceptibility, `\(\beta_1=am + b\)` - initial proficiency, `\(\beta_0=cm+d\)` * Both these combined ] -- <br> <br> .pull-right-50[ <img src="presentation1_files/figure-html/beer2-1.svg" style="display: block; margin: auto auto auto 0;" /> ] -- .center[Still not quite right...] --- ## Models | `Example` .pull-left-50[ More realistic? <img src="presentation1_files/figure-html/beer3-1.svg" style="display: block; margin: auto auto auto 0;" /> * This is a logistic model `\(y = \frac{1}{1+e^{-(\beta_0 +\beta_1x)}}\)`, where there are saturation towards the extremes * By transforming `\(y\)` into log-odds, we get a linear expression in the r.h.s. `$$\log\frac{y}{1-y} = \beta_0 + \beta_1 x$$` ] -- .pull-right-50[ * Models can also be described as a graph ('graph models'): + This is a graph model showing whether it is possible to fly to RaukR from a selection of cities of the world <img src="presentation1_files/figure-html/models1-1.svg" style="display: block; margin: auto auto auto 0;" /> ] --- # But what can models be used for? -- Perform the lab tasks: ## Task | `Simulation` ## Task | `Probability of data` ## Task | `Statistical test` **Note**: * Don't get stuck on *Challenges* or *Extra reading* --- name: Simulation1 ## Task | `Simulation` Generate 100 samples from `\(Y = \beta_0 + \beta_1* X,\)` with parameters `\(\beta_0=0.3\)` and `\(\beta_1=0.2\)` **Think about** * Does the plotted results look biologically reasonable? + if not: what could be the reason? -- .pull-left-50[ ```r #parameters b0 = 0.3 b1 = 0.2 N=100 sim1 = data.frame(genos=round(runif(N,min=0,max=2))) sim1$phenos = b0 + b1 * sim1$genos plot(x=sim1$genos, y=sim1$phenos) ``` ] .pull-right-50[ <img src="presentation1_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto auto auto 0;" /> ] --- name: Simulation2 ## Task | `Simulation` ### Deterministic vs statistical models * `\(Y = \beta_1* X + \beta_0\)` is a *deterministic* model + does not model any variation + Common, e.g., in classical physics (velocity `\(v = \frac{\Delta s}{\Delta t}\)`) * `\(Y = \beta_1* X + \beta_0 +\epsilon,\)` where the *residuals* `\(\epsilon ~\sim N(mean=0,sd=\sigma^2)\)` is a *statistical* (equiv. *stochastic*, *random*) model + attempts to model variation around a population mean (the *residuals*) determined by the model + generates a stochastic variable, `\(Y\)`, and is used in statistical analysis -- .pull-left-50[ ```r #parameters b0 = 0.3 b1 = 0.2 N=100 sim1 = data.frame(genos=round(runif(N,min=0,max=2))) sim1$phenos = b0 + b1 * sim1$genos + rnorm(N, mean=0, sd=0.05) plot(x=sim1$genos, y=sim1$phenos) ``` ] .pull-right-50[ <img src="presentation1_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto auto auto 0;" /> ] --- name: simulation3 ## Task | `Simulation` ### Something about the `\(\sim\)` notation * Statistical notation + `\(x\sim U(0,1)\)` means that `\(x\)` is a stochastic variable with a uniform distribution in the interval [0,1] * R formulas + `y~x` in `lm` function is a shorthand for `\(y = \beta_0 + \beta_1 x\)`. --- name: Simulation3 ## Task | `Simulation` **Think about** * When can simulated data be useful? -- **Uses for simulated data** * *Oracle knowledge* when evaluating performance of methods, e.g., Type I and II errors * Estimating probabilities and probability distributions of, e.g., data and summary statistics of data (next task) --- name: Probability1 ## Task | `Probability of observed data` For the linear model `\(Y=\beta_0+\beta_1 X,\)` with parameters `\((\beta_0=0.3, \beta_1=0.2, \sigma^2=0.05\)`), estimate the conditional probability `$$Pr[Y <= 0.65|x=2],$$` using a #### Simulation solution and a #### Analytic solution --- name: Probability2 ## Task | `Probability of observed data` #### Simulation solution **Think about** * What shape does the plotted histogram have? * Where approximately is the mean? * Does this make sense in light of the generative model we used? .pull-left-50[ ```r #parameters b0 = 0.3 b1 = 0.2 N = 1000 x = 2 y = b0 + b1 * x + rnorm(N, mean=0, sd=0.05) h=hist(y, frequency=TRUE, right=FALSE, breaks=seq(0,1.0,0.05), plot=FALSE) h$counts = h$counts/N plot(h, xlim=c(0,1.0), ylim =c(0,0.6), labels=TRUE) # compute requested probability from hist paste("Pr[Y<=0.65|X=2] = ", sum(h$counts[1:13])) ``` ] .pull-right-50[ ``` ## [1] "Pr[Y<=0.65|X=2] = 0.183" ``` <img src="presentation1_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto auto auto 0;" /> ] --- name: Probaility3 ## Task | `Probability of observed data` ### Adding mean and multiplying sd `\(Y \sim N(mean=0,sd=1) \Leftrightarrow Y \sim \mu+\sigma * N(mean = 0, sd=1)\)` <img src="presentation1_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto auto auto 0;" /> -- * `\(N(mean=0, sd=1)\)` is called the * General Normal distribution* ??? `\begin{eqnarray*} Y &\sim& \mu + N(mean=0,sd=1)\\ Y&\sim & N(mean = \mu, sd=\sigma)\\ Y&\sim & \sigma * N(mean = \mu, sd=1) \end{eqnarray*}` --- name: Probability4 ## Task | `Probability of observed data` **Think about** * What have we plotted; can we plot `\(Pr[Y<=y|X,\theta]\)` more directly? .pull-left-50[ We estimated the interval probabilities of the model `\(Y\sim N(mean=\beta_0+\beta_1 X, sd=\sigma)\)`... <img src="presentation1_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto auto auto 0;" /> ] -- .pull-right-50[ ...but we could actually have plotted the *cumulative probability function* (CDF) to directly estimate `\(Pr[Y<=0.65|X=x, \theta]\)` <img src="presentation1_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto auto auto 0;" /> ] --- name: Probability5 ## Task | `Probability of observed data` ### Analytic solution .pull-left-50[ **Think about** Use R's function `pnorm` for calculate the CDF * Do the result fit that from the simulation? + If not, how can we improve the fit? * What can this result be used for? ] -- <br> .pull-right[ ```r b0 = 0.3 b1 = 0.2 x = 2 y = 0.65 mu = b0 + b1 * x paste("Pr[Y<=.6|X=2] = ", pnorm(y, mean = mu, sd=0.05)) ``` ``` ## [1] "Pr[Y<=.6|X=2] = 0.158655253931457" ``` ] -- .pull-left-100[ **p-values** * Probability fo finding a more extreme value or test statistic under the NULL model. + p-values can refer to different meanings of *extreme*. Using our model as an NULL model: - *left-tailed* p-value - *right-tailed* p-value - *double-tailed* p-value + If the p-value is *significant*, i.e., below some threshold `\(\alpha\)` (typically `\(\alpha = 0.05\)`) we can reject the NULL hypothesis that the data is generated from our model. ] --- name: Probability6 ## Task | `Probability of observed data` ** *Challenge*: Why look at intervals of `\(Y\)` rather than specific values?** -- * **For continuous `\(Y\)`, probabilities only exist for intervals; for any specific `\(y\)`, `\(Pr[Y=y] = 0\)`** -- * But it's handy to calculate probabilities for specific values? -- * Let's look at the average probability over the interval of the histogram interval including 0.65 <img src="presentation1_files/figure-html/unnamed-chunk-14-1.svg" style="display: block; margin: auto auto auto 0;" /> -- * Asympotically approaches a limit, which is defined as the *Probability density function* (PDF); PDF for 0.65 = 4.83941 -- * PDF is the derivative of CDF * PDF is not a proper probability! --- name: tests ## Task | `Statistical tests` ### Student's t-test for `\(y\sim N(mean=\mu,sd)\)` * The t-test uses normalized residuals as a test statistics: `\begin{eqnarray*} t &= \sum_{i}\frac{y_i-\mu}{s/\sqrt{N}} &= \frac{\bar{y}-\mu}{s/\sqrt{N}} \end{eqnarray*}` where `\(s/\sqrt{N}\)` is an estimate of the standard deviation, `\(\sigma\)`, from the observed data. * Normalized residuals have a *General Normal distribution*, `\(N(mean=0,sd=1)\)` **Think about** * Does the t-test reject the NULL model or not? ``` ## [1] "generate data with c0 = 0.1 , c1 = 0.5" ## ## One Sample t-test ## ## data: y ## t = 79.213, df = 99, p-value < 2.2e-16 ## alternative hypothesis: true mean is not equal to 0.7 ## 95 percent confidence interval: ## 1.072927 1.092090 ## sample estimates: ## mean of x ## 1.082509 ``` --- name: tests ## Task | `Statistical tests` * Test if two samples are generated by the same (unknown) model using the standardized difference in means, typically, using the NULL hypothesis that this difference is 0. Example code below .pull-left-50[ ```r # data 1 b0 = 0.3 b1 = 0.2 x = rep(x,N) N=100 y = b0 + b1 * x + rnorm(N, mean=0, sd=0.05) #data 2 c0 = 0.4 c1 = 0.2 N=100 yp = c0 + c1 * + rnorm(N, mean=0, sd=0.05) t.test(y,yp) ``` ``` ## ## Welch Two Sample t-test ## ## data: y and yp ## t = 61.492, df = 107.92, p-value < 2.2e-16 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 0.2963383 0.3160794 ## sample estimates: ## mean of x mean of y ## 0.7048302 0.3986213 ``` ] .pull-right-50[ <img src="presentation1_files/figure-html/unnamed-chunk-17-1.svg" style="display: block; margin: auto auto auto 0;" /> ] --- name: tests ## Task | `Statistical tests` * In linear models, e.g., in R's `lm`, a *t*-test is used to test if the estimated parameters are significantly different from 0. ```r summary(lm(phenos~genos, data=sim1)) ``` ``` ## ## Call: ## lm(formula = phenos ~ genos, data = sim1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.099606 -0.032441 -0.004768 0.028748 0.126614 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.296934 0.007460 39.81 <2e-16 *** ## genos 0.197828 0.005953 33.23 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.04572 on 98 degrees of freedom ## Multiple R-squared: 0.9185, Adjusted R-squared: 0.9177 ## F-statistic: 1104 on 1 and 98 DF, p-value: < 2.2e-16 ``` --- name: nonparam .pull-left-50[ ## Types of models * Deterministic models + structure + parameter values * Statistical models + Additional *random distribution* + Parameteric models - structure fixed *a priori* * Example: Normal family of distribution (incl. linear model) + Non-parametric distribution - estimates structure from data * Example: histograms, Kernel density estimation ] -- .pull-right-50[ ## Types of statistical tests * Parametric tests + assumes statistical distribution *a priori* - Student's t-test - ANOVA * Non-parametric tests + no assumption of statistical distribution of data + typically uses ranking of values - Example: Mann-Whitney U tests - Kruskal-Wallis test ] --- name: report ## Session * This presentation was created in RStudio using [`remarkjs`](https://github.com/gnab/remark) framework through R package [`xaringan`](https://github.com/yihui/xaringan). * For R Markdown, see <http://rmarkdown.rstudio.com> * For R Markdown presentations, see <https://rmarkdown.rstudio.com/lesson-11.html> ```r R.version ``` ``` ## _ ## platform x86_64-apple-darwin15.6.0 ## arch x86_64 ## os darwin15.6.0 ## system x86_64, darwin15.6.0 ## status ## major 3 ## minor 5.3 ## year 2019 ## month 03 ## day 11 ## svn rev 76217 ## language R ## version.string R version 3.5.3 (2019-03-11) ## nickname Great Truth ``` <!-- --------------------- Do not edit this and below --------------------- --> --- name: end-slide class: end-slide, middle count: false # Thank you. Questions? <p>R version 3.5.3 (2019-03-11)<br><p>Platform: x86_64-apple-darwin15.6.0 (64-bit)</p><p>OS: macOS High Sierra 10.13.6</p><br> Built on : <i class='fa fa-calendar' aria-hidden='true'></i> 17-Jun-2019 at <i class='fa fa-clock-o' aria-hidden='true'></i> 08:51:16 __2019__ • [SciLifeLab](https://www.scilifelab.se/) • [NBIS](https://nbis.se/)