Models
Simulation
Here, we will generate a synthetic data set by simulating data from a linear model.
\(Y = \beta_0 + \beta_1* X\) is a generating model – it describes how to generate \(Y\) from observed \(X\) and parameters \(\beta_0,\beta_1\).
runif
)#parameters
b0 = 0.3
b1 = 0.2
N=100
sim1 = data.frame(x=round(runif(N,min=0,max=2)))
sim1$y = b0 + b1 * sim1$x
plot(x=sim1$x, y=sim1$y)
Some possible answers
Does the plot look biologically reasonable?
Deterministic vs statistical models
Henceforth, we will only consider statistical models.
Task
rnorm
).#parameters
b0 = 0.3
b1 = 0.2
N=100
sim1 = data.frame(x=round(runif(N,min=0,max=2)))
sim1$y = b0 + b1 * sim1$x + rnorm(N, mean=0, sd=0.05)
plot(x=sim1$x, y=sim1$y)
Some possible answers
Biologically plausible
Uses for simulated data
Probability of observed data
A (statistical) linear model, \(Y=\beta_0+\beta_1 X,\) generates \(Y\) given, or conditioned on, specified value of \(X\). Since we have variation in the model, we can’t say exactly what value \(Y\) will be. Instead, we can compute or estimate the probability that \(Y\) takes a specific value, say \(Y=y\), if we know that \(X\) has a specific value, say \(X=x\) (we say that the probaility is conditioned this value of \(X\)). This will be a conditional probability \(Pr[Y=y|X=x]\), where the bar (‘\(|\)’) means that the probability is conditioned on \(X=x\). In fact, if \(X\neq x\) then this probability tells us nothing about \(Y\).
Our task here is to, for the linear model \[Y=\beta_0+\beta_1 X\] with parameters \((\beta_0=0.3, \beta_1=0.2, \sigma=0.05\)), estimate the conditional probability \[Pr[Y <=0.65|X=2].\]
Think About
#parameters
b0 = 0.3
b1 = 0.2
N = 1000
x = 2
y = b0 + b1 * x + rnorm(N, mean=0, sd=0.05)
h=hist(y, frequency=TRUE, right=FALSE, breaks=seq(0,1.0, 0.05), plot=FALSE)
h$counts = h$counts/N
plot(h, xlim=c(0,1.0), ylim =c(0,1), labels=TRUE)
# compute requested probability from hist
paste("Pr[Y<=0.65|X=2] = ", sum(h$counts[1:13]))
Some possible answers
Shape
Mean
Makes sense?
mean = 0
, and not 0.7
– how does this work?
What have we plotted; can we plot \(Pr[Y<=y|X,\theta]\) more directly?
#parameters
b0 = 0.3
b1 = 0.2
N = 1000
x = 2
y = b0 + b1 * x + rnorm(N, mean=0, sd=0.05)
h=hist(y, frequency=TRUE, right=FALSE, breaks=seq(0,1.0, 0.05), plot=FALSE)
h$counts = h$counts/N
# replace the histogram counts with the cumulative counts
h$counts = cumsum(h$counts)
#plot the CDF
plot(h, labels=TRUE)
# compute requested probability from hist
paste("Pr[Y<=0.65|X=2] = ", sum(h$counts[13]))
## [1] "Pr[Y<=0.65|X=2] = 0.151"
Extra reading The General Normal distribution
Normal
)
Think about
b0 = 0.3
b1 = 0.2
x = 2
y = 0.65
mu = b0 + b1 * x
paste("Pr[Y<=.6|X=2] = ",pnorm(y, mean = mu, sd=0.05))
Some possible answers
Extra Reading Why probability of interval
This is a bit tricky to explain.
Statistical tests
We will now consider an extension of the previous task to several data points. As you can imagine, doing this with the above approach could be cumbersome. Instead specific tests have been developed. We will here use the t-test as a very simple example.
We will use a single sample t-test, which tests how probable it is that some sampled data is drawn from a given Normal distribution; this distribution is called the NULL model of the test. The t-test is not designed to handle different genotypes, so for simplicity, we will, again, limit ourselves to the case \(X=2\).
The t-test uses normalized residuals as a test statistics:
\[\begin{eqnarray*} t &= \sum_{i}\frac{y_i-\mu}{s/\sqrt{N}} &= \frac{\bar{y}-\mu}{s/\sqrt{N}} \end{eqnarray*}\] where \(s/\sqrt{N}\) is an estimate of the standard deviation, \(\sigma\), from the observed data. As you can see this boils down to comparing the (standardized) estimated mean from the data, \(\bar{y}\) and the model mean, \(\mu\).
These normalized residuals are approximately distributed \(t~\sim N(0,1)\) (this is the General Normal distribution, see also ‘Possible answers/Extra Reading’ under 1.2.1)
Conveniently, The t-test is implemented in the R function t.test
Think about
# compute the mean, mu, from our model
b0 = 0.3
b1 = 0.2
mu = b0 + b1 * x
# Let's simulate the data from another model
c0 = 0.4
c1 = 0.2
N=100
yp = c0 + c1 * rep(x,N) + rnorm(N, mean=0, sd=0.05)
# t-test
t.test(yp, mu=mu)
Some possible answers
* The t-test appears quite sensitive, especially for changes of \(\gamma_1\) (c_1
)
The t-test can also be applied to other questions
# data 1
b0 = 0.3
b1 = 0.2
N=100
y = b0 + b1 * rep(x,N) + rnorm(N, mean=0, sd=0.05)
#data 2
c0 = 0.4
c1 = 0.2
N=100
yp = c0 + c1 * rep(x,N) + rnorm(N, mean=0, sd=0.05)
t.test(y,yp)
# reuse previously simulated data in sim1
b0 = 0.3
b1 = 0.2
N=100
sim1 = data.frame(x=round(runif(N,min=0,max=2)))
sim1$y = b0 + b1 * sim1$x
summary(lm(y~x, data=sim1))
## R version 3.5.3 (2019-03-11)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] sv_SE.UTF-8/sv_SE.UTF-8/sv_SE.UTF-8/C/sv_SE.UTF-8/sv_SE.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] captioner_2.2.3 bookdown_0.11 knitr_1.23
##
## loaded via a namespace (and not attached):
## [1] compiler_3.5.3 magrittr_1.5 tools_3.5.3 htmltools_0.3.6
## [5] yaml_2.2.0 Rcpp_1.0.1 stringi_1.4.3 rmarkdown_1.13
## [9] stringr_1.4.0 xfun_0.7 digest_0.6.19 evaluate_0.14
Built on: 14-Jun-2019 at 14:22:25.
2019 • SciLifeLab • NBIS • RaukR