4  Sampling and experimental design

4.1 Random sampling

In many (most) experiments it is not feasible (or even possible) to examine the entire population. Instead a random sample is studied.

A random sample is a random subset of individuals from a population.

There are different techniques for performing random sampling, two common techniques are simple random sampling and stratified random sampling.

A simple random sample is a random subset of individuals from a population, where every individual has the same probability of being choosen. Simple random sampling can be performed using the urn model. If every individual in the population is represented by one ball, a simple random sample of size \(n\) can be achieved by drawing \(n\) balls from the urn, without replacement.

In stratified random sampling the population is first divided into subpopulations based on important attributes, e.g. sex (male/female), age (young/middle aged/old) or BMI (underweight/normal weight/overweight/obese). Simple random sampling is then performed within each subpopulation.

4.2 Principles of experimental design

When you plan your experiment and assign experimental units to treatment/control group (or similar), it is important to consider extraneous variables, i.e. variables that are not your main interest but that might affect the studied experimental outcome or the variable of interest. These variables can be properties of the experimental units such as age, sex etc. but also variables introduced in the experiment, like batch, experiment date, laboratory personell etc.

Fundamental to experimental design are the three principles; replication, randomization and blocking.

Replication. Replication is the repetition of the same experiment, with the same conditions. Biological replicates are measurements of different biological units under the same conditions, whereas technical replicates are repeated measurements of the same biological unit under the same conditions.

Randomization. Experimental units are not identical, hence by assigning experimental units to treatment/control at random we can avoid unnecessary bias. It is also important to perform the measurements in random order.

Blocking. Blocking is grouping experimental units into blocks consisting of units that are similar to one another and assigning units within a block to treatment/control at random. Blocking reduces known but irrelevant sources of variation between units and thus allows greater precision in the estimation of the source of variation under study.

The experimental units can be grouped into blocks according to their properies (e.g. age, sex, etc). Units within a block will be more similar than between blocks. By assigning units within a block to treatment/control at random, the variation due to differences between blocks (that are not relevant to the studied outcome) can be reduced.

In many retrospective studies it is not possible to assign patients into treatment/control or sick/healthy. Experimental design is still important for controling sources of variation introduced during the experiment.

Block what you can; randomize what you cannot.

4.3 Sample properties

Summary statistics can be computed for a sample, such as the sum, proportion, mean and variance.

Notation: A random sample \(x_1,x_2,\dots,x_n\) from a distribution, \(D\), consists of \(n\) observations of the independent random variables \(X_1, X_2,\dots,X_n\) all with the probability distribution \(D\).

4.3.1 Sample proportion

The proportion of a population with a particular property, the population proportion, is denoted \(\pi\).

The number of individuals with the property in a simple random sample of size \(n\) is a random variable \(X\). The proportion of individuals in a sample with the property is also a random variable;

\[P = \frac{X}{n}\] with expected value \[E[P] = \frac{E[X]}{n} = \frac{n\pi}{n} = \pi\].

The sample proportion, \(p\), is said to be an unbiased estimate of the population proportion, as it’s expected value is the population proportion \(\pi\).

4.3.2 Sample mean

For a particular sample of size \(n\); \(x_1, \dots, x_n\), the sample mean is denoted \(m = \bar x\). The sample mean is calculated as;

\[m = \bar x = \frac{1}{n}\displaystyle\sum_{i=1}^n x_i.\]

Note that the mean of \(n\) independent identically distributed random variables, \(X_i\), is itself a random variable;

\[\bar X = \frac{1}{n}\sum_{i=1}^n X_i.\]

If \(X_i\) are normaly distributed \(X_i \sim N(\mu, \sigma)\), then \(\bar X \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)\).

When we only have a sample of size \(n\), the sample mean \(m\) is our best estimate of the population mean. It is possible to show that the sample mean is an unbiased estimate of the population mean, i.e. the average (over many size \(n\) samples) of the sample mean is \(\mu\).

\[E[\bar X] = E\left[\frac{1}{n}\sum_{i=1}^n X_i\right] = \frac{1}{n} \sum_{i=1}^{n} E[X_i] = \frac{1}{n} n \mu = E[X] = \mu\]

4.3.3 Sample variance

The sample variance is computed as;

\[s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i-m)^2.\]

The sample variance is an unbiased estimate of the population variance.

4.3.4 Sampling distribution

A sampling distribution is a probability distribution of a sample property. A sampling distribution is obtained by sampling many times from the studied population.

4.3.5 Standard error

Eventhough the sample can be used to calculate unbiased estimates of the population property, the sample estimate will not be perfect. The standard deviation of the sampling distribution is called the standard error, \(SE\).

For the sample mean, \(\bar X\), the variance is

\[E[(\bar X - \mu)] = var(\bar X) = var(\frac{1}{n}\sum_i X_i) = \frac{1}{n^2} \sum_i var(X_i) = \frac{1}{n^2} n var(X) = \frac{\sigma^2}{n}\]

The standard error of the mean is thus;

\[SEM = \frac{\sigma}{\sqrt{n}}\]

Replacing \(\sigma\) with the sample standard deviation, \(s\), we get an estimate of the standard error of the mean;

\[SEM \approx \frac{s}{\sqrt{n}}\]