The following exercises highlight parts of the wide field of experimental design. Specifically, it will target power estimation and evaluation of statistical methods. The two parts overlap but can be completed independently of each other. There is also an open-ended challenge at the end for those who want to work more freely.

Ideally, there would be software packages that accurately estimate the required number of samples for your planned experiment. However, such methods are hard to generalize and for a specific experiment one always has to take the characteristics of that specific data into account and look at what is the praxis within the field. If you do have examples of packages that have helped you in your work feel free to share them with the group.

set.seed(42)

1 Power estimation

Power estimation is a key concept in experimental design. In an ideal world, researchers would have a good grasp of the variability in their data, expected sample sizes and a clear understanding of their planned statistical analysis. Then estimating the number of samples required to capture the effect would be straightforward. For many classical statistical methods this can easily be done through R (or by hand), both via native methods and packages e.g. pwr. For more complex methods that e.g. use empirical Bayesian estimators that depend on all the data in a high-dimensional dataset, these calculations are not as simple and can primarily be done via simulation, more on that in the second task. For this first tutorial we will stick with the basics. For each task, try to figure out your own solution but if needed click the code button for a solution.

First a short remined on the statistical terminology. For null-hypotheses testing one typically considers two kinds of errors, Type-I and Type-II, formulated as,

Type-I: Rejecting the null-hypothesis when the null-hypothesis is true

Type-II: Failing to reject the null-hypothesis when the null-hypothesis is false.

These are also commonly referred to as false positives and false negatives. Statistical power is defined as the probability to reject the null-hypotheses when the null-hypotheses is false. That is, the probability to avoid a type-II error given that an effect exist. Estimating the power of a test requires making assumptions on what the data looks like (primarily the effect size and variance) and setting a rejection level for the test (often called \(\alpha\) or significance cutoff). Thus, while we would like to “know” the power of our planned experiment, it will always be conditioned on these assumptions.

Task 1: First consider the power for a standard two-sample t-test. Assume you are going to run an RNAseq experiment where one group of mice have received a candidate drug while the second group received placebo. Also assume that you have decided to use a t-test to test the difference of a specific gene (a single gene to avoid multiple testing considerations for now). You believe that the difference should be at least 2 units and you want the power to detect this difference to be 0.9. Calculate the number of samples needed assuming a standard deviation of 1.3 and an alpha of 0.05 (hint: power.t.test())

power.t.test(delta=2, power=0.9, sd=1.3)
#Answer, at least 10 samples per group are needed to achieve the desired power.

Task 2: Unfortunately while you sequenced the required number of samples other experiments have shown that the effect size could be small as 0.25 units. What power would your study result in?

power.t.test(delta=0.25, n=10, sd=1.3,type="two.sample",strict = TRUE)
#Answer, power is ca 0.06

Task 3: Now, to proceed to a deeper understanding of power we will do a small simulation study. Begin with setting up the simulation necessary to simulate the data with the same parameters as in task 2. Note, you will need to draw two vectors of random numbers from the normal distribution, one with without an effect and one with.

Apply the two-sample t-test to repeated draws of your simulated data, does the number of significant outcomes match the estimated power?

nIterations <- 10000
result <- rep(FALSE,nIterations)
for (i in 1:nIterations) {
  g1 <- rnorm(10,0,sd=1.3)
  g2 <- rnorm(10,0.25,sd=1.3)
  result[i] <- t.test(g1,g2)[3] < 0.05 # run the test, get the pvalue and 
                                       # check if it is smaller than 0.05
}

mean(result) 
# The proportion of significant tests is slightly higher than 0.06 due to
# rejections in the "wrong" directions. More on that in the next task, also 
# check the "strict" argument of the power.t.test function.

Task 4 (optional): In classical statistics one generally speaks of Type I and Type II errors. However, these are not the only forms of errors possible. Now we will consider Type-M and Type-S errors. M stands for magnitude and the error captures the overestimation of the effect size. S stands for sign and captures the error of getting an effect of the wrong sign. Note that these definitions are not as widespread but gives useful insight into the errors that are likely to occur during testing.

Now, extend the simulation in task 3 to estimate Type-M and Type-S error. That is, for each set of data that results in a significant outcome save the estimated difference between the groups. Then calculate how many times the differences turned out to be negative instead of positive and how many time larger the estimated effect was than the real effect.

nIterations <- 10000
result <- c() # As we have an unkown number of significant results we cannot
counter <- 0  # pre-allocate the whole vector and need a counter
temp <- c()

for (i in 1:nIterations) {
  g1 <- rnorm(10,0,sd=1.3)
  g2 <- rnorm(10,0.25,sd=1.3)
  outcome <- t.test(g1,g2)
  temp[i] <- mean(outcome$conf.int)
  if (outcome[3] < 0.05) { # run the test, get the pvalue and check that it is
                           # smaller than 0.05
    counter <- counter + 1
    result[counter] <- mean(outcome$conf.int) # quick way of getting the estimated
                                              # effect size, could also be done by
                                              # simply taking the difference in 
                                              # means between the groups.
  }
}

# Note that type-M and type-S errors are not strictly defined, but rather  
# useful concepts to keep in mind.
typeM <- mean(result[result <0])/-0.25 # The average effect for the significant
                                       # results is ca 5 times as large as the 
                                       # true effect.
typeS <- sum(result >0)/length(result) # ca 13% of the significant tests have 
                                       # effects with the wrong sign. 

paste0("Type M error = ", typeM)
paste0("Type S error = ", typeS)

Do these results seem surprising to you? The setup for this exercise is a bit extreme, a power of 0.06 is really quite low. However, with noisy data (as is most omics) combined with small sample sizes it is still a result to be concerned about.

This final exercise was inspired by the work of Andrew Gelman, specifically this blog post (Bonus point to anyone who figures out why his theoretical results do not match my simulation example). All of his blog is highly recommended for an insight into recent discussions on sound statistics.

2 Evaluating the performance of statistical tests

Choosing the appropriate statistical method is an essential part of any experiment. There are classical methods that are robust and do the work but might lack in power. One could turn to non-parametric methods but also lack power, especially when few samples are available. A third option is to turn to more complex methods that attempts to model the characteristics of the data and increase power but runs the risk of also increasing the number of false positives (and other error rates) when model assumptions are flawed. This exercise will show one way of evaluating statistical methods for use on omics data.

Preparation: Load the data file data/qin_data.tsv. The file contains gene counts from 10 metagenomes generated from the human gut (originally from Qin et. al. 2012). The samples are divided into two groups and I have added effects to 10% of the genes, simply by removing (“downsampling”) 9/10 of their counts (these constitute the true positives). The final column, tpIndex, indicates which of the genes have had an effect added. Load this data into R and preferably separate out the tpIndex from the raw data.

# Note! the data is available thorugh the link in the course programme
rawData <- read.table("data/qin_data.tsv",header=TRUE,row.names = 1) 
tpIndex <- rawData[,11]
counts <- as.matrix(rawData[,1:10])

Task 5: Now to analyse this data, let’s start with the same t-test as above. First, however, the data needs to be normalized; for this example it is sufficient to use the total counts of each column (hint: colSums and sweep). Apply the test to each row of the normalized data and save the resulting pvalues. As the data contains about 5000 genes you will need to apply multiple testing correction, for example estimating the false discovery rate FDR via Benjamini-Hochbergs method (hint: p.adjust(x,method="BH")).

How many of the genes with an effect was the t-test able to detect at an FDR of 0.05?

normalizedData <- sweep(counts,2,colSums(counts),"/")

pvalues.ttest <- c()
for(iGene in 1:dim(counts)[1]){
  pvalues.ttest[iGene] <- t.test(counts[iGene,1:5],
                    counts[iGene,6:10])[3]
}
adjp.ttest <- p.adjust(pvalues.ttest,method="BH")
sum(adjp.ttest[tpIndex]<0.05)

Task 6: Next, we will apply a test that can capture the count based nature of the data, a generalized linear model (GLM) based on the Poisson distribution. This has for example been applied to metagenomic data in Yatsunenko et al 2012. To run this you will need to use the glm function, define a design vector corresponding to the two groups (a vector of 0s and 1s will suffice) and set family=poisson. Note that the data should not be normalized before applying the GLM; rather the normalization factors (column sums) should be added as an offset in the model formulation (see ?offset()). Apply the Poisson GLM to each gene, retrieve the p-values and adjust them. Note that the most stable method to get p-values from the glm function is to fit the model both with and without the group factor and doing a likelihood ratio test via the anova function (hint: anova(model1,model2,test="Chisq")).

How many of the genes with an effect was the Poisson GLM able to detect at an FDR of 0.05?

normFactors <- colSums(counts)
group <- c(0,0,0,0,0,1,1,1,1,1) #design vector to capture the groups

pvalues.glm <- c()
for(iGene in 1:dim(counts)[1]){
  glmRes <- glm(counts[iGene,]~group+offset(log(normFactors)),family=poisson)
  glmResNull <- glm(counts[iGene,]~offset(log(normFactors)),family=poisson)
  pvalues.glm[iGene] <- anova(glmResNull,glmRes,test="Chisq")[2,5]
}
adjp.glm <- p.adjust(pvalues.glm,method="BH")
sum(adjp.glm[tpIndex]<0.05)

Task 7: The results seem to favor the Poisson GLM in this case but the number of true positives only shows part of the picture.

How many false positives do each test have (the number of genes without an effect that turned out as significant)?

sum(adjp.ttest[!tpIndex] < 0.05)
sum(adjp.glm[!tpIndex] < 0.05)

Task 8: The Poisson GLM fails completely in this example as it is based on a very strict assumption that the mean is the same as the variance. This means that it is not capable of capturing the variability in the data and that the high variability instead is interpreted as a high mean. This results in skewed p-values despite the multiple testing correction.

Visualize the the distribution of p-values for the genes with no effect for both the t-test and the Poisson GLM using a histogram. What should the p-value distribution look like if the null-hypotheses was true? Does this hold for these tests?

hist(unlist(pvalues.ttest)[!tpIndex]) #resonably flat
hist(pvalues.glm[!tpIndex]) # completely skewed towards zero

Task 9: However, using a count-based model is still appropriate for metagenomic data. Redo task 7, but set the GLM family to quasipoisson. Note, this requires setting the lrt test in anova to "F" instead of "Chisq".

How does this overdispersed Poisson GLM perform compared to the t-test in terms of true and false positives at an FDR cut-off of 0.05?

normFactors <- colSums(counts)
group <- c(0,0,0,0,0,1,1,1,1,1) #design vector to capture the groups

pvalues.oglm <- c()
for(iGene in 1:dim(counts)[1]){
  oglmRes <- glm(counts[iGene,]~group+offset(log(normFactors)),family=quasipoisson)
  oglmResNull <- glm(counts[iGene,]~offset(log(normFactors)),family=quasipoisson)
  pvalues.oglm[iGene] <- anova(oglmResNull,oglmRes,test="F")[2,6]
}
adjp.oglm <- p.adjust(pvalues.oglm,method="BH")
sum(adjp.oglm[tpIndex]<0.05)
sum(adjp.oglm[!tpIndex]<0.05)

Taks 10: Finally, add a test of your own choice and evaluate how it performs on the data above, both the number of true and false positives. This could for example be a modern statistical package appropriate for count data e.g. edgeR or DESeq2 or a non-parametric test such as the Wilcoxon-Mann-Whitney test (see wilcox.test). Another option could be to transform the data using the logarithm or square-root and see if this improves the performance of the t-test.

Which test would you recommend other researchers to use?

Open-ended challenge: The data from the tasks above comes from a real data set where group labels have been permuted to remove any real effects. Then the effects have been added to a random set of genes via downsampling counts from all samples in one of the new groups. This way of generating test data for evaluating statistical methods has the benefit of retaining the variability of real data while giving control over which genes have an effect. It also gives you control of effect sizes and group sizes. A good setup for being able to evaluate experimental designs.

Now to the challenge; using the above strategy, set up your own method for generating test data. You will need a sufficiently large dataset (20+ samples) from the method of your choice e.g. RNAseq. If you lack data some count tables from metagenomics can be found here. Now create a function that

loads the data,
randomly selects a subset of samples to create two groups,
adds effect to a subset of the genes and 4.outputs the resulting table.

Given this you can for example use this new data set for the tasks above. Repeat the procedure multiple times to estimate the average performance of the methods. The data generation can also be made more complex, creating several groups, adding effects across different groups etc. to create arbitrarily complex designs. For some a slightly more detailed view of the data generation procedure see the methods section of Jonsson et al 2016.

3 References

Jonsson, V., Österlund, T., Nerman, O. and Kristiansson, E., 2016. Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics. BMC genomics, 17(1), p.78.

Qin, J., Li, Y., Cai, Z., Li, S., Zhu, J., Zhang, F., Liang, S., Zhang, W., Guan, Y., Shen, D. and Peng, Y., 2012. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature, 490(7418), p.55.

Yatsunenko, T., Rey, F.E., Manary, M.J., Trehan, I., Dominguez-Bello, M.G., Contreras, M., Magris, M., Hidalgo, G., Baldassano, R.N., Anokhin, A.P. and Heath, A.C., 2012. Human gut microbiome viewed across age and geography. nature, 486(7402), p.222.

4 Session Info

This document has been created in RStudio using R Markdown and related packages.
For R Markdown, see http://rmarkdown.rstudio.com
For details about the OS, packages and versions, see detailed information below:

sessionInfo()

## R version 3.4.4 (2018-03-15)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.5
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] sv_SE.UTF-8/sv_SE.UTF-8/sv_SE.UTF-8/C/sv_SE.UTF-8/sv_SE.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_0.9.0 forcats_0.3.0    stringr_1.3.1    dplyr_0.7.5     
##  [5] purrr_0.2.5      readr_1.1.1      tidyr_0.8.1      tibble_1.4.2    
##  [9] ggplot2_2.2.1    tidyverse_1.2.1  captioner_2.2.3  bookdown_0.7    
## [13] knitr_1.20      
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.4  xfun_0.1          reshape2_1.4.3   
##  [4] haven_1.1.1       lattice_0.20-35   colorspace_1.3-2 
##  [7] viridisLite_0.3.0 htmltools_0.3.6   yaml_2.1.19      
## [10] rlang_0.2.1       pillar_1.2.3      foreign_0.8-70   
## [13] glue_1.2.0        modelr_0.1.2      readxl_1.1.0     
## [16] bindrcpp_0.2.2    bindr_0.1.1       plyr_1.8.4       
## [19] munsell_0.5.0     gtable_0.2.0      cellranger_1.1.0 
## [22] rvest_0.3.2       psych_1.8.4       evaluate_0.10.1  
## [25] parallel_3.4.4    broom_0.4.4       Rcpp_0.12.17     
## [28] backports_1.1.2   scales_0.5.0      jsonlite_1.5     
## [31] mnormt_1.5-5      hms_0.4.2         digest_0.6.15    
## [34] stringi_1.2.3     grid_3.4.4        rprojroot_1.3-2  
## [37] cli_1.0.0         tools_3.4.4       magrittr_1.5     
## [40] lazyeval_0.2.1    crayon_1.3.4      pkgconfig_2.0.1  
## [43] xml2_1.2.0        lubridate_1.7.4   rstudioapi_0.7   
## [46] assertthat_0.2.0  rmarkdown_1.10    httr_1.3.1       
## [49] R6_2.2.2          nlme_3.1-137      compiler_3.4.4

Page built on: 19-Jun-2018 at 14:19:47.

2018 | SciLifeLab > NBIS > RaukR

Statistics workshop Lab

Experimental Design

Viktor Jonsson

19-Jun-2018

1 Power estimation

2 Evaluating the performance of statistical tests

3 References

4 Session Info