Statistics in the big data era
Tables are long and lean
Assumptions:
Frequentist inference dominates; hypothesis testing and confidence intervals
Tables are wide - e.g. scRNAseq with few observations (cells) and many variables (genes)
Problems:
Implications:
Alternatives to univariate analysis
Multivariate analysis
Machine learning
Data from (Hastie and Tibshirani 2009)
Pairwise scatter plots for exploratory data analysis
Perform multiple linear regression using a subset of variables
Output \(y\) is typically one of
Input \(x\) can be any of the above, or more complex data structures, such as images
(Murphy 2012)
classification and regression are examples of supervised learning, where data consists of input-output pairs; clustering and dimensionality reduction examples of unsupervised learning where data consists of inputs only
Example - classification of flowers
Goal: learn mapping from input \(\mathbf{x}\) to output \(y \in \{1; @\dots; @C\}\) where \(C\) is number of classes
Here, scatter plots are based on features extracted from iris data set. Setosas easy to distinguish from other species.
Exploratory data analysis useful before applying any modelling/machine learning.
lm(y ~ x)
lm(y ~ poly(x, 2))
Linear regression minimizes the squared distance from the points to the regression line.
Goals
Harder problem - we don’t know the answer beforehand. How to choose \(K\) wisely? Model selection.
Example - dimensionality reduction with pca (data from (Murphy 2012)))
Here project 3-dimensional point cloud to 2-dimensional subspace, which approximates data reasonably well. The motivation behind this technique is that although the data may appear high dimensional, there may only be a small number of degrees of variability, corresponding to latent factors.
Question: how do we interpret the so-called principal components?
Developed by Pearson (1901), finally formalized by Hotelling (1931)
The central idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set
(Principal Component Analysis I.T. Jolliffe Springer 2002)
In other words, if we have a data table with \(n\) samples and \(p\) variables (typically \(p >> n\) in scRNASeq), we want to project the data to some (lower) dimension \(k \leq p\) while retaining as much as possible of the variation
PCA finds principal components that are a series of (orthogonal) projections of the data. The components are orthogonal and ordered in variance, so that here PC1 is aligned along the direction of highest variance, PC2 along the direction of second-highest variance subject to the condition that it is uncorrelated to PC1. PCA essentially rotates the set of points around their mean in order to align with the principal components. The principal components span a subspace and intersect at the centroid (green point).
Here, the 1-dimensional subspace is a line. The blue points are orthogonal projections of the data points onto the line. The projections minimize the squared lengths of the dashed lines, subject to the condition that the variance along the line is maximized.
An animation says more than a million words…
Incidentally, I haven’t said anything about where the centroid is located - here it is at the origin. This can (and should) always be achieved by mean-centering the data prior to analysis. The R function prcomp
does this behind the scenes.
Denote the data by the \(n \times p\) matrix \(\mathbf{X}\). There exists a factorization called singular value decomposition of the form
\[\begin{equation*} \mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V^T} \end{equation*}\]where
The columns of \(\mathbf{UD}\) are score vectors of the principal components, columns of \(\mathbf{V}\) are loadings. The first principal component can also be written \(\mathbf{z_1}=\mathbf{X}v_1=\mathbf{U}\mathbf{D}\mathbf{V^T}v_1=\mathbf{u_1}d_1\) where \(\mathsf{Var(}{\mathbf{z_1}}\mathsf{)}=\frac{d^2_1}{n}\) and so on.
Back to our PCA plot.
For a given point \(x_i\), the projection point on PC1 is given by \(u_{i1}d_1v_1\). Here, \(x_i\) has a higher PC1 score than \(x_j\).
One approach is to use loadings which represent the covariance of the original variables with the (unit scaled) principal components. A biplot plots the scores together with the loadings of the original variables. They show how much a component loads an original variable.
Coloring by metadata also facilitates interpretation.
Since PCA aligns PCs along directions of maximum variance, it is sensitive to differences in scale, e.g. caused by different measurement units. Therefore it is standard practice to scale variables to unit variance when running PCA.
\(\mathbf{X}\) can be approximated by using only the first \(q\) principal components, i.e.:
\[ X \approx U D_q V_q^T \]
In essence, PCA can be seen as a compression algorithm in which we throw away the smallest terms (smallest variances) in the diagonal matrix and approximate \(\mathbf{X}\). Any given PC \(j\) explains a proportion \(\pi_j\) of the variance which is defined as \(\pi_j = d_j/\sum_{i=1}^pd_i\).
By introducing an error term, the PCA model can be written as
\[ \mathbf{X} = U D_q V_q^T + E \]
Unfortunately […] there is a lot of confusion about what PCA is really about
(Pachter 2014)
PCA is
Still, less important than knowing what PCA is or how it is implemented is knowing how and when to use it. Remember to think what you’re doing when using PCA.
In contrast to PCA, t-SNE (Maaten and Hinton 2008) is a non-linear dimensionality reduction technique
It models high-dimensional points in such a way that points close in high-dimensional space end up close low-dimensional space
The algorithm is non-linear and adapts to the underlying data density, using different transformations for different regions
The perplexity input parameter approximates the number of neigbors a point has. Typical value are between 5 and 50.
In addition, the algorithm is stochastic, meaning successive runs with identical parameters may produce different results
Too small perplexity and local variations dominate; too large (more than number of points) yields unexpected behaviour.
The following examples are based on a study by (Zeisel et al. 2015). The data can be obtained from (Amit Zeisel and Linnarsson 2015). The analyses have been performed in R using utility functions from the scater package (“Scater,” n.d.). The analysis is based on the simpleSingleCell
bioconductor tutorial (Aaron T. L. Lun and Marioni 2017).
The authors sequenced RNA from \(>\) 3000 single cells to “perform a molecular census of the primary somatosensory cortex (S1) and the hippocampal CA1 region”.
PCA can be used for quality control by basing it on the quality metrics (coldata
) for each cell, e.g.:
or by using (filtered) expression values to get an overview of clustering patterns.
Outliers may be indicative of low-quality cells. Use quality metrics to aid interpretation of principal components.
# data has been loaded and preprocessed and stored in a SingleCellExperiment object
sce <- calculateQCMetrics(sce, feature_controls=list(Spike=is.spike, Mt=is.mito))
plotPCA(sce, pca_data_input="coldata")
Interpretation? No clear outlier or cluster. Add metadata information.
Add metadata information (tissue
and total.mRNA.mol
) to facilitate interpretation.
plotPCA(sce, colour_by="tissue", size_by="total.mRNA.mol",
pca_data_input="coldata", legend="none")
First PC seems to separate by number of mRNA molecules in cell.
Size by number of expressed genes (calculated by CalculateQCMetrics
)
plotPCA(sce, colour_by="tissue", size_by="total_features",
pca_data_input="coldata", legend="none")
Even clearer trend - cells with few features at the right part of the plot.
Look at higher principal components.
plotPCA(sce, colour_by="tissue", size_by="total_features",
pca_data_input="pdata", legend="none", ncomponents=3)
PCA done on expression values (rowdata
)
plotPCA(sce, pca_data_input="rowdata", colour_by="tissue",
ncomponents=2, exprs_values="counts",
size_by=as.data.frame(rep(2, length(sce$tissue))))
Some separation by tissue, but more clusters are clearly present.
Size by library size and number of genes (calculated by calculateQCMetrics
)
plotPCA(sce, colour_by="tissue", exprs_values="counts",
size_by=as.data.frame(sce$total_counts/1e6),
legend="none", ncomponents=2)
plotPCA(sce, colour_by="tissue",
exprs_values="counts",
size_by=as.data.frame(sce$total_features),
legend="none", ncomponents=2)
There are many more quality control steps to perform. Here, we have only discussed the use of PCA.
Before proceeding with data exploration with t-SNE/PCA and clustering we need to process and filter the data via a number of steps:
tsne1 <- plotTSNE(sce, exprs_values="norm_exprs", colour_by=top.hvg,
perplexity=10, rand_seed=100, feature_set=chosen) + fontsize
tsne2 <- plotTSNE(sce, exprs_values="norm_exprs", colour_by="Mog",
perplexity=10, rand_seed=100, feature_set=chosen) + fontsize
multiplot(tsne1, tsne2, cols=2)
Dimensionality reduction with t-SNE is applied to a subset of highly varible genes (HGVs) which have been obtained through multiple data processing and filtering steps.
pca1 <- plotPCA(sce, exprs_values="norm_exprs", colour_by=top.hvg) + fontsize
pca2 <- plotPCA(sce, exprs_values="norm_exprs", colour_by="Mog") + fontsize
multiplot(pca1, pca2, cols=2)
Here, PCA is less effective at separating cells into many different clusters.
Perform hierarchical clustering on Euclidean distances between cells and use a dynamic tree cut to define clusters. The block-like pattern consistent with dimensionality reduction shown previosly.
The goal: identify marker genes for each subpopulation. Many markers are not uniquely up- or down-regulated in a cluster; rather a combination of markers is used to characterize a subpopulation
Dimensionality reduction techniques, with a focus on PCA (and t-SNE)
Using PCA:
t-SNE:
Aaron T. L. Lun, Davis J. McCarthy, and John C. Marioni. 2017. https://www.bioconductor.org/help/workflows/simpleSingleCell/.
Amit Zeisel, Peter Lönnerberg, Ana B. Muñoz Manchado, and Sten Linnarsson. 2015. http://linnarssonlab.org/cortex/.
Hastie, Trevor, and Rob Tibshirani. 2009. https://web.stanford.edu/~hastie/ElemStatLearn/datasets.
Maaten, Laurens van der, and Geoffrey Hinton. 2008. “Visualizing Data Using T-SNE.” Journal of Machine Learning Research 9 (Nov): 2579–2605. http://jmlr.org/papers/v9/vandermaaten08a.html.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective (Adaptive Computation and Machine Learning series). Hardcover; The MIT Press. http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20\&path=ASIN/0262018020.
Pachter, Lior. 2014. https://liorpachter.wordpress.com/2014/05/26/what-is-principal-component-analysis/.
Principal Component Analysis I.T. Jolliffe Springer. 2002. http://www.springer.com/la/book/9780387954424.
“Scater.” n.d. Bioconductor. http://bioconductor.org/packages/scater/.
Wattenberg, Martin, Fernanda Viégas, and Ian Johnson. 2016. “How to Use T-SNE Effectively.” Distill 1 (10): e2. doi:10.23915/distill.00002.
Zeisel, Amit, Ana B. Muñoz Manchado, Simone Codeluppi, Peter Lönnerberg, Gioele La Manno, Anna Juréus, Sueli Marques, et al. 2015. “Cell Types in the Mouse Cortex and Hippocampus Revealed by Single-Cell RNA-Seq.” Science, February, aaa1934. doi:10.1126/science.aaa1934.