13  Background

Studies in biology have become data-intensive as high throughput experiments in Omics have produced data sets of massive volume and complexity. One example is shown in the figure below. It shows a part of the gene expression data for 6830 genes for 64 cell lines obtained from https://web.stanford.edu/~hastie/ElemStatLearn/datasets/nci.data.csv. Many illustrations here are generated from the data.

Figure 13.1: A part of gene expression data for 6830 genes for 64 cell lines

It makes difficult to get an idea about the contents of the data. Dimensional reduction is frequently used for this challenge. It reduces vast dimensions of a data to a manageable number of dimensions and helps us get a simplified overview of the data. One of the popular methods for the dimensional reduction is principal component analysis (PCA).

PCA focuses on dispersion or variance of data. Let’s take one of the variables and see how the values of it are dispersed around average.

The red numbers show the index of each sample. The red line shows the mean and the dashed lines show the distance between each observation and mean.

The values of individual variables are dispersed as above. Interestingly, multiple variables often have dispersed in the similar pattern.

The red numbers show the index of each sample. The red line shows the mean and the dashed lines show the distance between each observation and mean.

In the scatter plot of those two variables, the correlation is more obvious.

As both variables provide almost identical information redundantly in terms of dispersion across objects (or observations), we can attempt to reduce the number of variables, i.e. the dimension.

14 Two-dimensional data

PCA finds a new variable that explains most of the dispersion or variance of the two variables of interest. The remaining smaller portion of variance in the data is stored in an orthogonal variable to the primary variable. It is achieved by new axes and rotation graphically, which are mathematically expressed by linear combinations of the variables.

Let \(x_{ij}\) denote an observation \(i\) for a variable \(j\), and \(y_{i1}\) be a linear combination of \(x_{i1}\) and \(x_{i2}\).

\[y_{i1} = a_{11} x_{i1} + a_{21} x_{i2} = \sum_{j=1}^{2} a_{j1} x_{ij} = \sum_{j=1}^{2} x_{ij} a_{j1}\]

By adjusting \(a_{11}\) and \(a_{21}\), let the \(\mathbf{y}_1^T = [ y_{11}, y_{21}, \dots, y_{n1} ]\) explain largest amount of variance of the data. Then, the \(\mathbf{y}_1\) is called the first principal component (PC). In a matrix equation for all \(n\) observations, \[\mathbf{y}_1 = \mathbf{X} \mathbf{a}_1\] , where \[ \underset{n\times 1}{\mathbf{y}_1} = \left[ {\begin{array}{c} y_{11} \\ y_{21} \\ \vdots \\ y_{n1} \end{array}} \right] , \quad \underset{n\times 2}{\mathbf{X}} = \left[ {\begin{array}{cc} x_{11} & x_{12} \\ x_{21} & x_{22} \\ \vdots & \vdots \\ x_{n1} & x_{n2} \\ \end{array}} \right] \quad and \quad \underset{2\times 1}{\mathbf{a}_1} = \left[ {\begin{array}{c} a_{11} \\ a_{21} \\ \end{array}} \right] \]

The values of the 1st PC, often called PC1 scores, are colorfully highlighted in the scatter plot of the two original variables.

The scores are presented in the coordinate system of the 1st PC (PC1). As shown below, it is graphically a rotation of the plot above.

If \(\mathbf{S}\) is the covariance matrix of \(\mathbf{X}\) having eigenvalue-eigenvector pairs (\(\hat{\lambda}_1\), \(\hat{\mathbf{e}}_1\)) and (\(\hat{\lambda}_2\), \(\hat{\mathbf{e}}_2\)) where \(\hat{\lambda}_1 \ge \hat{\lambda}_2 \ge 0\), it has been proved that when \[\mathbf{a}_1 = \hat{\mathbf{e}}_1\] , the \(\mathbf{y}_1\) maximizes the proportion explained by it of the variance of \(\mathbf{X}\). Hence, the first principal component is \[\mathbf{y}_1 = \mathbf{X} \hat{\mathbf{e}}_1\] As the total sample variance is \(\sum_{k} \hat{\lambda}_k\), the explained proportion by the first principal component is \[\frac{\hat{\lambda}_1}{\sum_{k} \hat{\lambda}_k}\]

15 Multiple dimensional data

Extension of PCA from 2-D to multiple dimension is rather simple. The matrices are extended as

\[ \underset{n\times p}{\mathbf{X}} = \left[ {\begin{array}{cccc} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \\ \end{array}} \right] \quad and \quad \underset{p\times 1}{\mathbf{a}_1} = \left[ {\begin{array}{c} a_{11} \\ a_{21} \\ \vdots \\ a_{p1} \\ \end{array}} \right] \] PCA finds an axis or a linear combination that maximizes the explained variances. Because we can repeat the process more than once, we can have the 2nd, 3rd, and 4th principal components. Similar to the 1st PC, the 2nd PC is chosen such that it explains the largest variances after excluding the variances already explained by the 1st PC.

If \(\mathbf{S}\) is the covariance matrix of \(\mathbf{X}\) having eigenvalue-eigenvector pairs (\(\hat{\lambda}_1\), \(\hat{\mathbf{e}}_1\)), (\(\hat{\lambda}_2\), \(\hat{\mathbf{e}}_2\)), \(\cdots\), (\(\hat{\lambda}_p\), \(\hat{\mathbf{e}}_p\)) where \(\hat{\lambda}_1 \ge \hat{\lambda}_2 \ge \cdots \ge \hat{\lambda}_p \ge 0\), the \(k\)th principal component is \[\mathbf{y}_k = \mathbf{X} \hat{\mathbf{e}}_k\]

The proportion of variance explained by the \(k\)th principal component is \[\frac{\hat{\lambda}_k}{\sum_{l} \hat{\lambda}_l}\]

16 Notes for PCA

16.1 Missing

Like many other multiple variable analysis, any missing value is not allowed. You may need to impute or trim incomplete cases out.

16.2 Scaling

Scaling ahead of PCA is often recommended. If one variable has substantially larger variance than other variables, basically that variable will be selected as the first principal component. But, we don’t usually want to give a higher weight to one variable or a few, especially when units of the variables are different or arbitrary.

17 Output from PCA

17.1 Principal components

The output presented and investigated most frequently from PCA is the first two principal components (PCs). The values of those two PCs, PC1 and PC2 scores, are often presented in a scatter plot. It is called “PCA score plot” or simply “PCA plot”.

The x and y axes of the plot often are modified. The labels include how much proportion of the variance was explained by the axis. The axes are standardized to have a similar range.

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

17.2 Biplot (Scores + Loadings)

How much each variable contribute to each principal component is stored in vectors. They are called loadings, which are often presented together with the scores. The plot is called “Biplot”.

17.3 Scree plot

The proportion explained by each principal component is presented in a bar or line plot. It is called “scree plot”.