Abstract
In this tutorial, we talk about the Curse of Dimensionality and ideas behind dimensionality reduction. We are going to cover a) linear dimensionality reduction techniques (PCA, metric MDS, ICA), and b) non-linear dimensionality reduction techniques (tSNE, UMAP, LLE, Isomaps)
Dimensionality Reduction concept is really not just about visualization like many of use might think. This is a necessaty in Data Scince in order to overcome the Curse of Dimensionality, also known as Rao’s paradox. What is it about? When we work with data we have n observations (samples) for p variables (features). Very often (almost always unless you are lucky) we have p>>n, i.e. we have a highly dimensional space. It turns out that the classical Frequentist statistics blows up in a highly-dimensional space, i.e. the conclusions of the models are not valid (robust) any more. Let us simulate just a few (n=20-nish) observations of a response variable Y and a few (e.g.p=2) predictor variables incapsuletd into a matrix X and run a simple linear association between X and Y:
set.seed(123)
n<-20
p<-2
Y<-rnorm(n)
Y
## [1] -0.56047565 -0.23017749 1.55870831 0.07050839 0.12928774 1.71506499
## [7] 0.46091621 -1.26506123 -0.68685285 -0.44566197 1.22408180 0.35981383
## [13] 0.40077145 0.11068272 -0.55584113 1.78691314 0.49785048 -1.96661716
## [19] 0.70135590 -0.47279141
X<-matrix(rnorm(n*p),n,p)
X
## [,1] [,2]
## [1,] -1.06782371 -0.69470698
## [2,] -0.21797491 -0.20791728
## [3,] -1.02600445 -1.26539635
## [4,] -0.72889123 2.16895597
## [5,] -0.62503927 1.20796200
## [6,] -1.68669331 -1.12310858
## [7,] 0.83778704 -0.40288484
## [8,] 0.15337312 -0.46665535
## [9,] -1.13813694 0.77996512
## [10,] 1.25381492 -0.08336907
## [11,] 0.42646422 0.25331851
## [12,] -0.29507148 -0.02854676
## [13,] 0.89512566 -0.04287046
## [14,] 0.87813349 1.36860228
## [15,] 0.82158108 -0.22577099
## [16,] 0.68864025 1.51647060
## [17,] 0.55391765 -1.54875280
## [18,] -0.06191171 0.58461375
## [19,] -0.30596266 0.12385424
## [20,] -0.38047100 0.21594157
summary(lm(Y~X))
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0522 -0.6380 0.1451 0.3911 1.8829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.14950 0.22949 0.651 0.523
## X1 -0.09405 0.28245 -0.333 0.743
## X2 -0.11919 0.24486 -0.487 0.633
##
## Residual standard error: 1.017 on 17 degrees of freedom
## Multiple R-squared: 0.02204, Adjusted R-squared: -0.09301
## F-statistic: 0.1916 on 2 and 17 DF, p-value: 0.8274
Looks good, the variables are not related as expected (since they are drawn from a Gaussian distribution) but the math works, no problems as long as n>p. Let us now increase the number of features p and see what happens.
set.seed(123456)
n<-20
p<-10
Y<-rnorm(n)
X<-matrix(rnorm(n*p),n,p)
summary(lm(Y~X))
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0255 -0.4320 0.1056 0.4493 1.0617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.54916 0.26472 2.075 0.0679 .
## X1 0.30013 0.21690 1.384 0.1998
## X2 0.68053 0.27693 2.457 0.0363 *
## X3 -0.10675 0.26010 -0.410 0.6911
## X4 -0.21367 0.33690 -0.634 0.5417
## X5 -0.19123 0.31881 -0.600 0.5634
## X6 0.81074 0.25221 3.214 0.0106 *
## X7 0.09634 0.24143 0.399 0.6992
## X8 -0.29864 0.19004 -1.571 0.1505
## X9 -0.78175 0.35408 -2.208 0.0546 .
## X10 0.83736 0.36936 2.267 0.0496 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8692 on 9 degrees of freedom
## Multiple R-squared: 0.6592, Adjusted R-squared: 0.2805
## F-statistic: 1.741 on 10 and 9 DF, p-value: 0.2089
Opps! What happened? Some explanatory variables from X seem to be significantly associated with Y. How come, we drew them from the Gaussian distribution? The reason for that is that we have a limited number of obstevations n. So any two variables with just a few observations can be correlated by chance alone. Roughly speaking, if you have 10 samples and 5 variables one could expect that the corraltions between the variables you might observe is not true since any two variables are significantly correlated by chance alone because we do not have enough variation in our data to detect the differences. This violates very basic Maximum Likelihood (ML) principle assumtions which lies behind the Ordinary Least Square Linear Regression Model which we have been fitting. Let us go further and hot the case n=p:
set.seed(123456)
n<-20
p<-20
Y<-rnorm(n)
X<-matrix(rnorm(n*p),n,p)
summary(lm(Y~X))
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## ALL 20 residuals are 0: no residual degrees of freedom!
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.34889 NaN NaN NaN
## X1 0.66218 NaN NaN NaN
## X2 0.76212 NaN NaN NaN
## X3 -1.35033 NaN NaN NaN
## X4 -0.57487 NaN NaN NaN
## X5 0.02142 NaN NaN NaN
## X6 0.40290 NaN NaN NaN
## X7 0.03313 NaN NaN NaN
## X8 -0.31983 NaN NaN NaN
## X9 -0.92833 NaN NaN NaN
## X10 0.18091 NaN NaN NaN
## X11 -1.37618 NaN NaN NaN
## X12 2.11438 NaN NaN NaN
## X13 -1.75103 NaN NaN NaN
## X14 -1.55073 NaN NaN NaN
## X15 0.01112 NaN NaN NaN
## X16 -0.50943 NaN NaN NaN
## X17 -0.47576 NaN NaN NaN
## X18 0.31793 NaN NaN NaN
## X19 1.43615 NaN NaN NaN
## X20 NA NA NA NA
##
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: NaN
## F-statistic: NaN on 19 and 0 DF, p-value: NA
What happened, we see lots of “NA”? The Linear Regression Model could not converge. If we further increase p, when p>n or p>>n, the convergence will not become any better. We hit the limitation of the Maximum Likelihood (ML) principle which demands many things like large sample size, Normal distribution of the data, uncorrelated errors, homoscedasticity etc. Let us now take a closer look at why exactly the ML math blows up when n<=p. Consider a linear model:
\[ Y = \beta X \]
Let us make a few mathematical tricks in order to get a solution for the coefficients of the linear model:
\[ X^TY = \beta X^TX \]
\[ (X^TX)^{-1}X^TY = \beta(X^TX)^{-1} X^TX \]
\[ (X^TX)^{-1}X^TY = \beta \]
This is the solution for linear model. We can see that it is proportional to an inverse matrix. From Linear Algebra, inverse matrix is inversely proportional to a determinant of that matrix. again, from Linear Algebra, determinant of a matrix is equal to zero (approaches zero) when columns or rows of the matrix are collinear, i.e. can be expressed as linear combinations of each other, i.e. correlated. This implies, if we have a limited number of observations n and large p such that p>=n, when, as we saw, some (at least two) can become correlated just by chance alone (if X1 and X2 are correlated to Y separately, they must be correlated wih each other), the determinant of X is approaching zero, so one over determinant leads to singularity, i.e. it diverges. Therefore the solution of the linear model does not hold any more. This is what is meant by “the math blows up”.
Now comes the question: how can we overcome the curse of dimensionality? Well, the easiest answer would be: increase n or/and decrease p. Increasing the sample size is usually very expensive and often not feasible. If increasing n is not an option, Dimensionality Reduction, i.e. a way of conceptualizing p variables in m (where p>>m) latent variables, can be very useful. Thus two main motivation points for doing Dimensionality Reduction can be following:
Principal Component Aalysis (PCA) is the simplest and most popular way to perform Dimensionality Reduction. There are numerous ways to think about PCA, i.e. the it has an “infinite depth of understanding” with multiple layers. Despite its popularity and inclination to view it as a “simple technique that everyone can do by just one line of code”, the method has many hidden pitfalls and can generate misleading results if applied without precautions. Below we will describe possible ways to understand PCA in a bullet point fasion:
The basic idea of PCA is to collapse p features (p>>n) down to just a few latent variables called principal components (transformation to a space with at most min(n-1,p) directions) and keep as much variation within the data in the low-dimensional space as it was in the p-dimensional space.
Geometrically PCA can be seen as a linear transformation ivolving rotattion and shift of the coordinate system in order to find directions of most variation within the data. Hence, PCA makes sense to do only if you suspect linear correlation between variables in your data. For example, if two variables X1 and X2 are fairly correlated, one of them is redundant for the analysis and can be dropped off. So if we put the origin of the coordinate system somewhere in the middle of the clous of points, like mean(X1) and mean(X2), and rotate the coordinate system so that the X1 axis coincides with the main direction of covariation between X1 and X2, we can conclude that the variation along X2 is negligible and can be ignored and we will keep only the variation with respect to X1. Thus we have done Dimensionality Reduction, i.e. replace (X1, X2) by just X1 without loosing to much variation in the data.
Often we hear that PCA problem can be solved through Eigen Matrix Decomposition (the other and a faster way is Singular Matrix Decomposition (SVD)). Let us show how finding axes of maximal variation can mathematically lead to the Eigen Matrix Decomposition problem. Let us define a projection (called Principal Component) of a matrix X onto a basic (eigen) unit vector u as
\[ PC = u^T X = X^Tu \]
If X is a mean centered matrix, then the mean of PC is equal to zero
\[ <PC> = 0 \]
and the variance of PC is:
\[ <(PC-<PC>)^2> = <PC^2> = u^T X X^Tu \]
Here the matrix in the middle is called variance-covariance matrix:
\[ X X^T=A \]
\[ <PC^2> = u^T Au \]
Let us now find such direction, i.e. eigen vector u, that capture most of the variation in X, i.e. let us maximize the variance of PC taking into account (with Lagrange multiplier) that vector u is a unit vector:
\[ \rm{max}(u^T Au + \lambda(1-u^Tu))=0 \]
Differentiating the function with respect to u one can arraive to the eigen vector-eigen value problem:
\[ Au = \lambda u \]
where A is the variance-covariance matrix of the initial data X.
Let us demonstrate how PCA works using the MNIST data set [1]. The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems.
#Uncomment this in case you can not change your working directory
#current_dir<-getwd()
#setwd(paste0(current_dir,"/session_ml/DimReductSingleCell"))
mnist<-read.csv("2017-10-13-mnist_train.csv")
labels<-mnist$label
mnist$label<-NULL
mnist[1:10,1:10]
## pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0 0
dim(mnist)
## [1] 10000 784
We will use the most native R function for PCA which is “prcomp”. Here we perform PCA, look at the percentage of variation explained by the top principal components and finally plot MNIST digits.
PC<-prcomp(log10(mnist + 1), center=TRUE, scale=FALSE)
vars<- PC$sdev^2
vars<- vars/sum(vars)
barplot(vars[1:10],names.arg=1:10,xlab="PCs",ylab="PERCENT OF VARIANCE EXPLAINED",main="PERCENT OF VARIANCE EXPLAINED BY PCs")
colors <- rainbow(length(unique(labels)))
names(colors) <- unique(labels)
plot(PC$x[,1:2], t='n',main="PCA PLOT WITH PRCOMP", xlab="PC1",ylab="PC2")
text(PC$x[,1:2], labels = labels, col = colors[as.character(labels)], cex = 0.5)
Obviously replicas of the same digit tend to cluster together, i.e. zeros cluster together with zeros etc.. However they are still quite mixed and do not form distinct cluster. This might be a result of non-linear relation between variables which can not be captured in 2D by linear transformation.
Next, we will consider another popular linear Dimensionality Reduction technique called Multi-Dimensional Scaling, sometimes it is also called Principal Coordinate Analysis (PCoA). The principal of Eigen Matrix Decomposition holds here as well, the ony difference is that we decompose not the variance-covariance matrix of initial data X, but build a matrix of pairwise Eucledian distances between all the variables in X.
For Multi-Dimensional Scaling plot we will use “cmdscale” R function.
d<-dist(log10(mnist + 1))
mds<-cmdscale(d, k = 2)
plot(mds[,1:2], t='n',main="MDS PLOT WITH CMDSCALE", xlab="DIM1",ylab="DIM2")
text(mds[,1:2], labels = labels, col = colors[as.character(labels)], cex = 0.5)
One can see that MDS gives quite a similar to PCA 2D representation, and this is not at all surprising if one thinks about what kind of relation Euclidean distance and variance-covariance matrix have. Let us expand the Euclidean distance between two points, i.e. variables (columns) of data X:
\[ (x_i-x_j)^2 = x_i^2 + x_j^2 - 2x_ix_j \]
The last term in the expansion is nothing else as the variance-covariance matrix. So Euclidean distance and variance-covariance matrix are linearly related, therefore it is not suprising that they give us similar results.
Often PCA is performed on a correlation matrix (i.e. matrix of pairwise correlations between the variables in X) instead of variance-covariance matrix. Again this is all about the same thing since according to Pearson’s definition of correlation coefficient:
\[ \rho_{xy} = \frac{cov(x,y)}{\sigma_x\sigma_y} \]
So Euclidean distance, variance-covariance and correlation coefficient are linearly related and should bring similar matrix decomposition results, i.e .eigen vectors and eigen values.
PCA or MDS make sense to do when we suspect linear relations between the variables in X. Sometimes however correlation between two variables can be zero, does it mean that the two variables are not related? No, it does not, the relationship can be non-linear, e.g. quadratic, logarithmic, sinesoidal etc. To figure out non-linear relationship between observations there are non-linear Dimensionality Rediction techniques such as tSNE, Isomaps, LLE, Self-Organizing Maps etc. Among them tSNE is especially popular in many Data Science areas due to its intersting visualization properties.
In a nutshell tSNE projects high-dimensional data into low-dimensional space in such a way so that points close/far in a high-dimensional space are also close/far in the low-dimensional space. tSNE has its special way to measure similarity in the high- and low-dimensional spaces, namely the Gaussian law
\[ p_{ij} \sim \exp{(-||x_i-x_j||^2/2\sigma^2)} \]
is used for highly-dimensional space, and the heavy-tailed Student t-distribution is used for measuring similarities in the low-dimensional space:
\[ q_{ij} \sim (1+||y_i-y_j||^2)^{-1} \]
In order to make distributions of points in high- and low-dimensional spaces as similar as possible, they are mixed together with the Kullback-Leibler divergence which is known as the entropy of mixing in the Information Theory:
\[ KL = \sum_{i \neq j}p_{ij}\log\frac{p_{ij}}{q_{ij}} \]
Kullback-Leibler entropy is minimized with gradient descent method in an iterative way. The entropy has an asymmetric shape, i.e. it has a lower cost for points that are far apart in the high-dimensional space (p=0) but close in the low-dimensional space (q=1) compared to the opposite situation when points are close in the high-dimenional space (p=1) and far in the low-dimensional space (q=0). This leads to a more “condensed” representation of the data in the low-dimensional space.
The denominator of exponential power in the p matrix is called perplexity. It is responsible for finding a balance between low- and high-dimenional representations, i.e. how close or far the points should be placed with respect to each other. Simply put, perplexity reflects the number of neighbors each point has in the hogh-dimensional space.
Let us use the MNIST data set and check how tSNE plot looks like:
library("Rtsne")
set.seed(12)
tsne.out<-Rtsne(log10(mnist + 1), initial_dims = 20, verbose = TRUE, perplexity = 30, max_iter = 1000)
## Performing PCA
## Read the 10000 x 20 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## - point 10000 of 10000
## Done in 5.06 seconds (sparsity = 0.011811)!
## Learning embedding...
## Iteration 50: error is 97.884243 (50 iterations in 1.85 seconds)
## Iteration 100: error is 88.610422 (50 iterations in 1.99 seconds)
## Iteration 150: error is 84.694992 (50 iterations in 1.84 seconds)
## Iteration 200: error is 84.151969 (50 iterations in 1.77 seconds)
## Iteration 250: error is 84.013338 (50 iterations in 1.73 seconds)
## Iteration 300: error is 3.086070 (50 iterations in 1.57 seconds)
## Iteration 350: error is 2.670700 (50 iterations in 1.47 seconds)
## Iteration 400: error is 2.447539 (50 iterations in 1.45 seconds)
## Iteration 450: error is 2.301062 (50 iterations in 1.40 seconds)
## Iteration 500: error is 2.196937 (50 iterations in 1.42 seconds)
## Iteration 550: error is 2.118819 (50 iterations in 1.39 seconds)
## Iteration 600: error is 2.057684 (50 iterations in 1.35 seconds)
## Iteration 650: error is 2.009215 (50 iterations in 1.35 seconds)
## Iteration 700: error is 1.971424 (50 iterations in 1.35 seconds)
## Iteration 750: error is 1.941662 (50 iterations in 1.32 seconds)
## Iteration 800: error is 1.919722 (50 iterations in 1.31 seconds)
## Iteration 850: error is 1.902596 (50 iterations in 1.30 seconds)
## Iteration 900: error is 1.889393 (50 iterations in 1.31 seconds)
## Iteration 950: error is 1.879076 (50 iterations in 1.32 seconds)
## Iteration 1000: error is 1.869883 (50 iterations in 1.32 seconds)
## Fitting performed in 29.80 seconds.
plot(tsne.out$Y, t = 'n', main = "tSNE MNIST", xlab="tSNE1",ylab="tSNE2")
text(tsne.out$Y, labels = labels, col = colors[as.character(labels)], cex = 0.5)
It is obvious that the clouds of different digits look more distinct now compared to the linear Dimensionality Reduction representations. Thus tSNE is handy when it concerns non-linear relations between data points which can not be captured by PCA or MDS. One caution is important to remember: due to its highly non-linear nature, the visual distances at the tSNE plot do not necessarily reflect the true distances in the high-dimensional space. In other words, it is hard to say with certanty how far or how close two clusters on the tSNE plot are since tSNE distances do not have a trivial meaning. Another consequence of the non-linear transformation is that the features that drive the clustering on the tSNE plot are not easy to extract since we are not doing any linear matrix decomposition as with e.g. PCA.
Finally, we will also compare tSNE embeddings with the UMAP visualization of the MNIST data set:
library("uwot")
## Loading required package: Matrix
mnist_umap <- umap(mnist, n_neighbors = 30, pca = 20, min_dist = 0.01, verbose = TRUE, n_threads = 4)
## 23:16:29 UMAP embedding parameters a = 1.896 b = 0.8006
## 23:16:29 Read 10000 rows and found 784 numeric columns
## 23:16:29 Reducing X column dimension to 20 via PCA
## 23:16:29 PCA: 20 components explained 64.47% variance
## 23:16:29 Using Annoy for neighbor search, n_neighbors = 30
## 23:16:30 Building Annoy index with metric = euclidean, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 23:16:31 Writing NN index file to temp file /var/folders/vh/twxllthj78j2bww9v4p_hjx1f3blpf/T//RtmptlAdKj/file1402f5d38a959
## 23:16:31 Searching Annoy index using 4 threads, search_k = 3000
## 23:16:31 Annoy recall = 100%
## 23:16:31 Commencing smooth kNN distance calibration using 4 threads
## 23:16:32 Initializing from normalized Laplacian + noise
## 23:16:32 Commencing optimization for 500 epochs, with 396780 positive edges
## 23:16:44 Optimization finished
head(mnist_umap)
## [,1] [,2]
## [1,] -8.61167565 3.32826446
## [2,] 5.40279827 9.14126514
## [3,] -7.20264425 0.04098247
## [4,] -0.08399781 -7.45212246
## [5,] 5.31740675 9.27951168
## [6,] 4.72207031 7.96125482
plot(mnist_umap, t = 'n', main = "UMAP MNIST", xlab="UMAP1",ylab="UMAP2")
text(mnist_umap, labels = labels, col = colors[as.character(labels)], cex = 0.5)
It looks like UMAP provides more condensed embeddings and the distances between clusters are more meaningful compared to tSNE.
[1] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324.
Finally here is the details on the system on which this document was compiled:
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS/LAPACK: /Users/rui.benfeitas/opt/miniconda3/envs/mofa2_nets/lib/libopenblasp-r0.3.15.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] uwot_0.1.10 Matrix_1.3-4 Rtsne_0.15
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.7 knitr_1.33 magrittr_2.0.1 lattice_0.20-44
## [5] R6_2.5.0 rlang_0.4.11 stringr_1.4.0 highr_0.9
## [9] tools_4.1.0 grid_4.1.0 xfun_0.24 RSpectra_0.16-0
## [13] irlba_2.3.3 jquerylib_0.1.4 htmltools_0.5.1.1 yaml_2.2.1
## [17] digest_0.6.27 RcppAnnoy_0.0.18 sass_0.4.0 codetools_0.2-18
## [21] evaluate_0.14 rmarkdown_2.9 stringi_1.6.2 compiler_4.1.0
## [25] bslib_0.2.5.1 jsonlite_1.7.2