Why to Do Dimensionality Reduction?

Dimensionality Reduction concept is really not just about visualization like many of use might think. This is a necessaty in Data Scince in order to overcome the Curse of Dimensionality, also known as Rao’s paradox. What is it about? When we work with data we have n observations (samples) for p variables (features). Very often (almost always unless you are lucky) we have p>>n, i.e. we have a highly dimensional space. It turns out that the classical Frequentist statistics blows up in a highly-dimensional space, i.e. the conclusions of the models are not valid (robust) any more. Let us simulate just a few (n=20-nish) observations of a response variable Y and a few (e.g.p=2) predictor variables incapsuletd into a matrix X and run a simple linear association between X and Y:

set.seed(123)
n<-20
p<-2
Y<-rnorm(n)
Y
##  [1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774  1.71506499
##  [7]  0.46091621 -1.26506123 -0.68685285 -0.44566197  1.22408180  0.35981383
## [13]  0.40077145  0.11068272 -0.55584113  1.78691314  0.49785048 -1.96661716
## [19]  0.70135590 -0.47279141
X<-matrix(rnorm(n*p),n,p)
X
##              [,1]        [,2]
##  [1,] -1.06782371 -0.69470698
##  [2,] -0.21797491 -0.20791728
##  [3,] -1.02600445 -1.26539635
##  [4,] -0.72889123  2.16895597
##  [5,] -0.62503927  1.20796200
##  [6,] -1.68669331 -1.12310858
##  [7,]  0.83778704 -0.40288484
##  [8,]  0.15337312 -0.46665535
##  [9,] -1.13813694  0.77996512
## [10,]  1.25381492 -0.08336907
## [11,]  0.42646422  0.25331851
## [12,] -0.29507148 -0.02854676
## [13,]  0.89512566 -0.04287046
## [14,]  0.87813349  1.36860228
## [15,]  0.82158108 -0.22577099
## [16,]  0.68864025  1.51647060
## [17,]  0.55391765 -1.54875280
## [18,] -0.06191171  0.58461375
## [19,] -0.30596266  0.12385424
## [20,] -0.38047100  0.21594157
summary(lm(Y~X))
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0522 -0.6380  0.1451  0.3911  1.8829 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.14950    0.22949   0.651    0.523
## X1          -0.09405    0.28245  -0.333    0.743
## X2          -0.11919    0.24486  -0.487    0.633
## 
## Residual standard error: 1.017 on 17 degrees of freedom
## Multiple R-squared:  0.02204,    Adjusted R-squared:  -0.09301 
## F-statistic: 0.1916 on 2 and 17 DF,  p-value: 0.8274

Looks good, the variables are not related as expected (since they are drawn from a Gaussian distribution) but the math works, no problems as long as n>p. Let us now increase the number of features p and see what happens.

set.seed(123456)
n<-20
p<-10
Y<-rnorm(n)
X<-matrix(rnorm(n*p),n,p)
summary(lm(Y~X))
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0255 -0.4320  0.1056  0.4493  1.0617 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.54916    0.26472   2.075   0.0679 .
## X1           0.30013    0.21690   1.384   0.1998  
## X2           0.68053    0.27693   2.457   0.0363 *
## X3          -0.10675    0.26010  -0.410   0.6911  
## X4          -0.21367    0.33690  -0.634   0.5417  
## X5          -0.19123    0.31881  -0.600   0.5634  
## X6           0.81074    0.25221   3.214   0.0106 *
## X7           0.09634    0.24143   0.399   0.6992  
## X8          -0.29864    0.19004  -1.571   0.1505  
## X9          -0.78175    0.35408  -2.208   0.0546 .
## X10          0.83736    0.36936   2.267   0.0496 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8692 on 9 degrees of freedom
## Multiple R-squared:  0.6592, Adjusted R-squared:  0.2805 
## F-statistic: 1.741 on 10 and 9 DF,  p-value: 0.2089

Opps! What happened? Some explanatory variables from X seem to be significantly associated with Y. How come, we drew them from the Gaussian distribution? The reason for that is that we have a limited number of obstevations n. So any two variables with just a few observations can be correlated by chance alone. Roughly speaking, if you have 10 samples and 5 variables one could expect that the corraltions between the variables you might observe is not true since any two variables are significantly correlated by chance alone because we do not have enough variation in our data to detect the differences. This violates very basic Maximum Likelihood (ML) principle assumtions which lies behind the Ordinary Least Square Linear Regression Model which we have been fitting. Let us go further and hot the case n=p:

set.seed(123456)
n<-20
p<-20
Y<-rnorm(n)
X<-matrix(rnorm(n*p),n,p)
summary(lm(Y~X))
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
## ALL 20 residuals are 0: no residual degrees of freedom!
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.34889        NaN     NaN      NaN
## X1           0.66218        NaN     NaN      NaN
## X2           0.76212        NaN     NaN      NaN
## X3          -1.35033        NaN     NaN      NaN
## X4          -0.57487        NaN     NaN      NaN
## X5           0.02142        NaN     NaN      NaN
## X6           0.40290        NaN     NaN      NaN
## X7           0.03313        NaN     NaN      NaN
## X8          -0.31983        NaN     NaN      NaN
## X9          -0.92833        NaN     NaN      NaN
## X10          0.18091        NaN     NaN      NaN
## X11         -1.37618        NaN     NaN      NaN
## X12          2.11438        NaN     NaN      NaN
## X13         -1.75103        NaN     NaN      NaN
## X14         -1.55073        NaN     NaN      NaN
## X15          0.01112        NaN     NaN      NaN
## X16         -0.50943        NaN     NaN      NaN
## X17         -0.47576        NaN     NaN      NaN
## X18          0.31793        NaN     NaN      NaN
## X19          1.43615        NaN     NaN      NaN
## X20               NA         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:    NaN 
## F-statistic:   NaN on 19 and 0 DF,  p-value: NA

What happened, we see lots of “NA”? The Linear Regression Model could not converge. If we further increase p, when p>n or p>>n, the convergence will not become any better. We hit the limitation of the Maximum Likelihood (ML) principle which demands many things like large sample size, Normal distribution of the data, uncorrelated errors, homoscedasticity etc. Let us now take a closer look at why exactly the ML math blows up when n<=p. Consider a linear model:

\[ Y = \beta X \]

Let us make a few mathematical tricks in order to get a solution for the coefficients of the linear model:

\[ X^TY = \beta X^TX \]

\[ (X^TX)^{-1}X^TY = \beta(X^TX)^{-1} X^TX \]

\[ (X^TX)^{-1}X^TY = \beta \]

This is the solution for linear model. We can see that it is proportional to an inverse matrix. From Linear Algebra, inverse matrix is inversely proportional to a determinant of that matrix. again, from Linear Algebra, determinant of a matrix is equal to zero (approaches zero) when columns or rows of the matrix are collinear, i.e. can be expressed as linear combinations of each other, i.e. correlated. This implies, if we have a limited number of observations n and large p such that p>=n, when, as we saw, some (at least two) can become correlated just by chance alone (if X1 and X2 are correlated to Y separately, they must be correlated wih each other), the determinant of X is approaching zero, so one over determinant leads to singularity, i.e. it diverges. Therefore the solution of the linear model does not hold any more. This is what is meant by “the math blows up”.

Now comes the question: how can we overcome the curse of dimensionality? Well, the easiest answer would be: increase n or/and decrease p. Increasing the sample size is usually very expensive and often not feasible. If increasing n is not an option, Dimensionality Reduction, i.e. a way of conceptualizing p variables in m (where p>>m) latent variables, can be very useful. Thus two main motivation points for doing Dimensionality Reduction can be following:

Principal Component Analysis (PCA)

Principal Component Aalysis (PCA) is the simplest and most popular way to perform Dimensionality Reduction. There are numerous ways to think about PCA, i.e. the it has an “infinite depth of understanding” with multiple layers. Despite its popularity and inclination to view it as a “simple technique that everyone can do by just one line of code”, the method has many hidden pitfalls and can generate misleading results if applied without precautions. Below we will describe possible ways to understand PCA in a bullet point fasion:

\[ PC = u^T X = X^Tu \]

If X is a mean centered matrix, then the mean of PC is equal to zero

\[ <PC> = 0 \]

and the variance of PC is:

\[ <(PC-<PC>)^2> = <PC^2> = u^T X X^Tu \]

Here the matrix in the middle is called variance-covariance matrix:

\[ X X^T=A \]

\[ <PC^2> = u^T Au \]

Let us now find such direction, i.e. eigen vector u, that capture most of the variation in X, i.e. let us maximize the variance of PC taking into account (with Lagrange multiplier) that vector u is a unit vector:

\[ \rm{max}(u^T Au + \lambda(1-u^Tu))=0 \]

Differentiating the function with respect to u one can arraive to the eigen vector-eigen value problem:

\[ Au = \lambda u \]

where A is the variance-covariance matrix of the initial data X.

Let us demonstrate how PCA works using the MNIST data set [1]. The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems.

#Uncomment this in case you can not change your working directory
#current_dir<-getwd()
#setwd(paste0(current_dir,"/session_ml/DimReductSingleCell"))

mnist<-read.csv("2017-10-13-mnist_train.csv")
labels<-mnist$label
mnist$label<-NULL
mnist[1:10,1:10]
##    pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9
## 1       0      0      0      0      0      0      0      0      0      0
## 2       0      0      0      0      0      0      0      0      0      0
## 3       0      0      0      0      0      0      0      0      0      0
## 4       0      0      0      0      0      0      0      0      0      0
## 5       0      0      0      0      0      0      0      0      0      0
## 6       0      0      0      0      0      0      0      0      0      0
## 7       0      0      0      0      0      0      0      0      0      0
## 8       0      0      0      0      0      0      0      0      0      0
## 9       0      0      0      0      0      0      0      0      0      0
## 10      0      0      0      0      0      0      0      0      0      0
dim(mnist)
## [1] 10000   784

We will use the most native R function for PCA which is “prcomp”. Here we perform PCA, look at the percentage of variation explained by the top principal components and finally plot MNIST digits.

PC<-prcomp(log10(mnist + 1), center=TRUE, scale=FALSE)

vars<- PC$sdev^2
vars<- vars/sum(vars)
barplot(vars[1:10],names.arg=1:10,xlab="PCs",ylab="PERCENT OF VARIANCE EXPLAINED",main="PERCENT OF VARIANCE EXPLAINED BY PCs")

colors <- rainbow(length(unique(labels)))
names(colors) <- unique(labels)
plot(PC$x[,1:2], t='n',main="PCA PLOT WITH PRCOMP", xlab="PC1",ylab="PC2")
text(PC$x[,1:2], labels = labels, col = colors[as.character(labels)], cex = 0.5)

Obviously replicas of the same digit tend to cluster together, i.e. zeros cluster together with zeros etc.. However they are still quite mixed and do not form distinct cluster. This might be a result of non-linear relation between variables which can not be captured in 2D by linear transformation.

Multi-Dimensional Scaling (MDS)

Next, we will consider another popular linear Dimensionality Reduction technique called Multi-Dimensional Scaling, sometimes it is also called Principal Coordinate Analysis (PCoA). The principal of Eigen Matrix Decomposition holds here as well, the ony difference is that we decompose not the variance-covariance matrix of initial data X, but build a matrix of pairwise Eucledian distances between all the variables in X.

For Multi-Dimensional Scaling plot we will use “cmdscale” R function.

d<-dist(log10(mnist + 1))
mds<-cmdscale(d, k = 2)
plot(mds[,1:2], t='n',main="MDS PLOT WITH CMDSCALE", xlab="DIM1",ylab="DIM2")
text(mds[,1:2], labels = labels, col = colors[as.character(labels)], cex = 0.5)

One can see that MDS gives quite a similar to PCA 2D representation, and this is not at all surprising if one thinks about what kind of relation Euclidean distance and variance-covariance matrix have. Let us expand the Euclidean distance between two points, i.e. variables (columns) of data X:

\[ (x_i-x_j)^2 = x_i^2 + x_j^2 - 2x_ix_j \]

The last term in the expansion is nothing else as the variance-covariance matrix. So Euclidean distance and variance-covariance matrix are linearly related, therefore it is not suprising that they give us similar results.

Often PCA is performed on a correlation matrix (i.e. matrix of pairwise correlations between the variables in X) instead of variance-covariance matrix. Again this is all about the same thing since according to Pearson’s definition of correlation coefficient:

\[ \rho_{xy} = \frac{cov(x,y)}{\sigma_x\sigma_y} \]

So Euclidean distance, variance-covariance and correlation coefficient are linearly related and should bring similar matrix decomposition results, i.e .eigen vectors and eigen values.

t-distributed Stochastic Neighbor Embedding (tSNE)

PCA or MDS make sense to do when we suspect linear relations between the variables in X. Sometimes however correlation between two variables can be zero, does it mean that the two variables are not related? No, it does not, the relationship can be non-linear, e.g. quadratic, logarithmic, sinesoidal etc. To figure out non-linear relationship between observations there are non-linear Dimensionality Rediction techniques such as tSNE, Isomaps, LLE, Self-Organizing Maps etc. Among them tSNE is especially popular in many Data Science areas due to its intersting visualization properties.

In a nutshell tSNE projects high-dimensional data into low-dimensional space in such a way so that points close/far in a high-dimensional space are also close/far in the low-dimensional space. tSNE has its special way to measure similarity in the high- and low-dimensional spaces, namely the Gaussian law

\[ p_{ij} \sim \exp{(-||x_i-x_j||^2/2\sigma^2)} \]

is used for highly-dimensional space, and the heavy-tailed Student t-distribution is used for measuring similarities in the low-dimensional space:

\[ q_{ij} \sim (1+||y_i-y_j||^2)^{-1} \]

In order to make distributions of points in high- and low-dimensional spaces as similar as possible, they are mixed together with the Kullback-Leibler divergence which is known as the entropy of mixing in the Information Theory:

\[ KL = \sum_{i \neq j}p_{ij}\log\frac{p_{ij}}{q_{ij}} \]

Kullback-Leibler entropy is minimized with gradient descent method in an iterative way. The entropy has an asymmetric shape, i.e. it has a lower cost for points that are far apart in the high-dimensional space (p=0) but close in the low-dimensional space (q=1) compared to the opposite situation when points are close in the high-dimenional space (p=1) and far in the low-dimensional space (q=0). This leads to a more “condensed” representation of the data in the low-dimensional space.

The denominator of exponential power in the p matrix is called perplexity. It is responsible for finding a balance between low- and high-dimenional representations, i.e. how close or far the points should be placed with respect to each other. Simply put, perplexity reflects the number of neighbors each point has in the hogh-dimensional space.

Let us use the MNIST data set and check how tSNE plot looks like:

library("Rtsne")
set.seed(12)
tsne.out<-Rtsne(log10(mnist + 1), initial_dims = 20, verbose = TRUE, perplexity = 30, max_iter = 1000)
## Performing PCA
## Read the 10000 x 20 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
##  - point 10000 of 10000
## Done in 5.06 seconds (sparsity = 0.011811)!
## Learning embedding...
## Iteration 50: error is 97.884243 (50 iterations in 1.85 seconds)
## Iteration 100: error is 88.610422 (50 iterations in 1.99 seconds)
## Iteration 150: error is 84.694992 (50 iterations in 1.84 seconds)
## Iteration 200: error is 84.151969 (50 iterations in 1.77 seconds)
## Iteration 250: error is 84.013338 (50 iterations in 1.73 seconds)
## Iteration 300: error is 3.086070 (50 iterations in 1.57 seconds)
## Iteration 350: error is 2.670700 (50 iterations in 1.47 seconds)
## Iteration 400: error is 2.447539 (50 iterations in 1.45 seconds)
## Iteration 450: error is 2.301062 (50 iterations in 1.40 seconds)
## Iteration 500: error is 2.196937 (50 iterations in 1.42 seconds)
## Iteration 550: error is 2.118819 (50 iterations in 1.39 seconds)
## Iteration 600: error is 2.057684 (50 iterations in 1.35 seconds)
## Iteration 650: error is 2.009215 (50 iterations in 1.35 seconds)
## Iteration 700: error is 1.971424 (50 iterations in 1.35 seconds)
## Iteration 750: error is 1.941662 (50 iterations in 1.32 seconds)
## Iteration 800: error is 1.919722 (50 iterations in 1.31 seconds)
## Iteration 850: error is 1.902596 (50 iterations in 1.30 seconds)
## Iteration 900: error is 1.889393 (50 iterations in 1.31 seconds)
## Iteration 950: error is 1.879076 (50 iterations in 1.32 seconds)
## Iteration 1000: error is 1.869883 (50 iterations in 1.32 seconds)
## Fitting performed in 29.80 seconds.
plot(tsne.out$Y,  t = 'n', main = "tSNE MNIST", xlab="tSNE1",ylab="tSNE2")
text(tsne.out$Y, labels = labels, col = colors[as.character(labels)], cex = 0.5)

It is obvious that the clouds of different digits look more distinct now compared to the linear Dimensionality Reduction representations. Thus tSNE is handy when it concerns non-linear relations between data points which can not be captured by PCA or MDS. One caution is important to remember: due to its highly non-linear nature, the visual distances at the tSNE plot do not necessarily reflect the true distances in the high-dimensional space. In other words, it is hard to say with certanty how far or how close two clusters on the tSNE plot are since tSNE distances do not have a trivial meaning. Another consequence of the non-linear transformation is that the features that drive the clustering on the tSNE plot are not easy to extract since we are not doing any linear matrix decomposition as with e.g. PCA.

Finally, we will also compare tSNE embeddings with the UMAP visualization of the MNIST data set:

library("uwot")
## Loading required package: Matrix
mnist_umap <- umap(mnist, n_neighbors = 30, pca = 20, min_dist = 0.01, verbose = TRUE, n_threads = 4)
## 23:16:29 UMAP embedding parameters a = 1.896 b = 0.8006
## 23:16:29 Read 10000 rows and found 784 numeric columns
## 23:16:29 Reducing X column dimension to 20 via PCA
## 23:16:29 PCA: 20 components explained 64.47% variance
## 23:16:29 Using Annoy for neighbor search, n_neighbors = 30
## 23:16:30 Building Annoy index with metric = euclidean, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 23:16:31 Writing NN index file to temp file /var/folders/vh/twxllthj78j2bww9v4p_hjx1f3blpf/T//RtmptlAdKj/file1402f5d38a959
## 23:16:31 Searching Annoy index using 4 threads, search_k = 3000
## 23:16:31 Annoy recall = 100%
## 23:16:31 Commencing smooth kNN distance calibration using 4 threads
## 23:16:32 Initializing from normalized Laplacian + noise
## 23:16:32 Commencing optimization for 500 epochs, with 396780 positive edges
## 23:16:44 Optimization finished
head(mnist_umap)
##             [,1]        [,2]
## [1,] -8.61167565  3.32826446
## [2,]  5.40279827  9.14126514
## [3,] -7.20264425  0.04098247
## [4,] -0.08399781 -7.45212246
## [5,]  5.31740675  9.27951168
## [6,]  4.72207031  7.96125482
plot(mnist_umap,  t = 'n', main = "UMAP MNIST", xlab="UMAP1",ylab="UMAP2")
text(mnist_umap, labels = labels, col = colors[as.character(labels)], cex = 0.5)

It looks like UMAP provides more condensed embeddings and the distances between clusters are more meaningful compared to tSNE.

References

[1] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324.

Finally here is the details on the system on which this document was compiled:

Session Info

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS/LAPACK: /Users/rui.benfeitas/opt/miniconda3/envs/mofa2_nets/lib/libopenblasp-r0.3.15.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] uwot_0.1.10  Matrix_1.3-4 Rtsne_0.15  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7        knitr_1.33        magrittr_2.0.1    lattice_0.20-44  
##  [5] R6_2.5.0          rlang_0.4.11      stringr_1.4.0     highr_0.9        
##  [9] tools_4.1.0       grid_4.1.0        xfun_0.24         RSpectra_0.16-0  
## [13] irlba_2.3.3       jquerylib_0.1.4   htmltools_0.5.1.1 yaml_2.2.1       
## [17] digest_0.6.27     RcppAnnoy_0.0.18  sass_0.4.0        codetools_0.2-18 
## [21] evaluate_0.14     rmarkdown_2.9     stringi_1.6.2     compiler_4.1.0   
## [25] bslib_0.2.5.1     jsonlite_1.7.2