Dimension Reduction

Abstract

In this tutorial, we talk about the Curse of Dimensionality and ideas behind dimensionality reduction. We are going to cover a) linear dimensionality reduction techniques (PCA, metric MDS, ICA), and b) non-linear dimensionality reduction techniques (tSNE, UMAP, LLE, Isomaps)

Dimensionality Reduction concept is really not just about visualization like many of use might think. This is a necessaty in Data Scince in order to overcome the Curse of Dimensionality, also known as Rao’s paradox. What is it about? When we work with data we have n observations (samples) for p variables (features). Very often (almost always unless you are lucky) we have p>>n, i.e. we have a highly dimensional space. It turns out that the classical Frequentist statistics blows up in a highly-dimensional space, i.e. the conclusions of the models are not valid (robust) any more. Let us simulate just a few (n=20-nish) observations of a response variable Y and a few (e.g.p=2) predictor variables incapsuletd into a matrix X and run a simple linear association between X and Y:

```
set.seed(123)
n<-20
p<-2
Y<-rnorm(n)
Y
```

```
## [1] -0.56047565 -0.23017749 1.55870831 0.07050839 0.12928774 1.71506499
## [7] 0.46091621 -1.26506123 -0.68685285 -0.44566197 1.22408180 0.35981383
## [13] 0.40077145 0.11068272 -0.55584113 1.78691314 0.49785048 -1.96661716
## [19] 0.70135590 -0.47279141
```

```
X<-matrix(rnorm(n*p),n,p)
X
```

```
## [,1] [,2]
## [1,] -1.06782371 -0.69470698
## [2,] -0.21797491 -0.20791728
## [3,] -1.02600445 -1.26539635
## [4,] -0.72889123 2.16895597
## [5,] -0.62503927 1.20796200
## [6,] -1.68669331 -1.12310858
## [7,] 0.83778704 -0.40288484
## [8,] 0.15337312 -0.46665535
## [9,] -1.13813694 0.77996512
## [10,] 1.25381492 -0.08336907
## [11,] 0.42646422 0.25331851
## [12,] -0.29507148 -0.02854676
## [13,] 0.89512566 -0.04287046
## [14,] 0.87813349 1.36860228
## [15,] 0.82158108 -0.22577099
## [16,] 0.68864025 1.51647060
## [17,] 0.55391765 -1.54875280
## [18,] -0.06191171 0.58461375
## [19,] -0.30596266 0.12385424
## [20,] -0.38047100 0.21594157
```

`summary(lm(Y~X))`

```
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0522 -0.6380 0.1451 0.3911 1.8829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.14950 0.22949 0.651 0.523
## X1 -0.09405 0.28245 -0.333 0.743
## X2 -0.11919 0.24486 -0.487 0.633
##
## Residual standard error: 1.017 on 17 degrees of freedom
## Multiple R-squared: 0.02204, Adjusted R-squared: -0.09301
## F-statistic: 0.1916 on 2 and 17 DF, p-value: 0.8274
```

Looks good, the variables are not related as expected (since they are drawn from a Gaussian distribution) but the math works, no problems as long as n>p. Let us now increase the number of features p and see what happens.

```
set.seed(123456)
n<-20
p<-10
Y<-rnorm(n)
X<-matrix(rnorm(n*p),n,p)
summary(lm(Y~X))
```

```
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0255 -0.4320 0.1056 0.4493 1.0617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.54916 0.26472 2.075 0.0679 .
## X1 0.30013 0.21690 1.384 0.1998
## X2 0.68053 0.27693 2.457 0.0363 *
## X3 -0.10675 0.26010 -0.410 0.6911
## X4 -0.21367 0.33690 -0.634 0.5417
## X5 -0.19123 0.31881 -0.600 0.5634
## X6 0.81074 0.25221 3.214 0.0106 *
## X7 0.09634 0.24143 0.399 0.6992
## X8 -0.29864 0.19004 -1.571 0.1505
## X9 -0.78175 0.35408 -2.208 0.0546 .
## X10 0.83736 0.36936 2.267 0.0496 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8692 on 9 degrees of freedom
## Multiple R-squared: 0.6592, Adjusted R-squared: 0.2805
## F-statistic: 1.741 on 10 and 9 DF, p-value: 0.2089
```

Opps! What happened? Some explanatory variables from X seem to be significantly associated with Y. How come, we drew them from the Gaussian distribution? The reason for that is that we have a limited number of obstevations n. So any two variables with just a few observations can be correlated by chance alone. Roughly speaking, if you have 10 samples and 5 variables one could expect that the corraltions between the variables you might observe is not true since any two variables are significantly correlated by chance alone because we do not have enough variation in our data to detect the differences. This violates very basic Maximum Likelihood (ML) principle assumtions which lies behind the Ordinary Least Square Linear Regression Model which we have been fitting. Let us go further and hot the case n=p:

```
set.seed(123456)
n<-20
p<-20
Y<-rnorm(n)
X<-matrix(rnorm(n*p),n,p)
summary(lm(Y~X))
```

```
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## ALL 20 residuals are 0: no residual degrees of freedom!
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.34889 NaN NaN NaN
## X1 0.66218 NaN NaN NaN
## X2 0.76212 NaN NaN NaN
## X3 -1.35033 NaN NaN NaN
## X4 -0.57487 NaN NaN NaN
## X5 0.02142 NaN NaN NaN
## X6 0.40290 NaN NaN NaN
## X7 0.03313 NaN NaN NaN
## X8 -0.31983 NaN NaN NaN
## X9 -0.92833 NaN NaN NaN
## X10 0.18091 NaN NaN NaN
## X11 -1.37618 NaN NaN NaN
## X12 2.11438 NaN NaN NaN
## X13 -1.75103 NaN NaN NaN
## X14 -1.55073 NaN NaN NaN
## X15 0.01112 NaN NaN NaN
## X16 -0.50943 NaN NaN NaN
## X17 -0.47576 NaN NaN NaN
## X18 0.31793 NaN NaN NaN
## X19 1.43615 NaN NaN NaN
## X20 NA NA NA NA
##
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: NaN
## F-statistic: NaN on 19 and 0 DF, p-value: NA
```

What happened, we see lots of “NA”? The Linear Regression Model could not converge. If we further increase p, when p>n or p>>n, the convergence will not become any better. We hit the limitation of the Maximum Likelihood (ML) principle which demands many things like large sample size, Normal distribution of the data, uncorrelated errors, homoscedasticity etc. Let us now take a closer look at why exactly the ML math blows up when n<=p. Consider a linear model:

\[ Y = \beta X \]

Let us make a few mathematical tricks in order to get a solution for the coefficients of the linear model:

\[ X^TY = \beta X^TX \]

\[ (X^TX)^{-1}X^TY = \beta(X^TX)^{-1} X^TX \]

\[ (X^TX)^{-1}X^TY = \beta \]

This is the solution for linear model. We can see that it is proportional to an inverse matrix. From Linear Algebra, inverse matrix is inversely proportional to a determinant of that matrix. again, from Linear Algebra, determinant of a matrix is equal to zero (approaches zero) when columns or rows of the matrix are collinear, i.e. can be expressed as linear combinations of each other, i.e. correlated. This implies, if we have a limited number of observations n and large p such that p>=n, when, as we saw, some (at least two) can become correlated just by chance alone (if X1 and X2 are correlated to Y separately, they must be correlated wih each other), the determinant of X is approaching zero, so one over determinant leads to singularity, i.e. it diverges. Therefore the solution of the linear model does not hold any more. This is what is meant by “the math blows up”.

Now comes the question: how can we overcome the curse of dimensionality? Well, the easiest answer would be: increase n or/and decrease p. Increasing the sample size is usually very expensive and often not feasible. If increasing n is not an option, Dimensionality Reduction, i.e. a way of conceptualizing p variables in m (where p>>m) latent variables, can be very useful. Thus two main motivation points for doing Dimensionality Reduction can be following:

- Dimensionality Reduction gives a handy way to visualize and cluster samples in 2D or 3D using all explanatory variables together
- Dimensionality Reduction is a good way to overcome the curse of dimensionality

Principal Component Aalysis (PCA) is the simplest and most popular way to perform Dimensionality Reduction. There are numerous ways to think about PCA, i.e. the it has an “infinite depth of understanding” with multiple layers. Despite its popularity and inclination to view it as a “simple technique that everyone can do by just one line of code”, the method has many hidden pitfalls and can generate misleading results if applied without precautions. Below we will describe possible ways to understand PCA in a bullet point fasion:

The basic idea of PCA is to collapse p features (p>>n) down to just a few latent variables called principal components (transformation to a space with at most min(n-1,p) directions) and keep as much variation within the data in the low-dimensional space as it was in the p-dimensional space.

Geometrically PCA can be seen as a linear transformation ivolving rotattion and shift of the coordinate system in order to find directions of most variation within the data. Hence, PCA makes sense to do only if you suspect linear correlation between variables in your data. For example, if two variables X1 and X2 are fairly correlated, one of them is redundant for the analysis and can be dropped off. So if we put the origin of the coordinate system somewhere in the middle of the clous of points, like mean(X1) and mean(X2), and rotate the coordinate system so that the X1 axis coincides with the main direction of covariation between X1 and X2, we can conclude that the variation along X2 is negligible and can be ignored and we will keep only the variation with respect to X1. Thus we have done Dimensionality Reduction, i.e. replace (X1, X2) by just X1 without loosing to much variation in the data.

Often we hear that PCA problem can be solved through Eigen Matrix Decomposition (the other and a faster way is Singular Matrix Decomposition (SVD)). Let us show how finding axes of maximal variation can mathematically lead to the Eigen Matrix Decomposition problem. Let us define a projection (called Principal Component) of a matrix X onto a basic (eigen) unit vector u as

\[ PC = u^T X = X^Tu \]

If X is a mean centered matrix, then the mean of PC is equal to zero

\[ <PC> = 0 \]

and the variance of PC is:

\[ <(PC-<PC>)^2> = <PC^2> = u^T X X^Tu \]

Here the matrix in the middle is called variance-covariance matrix:

\[ X X^T=A \]

\[ <PC^2> = u^T Au \]

Let us now find such direction, i.e. eigen vector u, that capture most of the variation in X, i.e. let us maximize the variance of PC taking into account (with Lagrange multiplier) that vector u is a unit vector:

\[ \rm{max}(u^T Au + \lambda(1-u^Tu))=0 \]

Differentiating the function with respect to u one can arraive to the eigen vector-eigen value problem:

\[ Au = \lambda u \]

where A is the variance-covariance matrix of the initial data X.

Let us demonstrate how PCA works using the MNIST data set [1]. The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems.

```
#Uncomment this in case you can not change your working directory
#current_dir<-getwd()
#setwd(paste0(current_dir,"/session_ml/DimReductSingleCell"))
mnist<-read.csv("2017-10-13-mnist_train.csv")
labels<-mnist$label
mnist$label<-NULL
mnist[1:10,1:10]
```

```
## pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0 0
```

`dim(mnist)`

`## [1] 10000 784`

We will use the most native R function for PCA which is “prcomp”. Here we perform PCA, look at the percentage of variation explained by the top principal components and finally plot MNIST digits.

```
PC<-prcomp(log10(mnist + 1), center=TRUE, scale=FALSE)
vars<- PC$sdev^2
vars<- vars/sum(vars)
barplot(vars[1:10],names.arg=1:10,xlab="PCs",ylab="PERCENT OF VARIANCE EXPLAINED",main="PERCENT OF VARIANCE EXPLAINED BY PCs")
```

```
colors <- rainbow(length(unique(labels)))
names(colors) <- unique(labels)
plot(PC$x[,1:2], t='n',main="PCA PLOT WITH PRCOMP", xlab="PC1",ylab="PC2")
text(PC$x[,1:2], labels = labels, col = colors[as.character(labels)], cex = 0.5)
```