```{r new setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir=getwd())
```


### Why to Do Dimensionality Reduction?

Dimensionality Reduction concept is really not just about visualization like many of use might think. This is a necessaty in Data Scince in order to overcome the Curse of Dimensionality, also known as Rao's paradox. What is it about? When we work with data we have n observations (samples) for p variables (features). Very often (almost always unless you are lucky) we have p>>n, i.e. we have a highly dimensional space. It turns out that the classical Frequentist statistics blows up in a highly-dimensional space, i.e. the conclusions of the models are not valid (robust) any more. Let us simulate just a few (n=20-nish) observations of a response variable Y and a few (e.g.p=2) predictor variables incapsuletd into a matrix X and run a simple linear association between X and Y:

```{r}
set.seed(123)
n<-20
p<-2
Y<-rnorm(n)
Y
X<-matrix(rnorm(n*p),n,p)
X
summary(lm(Y~X))
```

Looks good, the variables are not related as expected (since they are drawn from a Gaussian distribution) but the math works, no problems as long as n>p. Let us now increase the number of features p and see what happens.

```{r}
set.seed(123456)
n<-20
p<-10
Y<-rnorm(n)
X<-matrix(rnorm(n*p),n,p)
summary(lm(Y~X))
```

Opps! What happened? Some explanatory variables from X seem to be significantly associated with Y. How come, we drew them from the Gaussian distribution? The reason for that is that we have a limited number of obstevations n. So any two variables with just a few observations can be correlated by chance alone. Roughly speaking, if you have 10 samples and 5 variables one could expect that the corraltions between the variables you might observe is not true since any two variables are significantly correlated by chance alone because we do not have enough variation in our data to detect the differences. This violates very basic Maximum Likelihood (ML) principle assumtions which lies behind the Ordinary Least Square Linear Regression Model which we have been fitting. Let us go further and hot the case n=p:

```{r}
set.seed(123456)
n<-20
p<-20
Y<-rnorm(n)
X<-matrix(rnorm(n*p),n,p)
summary(lm(Y~X))
```

What happened, we see lots of "NA"? The Linear Regression Model could not converge. If we further increase p, when p>n or p>>n, the convergence will not become any better. We hit the limitation of the Maximum Likelihood (ML) principle which demands many things like large sample size, Normal distribution of the data, uncorrelated errors, homoscedasticity etc. Let us now take a closer look at why exactly the ML math blows up when n<=p. Consider a linear model:

$$
Y = \beta X
$$

Let us make a few mathematical tricks in order to get a solution for the coefficients of the linear model:

$$
X^TY = \beta X^TX
$$

$$
(X^TX)^{-1}X^TY = \beta(X^TX)^{-1} X^TX
$$

$$
(X^TX)^{-1}X^TY = \beta
$$

This is the solution for linear model. We can see that it is proportional to an inverse matrix. From Linear Algebra, inverse matrix is inversely proportional to a determinant of that matrix. again, from Linear Algebra, determinant of a matrix is equal to zero (approaches zero) when columns or rows of the matrix are collinear, i.e. can be expressed as linear combinations of each other, i.e. correlated. This implies, if we have a limited number of observations n and large p such that p>=n, when, as we saw, some (at least two) can become correlated just by chance alone (if X1 and X2 are correlated to Y separately, they must be correlated wih each other), the determinant of X is approaching zero, so one over determinant leads to singularity, i.e. it diverges. Therefore the solution of the linear model does not hold any more. This is what is meant by "the math blows up".   

Now comes the question: how can we overcome the curse of dimensionality? Well, the easiest answer would be: increase n or/and decrease p. Increasing the sample size is usually very expensive and often not feasible. If increasing n is not an option, Dimensionality Reduction, i.e. a way of conceptualizing p variables in m (where p>>m) latent variables, can be very useful. Thus two main motivation points for doing Dimensionality Reduction can be following:

* Dimensionality Reduction gives a handy way to visualize and cluster samples in 2D or 3D using all explanatory variables together
* Dimensionality Reduction is a good way to overcome the curse of dimensionality


### Principal Component Analysis (PCA)

Principal Component Aalysis (PCA) is the simplest and most popular way to perform Dimensionality Reduction. There are numerous ways to think about PCA, i.e. the it has an "infinite depth of understanding" with multiple layers. Despite its popularity and inclination to view it as a "simple technique that everyone can do by just one line of code", the method has many hidden pitfalls and can generate misleading results if applied without precautions. Below we will describe possible ways to understand PCA in a bullet point fasion: 


* The basic idea of PCA is to collapse p features (p>>n) down to just a few latent variables called principal components (transformation to a space with at most min(n-1,p) directions) and keep as much variation within the data in the low-dimensional space as it was in the p-dimensional space. 

* Geometrically PCA can be seen as a linear transformation ivolving rotattion and shift of the coordinate system in order to find directions of most variation within the data. Hence, PCA makes sense to do only if you suspect linear correlation between variables in your data. For example, if two variables X1 and X2 are fairly correlated, one of them is redundant for the analysis and can be dropped off. So if we put the origin of the coordinate system somewhere in the middle of the clous of points, like mean(X1) and mean(X2), and rotate the coordinate system so that the X1 axis coincides with the main direction of covariation between X1 and X2, we can conclude that the variation along X2 is negligible and can be ignored and we will keep only the variation with respect to X1. Thus we have done Dimensionality Reduction, i.e. replace (X1, X2) by just X1 without loosing to much variation in the data.

* Often we hear that PCA problem can be solved through Eigen Matrix Decomposition (the other and a faster way is Singular Matrix Decomposition (SVD)). Let us show how finding axes of maximal variation can mathematically lead to the Eigen Matrix Decomposition problem. Let us define a projection (called Principal Component) of a matrix X onto a basic (eigen) unit vector u as

$$
PC = u^T X = X^Tu
$$

If X is a mean centered matrix, then the mean of PC is equal to zero

$$
<PC> = 0
$$

and the variance of PC is:

$$
<(PC-<PC>)^2> = <PC^2> = u^T X X^Tu
$$

Here the matrix in the middle is called variance-covariance matrix:

$$
X X^T=A
$$

$$
<PC^2> = u^T Au
$$

Let us now find such direction, i.e. eigen vector u, that capture most of the variation in X, i.e. let us maximize the variance of PC taking into account (with Lagrange multiplier) that vector u is a unit vector:

$$
\rm{max}(u^T Au + \lambda(1-u^Tu))=0
$$

Differentiating the function with respect to u one can arraive to the eigen vector-eigen value problem:

$$
Au = \lambda u
$$

where A is the variance-covariance matrix of the initial data X.

Let us demonstrate how PCA works using the MNIST data set [1]. The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. 

```{r Read MNIST}
#Uncomment this in case you can not change your working directory
#current_dir<-getwd()
#setwd(paste0(current_dir,"/session_ml/DimReductSingleCell"))

mnist<-read.csv("2017-10-13-mnist_train.csv")
labels<-mnist$label
mnist$label<-NULL
mnist[1:10,1:10]
dim(mnist)
```

We will use the most native R function for PCA which is "prcomp". Here we perform PCA, look at the percentage of variation explained by the top principal components and finally plot MNIST digits.

```{r PCA,fig.width=10,fig.height=8}
PC<-prcomp(log10(mnist + 1), center=TRUE, scale=FALSE)

vars<- PC$sdev^2
vars<- vars/sum(vars)
barplot(vars[1:10],names.arg=1:10,xlab="PCs",ylab="PERCENT OF VARIANCE EXPLAINED",main="PERCENT OF VARIANCE EXPLAINED BY PCs")

colors <- rainbow(length(unique(labels)))
names(colors) <- unique(labels)
plot(PC$x[,1:2], t='n',main="PCA PLOT WITH PRCOMP", xlab="PC1",ylab="PC2")
text(PC$x[,1:2], labels = labels, col = colors[as.character(labels)], cex = 0.5)
```

Obviously replicas of the same digit tend to cluster together, i.e. zeros cluster together with zeros etc.. However they are still quite mixed and do not form distinct cluster. This might be a result of non-linear relation between variables which can not be captured in 2D by linear transformation.


### Multi-Dimensional Scaling (MDS)

Next, we will consider another popular linear Dimensionality Reduction technique called Multi-Dimensional Scaling, sometimes it is also called Principal Coordinate Analysis (PCoA). The principal of Eigen Matrix Decomposition holds here as well, the ony difference is that we decompose not the variance-covariance matrix of initial data X, but build a matrix of pairwise Eucledian distances between all the variables in X. 

For Multi-Dimensional Scaling plot we will use "cmdscale" R function.

```{r MDS,fig.width=10,fig.height=8}
d<-dist(log10(mnist + 1))
mds<-cmdscale(d, k = 2)
plot(mds[,1:2], t='n',main="MDS PLOT WITH CMDSCALE", xlab="DIM1",ylab="DIM2")
text(mds[,1:2], labels = labels, col = colors[as.character(labels)], cex = 0.5)
```

One can see that MDS gives quite a similar to PCA 2D representation, and this is not at all surprising if one thinks about what kind of relation Euclidean distance and variance-covariance matrix have. Let us expand the Euclidean distance between two points, i.e. variables (columns) of data X:

$$
(x_i-x_j)^2 = x_i^2 + x_j^2 - 2x_ix_j
$$

The last term in the expansion is nothing else as the variance-covariance matrix. So Euclidean distance and variance-covariance matrix are linearly related, therefore it is not suprising that they give us similar results.

Often PCA is performed on a correlation matrix (i.e. matrix of pairwise correlations between the variables in X) instead of variance-covariance matrix. Again this is all about the same thing since according to Pearson's definition of correlation coefficient:

$$
\rho_{xy} = \frac{cov(x,y)}{\sigma_x\sigma_y}
$$

So Euclidean distance, variance-covariance and correlation coefficient are linearly related and should bring similar matrix decomposition results, i.e .eigen vectors and eigen values.


### t-distributed Stochastic Neighbor Embedding (tSNE)

PCA or MDS make sense to do when we suspect linear relations between the variables in X. Sometimes however correlation between two variables can be zero, does it mean that the two variables are not related? No, it does not, the relationship can be non-linear, e.g. quadratic, logarithmic, sinesoidal etc. To figure out non-linear relationship between observations there are non-linear Dimensionality Rediction techniques such as tSNE, Isomaps, LLE, Self-Organizing Maps etc. Among them tSNE is especially popular in many Data Science areas due to its intersting visualization properties.  

In a nutshell tSNE projects high-dimensional data into low-dimensional space in such a way so that points close/far in a high-dimensional space are also close/far in the low-dimensional space. tSNE has its special way to measure similarity in the high- and low-dimensional spaces, namely the Gaussian law

$$
p_{ij} \sim \exp{(-||x_i-x_j||^2/2\sigma^2)}
$$

is used for highly-dimensional space, and the heavy-tailed Student t-distribution is used for measuring similarities in the low-dimensional space:

$$
q_{ij} \sim (1+||y_i-y_j||^2)^{-1}
$$

In order to make distributions of points in high- and low-dimensional spaces as similar as possible, they are mixed together with the Kullback-Leibler divergence which is known as the entropy of mixing in the Information Theory:

$$
KL = \sum_{i \neq j}p_{ij}\log\frac{p_{ij}}{q_{ij}}
$$

Kullback-Leibler entropy is minimized with gradient descent method in an iterative way. The entropy has an asymmetric shape, i.e. it has a lower cost for points that are far apart in the high-dimensional space (p=0) but close in the low-dimensional space (q=1) compared to the opposite situation when points are close in the high-dimenional space (p=1) and far in the low-dimensional space (q=0). This leads to a more "condensed" representation of the data in the low-dimensional space.

The denominator of exponential power in the p matrix is called perplexity. It is responsible for finding a balance between low- and high-dimenional representations, i.e. how close or far the points should be placed with respect to each other. Simply put, perplexity reflects the number of neighbors each point has in the hogh-dimensional space.

Let us use the MNIST data set and check how tSNE plot looks like:

```{r tSNE,fig.width=10,fig.height=8}
library("Rtsne")
set.seed(12)
tsne.out<-Rtsne(log10(mnist + 1), initial_dims = 20, verbose = TRUE, perplexity = 30, max_iter = 1000)
plot(tsne.out$Y,  t = 'n', main = "tSNE MNIST", xlab="tSNE1",ylab="tSNE2")
text(tsne.out$Y, labels = labels, col = colors[as.character(labels)], cex = 0.5)
```

It is obvious that the clouds of different digits look more distinct now compared to the linear Dimensionality Reduction representations. Thus tSNE is handy when it concerns non-linear relations between data points which can not be captured by PCA or MDS. One caution is important to remember: due to its highly non-linear nature, the visual distances at the tSNE plot do not necessarily reflect the true distances in the high-dimensional space. In other words, it is hard to say with certanty how far or how close two clusters on the tSNE plot are since tSNE distances do not have a trivial meaning. Another consequence of the non-linear transformation is that the features that drive the clustering on the tSNE plot are not easy to extract since we are not doing any linear matrix decomposition as with e.g. PCA.

Finally, we will also compare tSNE embeddings with the UMAP visualization of the MNIST data set:

```{r UMAP,fig.width=10,fig.height=8}
library("uwot")
mnist_umap <- umap(mnist, n_neighbors = 30, pca = 20, min_dist = 0.01, verbose = TRUE, n_threads = 4)
head(mnist_umap)
plot(mnist_umap,  t = 'n', main = "UMAP MNIST", xlab="UMAP1",ylab="UMAP2")
text(mnist_umap, labels = labels, col = colors[as.character(labels)], cex = 0.5)
```

It looks like UMAP provides more condensed embeddings and the distances between clusters are more meaningful compared to tSNE.

### References

[1] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324.

Finally here is the details on the system on which this document was compiled:

### Session Info
```{r}
sessionInfo()
```