```{r new setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#If you want to knit the whole notebook, please specify below correct path to the Rmd-file
knitr::opts_knit$set(root.dir=getwd())
```
### Introduction: Why to Select Good Features?
Imagine that we are interested in monitoring a variable Y (we will call it a Response), which can be e.g. a phenotype of interest (in biology), sell profit (in econometrics and business), reaction of a person on some treatment (in phycology) etc. We have collected 10 independent statistical observations / samples Y1, Y2,..., Y10 of the Response and we noticed some variation in the Response from sample to sample.
Now, we want to understand what this variation is due to. We need to know this in order to understand mechanisms (biological, economical etc.) behid this variation. Suppose that in addition to the Response Y, we also collected data related to potential drivers / causes of Y such as gene expression (in biology), customer's social status and income (in econometrics) etc. Suppose we collected 100 potential drivers / predictors / causes for each of Y1, Y2,..., Y10. We can represent those predictors as a matrix X with 100 columns (one for each predictor) and 10 rows (one for each observation Y1, Y2,..., Y10). We know that the variation in Y is probably due to some variables (columns / predictors) in X matrix, but do all of them equally explain the variation? Probably not, there might be some redundancy among the variables, so they are not all equally important / informative. Therefore, **it is reasonable to assume that only a fraction of the variables in X are causal for the variation in Y**, but which of them are causal? To answer this question we have to test the variables in X against Y, but how should we do it: test all of them together or one-by-one?
Here we have a typical biological case scanario where the number of drivers / causes / predictors (we will call them **features** in the future), p=100, is much greater than the number of samples / observations, n=10, **p>>n**. This case is called "the underdetermined system" in mathematics, it does not have a unique solution but infinitely many solutions. Therefore **if we want to select features explaining the variation in Response Y, we can not directly test all the features together without special precautions such as regularizations**. Therefore it makes sense to stick (at least in the beginning) to testing the features one-by-one. Testing features one-by-one is called **Univariate Feature Selection**, while testing all features together is refferred to as **Multivariate Feature Selection**.
### GTEX Skeletal Muscles Gene Expression Data Set
Here, we are going to go through methods for a) Univariate (one-by-one) Feature Selection, and b) Multivariate (all together) Feature Selection. For practicing the concept of Feature Selection, we will use the skeletal muscle gene expression subset from GTEX Human Tussue Gene Expression Consortium [1]. In order to speed up computations we randomly sampled 1000 genes / features from the data set. Here, we load the gene expression matrix X, remove lowly expressed genes and pre-view it:
```{r X,fig.width=10,fig.height=8}
#Uncomment this in case you can not change your working directory
#current_dir<-getwd()
#setwd(paste0(current_dir,"/session_ml/FeatureSelectionIntegrOMICs"))
X<-read.table("GTEX_SkeletalMuscles_157Samples_1000Genes.txt",header=TRUE,row.names=1,check.names=FALSE,sep="\t")
X<-X[,colMeans(X)>=1]
X[1:5,1:3]
dim(X)
```
We can see that the gene expression data set includes p = 546 expressed genes (features) and n = 157 samples, i.e. we work in the limit p >> n. The phenotype of interest we are going to address is Gender, i.e. we will figure out which of the 546 genes expressed in human skeletal muscles are associated with the phenotypic difference between Males and Females. Thus our response Y vector is the following:
```{r Y,fig.width=10,fig.height=8}
Y<-read.table("GTEX_SkeletalMuscles_157Samples_Gender.txt",header=TRUE,sep="\t")$GENDER
summary(Y)
length(Y)
```
The data set used here includes 99 Males and 58 Females. It is not perfectly balanced, ideally we would prefer roughly half of the samples to be assigned to each class. This is important for correct training in Machine Learning, otherwise a model tends to learn informative features from the majority class that can result in overfitting and problems with generalization. However, the Gender variable Y does not look too imbalanced and we continue working with it as is. To visualize the samples, let us display a PCA plot of the 157 samples.
```{r PCA,fig.width=10,fig.height=8}
library("mixOmics")
pca.gtex <- pca(X, ncomp=10)
pca.gtex
plot(pca.gtex)
plotIndiv(pca.gtex, group = Y, ind.names = FALSE, legend = TRUE, title = 'PCA on GTEX Skeletal Muscles')
```
The PCA plot demonstrates that there is a lot of variation between samples with respect to both PC1 and PC2, but there is no clear seggregation of Males and Females based on their skeletal muscle gene expression data. Now, we are going to start with a simple gene-by-gene Univariate Feature Selection, later and compare its results with the Multivariate Feature Selection.
### Univariate Feature Selection
One way to understand what genes stand behind the variation between (Males and Females) samples would be to test correlation of each individual feature (gene, X variable) against a phenotype of interest (Gender, Y variable), in our case this is equivalent to a simple Differential Gene Expression (DGE) analysis. Here, we will use a simple non-parametric Spearman correlation for inferring relation between X and Y, one can alternatively use other measures of relatedness like Mann-Whittney U test ("wilcox.test" function in base R), Linear Regression ("lm" function in base R), Distance Correlations ("dcor" function in "energy" R package), Maximal Information Coefficient (MIC) ("mine" function in "minerva" R package) etc.
```{r warning=FALSE}
rho<-vector()
p<-vector()
a<-seq(from=0,to=dim(X)[2],by=100)
for(i in 1:dim(X)[2])
{
corr_output<-cor.test(X[,i],as.numeric(as.factor(Y)),method="spearman")
rho<-append(rho,as.numeric(corr_output$estimate))
p<-append(p,as.numeric(corr_output$p.value))
if(isTRUE(i%in%a)==TRUE){print(paste("FINISHED ",i," FEATURES",sep=""))}
}
output<-data.frame(GENE=colnames(X), SPEARMAN_RHO=rho, PVALUE=p)
output$FDR<-p.adjust(output$PVALUE,method="BH")
output<-output[order(output$FDR,output$PVALUE,-output$SPEARMAN_RHO),]
head(output,10)
```
We have ranked all genes by their contribution to the variation in skeletal muscles gene expression between Males and Females. The ranking is based on Spearman correlation p-value which was adjusted (FDR column) to acount for the effect of multiple independent statistical tests.
Now there is a temptation to take the top differentially expressed genes with e.g. FDR < 0.05 and build a prediction score that can be used for descriminating between Males and Females based on skeletal muscle gene expression in any other cohort. Why do we need that kind of prediction score? Suppose the phenotype of interest is a disease status (Sick-Healthy), then this prediction is of a major need and importance for clinical diagnostics in e.g. cancer and diabetes. In this, one can potentially make a more informed decision about e.g. cancer or diabetes subtype by utilizing molecular level information such as e.g. gene expression from the tissue of interest.
However, in practice this type of prediction based on Univariate Feature Selection works quite poorly. The reason is that the Univariate Feature Selection has at least two severe problems that we have not been addressed here yet.
* Univariate Feature Selection does not fully overcome the p >> n obstacle (FDR correction turns out to be not good enough), i.e. it is prone to overfitting and has a poor generalization.
* Univariate Feature Selection does not account for multi-collinearity between features, i.e. when different features are correlated with each other.
For more information and comparison of predictions based on Univariate and Multivariate analyses, please check [this blog post](https://towardsdatascience.com/univariate-vs-multivariate-prediction-c1a6fb3e009?sk=c7af5a520fda83726367080f9d5df46e)
The shortcomings mentioned above can be addressed with Sparse Linear Models, i.e. models with regularization penalties like LASSO, Ridge and Elastic Net which are basic techniques for Multivariate Feature Selection.
### Multivariate Feature Selection: LASSO, Ridge, Elastic Net
The simplest way to account for all explanatory variables (genes) in X simultaneously would be to put them into the the multiple / multivariate linear regression model and perform Ordinary Least Squares minimization:
$$
Y = \beta_1X_1+\beta_2X_2+\epsilon
$$
$$
\textrm{OLS} = (y-\beta_1X_1-\beta_2X_2)^2
$$
Here, for simplicity we used only two predictors X1 and X2, but there can be thousands and millions of them. It implies that in order to minimize the OLS cost function we have to do it in a highly-dimensional space which is notoriously difficult because of the **Curse of Dimensionality**. This leads to a very unstable sulution of multiple linear reression. To vercome this obstacle we can add a penalty term to the OLS cost function:
$$
\textrm{Penalized OLS} = (y-\beta_1X_1-\beta_2X_2)^2 + \lambda[\alpha(|\beta_1|+|\beta_2|)+(1-\alpha)(\beta_1^2+\beta_2^2)]
$$
Here, $\lambda$ is called a Lagrange multiplier, and is a measure of how much penalty we would like to put on our Linear Regression Model, its optimal value is usually found through K-fold cross-validation. The parameter $\alpha$ is usually fixed (but in principle can also be found through cross-validation) and the regularization is called 1) LASSO if $\alpha=1$, 2) Ridge if $\alpha=0$, and 3) Elastic Net if $\alpha=0.5$. These penalty methods have a few differences that are good to remember when you select a method for your analysis. LASSO is the most strict penalty and works best at the data with lots of noise. A problem of LASSO is that it can not handle multi-collinearity among predictors in a good way. If two variables are strongly correlated, LASSO will select only one of them (by chance) and set the coefficient in front of the other one to zero. Sometimes this type of selection can be problematic if it happens that the feature that was ignored / omitted by LASSO has more physical / biological meaning than the one that was kept by LASSO. This situation can be avoided with Ridge penalty, in addition Ridge is much more stable for numerical minimization as it provides a fully convex manifold in a multi-dimensional space. However, in ultra-higly-dimensional spaces Ridge can be too allowing and deliver too many "noisy" features that might not be very interesting. Elastic Net penalty provides a good compromise between LASSO and Ridge and is generally prefered and recommended by Machine Learning practicioners.
In the example below we will run LASSO penalty on Y vs. X Linear Model and find an optimal value of $\lambda$ via 10-fold cross-validation:
```{r LASSO,fig.width=10,fig.height=8}
library("glmnet")
lasso_fit <- cv.glmnet(as.matrix(X), Y, family="binomial", alpha=1)
plot(lasso_fit)
lasso_fit$lambda.min
log(lasso_fit$lambda.min)
```
Once we know the optimal $\lambda$, we can display the names of the most informative features selected by LASSO for that optimal $\lambda$.
```{r}
coef<-predict(lasso_fit, s = "lambda.min", type = "nonzero")
colnames(X)[unlist(coef)]
```
We can see that LASSO selected `r dim(coef)[1]` most informative features and set the coefficients in front of the other features to zero. Finally, let us use LASSO scoring system for ranking selected features by their importance:
```{r}
result<-data.frame(GENE = names(as.matrix(coef(lasso_fit, s = "lambda.min"))[as.matrix(coef(lasso_fit, s = "lambda.min"))[,1]!=0, 1])[-1], SCORE = as.numeric(as.matrix(coef(lasso_fit, s = "lambda.min"))[as.matrix(coef(lasso_fit, s = "lambda.min"))[,1]!=0, 1])[-1])
result<-result[order(-abs(result$SCORE)),]
head(result,10)
```
We conclude that the features selected by Multivariate Feature Selection approach do not look quite similar to the ones selected by Univariate Feature Selection in the previous section. This is generally the case in practice and it is good to remember that the features selected in Multivariate fashion tend to have a higher predictive capacity.
### Multivariate Feature Selection: PLS
Another elegant Multivariate Feature Selection method is the Partial Least Squares (PLS) regression which is also called (by its author) Projection to Latent Structures (PLS). The idea behind PLS is very simple, it perfoms feature selection as a group via maximizing the covariance between X and Y:
$$
\max_{\beta}\textrm{cov}(X,Y) \Longrightarrow \hat\beta
$$
This algorithm can roughly be viewed as a process of collective selection of features that provide the largest separation in a lower dimensional latent space like PCA plot. The maximized covariance matrix (build on selected most informative features) can then be factorized (expanded into the series of orthogonal components) and visualized:
```{r PLS, warning=FALSE, fig.width=10,fig.height=8}
library("mixOmics")
gtex.plsda <- plsda(X, Y, ncomp = 2)
background = background.predict(gtex.plsda, comp.predicted = 2, dist = "max.dist")
plotIndiv(gtex.plsda , comp = 1:2, group = Y, ind.names = FALSE, ellipse = TRUE, legend = TRUE, title = 'PLSDA on GTEX Skeletal Muscles', background = background)
```
Now we observe a much more clear separation between Males and Females compared to the PCA plot above. This separation is achied by selecting most informative features that correlate with Gender. The selected features can be visualized and ranked by their importances via the PLS loadings:
```{r PLS Loadings, fig.width=10,fig.height=8}
plotLoadings(gtex.plsda, comp = 1, title = 'Loadings on comp 1', contrib = 'max', method = 'median', ndisplay = 10, size.name = 0.6)
plotLoadings(gtex.plsda, comp = 2, title = 'Loadings on comp 2', contrib = 'max', method = 'median', ndisplay = 10, size.name = 0.6)
```
Here the x-axis represents the importance weight of each feature, and the color codes that the feature (gene) on the y-axis has a higher abundance (expression) in the Males or Females, please compare the colors in the legend with the colors of the bars.
Again, we conclude that the Multivariate Feature Selection via PLS provided a set of features that looks quite different from the one-by-one feature selection with Spearman correlation.
### Multivariate Feature Selection: Linear Discriminant Analysis (LDA)
Finally, we would like to mention that there is another interesting type of linear multivariate feature selection which is the Linear Discriminant Analysis (LDA). The idea of the method is to group samples according to their labeles (e.g. Males vs. Females) so that the variance is minimized within each group and maximized between the groups (Males and Females). Despite the idea of LDA seems different from PLS and LASSO at first glance, it can be shown that LDA conceptually falls within the same group of linear multivariate feature selection methods as PLS and LASSO. All three approaches, i.e. LDA, PLS and LASSO / Ridge / Elastic Net, are used interchangebly and often result in very similar sets of most informative features providing the sample size is large enough. In this tutorial, we unfortunately can not demonstrate LDA since it requires the Y response variable to have at lest three classes (and we have only two, Males and Females, for our test data set). However, we recommend checking and testing [scikit-learn implementation of LDA](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html) if you have a data set with >=3 classes.
### References
[1] The Genotype-Tissue Expression (GTEx) project. The GTEx Consortium.
Nature Genetics. 29 May 2013. 45(6):580-5.
### RSession
```{r}
sessionInfo()
```