Feature Selection

Abstract

Here we will cover basic concepts ofunivariate and mutlivariate feature selection using LASSO, Ridge, Elastic Net and Partial Least Squares (PLS) regression models.

Imagine that we are interested in monitoring a variable Y (we will call it a Response), which can be e.g. a phenotype of interest (in biology), sell profit (in econometrics and business), reaction of a person on some treatment (in phycology) etc. We have collected 10 independent statistical observations / samples Y1, Y2,…, Y10 of the Response and we noticed some variation in the Response from sample to sample.

Now, we want to understand what this variation is due to. We need to know this in order to understand mechanisms (biological, economical etc.) behid this variation. Suppose that in addition to the Response Y, we also collected data related to potential drivers / causes of Y such as gene expression (in biology), customer’s social status and income (in econometrics) etc. Suppose we collected 100 potential drivers / predictors / causes for each of Y1, Y2,…, Y10. We can represent those predictors as a matrix X with 100 columns (one for each predictor) and 10 rows (one for each observation Y1, Y2,…, Y10). We know that the variation in Y is probably due to some variables (columns / predictors) in X matrix, but do all of them equally explain the variation? Probably not, there might be some redundancy among the variables, so they are not all equally important / informative. Therefore, **it is reasonable to assume that only a fraction of the variables in X are causal for the variation in Y**, but which of them are causal? To answer this question we have to test the variables in X against Y, but how should we do it: test all of them together or one-by-one?

Here we have a typical biological case scanario where the number of drivers / causes / predictors (we will call them **features** in the future), p=100, is much greater than the number of samples / observations, n=10, **p>>n**. This case is called “the underdetermined system” in mathematics, it does not have a unique solution but infinitely many solutions. Therefore **if we want to select features explaining the variation in Response Y, we can not directly test all the features together without special precautions such as regularizations**. Therefore it makes sense to stick (at least in the beginning) to testing the features one-by-one. Testing features one-by-one is called **Univariate Feature Selection**, while testing all features together is refferred to as **Multivariate Feature Selection**.

Here, we are going to go through methods for a) Univariate (one-by-one) Feature Selection, and b) Multivariate (all together) Feature Selection. For practicing the concept of Feature Selection, we will use the skeletal muscle gene expression subset from GTEX Human Tussue Gene Expression Consortium [1]. In order to speed up computations we randomly sampled 1000 genes / features from the data set. Here, we load the gene expression matrix X, remove lowly expressed genes and pre-view it:

```
#Uncomment this in case you can not change your working directory
#current_dir<-getwd()
#setwd(paste0(current_dir,"/session_ml/FeatureSelectionIntegrOMICs"))
X<-read.table("GTEX_SkeletalMuscles_157Samples_1000Genes.txt",header=TRUE,row.names=1,check.names=FALSE,sep="\t")
X<-X[,colMeans(X)>=1]
X[1:5,1:3]
```

```
## ENSG00000243824.1_RP11-434O22.1
## GTEX-N7MS-0426-SM-2YUN6 2
## GTEX-NFK9-0626-SM-2HMIV 0
## GTEX-NPJ8-1626-SM-2HMIY 0
## GTEX-O5YT-1626-SM-32PK6 0
## GTEX-OHPM-1626-SM-2HMK4 0
## ENSG00000140527.10_WDR93 ENSG00000205352.6_PRR13
## GTEX-N7MS-0426-SM-2YUN6 2 543
## GTEX-NFK9-0626-SM-2HMIV 0 1482
## GTEX-NPJ8-1626-SM-2HMIY 3 1958
## GTEX-O5YT-1626-SM-32PK6 0 1174
## GTEX-OHPM-1626-SM-2HMK4 7 1092
```

`dim(X)`

`## [1] 157 546`

We can see that the gene expression data set includes p = 546 expressed genes (features) and n = 157 samples, i.e. we work in the limit p >> n. The phenotype of interest we are going to address is Gender, i.e. we will figure out which of the 546 genes expressed in human skeletal muscles are associated with the phenotypic difference between Males and Females. Thus our response Y vector is the following:

```
Y<-read.table("GTEX_SkeletalMuscles_157Samples_Gender.txt",header=TRUE,sep="\t")$GENDER
summary(Y)
```

```
## Female Male
## 58 99
```

`length(Y)`

`## [1] 157`

The data set used here includes 99 Males and 58 Females. It is not perfectly balanced, ideally we would prefer roughly half of the samples to be assigned to each class. This is important for correct training in Machine Learning, otherwise a model tends to learn informative features from the majority class that can result in overfitting and problems with generalization. However, the Gender variable Y does not look too imbalanced and we continue working with it as is. To visualize the samples, let us display a PCA plot of the 157 samples.

`library("mixOmics")`

`## Loading required package: MASS`

`## Loading required package: lattice`

`## Loading required package: ggplot2`

```
##
## Loaded mixOmics 6.10.1
## Thank you for using mixOmics!
## Tutorials: http://mixomics.org
## Bookdown vignette: https://mixomicsteam.github.io/Bookdown
## Questions, issues: Follow the prompts at http://mixomics.org/contact-us
## Cite us: citation('mixOmics')
```

```
pca.gtex <- pca(X, ncomp=10)
pca.gtex
```

```
## Eigenvalues for the first 10 principal components, see object$sdev^2:
## PC1 PC2 PC3 PC4 PC5 PC6
## 11979554198 1922793376 470907790 173035873 83960716 38937526
## PC7 PC8 PC9 PC10
## 29568540 24951919 19376723 17467325
##
## Proportion of explained variance for the first 10 principal components, see object$explained_variance:
## PC1 PC2 PC3 PC4 PC5 PC6
## 0.804731856 0.129164496 0.031633439 0.011623761 0.005640098 0.002615646
## PC7 PC8 PC9 PC10
## 0.001986280 0.001676156 0.001301640 0.001173375
##
## Cumulative proportion explained variance for the first 10 principal components, see object$cum.var:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 0.8047319 0.9338964 0.9655298 0.9771536 0.9827937 0.9854093 0.9873956 0.9890717
## PC9 PC10
## 0.9903734 0.9915467
##
## Other available components:
## --------------------
## loading vectors: see object$rotation
```

`plot(pca.gtex)`

`plotIndiv(pca.gtex, group = Y, ind.names = FALSE, legend = TRUE, title = 'PCA on GTEX Skeletal Muscles')`