Introduction: Why to Select Good Features?

Imagine that we are interested in monitoring a variable Y (we will call it a Response), which can be e.g. a phenotype of interest (in biology), sell profit (in econometrics and business), reaction of a person on some treatment (in phycology) etc. We have collected 10 independent statistical observations / samples Y1, Y2,…, Y10 of the Response and we noticed some variation in the Response from sample to sample.

Now, we want to understand what this variation is due to. We need to know this in order to understand mechanisms (biological, economical etc.) behid this variation. Suppose that in addition to the Response Y, we also collected data related to potential drivers / causes of Y such as gene expression (in biology), customer’s social status and income (in econometrics) etc. Suppose we collected 100 potential drivers / predictors / causes for each of Y1, Y2,…, Y10. We can represent those predictors as a matrix X with 100 columns (one for each predictor) and 10 rows (one for each observation Y1, Y2,…, Y10). We know that the variation in Y is probably due to some variables (columns / predictors) in X matrix, but do all of them equally explain the variation? Probably not, there might be some redundancy among the variables, so they are not all equally important / informative. Therefore, it is reasonable to assume that only a fraction of the variables in X are causal for the variation in Y, but which of them are causal? To answer this question we have to test the variables in X against Y, but how should we do it: test all of them together or one-by-one?

Here we have a typical biological case scanario where the number of drivers / causes / predictors (we will call them features in the future), p=100, is much greater than the number of samples / observations, n=10, p>>n. This case is called “the underdetermined system” in mathematics, it does not have a unique solution but infinitely many solutions. Therefore if we want to select features explaining the variation in Response Y, we can not directly test all the features together without special precautions such as regularizations. Therefore it makes sense to stick (at least in the beginning) to testing the features one-by-one. Testing features one-by-one is called Univariate Feature Selection, while testing all features together is refferred to as Multivariate Feature Selection.

GTEX Skeletal Muscles Gene Expression Data Set

Here, we are going to go through methods for a) Univariate (one-by-one) Feature Selection, and b) Multivariate (all together) Feature Selection. For practicing the concept of Feature Selection, we will use the skeletal muscle gene expression subset from GTEX Human Tussue Gene Expression Consortium [1]. In order to speed up computations we randomly sampled 1000 genes / features from the data set. Here, we load the gene expression matrix X, remove lowly expressed genes and pre-view it:

#Uncomment this in case you can not change your working directory
#current_dir<-getwd()
#setwd(paste0(current_dir,"/session_ml/FeatureSelectionIntegrOMICs"))

X<-read.table("GTEX_SkeletalMuscles_157Samples_1000Genes.txt",header=TRUE,row.names=1,check.names=FALSE,sep="\t")
X<-X[,colMeans(X)>=1]
X[1:5,1:3]
##                         ENSG00000243824.1_RP11-434O22.1
## GTEX-N7MS-0426-SM-2YUN6                               2
## GTEX-NFK9-0626-SM-2HMIV                               0
## GTEX-NPJ8-1626-SM-2HMIY                               0
## GTEX-O5YT-1626-SM-32PK6                               0
## GTEX-OHPM-1626-SM-2HMK4                               0
##                         ENSG00000140527.10_WDR93 ENSG00000205352.6_PRR13
## GTEX-N7MS-0426-SM-2YUN6                        2                     543
## GTEX-NFK9-0626-SM-2HMIV                        0                    1482
## GTEX-NPJ8-1626-SM-2HMIY                        3                    1958
## GTEX-O5YT-1626-SM-32PK6                        0                    1174
## GTEX-OHPM-1626-SM-2HMK4                        7                    1092
dim(X)
## [1] 157 546

We can see that the gene expression data set includes p = 546 expressed genes (features) and n = 157 samples, i.e. we work in the limit p >> n. The phenotype of interest we are going to address is Gender, i.e. we will figure out which of the 546 genes expressed in human skeletal muscles are associated with the phenotypic difference between Males and Females. Thus our response Y vector is the following:

Y<-read.table("GTEX_SkeletalMuscles_157Samples_Gender.txt",header=TRUE,sep="\t")$GENDER
summary(Y)
## Female   Male 
##     58     99
length(Y)
## [1] 157

The data set used here includes 99 Males and 58 Females. It is not perfectly balanced, ideally we would prefer roughly half of the samples to be assigned to each class. This is important for correct training in Machine Learning, otherwise a model tends to learn informative features from the majority class that can result in overfitting and problems with generalization. However, the Gender variable Y does not look too imbalanced and we continue working with it as is. To visualize the samples, let us display a PCA plot of the 157 samples.

library("mixOmics")
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Loaded mixOmics 6.10.1
## Thank you for using mixOmics!
## Tutorials: http://mixomics.org
## Bookdown vignette: https://mixomicsteam.github.io/Bookdown
## Questions, issues: Follow the prompts at http://mixomics.org/contact-us
## Cite us:  citation('mixOmics')
pca.gtex <- pca(X, ncomp=10)
pca.gtex
## Eigenvalues for the first 10 principal components, see object$sdev^2: 
##         PC1         PC2         PC3         PC4         PC5         PC6 
## 11979554198  1922793376   470907790   173035873    83960716    38937526 
##         PC7         PC8         PC9        PC10 
##    29568540    24951919    19376723    17467325 
## 
## Proportion of explained variance for the first 10 principal components, see object$explained_variance: 
##         PC1         PC2         PC3         PC4         PC5         PC6 
## 0.804731856 0.129164496 0.031633439 0.011623761 0.005640098 0.002615646 
##         PC7         PC8         PC9        PC10 
## 0.001986280 0.001676156 0.001301640 0.001173375 
## 
## Cumulative proportion explained variance for the first 10 principal components, see object$cum.var: 
##       PC1       PC2       PC3       PC4       PC5       PC6       PC7       PC8 
## 0.8047319 0.9338964 0.9655298 0.9771536 0.9827937 0.9854093 0.9873956 0.9890717 
##       PC9      PC10 
## 0.9903734 0.9915467 
## 
##  Other available components: 
##  -------------------- 
##  loading vectors: see object$rotation
plot(pca.gtex)

plotIndiv(pca.gtex, group = Y, ind.names = FALSE, legend = TRUE, title = 'PCA on GTEX Skeletal Muscles')