Introduction: Why to Select Good Features?

Imagine that we are interested in monitoring a variable Y (we will call it a Response), which can be e.g. a phenotype of interest (in biology), sell profit (in econometrics and business), reaction of a person on some treatment (in phycology) etc. We have collected 10 independent statistical observations / samples Y1, Y2,…, Y10 of the Response and we noticed some variation in the Response from sample to sample.

Now, we want to understand what this variation is due to. We need to know this in order to understand mechanisms (biological, economical etc.) behid this variation. Suppose that in addition to the Response Y, we also collected data related to potential drivers / causes of Y such as gene expression (in biology), customer’s social status and income (in econometrics) etc. Suppose we collected 100 potential drivers / predictors / causes for each of Y1, Y2,…, Y10. We can represent those predictors as a matrix X with 100 columns (one for each predictor) and 10 rows (one for each observation Y1, Y2,…, Y10). We know that the variation in Y is probably due to some variables (columns / predictors) in X matrix, but do all of them equally explain the variation? Probably not, there might be some redundancy among the variables, so they are not all equally important / informative. Therefore, it is reasonable to assume that only a fraction of the variables in X are causal for the variation in Y, but which of them are causal? To answer this question we have to test the variables in X against Y, but how should we do it: test all of them together or one-by-one?

Here we have a typical biological case scanario where the number of drivers / causes / predictors (we will call them features in the future), p=100, is much greater than the number of samples / observations, n=10, p>>n. This case is called “the underdetermined system” in mathematics, it does not have a unique solution but infinitely many solutions. Therefore if we want to select features explaining the variation in Response Y, we can not directly test all the features together without special precautions such as regularizations. Therefore it makes sense to stick (at least in the beginning) to testing the features one-by-one. Testing features one-by-one is called Univariate Feature Selection, while testing all features together is refferred to as Multivariate Feature Selection.

GTEX Skeletal Muscles Gene Expression Data Set

Here, we are going to go through methods for a) Univariate (one-by-one) Feature Selection, and b) Multivariate (all together) Feature Selection. For practicing the concept of Feature Selection, we will use the skeletal muscle gene expression subset from GTEX Human Tussue Gene Expression Consortium [1]. In order to speed up computations we randomly sampled 1000 genes / features from the data set. Here, we load the gene expression matrix X, remove lowly expressed genes and pre-view it:

#Uncomment this in case you can not change your working directory
#current_dir<-getwd()
#setwd(paste0(current_dir,"/session_ml/FeatureSelectionIntegrOMICs"))

X<-read.table("GTEX_SkeletalMuscles_157Samples_1000Genes.txt",header=TRUE,row.names=1,check.names=FALSE,sep="\t")
X<-X[,colMeans(X)>=1]
X[1:5,1:3]
##                         ENSG00000243824.1_RP11-434O22.1
## GTEX-N7MS-0426-SM-2YUN6                               2
## GTEX-NFK9-0626-SM-2HMIV                               0
## GTEX-NPJ8-1626-SM-2HMIY                               0
## GTEX-O5YT-1626-SM-32PK6                               0
## GTEX-OHPM-1626-SM-2HMK4                               0
##                         ENSG00000140527.10_WDR93 ENSG00000205352.6_PRR13
## GTEX-N7MS-0426-SM-2YUN6                        2                     543
## GTEX-NFK9-0626-SM-2HMIV                        0                    1482
## GTEX-NPJ8-1626-SM-2HMIY                        3                    1958
## GTEX-O5YT-1626-SM-32PK6                        0                    1174
## GTEX-OHPM-1626-SM-2HMK4                        7                    1092
dim(X)
## [1] 157 546

We can see that the gene expression data set includes p = 546 expressed genes (features) and n = 157 samples, i.e. we work in the limit p >> n. The phenotype of interest we are going to address is Gender, i.e. we will figure out which of the 546 genes expressed in human skeletal muscles are associated with the phenotypic difference between Males and Females. Thus our response Y vector is the following:

Y<-read.table("GTEX_SkeletalMuscles_157Samples_Gender.txt",header=TRUE,sep="\t")$GENDER
summary(Y)
## Female   Male 
##     58     99
length(Y)
## [1] 157

The data set used here includes 99 Males and 58 Females. It is not perfectly balanced, ideally we would prefer roughly half of the samples to be assigned to each class. This is important for correct training in Machine Learning, otherwise a model tends to learn informative features from the majority class that can result in overfitting and problems with generalization. However, the Gender variable Y does not look too imbalanced and we continue working with it as is. To visualize the samples, let us display a PCA plot of the 157 samples.

library("mixOmics")
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Loaded mixOmics 6.10.1
## Thank you for using mixOmics!
## Tutorials: http://mixomics.org
## Bookdown vignette: https://mixomicsteam.github.io/Bookdown
## Questions, issues: Follow the prompts at http://mixomics.org/contact-us
## Cite us:  citation('mixOmics')
pca.gtex <- pca(X, ncomp=10)
pca.gtex
## Eigenvalues for the first 10 principal components, see object$sdev^2: 
##         PC1         PC2         PC3         PC4         PC5         PC6 
## 11979554198  1922793376   470907790   173035873    83960716    38937526 
##         PC7         PC8         PC9        PC10 
##    29568540    24951919    19376723    17467325 
## 
## Proportion of explained variance for the first 10 principal components, see object$explained_variance: 
##         PC1         PC2         PC3         PC4         PC5         PC6 
## 0.804731856 0.129164496 0.031633439 0.011623761 0.005640098 0.002615646 
##         PC7         PC8         PC9        PC10 
## 0.001986280 0.001676156 0.001301640 0.001173375 
## 
## Cumulative proportion explained variance for the first 10 principal components, see object$cum.var: 
##       PC1       PC2       PC3       PC4       PC5       PC6       PC7       PC8 
## 0.8047319 0.9338964 0.9655298 0.9771536 0.9827937 0.9854093 0.9873956 0.9890717 
##       PC9      PC10 
## 0.9903734 0.9915467 
## 
##  Other available components: 
##  -------------------- 
##  loading vectors: see object$rotation
plot(pca.gtex)

plotIndiv(pca.gtex, group = Y, ind.names = FALSE, legend = TRUE, title = 'PCA on GTEX Skeletal Muscles')

The PCA plot demonstrates that there is a lot of variation between samples with respect to both PC1 and PC2, but there is no clear seggregation of Males and Females based on their skeletal muscle gene expression data. Now, we are going to start with a simple gene-by-gene Univariate Feature Selection, later and compare its results with the Multivariate Feature Selection.

Univariate Feature Selection

One way to understand what genes stand behind the variation between (Males and Females) samples would be to test correlation of each individual feature (gene, X variable) against a phenotype of interest (Gender, Y variable), in our case this is equivalent to a simple Differential Gene Expression (DGE) analysis. Here, we will use a simple non-parametric Spearman correlation for inferring relation between X and Y, one can alternatively use other measures of relatedness like Mann-Whittney U test (“wilcox.test” function in base R), Linear Regression (“lm” function in base R), Distance Correlations (“dcor” function in “energy” R package), Maximal Information Coefficient (MIC) (“mine” function in “minerva” R package) etc.

rho<-vector()
p<-vector()
a<-seq(from=0,to=dim(X)[2],by=100)
for(i in 1:dim(X)[2])
{
  corr_output<-cor.test(X[,i],as.numeric(as.factor(Y)),method="spearman")
  rho<-append(rho,as.numeric(corr_output$estimate))
  p<-append(p,as.numeric(corr_output$p.value))
  if(isTRUE(i%in%a)==TRUE){print(paste("FINISHED ",i," FEATURES",sep=""))}
}
## [1] "FINISHED 100 FEATURES"
## [1] "FINISHED 200 FEATURES"
## [1] "FINISHED 300 FEATURES"
## [1] "FINISHED 400 FEATURES"
## [1] "FINISHED 500 FEATURES"
output<-data.frame(GENE=colnames(X), SPEARMAN_RHO=rho, PVALUE=p)
output$FDR<-p.adjust(output$PVALUE,method="BH")
output<-output[order(output$FDR,output$PVALUE,-output$SPEARMAN_RHO),]
head(output,10)
##                             GENE SPEARMAN_RHO       PVALUE          FDR
## 256    ENSG00000184368.11_MAP7D2   -0.5730196 4.425151e-15 2.416132e-12
## 324       ENSG00000110013.8_SIAE    0.3403994 1.288217e-05 3.516833e-03
## 297    ENSG00000128487.12_SPECC1   -0.3003621 1.323259e-04 2.408332e-02
## 218      ENSG00000162512.11_SDC3    0.2945390 1.807649e-04 2.467441e-02
## 38     ENSG00000129007.10_CALML4    0.2879754 2.549127e-04 2.783647e-02
## 107   ENSG00000233429.5_HOTAIRM1   -0.2768054 4.489930e-04 4.085836e-02
## 278    ENSG00000185442.8_FAM174B   -0.2376098 2.731100e-03 2.130258e-01
## 421     ENSG00000234585.2_CCT6P3   -0.2322268 3.426233e-03 2.338404e-01
## 371       ENSG00000113312.6_TTC1    0.2284351 4.007655e-03 2.431310e-01
## 269 ENSG00000226329.2_AC005682.6   -0.2226587 5.064766e-03 2.523944e-01

We have ranked all genes by their contribution to the variation in skeletal muscles gene expression between Males and Females. The ranking is based on Spearman correlation p-value which was adjusted (FDR column) to acount for the effect of multiple independent statistical tests.

Now there is a temptation to take the top differentially expressed genes with e.g. FDR < 0.05 and build a prediction score that can be used for descriminating between Males and Females based on skeletal muscle gene expression in any other cohort. Why do we need that kind of prediction score? Suppose the phenotype of interest is a disease status (Sick-Healthy), then this prediction is of a major need and importance for clinical diagnostics in e.g. cancer and diabetes. In this, one can potentially make a more informed decision about e.g. cancer or diabetes subtype by utilizing molecular level information such as e.g. gene expression from the tissue of interest.

However, in practice this type of prediction based on Univariate Feature Selection works quite poorly. The reason is that the Univariate Feature Selection has at least two severe problems that we have not been addressed here yet.

For more information and comparison of predictions based on Univariate and Multivariate analyses, please check this blog post

The shortcomings mentioned above can be addressed with Sparse Linear Models, i.e. models with regularization penalties like LASSO, Ridge and Elastic Net which are basic techniques for Multivariate Feature Selection.

Multivariate Feature Selection: LASSO, Ridge, Elastic Net

The simplest way to account for all explanatory variables (genes) in X simultaneously would be to put them into the the multiple / multivariate linear regression model and perform Ordinary Least Squares minimization:

\[ Y = \beta_1X_1+\beta_2X_2+\epsilon \] \[ \textrm{OLS} = (y-\beta_1X_1-\beta_2X_2)^2 \]

Here, for simplicity we used only two predictors X1 and X2, but there can be thousands and millions of them. It implies that in order to minimize the OLS cost function we have to do it in a highly-dimensional space which is notoriously difficult because of the Curse of Dimensionality. This leads to a very unstable sulution of multiple linear reression. To vercome this obstacle we can add a penalty term to the OLS cost function:

\[ \textrm{Penalized OLS} = (y-\beta_1X_1-\beta_2X_2)^2 + \lambda[\alpha(|\beta_1|+|\beta_2|)+(1-\alpha)(\beta_1^2+\beta_2^2)] \]

Here, \(\lambda\) is called a Lagrange multiplier, and is a measure of how much penalty we would like to put on our Linear Regression Model, its optimal value is usually found through K-fold cross-validation. The parameter \(\alpha\) is usually fixed (but in principle can also be found through cross-validation) and the regularization is called 1) LASSO if \(\alpha=1\), 2) Ridge if \(\alpha=0\), and 3) Elastic Net if \(\alpha=0.5\). These penalty methods have a few differences that are good to remember when you select a method for your analysis. LASSO is the most strict penalty and works best at the data with lots of noise. A problem of LASSO is that it can not handle multi-collinearity among predictors in a good way. If two variables are strongly correlated, LASSO will select only one of them (by chance) and set the coefficient in front of the other one to zero. Sometimes this type of selection can be problematic if it happens that the feature that was ignored / omitted by LASSO has more physical / biological meaning than the one that was kept by LASSO. This situation can be avoided with Ridge penalty, in addition Ridge is much more stable for numerical minimization as it provides a fully convex manifold in a multi-dimensional space. However, in ultra-higly-dimensional spaces Ridge can be too allowing and deliver too many “noisy” features that might not be very interesting. Elastic Net penalty provides a good compromise between LASSO and Ridge and is generally prefered and recommended by Machine Learning practicioners.

In the example below we will run LASSO penalty on Y vs. X Linear Model and find an optimal value of \(\lambda\) via 10-fold cross-validation:

library("glmnet")
## Loading required package: Matrix
## Loaded glmnet 4.1-1
lasso_fit <- cv.glmnet(as.matrix(X), Y, family="binomial", alpha=1)
plot(lasso_fit)

lasso_fit$lambda.min
## [1] 0.02741586
log(lasso_fit$lambda.min)
## [1] -3.596634

Once we know the optimal \(\lambda\), we can display the names of the most informative features selected by LASSO for that optimal \(\lambda\).

coef<-predict(lasso_fit, s = "lambda.min", type = "nonzero")
colnames(X)[unlist(coef)]
##  [1] "ENSG00000183808.7_RBM12B"         "ENSG00000129007.10_CALML4"       
##  [3] "ENSG00000244306.5_CTD-2314B22.3"  "ENSG00000135541.16_AHI1"         
##  [5] "ENSG00000151023.12_ENKUR"         "ENSG00000257647.1_RP11-701H24.3" 
##  [7] "ENSG00000184949.10_FAM227A"       "ENSG00000261529.1_RP13-487P22.1" 
##  [9] "ENSG00000180817.6_PPA1"           "ENSG00000234336.2_JAZF1-AS1"     
## [11] "ENSG00000198954.4_KIAA1279"       "ENSG00000109943.4_CRTAM"         
## [13] "ENSG00000144677.10_CTDSPL"        "ENSG00000198729.4_PPP1R14C"      
## [15] "ENSG00000187239.12_FNBP1"         "ENSG00000203836.7_NBPF24"        
## [17] "ENSG00000250240.1_CTD-2154I11.2"  "ENSG00000233012.2_HDAC1P2"       
## [19] "ENSG00000016602.8_CLCA4"          "ENSG00000136279.14_DBNL"         
## [21] "ENSG00000162512.11_SDC3"          "ENSG00000124749.12_COL21A1"      
## [23] "ENSG00000254510.1_RP11-867G23.10" "ENSG00000155761.9_SPAG17"        
## [25] "ENSG00000130300.4_PLVAP"          "ENSG00000184368.11_MAP7D2"       
## [27] "ENSG00000267834.1_RP11-167N5.5"   "ENSG00000168566.11_SNRNP48"      
## [29] "ENSG00000128487.12_SPECC1"        "ENSG00000230267.2_HERC2P4"       
## [31] "ENSG00000110013.8_SIAE"           "ENSG00000113312.6_TTC1"          
## [33] "ENSG00000227407.1_AC008746.3"     "ENSG00000271964.1_RP11-415F23.2" 
## [35] "ENSG00000261064.1_RP11-1000B6.3"  "ENSG00000207697.1_MIR573"        
## [37] "ENSG00000182742.5_HOXB4"          "ENSG00000184304.10_PRKD1"        
## [39] "ENSG00000135127.7_CCDC64"         "ENSG00000140391.10_TSPAN3"       
## [41] "ENSG00000161847.9_RAVER1"         "ENSG00000172766.14_NAA16"        
## [43] "ENSG00000137168.7_PPIL1"          "ENSG00000152766.5_ANKRD22"

We can see that LASSO selected 44 most informative features and set the coefficients in front of the other features to zero. Finally, let us use LASSO scoring system for ranking selected features by their importance:

result<-data.frame(GENE = names(as.matrix(coef(lasso_fit, s = "lambda.min"))[as.matrix(coef(lasso_fit, s = "lambda.min"))[,1]!=0, 1])[-1], SCORE = as.numeric(as.matrix(coef(lasso_fit, s = "lambda.min"))[as.matrix(coef(lasso_fit, s = "lambda.min"))[,1]!=0, 1])[-1])
result<-result[order(-abs(result$SCORE)),]
head(result,10)
##                                GENE       SCORE
## 23 ENSG00000254510.1_RP11-867G23.10 -0.06619116
## 33     ENSG00000227407.1_AC008746.3 -0.04402248
## 5          ENSG00000151023.12_ENKUR  0.03725653
## 18        ENSG00000233012.2_HDAC1P2 -0.03554915
## 26        ENSG00000184368.11_MAP7D2 -0.03519248
## 44        ENSG00000152766.5_ANKRD22 -0.02841548
## 27   ENSG00000267834.1_RP11-167N5.5 -0.02797280
## 36         ENSG00000207697.1_MIR573 -0.02631968
## 2         ENSG00000129007.10_CALML4  0.02268164
## 1          ENSG00000183808.7_RBM12B -0.02145879

We conclude that the features selected by Multivariate Feature Selection approach do not look quite similar to the ones selected by Univariate Feature Selection in the previous section. This is generally the case in practice and it is good to remember that the features selected in Multivariate fashion tend to have a higher predictive capacity.

Multivariate Feature Selection: PLS

Another elegant Multivariate Feature Selection method is the Partial Least Squares (PLS) regression which is also called (by its author) Projection to Latent Structures (PLS). The idea behind PLS is very simple, it perfoms feature selection as a group via maximizing the covariance between X and Y:

\[ \max_{\beta}\textrm{cov}(X,Y) \Longrightarrow \hat\beta \]

This algorithm can roughly be viewed as a process of collective selection of features that provide the largest separation in a lower dimensional latent space like PCA plot. The maximized covariance matrix (build on selected most informative features) can then be factorized (expanded into the series of orthogonal components) and visualized:

library("mixOmics")
gtex.plsda <- plsda(X, Y, ncomp = 2)
background = background.predict(gtex.plsda, comp.predicted = 2, dist = "max.dist")
plotIndiv(gtex.plsda , comp = 1:2, group = Y, ind.names = FALSE, ellipse = TRUE, legend = TRUE, title = 'PLSDA on GTEX Skeletal Muscles', background = background)

Now we observe a much more clear separation between Males and Females compared to the PCA plot above. This separation is achied by selecting most informative features that correlate with Gender. The selected features can be visualized and ranked by their importances via the PLS loadings:

plotLoadings(gtex.plsda, comp = 1, title = 'Loadings on comp 1', contrib = 'max', method = 'median', ndisplay = 10, size.name = 0.6)
## Some variable names are too long. Trimmed for visualisation purposes.

plotLoadings(gtex.plsda, comp = 2, title = 'Loadings on comp 2', contrib = 'max', method = 'median', ndisplay = 10, size.name = 0.6)
## Some variable names are too long. Trimmed for visualisation purposes.

Here the x-axis represents the importance weight of each feature, and the color codes that the feature (gene) on the y-axis has a higher abundance (expression) in the Males or Females, please compare the colors in the legend with the colors of the bars.

Again, we conclude that the Multivariate Feature Selection via PLS provided a set of features that looks quite different from the one-by-one feature selection with Spearman correlation.

Multivariate Feature Selection: Linear Discriminant Analysis (LDA)

Finally, we would like to mention that there is another interesting type of linear multivariate feature selection which is the Linear Discriminant Analysis (LDA). The idea of the method is to group samples according to their labeles (e.g. Males vs. Females) so that the variance is minimized within each group and maximized between the groups (Males and Females). Despite the idea of LDA seems different from PLS and LASSO at first glance, it can be shown that LDA conceptually falls within the same group of linear multivariate feature selection methods as PLS and LASSO. All three approaches, i.e. LDA, PLS and LASSO / Ridge / Elastic Net, are used interchangebly and often result in very similar sets of most informative features providing the sample size is large enough. In this tutorial, we unfortunately can not demonstrate LDA since it requires the Y response variable to have at lest three classes (and we have only two, Males and Females, for our test data set). However, we recommend checking and testing scikit-learn implementation of LDA if you have a data set with >=3 classes.

References

[1] The Genotype-Tissue Expression (GTEx) project. The GTEx Consortium. Nature Genetics. 29 May 2013. 45(6):580-5.

RSession

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS/LAPACK: /Users/rui.benfeitas/opt/miniconda3/envs/ismb_si_fs/lib/libopenblasp-r0.3.15.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] glmnet_4.1-1    Matrix_1.2-17   mixOmics_6.10.1 ggplot2_3.2.1  
## [5] lattice_0.20-38 MASS_7.3-51.4  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5         RSpectra_0.15-0    highr_0.9          plyr_1.8.4        
##  [5] RColorBrewer_1.1-2 pillar_1.4.6       compiler_3.6.3     iterators_1.0.12  
##  [9] tools_3.6.3        digest_0.6.20      evaluate_0.14      lifecycle_0.2.0   
## [13] tibble_3.0.3       gtable_0.3.0       pkgconfig_2.0.2    rlang_0.4.7       
## [17] foreach_1.4.7      igraph_1.2.6       yaml_2.2.1         parallel_3.6.3    
## [21] xfun_0.23          gridExtra_2.3      withr_2.1.2        stringr_1.4.0     
## [25] dplyr_1.0.2        knitr_1.33         generics_0.1.0     vctrs_0.3.2       
## [29] grid_3.6.3         tidyselect_1.1.0   ellipse_0.4.1      glue_1.4.1        
## [33] R6_2.4.0           rARPACK_0.11-0     survival_3.2-10    rmarkdown_2.8     
## [37] reshape2_1.4.3     tidyr_1.1.0        purrr_0.3.4        corpcor_1.6.9     
## [41] magrittr_1.5       splines_3.6.3      codetools_0.2-16   matrixStats_0.54.0
## [45] scales_1.0.0       ellipsis_0.2.0.1   htmltools_0.5.1.1  shape_1.4.6       
## [49] colorspace_1.4-1   labeling_0.3       stringi_1.6.2      lazyeval_0.2.2    
## [53] munsell_0.5.0      crayon_1.3.4