Multi-OMICs Factor Analysis (MOFA) and Unsupervised Data Learning

Nikolay Oskolkov, SciLifeLab, NBIS Long Term Support, nikolay.oskolkov@scilifelab.se

January 16, 2018 (last update 10/07/2021 Rui Benfeitas)

Abstract

Here we will talk about supervised vs. unsupervised manifold learning and introduce Multi-OMICs Factor Analysis (MOFA) as an example of unsupervised data exploration. Further, we will perform integrative OMICs analysis of chromatin accessibility, DNA methylation and transcriptome profiling data from the scNMT-seq (single-cell nucleosome, methylation and transcription sequencing) using MOFA.

Multi-OMICs Factor Analysis (MOFA)
Prepare scNMT Data Set for MOFA
Run MOFA on scNMT Data Set
Session Info

Multi-OMICs Factor Analysis (MOFA)

MOFA is a typical hypothesis-free data exploration framework, https://www.embopress.org/doi/10.15252/msb.20178124. It allows data integration by extracting common axes of variation from the multiple OMICs layers. Given several data matrices with measurements of multiple ’omics data types on the same or on overlapping sets of samples, MOFA infers a low-dimensional data representation in terms of (hidden) factors which can be then visualized and interpreted.

Importantly, MOFA disentangles whether the underlying axes of heterogeneity are unique to a single data modality or are manifested in multiple modalities, thereby identifying links between the different ’omics. Once trained, the model output can be used for a range of downstream analyses, including

Visualisation of samples in factor space
Annotation of factors using (gene set) enrichment analysis
Imputation of missing values
Support of sparse data
Support of OMICs with non-Gaussian distribution including binary and count data

How does a typical MOFA output look like? In the MOFA paper they applied this method to Chronic Lymphocytic Leukaemia (CLL) on 200 human patients that combined 1) drug response, 2) somatic mutations (targeted sequencing), 3) RNAseq data, and 4) Methylation array data.

Notably, nearly 40% of the 200 samples were profiled with some but not all OMICs types; such a missing value scenario is not uncommon in large cohort studies, and MOFA is designed to cope with it.

One can inspect loadings of MOFA hidden factors and observe known bio-markers fo CLL. Specifically, factor 1 was aligned with the somatic mutation status of the immunoglobulin heavy-chain variable region gene (IGHV), while Factor 2 aligned with trisomy of chromosome 12.

Loadings from each factor can be used for pathway enrichment analysis. Interestingly, markers from diffeerent OMICs co-varying in a certain MOFA factor can be shown to be coupled to the same biological pathway. For CLL data, factor 5 demonstrates co-variation between drug response and mRNA levels. This factor tagged a set of genes that were highly enriched for oxidative stress and senescence pathways. Consistent with the annotation based on the mRNA view, it is observed that the drugs with the strongest weights on Factor 5 were associated with response to oxidative stress, such as target reactive oxygen species, DNA damage response and apoptosis.

Next MOFA can perform efficient imputation of missing values including imputation of “missing patients”, i.e. when a patient is present in one OMICs layer but absent in another, this absent layer can be reconstructed for the patient, i.e. gray bars in the panel a) of the figure above can be filled.

Finally, latent factors inferred by MOFA are predictive of clinical outcomes. The authors of MOFa paper assessed the prediction performance when combining the 10 MOFA factors in a multivariate Cox regression model (assessed using cross-validation). Notably, this model yielded higher prediction accuracy than the 10 factors derived from conventional PCA or using the individual features.

Important to realize that despite the results of PCA and Factor Analysis visually look very similar, as they both provide linear variance-covariance matrix decomposition, they have quite a different math behind. While PCA is a pure matrix factorization technique which splits the total variance into orthogonal Principal Components (PCs), Factor Analysis seeks to construct hidden latent variables that generate the observed data, therefore Factor Analysis is a generative model.

Prepare scNMT Data Set for MOFA

In this section we will apply MOFA to the single cell multi-OMICs data set scNMT which profiles chromatine accessebility (scATACseq), DNA methylation (scBSseq) and gene expression (scRNAseq) information from the same biological cells belonging to two types: differentiating mouse embryonic stem cells (ESCs) and embrionic bodies (EBs). We will start with reading and having a look at the individual OMICs data sets:

#Uncomment this in case you can not change your working directory
#current_dir<-getwd()
#setwd(paste0(current_dir,"/session_ml/UnsupervisedOMICsIntegration"))

scRNAseq<-read.delim("scRNAseq.txt", header = TRUE, check.names = FALSE, row.names = 1, sep="\t")
ens2genesymbol<-read.delim("ENSEMBLE_TO_GENE_SYMBOL_MOUSE.txt")
ens2genesymbol<-ens2genesymbol[match(colnames(scRNAseq),as.character(ens2genesymbol$ensembl_gene_id)),]
colnames(scRNAseq)<-ens2genesymbol$external_gene_name
scRNAseq<-as.data.frame(t(scRNAseq))
scRNAseq[1:5,1:5]

##          ESC_A02  ESC_A03  ESC_A04  ESC_A05  ESC_A06
## Mrpl15  557.4201 130.8536 305.7356 417.1656 573.6504
## Lypla1  410.5958 138.8325 186.3237 106.0895 662.7008
## Tcea1   437.9119 356.3899 276.9121 161.8315 213.3068
## Atp6v1h 345.7199 500.5417 195.5884 133.0614 325.1376
## Oprk1     0.0000   0.0000   0.0000   0.0000   0.0000

scBSseq<-read.delim("scBSseq.txt", header = TRUE, check.names = FALSE, row.names = 1, sep="\t")
scBSseq<-as.data.frame(t(scBSseq))
scBSseq[1:5,1:5]

##               ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_101606896   0.00000       0       0       0       0
## 1_101935054  88.88889     100     100     100      30
## 1_101935061  88.88889     100     100     100      90
## 1_102002184 100.00000     100       0       0       0
## 1_102238204 100.00000     100     100     100     100

scATACseq<-read.delim("scATACseq.txt", header = TRUE, check.names = FALSE, row.names = 1, sep="\t")
scATACseq<-as.data.frame(t(scATACseq))
scATACseq[1:5,1:5]

##             ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_100392139     100     100     100     100      67
## 1_100392151     100     100     100      50      67
## 1_100668590      50      86      44      45      28
## 1_100738967      83      88      83      83      88
## 1_100994324       0       0      11      75       0

For scRNAseq OMIC layer we will only select highly expressed genes in order to remove noisy features that might contaminate the further downstream analysis. We will also perform log-transform of the data which can be seen as a mild normalization:

scRNAseq<-scRNAseq[rowMeans(scRNAseq)>=1,]
scRNAseq<-log10(scRNAseq + 1)
scRNAseq[1:5,1:5]

##          ESC_A02  ESC_A03  ESC_A04  ESC_A05  ESC_A06
## Mrpl15  2.746961 2.120092 2.486764 2.621348 2.759404
## Lypla1  2.614471 2.145608 2.272593 2.029747 2.821972
## Tcea1   2.642377 2.553142 2.443907 2.211738 2.331036
## Atp6v1h 2.539979 2.700307 2.293558 2.127304 2.513401
## Oprk1   0.000000 0.000000 0.000000 0.000000 0.000000

dim(scRNAseq)

## [1] 12145   113

Since scBSseq and scATACseq OMIC layers should be almost binary (methylated vs unmethylated for scBSseq and open vs. close for scATACseq) data sets, a good way to get rid of redundancy in the scBSseq and scATACseq data (and thus perform feature pre-selection and reduce dimensions) is to select only features with high variability:

library("mixOmics")

## Loading required package: MASS

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Loaded mixOmics 6.16.0
## Thank you for using mixOmics!
## Tutorials: http://mixomics.org
## Bookdown vignette: https://mixomicsteam.github.io/Bookdown
## Questions, issues: Follow the prompts at http://mixomics.org/contact-us
## Cite us:  citation('mixOmics')

my_nearZeroVar<-nearZeroVar(as.data.frame(t(scBSseq)))
head(my_nearZeroVar$Metrics)

##             freqRatio percentUnique
## 1_101606896    112.00     1.7699115
## 1_105525627     35.00     5.3097345
## 1_105539120     26.25     5.3097345
## 1_10605572       0.00     0.8849558
## 1_107485472      0.00     0.8849558
## 1_109269874     20.00     7.9646018

dim(my_nearZeroVar$Metrics)

## [1] 3290    2

scBSseq <- scBSseq[-which(rownames(scBSseq)%in%rownames(my_nearZeroVar$Metrics)),]
scBSseq[1:5,1:5]

##               ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_101935054  88.88889     100     100     100      30
## 1_101935061  88.88889     100     100     100      90
## 1_102002184 100.00000     100       0       0       0
## 1_102238204 100.00000     100     100     100     100
## 1_102255678 100.00000     100     100     100       0

dim(scBSseq)

## [1] 5285  113

my_nearZeroVar<-nearZeroVar(as.data.frame(t(scATACseq)), uniqueCut = 1)
head(my_nearZeroVar$Metrics)

##             freqRatio percentUnique
## 1_10401731          0     0.8849558
## 1_128314944         0     0.8849558
## 1_13628215          0     0.8849558
## 1_178804829         0     0.8849558
## 1_183332775         0     0.8849558
## 1_183944178         0     0.8849558

dim(my_nearZeroVar$Metrics)

## [1] 91  2

scATACseq <- scATACseq[-which(rownames(scATACseq)%in%rownames(my_nearZeroVar$Metrics)),]
scATACseq[1:5,1:5]

##             ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_100392139     100     100     100     100      67
## 1_100392151     100     100     100      50      67
## 1_100668590      50      86      44      45      28
## 1_100738967      83      88      83      83      88
## 1_100994324       0       0      11      75       0

dim(scATACseq)

## [1] 11709   113

Let us now have a look at the histograms of individual OMICs layers in order to decide what distribution they follow and how we should model these distributions with MOFA:

hist(rowMeans(scRNAseq),breaks=100, main = "scRNAseq")

hist(rowMeans(scBSseq), breaks = 100, main = "scBSseq")

hist(rowMeans(scATACseq), breaks = 100, main = "scATACseq")

We conclude that while scRNAseq data looks fairly Gaussian (or at least exponential), we should probably model the scBSseq and scATACseq data following Bernoulli distribution as they look quite bimodal indicating the binary nature of the data, i.e. methylated vs. unmethylated for scBSseq and open vs. close for scATACseq. To make the scBSseq and scATACseq data purely Bernoulli-like, we will further make the scBSseq and scATACseq data sets binary by encoding values below 50 as 0 and above 50 as 1. Since binary data typically have vey low variation compared to continuous data, we need to remove low-variance features in this case again:

scBSseq<-ifelse(scBSseq<50,0,1)
scBSseq[1:5,1:5]

##             ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_101935054       1       1       1       1       0
## 1_101935061       1       1       1       1       1
## 1_102002184       1       1       0       0       0
## 1_102238204       1       1       1       1       1
## 1_102255678       1       1       1       1       0

my_nearZeroVar<-nearZeroVar(as.data.frame(t(scBSseq)))
head(my_nearZeroVar$Metrics)

##             freqRatio percentUnique
## 1_110547534     21.60     1.7699115
## 1_133172191     27.25     1.7699115
## 1_139117169     27.25     1.7699115
## 1_146093078     21.60     1.7699115
## 1_194260338    112.00     1.7699115
## 1_24611678       0.00     0.8849558

dim(my_nearZeroVar$Metrics)

## [1] 761   2

scBSseq <- as.data.frame(scBSseq[-which(rownames(scBSseq)%in%rownames(my_nearZeroVar$Metrics)),])
scBSseq[1:5,1:5]

##             ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_101935054       1       1       1       1       0
## 1_101935061       1       1       1       1       1
## 1_102002184       1       1       0       0       0
## 1_102238204       1       1       1       1       1
## 1_102255678       1       1       1       1       0

dim(scBSseq)

## [1] 4524  113

scATACseq<-ifelse(scATACseq<50,0,1)
my_nearZeroVar<-nearZeroVar(as.data.frame(t(scATACseq)))
head(my_nearZeroVar$Metrics)

##             freqRatio percentUnique
## 1_100392139    112.00      1.769912
## 1_100392151     27.25      1.769912
## 1_100738967     55.50      1.769912
## 1_101897864     55.50      1.769912
## 1_102283335     55.50      1.769912
## 1_102563492    112.00      1.769912

dim(my_nearZeroVar$Metrics)

## [1] 2238    2

scATACseq <- as.data.frame(scATACseq[-which(rownames(scATACseq)%in%rownames(my_nearZeroVar$Metrics)),])
scATACseq[1:5,1:5]

##             ESC_A02 ESC_A03 ESC_A04 ESC_A05 ESC_A06
## 1_100668590       1       1       0       0       0
## 1_100994324       0       0       0       1       0
## 1_100994328       0       0       0       0       0
## 1_100994336       0       0       0       0       0
## 1_100994343       0       0       0       0       1

dim(scATACseq)

## [1] 9471  113

Now data cleaning step is finished and the OMICs are prepared to be integrated in unsupervised way with MOFA.

Run MOFA on scNMT Data Set

Let us continue with creating MOFA object. MOFA allows for a handy overview of the data by displaying dimensions of each OMIC.

library("MOFA2")

## 
## Attaching package: 'MOFA2'

## The following object is masked from 'package:stats':
## 
##     predict

omics<-list(scRNAseq=as.matrix(scRNAseq),scBSseq=as.matrix(scBSseq),scATACseq=as.matrix(scATACseq))
lapply(omics,dim)

## $scRNAseq
## [1] 12145   113
## 
## $scBSseq
## [1] 4524  113
## 
## $scATACseq
## [1] 9471  113

MOFAobject <- create_mofa_from_matrix(omics)
plot_data_overview(MOFAobject, )

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Now we will use some model training parameters such as distribution types corresponding to each OMIC, number of iterations, number of factors to be computed etc.

#DEFINE DATA OPTIONS
DataOptions <- get_default_data_options(MOFAobject)
DataOptions

## $scale_views
## [1] FALSE
## 
## $scale_groups
## [1] FALSE
## 
## $center_groups
## [1] TRUE
## 
## $use_float32
## [1] FALSE
## 
## $views
## [1] "scRNAseq"  "scBSseq"   "scATACseq"
## 
## $groups
## [1] "group1"

#DEFINE MODEL OPTIONS
ModelOptions <- get_default_model_options(MOFAobject)
mydistr<-c("gaussian","bernoulli","bernoulli")
names(mydistr)<-c("scRNAseq","scBSseq","scATACseq")
ModelOptions$likelihoods<-mydistr
ModelOptions$num_factors<-20
ModelOptions

## $likelihoods
##    scRNAseq     scBSseq   scATACseq 
##  "gaussian" "bernoulli" "bernoulli" 
## 
## $num_factors
## [1] 20
## 
## $spikeslab_factors
## [1] FALSE
## 
## $spikeslab_weights
## [1] TRUE
## 
## $ard_factors
## [1] FALSE
## 
## $ard_weights
## [1] TRUE

#DEFINE TRAIN OPTIONS
TrainOptions <- get_default_training_options(MOFAobject)
TrainOptions$seed <- 2018
# Automatically drop factors that explain less than 3% of variance in all omics
TrainOptions$drop_factor_threshold <- 0.03
# TrainOptions$tolerance <- 0.1
TrainOptions$maxiter<-1000
TrainOptions$verbose<-TRUE
TrainOptions

## $maxiter
## [1] 1000
## 
## $convergence_mode
## [1] "fast"
## 
## $drop_factor_threshold
## [1] 0.03
## 
## $verbose
## [1] TRUE
## 
## $startELBO
## [1] 1
## 
## $freqELBO
## [1] 5
## 
## $stochastic
## [1] FALSE
## 
## $gpu_mode
## [1] FALSE
## 
## $seed
## [1] 2018
## 
## $outfile
## NULL
## 
## $weight_views
## [1] FALSE
## 
## $save_interrupted
## [1] FALSE

Finally, we are ready to run MOFA:

MOFAobject <- prepare_mofa(MOFAobject, data_options = DataOptions, 
                          model_options = ModelOptions, training_options = TrainOptions)

## Warning in prepare_mofa(MOFAobject, data_options = DataOptions, model_options
## = ModelOptions, : Some view(s) have a lot of features, it is recommended to
## perform a more stringent feature selection before creating the MOFA object....

## Checking data options...

## Checking training options...

## Checking model options...

MOFAobject <- run_mofa(MOFAobject)

## Warning in run_mofa(MOFAobject): No output filename provided. Using /var/folders/vh/twxllthj78j2bww9v4p_hjx1f3blpf/T//Rtmp8NxABE/mofa_20210710-235217.hdf5 to store the trained model.

## Connecting to the mofapy2 python package using reticulate (use_basilisk = FALSE)... 
##     Please make sure to manually specify the right python binary when loading R with reticulate::use_python(..., force=TRUE) or the right conda environment with reticulate::use_condaenv(..., force=TRUE)
##     If you prefer to let us automatically install a conda environment with 'mofapy2' installed using the 'basilisk' package, please use the argument 'use_basilisk = TRUE'

## Warning in .quality_control(object, verbose = verbose): Factor(s) 1 are strongly correlated with the total number of expressed features for at least one of your omics. Such factors appear when there are differences in the total 'levels' between your samples, *sometimes* because of poor normalisation in the preprocessing steps.

Let us check the covariance of the OMICs layers:

#ANALYZE RESULTS OF MOFA INTEGRATION
# r2=calculate_variance_explained(MOFAobject)
# r2$R2Total
# head(r2$R2PerFactor)
# MOFAobject
plot_list=plot_variance_explained(MOFAobject, x='view', y='factor', plot_total = T)
plot_list[[2]] #total variance

plot_list[[1]] #variance by factor

We can see that scRNAseq provides the largest variation (13%) in the total integrative OMICs scNMT data set, scBSseq and scATACseq contribute around 5% of variation. We also observe that MOFA selected 3 Latent Factors (LFs), scRNAseq contributes to all of them while scBSseq and scATACseq contributes only to the first LF. Now let us interpret the LFs by displaying the features associated with each LF:

NumFactors<-dim(get_factors(MOFAobject))[2]

plot_weights(MOFAobject, view = "scRNAseq", factor = 1)

plot_top_weights(object = MOFAobject, view = "scRNAseq", factor = 1, nfeatures = 10)

plot_data_heatmap(object = MOFAobject, view = "scRNAseq", factor = "Factor1", features = 10, transpose = F, show_colnames=TRUE, show_rownames=TRUE,cluster_cols = TRUE)

plot_weights(MOFAobject, view = "scBSseq", factor = 2)

plot_top_weights(object = MOFAobject, view = "scBSseq", factor = 2, nfeatures = 10)

plot_data_heatmap(object = MOFAobject, view = "scBSseq", factor = "Factor1", features = 10, transpose = F, show_colnames=TRUE, show_rownames=TRUE,cluster_cols = TRUE)

plot_weights(MOFAobject, view = "scATACseq", factor = 1)

plot_top_weights(object = MOFAobject, view = "scATACseq", factor = 1, nfeatures = 10)

plot_data_heatmap(object = MOFAobject, view = "scATACseq", factor = "Factor1", features = 10, transpose = F, show_colnames=TRUE, show_rownames=TRUE,cluster_cols = TRUE)

Let us also display the cells in the low-dimensional latent space:

controls<-c("EB_P1D12","EB_P1E12","EB_P1F12","EB_P1G12","EB_P2B12","EB_P2D12","EB_P2E12","EB_P2F12","EB_P2G12",
            "ESC_B12","ESC_C12","ESC_D12")
colnames(scRNAseq)<-ifelse(colnames(scRNAseq)%in%controls,"CONTROL_CELL",colnames(scRNAseq))
cell_types<-matrix(unlist(strsplit(colnames(scRNAseq),"_")),ncol=2,byrow=TRUE)[,1]

plot_factor(MOFAobject, factors = 1:2, color_by = as.factor(cell_types),dot_size = 2)

We conclude that the MOFA unsupervised integrative OMICs approach can distinguish between all three types of cells: ESCs, EBs and Controls.

Session Info

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS/LAPACK: /Users/rui.benfeitas/opt/miniconda3/envs/mofa2_nets/lib/libopenblasp-r0.3.15.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MOFA2_1.2.0     mixOmics_6.16.0 ggplot2_3.3.5   lattice_0.20-44
## [5] MASS_7.3-54     reticulate_1.20
## 
## loaded via a namespace (and not attached):
##  [1] MatrixGenerics_1.4.0 sass_0.4.0           tidyr_1.1.3         
##  [4] jsonlite_1.7.2       ellipse_0.4.2        bslib_0.2.5.1       
##  [7] assertthat_0.2.1     highr_0.9            stats4_4.1.0        
## [10] yaml_2.2.1           ggrepel_0.9.1        corrplot_0.90       
## [13] pillar_1.6.1         glue_1.4.2           digest_0.6.27       
## [16] RColorBrewer_1.1-2   colorspace_2.0-2     cowplot_1.1.1       
## [19] htmltools_0.5.1.1    Matrix_1.3-4         plyr_1.8.6          
## [22] pkgconfig_2.0.3      pheatmap_1.0.12      dir.expiry_1.0.0    
## [25] purrr_0.3.4          corpcor_1.6.9        scales_1.1.1        
## [28] HDF5Array_1.20.0     RSpectra_0.16-0      Rtsne_0.15          
## [31] BiocParallel_1.26.0  tibble_3.1.2         farver_2.1.0        
## [34] generics_0.1.0       IRanges_2.26.0       ellipsis_0.3.2      
## [37] withr_2.4.2          BiocGenerics_0.38.0  magrittr_2.0.1      
## [40] crayon_1.4.1         evaluate_0.14        fansi_0.5.0         
## [43] forcats_0.5.1        tools_4.1.0          lifecycle_1.0.0     
## [46] matrixStats_0.59.0   basilisk.utils_1.4.0 stringr_1.4.0       
## [49] Rhdf5lib_1.14.0      S4Vectors_0.30.0     munsell_0.5.0       
## [52] DelayedArray_0.18.0  compiler_4.1.0       jquerylib_0.1.4     
## [55] rlang_0.4.11         rhdf5_2.36.0         grid_4.1.0          
## [58] rhdf5filters_1.4.0   igraph_1.2.6         labeling_0.4.2      
## [61] rmarkdown_2.9        basilisk_1.4.0       gtable_0.3.0        
## [64] DBI_1.1.1            rARPACK_0.11-0       reshape2_1.4.4      
## [67] R6_2.5.0             gridExtra_2.3        knitr_1.33          
## [70] dplyr_1.0.7          uwot_0.1.10          utf8_1.2.1          
## [73] filelock_1.0.2       stringi_1.6.2        parallel_4.1.0      
## [76] Rcpp_1.0.7           vctrs_0.3.8          png_0.1-7           
## [79] tidyselect_1.1.1     xfun_0.24