1  Introduction

1.1 Introduction

High-throughput biological datasets, such as transcriptomic, proteomic, or single-cell data, are often high-dimensional, noisy, and complex. Dimensionality reduction techniques help simplify this complexity by projecting data into a lower-dimensional space that preserves meaningful structure. These representations support visualization, clustering, and trajectory inference. Importantly, biological data often defies simple binary classification. Cell states may exist along a continuum, clusters may overlap, and some samples may lie in transitional states. Traditional clustering methods may miss these subtleties, motivating the use of more flexible, geometry-aware approaches.

This tutorial introduces a set of complementary methods:

  • ICA (Independent Component Analysis) isolates statistically independent signals https://payamemami.com/ica_basics/
  • SOM (Self-Organizing Maps) organizes samples based on topological similarity https://payamemami.com/self_orginizating_maps_basics/
  • t-SNE and UMAP focus on preserving local neighborhoods for visualizing fine structure and clustering.
  • Diffusion Maps model global transitions using random walks, capturing continuous biological trajectories.

Comparison of Dimensionality Reduction Methods

Method Key Concept When to Use Common Applications Limitations
ICA Decomposes data into statistically independent components To identify latent signals or sources driving variation Gene module analysis, signal separation Assumes independence; sensitive to noise and scaling
SOM Maps high-dimensional data onto a topologically ordered grid When you want structured clustering and visualization on a 2D map Expression patterns, cohort stratification Grid size needs tuning; interpretation can be subjective
t-SNE Preserves local similarities via stochastic neighbor embedding For visualizing clusters in 2D or 3D Cell type discovery, quality control Poor global structure; sensitive to seed/perplexity; not suitable for downstream modeling
UMAP Graph-based manifold learning balancing local and global structure When you want fast, structure-preserving embeddings Single-cell analysis, cohort comparison Sensitive to parameters; may distort distances
Diffusion Maps Uses a random walk process to capture manifold geometry To model continuous transitions, trajectories, or diffusion-like dynamics Cell state transitions, lineage inference Slower on large datasets; interpretation of components may not be intuitive