
Multi-sample/batch harmonization
Aditya Singh
14-Apr-2026
The Challenge:
Batch Effects Include:

Key challenge: Distinguish biological variation from batch effects
Key Assumption: Batches are uncorrelated (orthogonal) to the variable of interest.
%%{init: {'theme':'base', 'themeVariables': { 'fontSize': '14px', 'primaryTextColor':'#333'}}}%%
flowchart LR
B["Multiple<br/>Datasets?"]
B --> X["Same<br/>Experiment"]
X --> C["đź”—<br/>INTEGRATION"]
B --> Y["Multiple<br/>Experiments"]
Y --> D["đź§ą<br/>BATCH<br/>CORRECTION"]
Y --> C
C --> C1["Align<br/>Biological<br/>Signals"]
C1 --> C3["Unified<br/>Cell Types"]
D --> D1["Remove<br/>Technical<br/>Noise"]
D1 --> D3["Clean<br/>Expression"]
C3 --> E["Proceed to<br/>Analysis"]
D3 --> E
classDef decision fill:#e1f5fe,color:#000
classDef action fill:#f3e5f5,color:#000
classDef result fill:#e8f5e8,color:#000
class B decision
class X,Y,D,C action
class C1,D1,E,C3,D3 result
linkStyle default stroke:#333,stroke-width:3px%%{init: {'theme':'base', 'themeVariables': { 'fontSize': '14px', 'primaryTextColor':'#333'}}}%%
flowchart LR
B["Multiple<br/>Datasets?"]
B --> X["Same<br/>Experiment"]
X --> C["đź”—<br/>INTEGRATION"]
B --> Y["Multiple<br/>Experiments"]
Y --> D["đź§ą<br/>BATCH<br/>CORRECTION"]
Y --> C
C --> C1["Align<br/>Biological<br/>Signals"]
C1 --> C3["Unified<br/>Cell Types"]
D --> D1["Remove<br/>Technical<br/>Noise"]
D1 --> D3["Clean<br/>Expression"]
C3 --> E["Proceed to<br/>Analysis"]
D3 --> E
classDef decision fill:#e1f5fe,color:#000
classDef action fill:#f3e5f5,color:#000
classDef result fill:#e8f5e8,color:#000
class B decision
class X,Y,D,C action
class C1,D1,E,C3,D3 result
linkStyle default stroke:#333,stroke-width:3px
Balanced Batches
| Sample | Condition | Batch |
|---|---|---|
| Patient A | Control | 1 |
| Patient B | Control | 2 |
| Patient C | Control | 3 |
| Patient D | Treated | 1 |
| Patient E | Treated | 2 |
| Patient F | Treated | 3 |
Confounded Batches
| Sample | Condition | Batch |
|---|---|---|
| Patient A | Control | 1 |
| Patient B | Control | 1 |
| Patient C | Control | 1 |
| Patient D | Treated | 2 |
| Patient E | Treated | 2 |
| Patient F | Treated | 2 |
| Method | Algorithm | Language | Library | Ref |
|---|---|---|---|---|
| CCA | Canonical Correlation Analysis | R | Seurat | Cell 2019 |
| MNN | Mutual Nearest Neighbors correction | R / Python | scater / Scanpy | Nat. Biotech 2018 |
| Conos | Graph-based joint kNN alignment | R | conos | Nat. Methods 2019 |
| Harmony | Iterative PC correction (soft-clustering) | R / Python | harmony/ harmonypy | Nat. Methods 2019 |
| Scanorama | Manifold alignment + SVD-based merging | Python | scanorama | Nat. Biotech 2019 |
Note: This is not an exhaustive list, just a selection of popular methods that we’ll cover in the exercises
| Method | Algorithm | Language | Library | Ref |
|---|---|---|---|---|
| ComBat | Empirical Bayes location/scale adjustment | R | sva | Bioinformatics 2007 |
| ComBat-seq | Negative binomial ComBat for counts | R | ComBat_seq | NAR Genomics 2020 |
| limma | Linear models with empirical Bayes | R | limma | NAR 2015 |
| RUVSeq | Remove unwanted variation (factor-based) | R | RUVSeq | BMC Bioinf 2016 |
| SVA | Surrogate variable analysis | R | sva | PNAS 2017 |
| ZINB-WaVE | Zero-inflated negative binomial | R | zinbwave | Genome Biol 2017 |
Note: These methods were developed for bulk data and usually not perform well on scRNA-seq due to sparsity and zero-inflation. They are included here for context
| Method | Algorithm | Language | Library | Ref |
|---|---|---|---|---|
| scVI | Variational autoencoder | Python | scvi-tools | Nat Biotech 2022 |
| scANVI | Conditional variational autoencoder | Python | scvi-tools | Nat Methods 2021 |
| scGen | Causal VAE for perturbation modeling | Python | scgen | Nat Methods 2019 |
| SAUCIE | Self-supervised autoencoder | Python | SAUCIE | Nat Methods 2019 |
| DESC | Deep embedded single cell clustering | Python | DESC | Nat Communs 2020 |
Note: These methods are powerful but often require more computational resources and expertise to use effectively. Watch this space, future might be here!
Core concept: Identifies pairs of cells that are each other’s nearest neighbors across batches in high-dimensional gene expression space
No predefined populations needed: Works without cell type annotations
How it works:
Minimal biological artifacts: Conservative approach, reducing risk of removing biological differences
Harmony iteratively adjusts cell embeddings so that each cluster contains a balanced mix of batches
| Aspect | CCA | MNN | Harmony | Scanorama | Conos |
|---|---|---|---|---|---|
| Speed | Fast | Slow | Fast | Medium | Slow |
| Scalability | ~10k | ~50k | 125k+ | ~50k | ~30k |
| Batch Correction | Good | Excellent | Good | Excellent | Very Good |
| Biology Preservation | Medium | Excellent | Excellent | Good | Excellent |
| Rare Type Detection | Poor | Excellent | Good | Medium | Good |
| Learning Curve | Low | Medium | Low | Medium | High |
| Language | R | R/Python | R/Python | Python | R |
| Graph-Based | No | No | No | No | Yes |
| Claim | Source | Key Finding | |
|---|---|---|---|
| Speed: Harmony > CCA > Scanorama > MNN > Conos | Tran et al. (2020) | Harmony 30-200x faster than MNN at 500k cells | |
| Scalability: Harmony(125k+) > MNN/Scanorama(50k) > CCA(10k) | Korsunsky et al. (2019) | Harmony scales to 1M+ cells, MNN fails >50k | |
| Batch Correction: MNN/Scanorama > Conos > Harmony > CCA | Luecken et al. (2022) | MNN/Scanorama top kBET/LISI scores | |
| Biology Preservation: MNN/Harmony/Conos > Scanorama > CCA | Luecken et al. (2022) | MNN best cell-cycle/trajectory conservation | |
| Rare Type Detection: MNN superior | Distilled information | MNN being local correction, preserves rare populations | |
| Learning Curve: CCA/Harmony < MNN/Scanorama < Conos | Personal opinion | Harmony/CCA: simple params; Conos: complex graphs |
Overcorrection
Batch structure ignored
Incompatible preprocessing
Confounded design
Key takeaway: Integration balances batch removal with biology preservation
Summary slides for revision follow
Q1. What is the main goal of scRNA-seq integration?
A: To align cells so they group by biological signal (for example, cell type) rather than technical or sample-specific differences.
Q2. Why can integration be needed even with one sequencing batch?
A: Different samples in the same run can still differ in preparation, dissociation, or depth, making cells cluster by sample instead of cell type.
Q3. What pattern in a UMAP suggests batch effect problems?
A: Clusters separate mainly by batch label even though the same cell type exists in all batches.
Q4. Conceptually, what does CCA do for integration?
A: It finds shared low-dimensional “correlated directions” across datasets where similar cell states align.
Q5. What are “anchors” in Seurat’s CCA-based integration?
A: Pairs or small groups of cells from different datasets that represent the same biological state and are used as reference points for alignment.
Q6. Simple example
A: Take T cells from patient A and B that look similar, treat them as anchors, and warp both datasets so T cells from A and B overlap in the joint space.
Q7. What is the core idea behind Harmony?
A: Start in PCA space, then iteratively shift cell embeddings so each cluster has similar contributions from each batch, reducing batch-driven structure.
Q8. Why is Harmony often practical for large datasets?
A: It operates on PCs, is relatively fast, and fits into a simple PCA → Harmony → UMAP workflow.
Q9. Metaphor for Harmony’s adjustment
A: You have groups of similar cells on a map; Harmony gently nudges cells from overrepresented batches so each group has a fair mix of batches.
Q10. What is a mutual nearest neighbors (MNN) pair?
A: A pair of cells, one from each dataset, where each is among the nearest neighbors of the other in expression space.
Q11. How does MNN-based integration use these pairs?
A: It assumes MNN pairs represent the same cell state and computes local correction vectors to align the datasets around those pairs.
Q12. Intuitive example for MNN in lecture
A: If a T cell in dataset A and a T cell in dataset B are each other’s closest match, link them and use that link to locally shift one dataset towards the other.
Q13. What is the main idea behind Scanorama?
A: It uses MNN-like matches across many datasets and stitches them together like a panorama to build a unified embedding.
Q14. Why is Scanorama useful for many-dataset studies?
A: It can integrate multiple experiments at once by finding shared cell populations and merging them into a common space.
Q15. What is Conos’ high-level approach to integration?
A: It builds a graph for each sample, then constructs a joint graph across samples using cross-sample neighbors, and analyzes this global graph for clustering and alignment.
Q16. When is Conos particularly attractive?
A: When you have many samples and want to keep per-sample identity while still obtaining joint clustering from the combined graph.
Q17. Simple metaphor for Conos
A: Each sample is its own social network; Conos adds edges between similar people across networks and then analyzes the big merged social graph.
Q18. How can you spot “over-integration” ?
A: Marker genes lose specificity across clusters, known biology is obscured, and distinct cell types merge together.
Q19. How can you spot “under-integration”?
A: UMAP: If technically distinct groups are not-integrated (eg. sequencing run 1 and 2)
Q20. Simple rule-of-thumb when choosing an integration method
A: Use anchor/CCA or Harmony for typical multi-sample datasets; consider MNN/Scanorama for more local alignment or many datasets; use Conos for graph-based integration across many samples while preserving sample identity.