Paper companion site

Generalized Contrastive PCA

A contrastive dimensionality reduction method for identifying structure that differs between two high-dimensional datasets, without tuning an alpha parameter.

gcPCA is built for experiments where the main question is comparative: what changes in the transcriptomics between health and disease? Instead of finding directions with the largest total variance in one dataset, gcPCA ranks directions by relative enrichment between high-dimensional datasets. The symmetric v4 formulation treats the two conditions on equal footing and returns components whose signs and eigenvalues have a direct interpretation.

Paper GitHub Install Tutorials

pip install generalized-contrastive-PCA

PLOS Computational Biology Published February 7, 2025 Python, MATLAB, and R implementations, including CRAN

Why gcPCA

PCA explains variance within one dataset. gcPCA targets differences between datasets.

PCA is often useful, but it answers a single-dataset question: which directions explain the most variance? Many experiments are instead comparative: treated versus control, wake versus sleep, baseline versus stimulation, or one cell state versus another. In those cases, the largest sources of variance may be shared across conditions, while the experimentally relevant structure lies in directions that differ between them.

PCA does not model the contrast directly

PCA can summarize variation in one matrix, but it does not ask whether a direction is more variable in one condition than another. For paired datasets, that distinction is often the main scientific question.

Shared structure can mask condition-specific structure

Batch effects, global activity fluctuations, population-level covariance, or other shared sources of variation can dominate a PCA embedding even when they are not the features of interest.

cPCA introduced an explicit contrast

Contrastive PCA compares a foreground dataset against a background dataset using an alpha parameter. In practice, users often scan over alpha values and inspect several embeddings before choosing one.

gcPCA uses a normalized contrast

gcPCA replaces the manual alpha scan with a normalized objective. This shifts the emphasis from absolute variance to relative enrichment, producing a more direct workflow for comparing two conditions.

Visual intuition

A low-variance direction can still be the most condition-specific direction.

In datasets with strong shared variance, the direction that best separates the conditions may not be the direction with the largest absolute variance. gcPCA is designed to recover directions whose variance changes most strongly between conditions.

Synthetic intuition figure showing low-variance condition-specific structure emerging under generalized contrastive PCA. — Synthetic example: a condition-specific signal can be small in absolute variance but large in relative enrichment. gcPCA prioritizes that relative change, allowing it to recover structure that PCA can miss when shared variance dominates.

Method overview

From two covariance matrices to contrastive components.

Estimate covariance in both conditions

Start with two datasets measured in the same feature space, A and B, and estimate their covariance matrices \(C_A\) and \(C_B\).

Construct a contrastive objective

Define directions \(x\) that score highly when variance differs between the two datasets.

Normalize the contrast

Normalize the contrast so components reflect relative enrichment rather than simply favoring high-variance directions.

Solve the eigenproblem and project

Solve the resulting eigenvalue problem, order the components, and project each dataset into the contrastive component space.

Mathematical formulation

The v4 objective measures normalized contrastive variance.

Main objective

\[ \max_{x} \frac{x^\mathsf{T}(C_A - C_B)x}{x^\mathsf{T}(C_A + C_B)x}, \quad \text{subject to } \lVert x \rVert_2 = 1 \]

In v4, the numerator captures the difference in variance between conditions, while the denominator normalizes by their total variance. This produces a symmetric objective: swapping A and B changes the sign of the result but not the underlying comparison.

Interpretation

Symmetric comparison: neither condition is treated as the fixed background.
Bounded eigenvalues: values lie in \([-1, 1]\).
Positive values: variance is enriched in dataset A.
Negative values: variance is enriched in dataset B.
Values near zero: the component is not strongly enriched in either dataset.

Compact summary of gcPCA variants
Variant	Objective	Interpretation
v2	\(A / B\)	Asymmetric ratio useful when B is a background condition.
v3	\((A - B) / B\)	Asymmetric relative change with B as the reference.
v4	\((A - B) / (A + B)\)	Symmetric contrast and the recommended default for paired conditions.
v2.1 / v3.1 / v4.1	Orthogonal variants	Preserve orthogonality in the original feature space when that constraint matters.
Sparse options	L1 feature selection	Encourage interpretability by selecting compact sets of informative features.

Example results

Examples where the contrastive structure is not the largest source of variance.

gcPCA is useful when the structure that differs between conditions is weaker than the variation shared by both conditions. In these cases, a variance-maximizing method can emphasize the wrong directions.

Synthetic data

In synthetic data, PCA follows the directions with the largest total variance, even when those directions are shared across conditions. cPCA can recover contrastive structure, but usually requires scanning over \(\alpha\). gcPCA orders components by normalized enrichment, making the condition-specific directions easier to identify without a manual parameter search.

Hippocampal replay example

During ripples, PCA is dominated by large shared modes of activity. gcPCA instead emphasizes dimensions whose variance is enriched during replay, producing a lower-dimensional view that is more directly tied to the replay-related structure. The method does this without labels and without an alpha sweep.

Comparison to related methods

gcPCA sits between classical PCA and supervised separation methods.

How gcPCA differs from common dimensionality reduction methods
Method	What it optimizes	Requires hyperparameter search?	Symmetric?	Best use case
PCA	Total variance in a single dataset.	No	Not applicable	Summarizing dominant variation within one condition.
LDA	Separation between labeled groups relative to within-group spread.	No	No	Supervised discrimination when class labels are available.
cPCA	\(\mathrm{Var}_A - \alpha \mathrm{Var}_B\)	Yes	No	Foreground-versus-background comparisons when manual \(\alpha\) scans are acceptable.
gcPCA	Normalized contrastive variance, including the symmetric v4 ratio.	No	Yes, in v4	Unsupervised comparison of two conditions with interpretable contrastive structure.

Python package / quick start

Install gcPCA and run the v4 method.

The Python package provides the workflow shown below. R and MATLAB implementations are also available from the project repository, with additional setup notes in the GitHub wiki.

Install from PyPI

pip install generalized-contrastive-PCA

PyPI package and source repository are both publicly available.

Also available

R on CRAN MATLAB GitHub wiki

Use the Python package for the workflow shown here, install gcpca from CRAN for the R interface, or work from the MATLAB implementation when you need a native MATLAB pipeline.

Quick start in Python

from generalized_contrastive_PCA import gcPCA

model = gcPCA(method="v4", normalize_flag=True)
model.fit(X_A, X_B)

scores_A = model.Ra_scores_
scores_B = model.Rb_scores_
loadings = model.loadings_
eigenvalues = model.gcPCA_values_

The repository documents the v4 workflow using gcPCA(method="v4", normalize_flag=True), followed by fit(X_A, X_B). Common outputs include loadings_, Ra_scores_, Rb_scores_, and gcPCA_values_.

R and MATLAB implementations are available as well. The GitHub wiki includes additional guidance on package variants, language-specific workflows, and setup details.

Resources

Paper, code, tutorials, and implementation notes.

Paper

The peer-reviewed PLOS Computational Biology article introducing the method and its applications.

Open paper

Main code repository

Python, MATLAB, and R implementations of gcPCA, plus repository-level installation and usage notes.

View repository

Python tutorial notebook for regular gcPCA

A notebook walkthrough using synthetic data, including expected inputs, common outputs, and interpretation of gcPCA variance structure.

Open tutorial notebook 1

Python tutorial notebook for sparse gcPCA

A second notebook covering sparse gcPCA, including the motivation for feature selection and an example using single-cell RNA sequencing data.

Open tutorial notebook 2

Documentation / wiki

Language-specific notes, usage guidance, and additional setup details collected in the project wiki.

Open GitHub wiki

Citation

Citing gcPCA

Plain citation

de Oliveira EF, Garg P, Hjerling-Leffler J, Batista-Brito R, Sjulson L (2025) Identifying patterns differing between high-dimensional datasets with generalized contrastive PCA. PLOS Computational Biology 21(2): e1012747. https://doi.org/10.1371/journal.pcbi.1012747

BibTeX

@article{deOliveira2025gcPCA,
  author = {de Oliveira, Eliezyer F. and Garg, Pranjal and Hjerling-Leffler, Jens and Batista-Brito, Renata and Sjulson, Lucas},
  title = {Identifying patterns differing between high-dimensional datasets with generalized contrastive PCA},
  journal = {PLOS Computational Biology},
  year = {2025},
  volume = {21},
  number = {2},
  pages = {e1012747},
  doi = {10.1371/journal.pcbi.1012747},
  url = {https://doi.org/10.1371/journal.pcbi.1012747}
}