PCA does not model the contrast directly
PCA can summarize variation in one matrix, but it does not ask whether a direction is more variable in one condition than another. For paired datasets, that distinction is often the main scientific question.
Paper companion site
A contrastive dimensionality reduction method for identifying structure that differs between two high-dimensional datasets, without tuning an alpha parameter.
gcPCA is built for experiments where the main question is comparative: what changes in the transcriptomics between health and disease? Instead of finding directions with the largest total variance in one dataset, gcPCA ranks directions by relative enrichment between high-dimensional datasets. The symmetric v4 formulation treats the two conditions on equal footing and returns components whose signs and eigenvalues have a direct interpretation.
pip install generalized-contrastive-PCA
Why gcPCA
PCA is often useful, but it answers a single-dataset question: which directions explain the most variance? Many experiments are instead comparative: treated versus control, wake versus sleep, baseline versus stimulation, or one cell state versus another. In those cases, the largest sources of variance may be shared across conditions, while the experimentally relevant structure lies in directions that differ between them.
PCA can summarize variation in one matrix, but it does not ask whether a direction is more variable in one condition than another. For paired datasets, that distinction is often the main scientific question.
Batch effects, global activity fluctuations, population-level covariance, or other shared sources of variation can dominate a PCA embedding even when they are not the features of interest.
Contrastive PCA compares a foreground dataset against a background dataset using an alpha parameter. In practice, users often scan over alpha values and inspect several embeddings before choosing one.
gcPCA replaces the manual alpha scan with a normalized objective. This shifts the emphasis from absolute variance to relative enrichment, producing a more direct workflow for comparing two conditions.
Visual intuition
In datasets with strong shared variance, the direction that best separates the conditions may not be the direction with the largest absolute variance. gcPCA is designed to recover directions whose variance changes most strongly between conditions.
Method overview
Start with two datasets measured in the same feature space, A and B, and estimate their covariance matrices \(C_A\) and \(C_B\).
Define directions \(x\) that score highly when variance differs between the two datasets.
Normalize the contrast so components reflect relative enrichment rather than simply favoring high-variance directions.
Solve the resulting eigenvalue problem, order the components, and project each dataset into the contrastive component space.
Mathematical formulation
Main objective
In v4, the numerator captures the difference in variance between conditions, while the denominator normalizes by their total variance. This produces a symmetric objective: swapping A and B changes the sign of the result but not the underlying comparison.
Interpretation
| Variant | Objective | Interpretation |
|---|---|---|
| v2 | \(A / B\) | Asymmetric ratio useful when B is a background condition. |
| v3 | \((A - B) / B\) | Asymmetric relative change with B as the reference. |
| v4 | \((A - B) / (A + B)\) | Symmetric contrast and the recommended default for paired conditions. |
| v2.1 / v3.1 / v4.1 | Orthogonal variants | Preserve orthogonality in the original feature space when that constraint matters. |
| Sparse options | L1 feature selection | Encourage interpretability by selecting compact sets of informative features. |
Example results
gcPCA is useful when the structure that differs between conditions is weaker than the variation shared by both conditions. In these cases, a variance-maximizing method can emphasize the wrong directions.
In synthetic data, PCA follows the directions with the largest total variance, even when those directions are shared across conditions. cPCA can recover contrastive structure, but usually requires scanning over \(\alpha\). gcPCA orders components by normalized enrichment, making the condition-specific directions easier to identify without a manual parameter search.
During ripples, PCA is dominated by large shared modes of activity. gcPCA instead emphasizes dimensions whose variance is enriched during replay, producing a lower-dimensional view that is more directly tied to the replay-related structure. The method does this without labels and without an alpha sweep.
Comparison to related methods
| Method | What it optimizes | Requires hyperparameter search? | Symmetric? | Best use case |
|---|---|---|---|---|
| PCA | Total variance in a single dataset. | No | Not applicable | Summarizing dominant variation within one condition. |
| LDA | Separation between labeled groups relative to within-group spread. | No | No | Supervised discrimination when class labels are available. |
| cPCA | \(\mathrm{Var}_A - \alpha \mathrm{Var}_B\) | Yes | No | Foreground-versus-background comparisons when manual \(\alpha\) scans are acceptable. |
| gcPCA | Normalized contrastive variance, including the symmetric v4 ratio. | No | Yes, in v4 | Unsupervised comparison of two conditions with interpretable contrastive structure. |
Python package / quick start
The Python package provides the workflow shown below. R and MATLAB implementations are also available from the project repository, with additional setup notes in the GitHub wiki.
pip install generalized-contrastive-PCA
PyPI package and source repository are both publicly available.
Also available
Use the Python package for the workflow shown here, install gcpca from CRAN for the R
interface, or work from the MATLAB implementation when you need a native MATLAB pipeline.
from generalized_contrastive_PCA import gcPCA
model = gcPCA(method="v4", normalize_flag=True)
model.fit(X_A, X_B)
scores_A = model.Ra_scores_
scores_B = model.Rb_scores_
loadings = model.loadings_
eigenvalues = model.gcPCA_values_
The repository documents the v4 workflow using gcPCA(method="v4", normalize_flag=True), followed by fit(X_A, X_B). Common outputs include loadings_, Ra_scores_, Rb_scores_, and gcPCA_values_.
R and MATLAB implementations are available as well. The GitHub wiki includes additional guidance on package variants, language-specific workflows, and setup details.
Resources
The peer-reviewed PLOS Computational Biology article introducing the method and its applications.
Open paperPython, MATLAB, and R implementations of gcPCA, plus repository-level installation and usage notes.
View repositoryA notebook walkthrough using synthetic data, including expected inputs, common outputs, and interpretation of gcPCA variance structure.
Open tutorial notebook 1A second notebook covering sparse gcPCA, including the motivation for feature selection and an example using single-cell RNA sequencing data.
Open tutorial notebook 2Language-specific notes, usage guidance, and additional setup details collected in the project wiki.
Open GitHub wikiCitation
de Oliveira EF, Garg P, Hjerling-Leffler J, Batista-Brito R, Sjulson L (2025) Identifying patterns differing between high-dimensional datasets with generalized contrastive PCA. PLOS Computational Biology 21(2): e1012747. https://doi.org/10.1371/journal.pcbi.1012747
@article{deOliveira2025gcPCA,
author = {de Oliveira, Eliezyer F. and Garg, Pranjal and Hjerling-Leffler, Jens and Batista-Brito, Renata and Sjulson, Lucas},
title = {Identifying patterns differing between high-dimensional datasets with generalized contrastive PCA},
journal = {PLOS Computational Biology},
year = {2025},
volume = {21},
number = {2},
pages = {e1012747},
doi = {10.1371/journal.pcbi.1012747},
url = {https://doi.org/10.1371/journal.pcbi.1012747}
}