Build Wheels PyPI version Python Versions License
Probabilistic PCA (PPCA) with missing-data support — fast C++ core, clean Python API.
ppca-cpp implements Probabilistic Principal Component Analysis (PPCA) as described by Tipping & Bishop (1999), with a focus on speed, usability, and robust handling of missing data. The core is written in C++ (Armadillo), exposed via a simple Python interface.
- Handles missing values natively: No need for manual imputation—just use
np.nanfor missing entries. - Familiar API: Drop-in replacement for scikit-learn PCA with attributes like
components_,explained_variance_, etc. - Probabilistic modeling: Compute log-likelihoods, posterior latent variable distributions, multiple imputations, and more.
- Fast and scalable: Optimized C++ backend for large datasets.
- Flexible: Supports both batch and online (mini-batch) EM.
pip install ppca-py
Note: pre-built wheels are produced only for Linux and macOS (CI builds target ubuntu-latest and macos-latest). On other platforms (e.g. Windows) you will need to build from source (see further below).
Usage example:
import numpy as np from ppca import PPCA X_train = np.random.randn(600, 10) + 0.1 # (n_samples, n_features) X_train[::7, 3] = np.nan # missing values X_test = np.random.randn(100, 10) + 0.1 X_test[::7, 2] = np.nan # missing values model = PPCA(n_components=3, batch_size=200) model.fit(X_train) mZ, covZ = model.posterior_latent(X_test) # latent representation mX, covX = model.likelihood(mZ) # reconstruction ll = model.score_samples(X_test) # data log likelihood # multiple imputation (return shape: (n_draws, n_samples, n_features)) X_imputed = model.sample_missing(X_test, n_draws=5) # estimate of components, mean and noise variance print("Components:", model.components_) print("Mean:", model.mean_) print("Noise variance:", model.noise_variance_)
For a short PPCA reference doc see docs/ppca.md, and some usage examples are provided in examples/.
For development install from source:
git clone https://github.com/brdav/ppca-cpp.git cd ppca-cpp git submodule update --init --recursive python -m pip install -e '.[dev]' pre-commit install
Minimum build dependencies
- CMake >= 3.18
- Python >= 3.9 (development headers)
- C++17-capable compiler (clang, gcc, or MSVC)
- BLAS/LAPACK implementation (OpenBLAS, MKL, or Accelerate)
Note: Builds on Windows are untested in CI. You can attempt a Windows build but expect manual steps.
The PPCA C++ core can also be built independently:
cmake -S src/cpp -B build/cpp -DCMAKE_BUILD_TYPE=Release cmake --build build/cpp --target ppca -j
PPCA uses an Expectation-Maximization (EM) algorithm to learn parameters through maximum likelihood estimation. For details see the reference paper listed below. The equations for the EM algorithm in the presence of missing values are shown in docs/equations.md.
If you use this code academically, cite the original PPCA paper:
- M. Tipping & C. Bishop. Probabilistic Principal Component Analysis. JRSS B, 1999.
You may also reference the library name or URL.
MIT License — see LICENSE.
Questions or requests? Open an issue.