This repository contains a comprehensive review and implementation of the paper "K-means Clustering via Principal Component Analysis" by Ding and He (2004), as part of the Geometric Data Analysis course at MVA 2024-2025.
We review the theoretical connection between K-means clustering and Principal Component Analysis (PCA) established by Ding and He. The project provides:
- A critical analysis of the original paper's methodology and findings
- Our own implementation of the PCA-based clustering algorithm
- Extensions using advanced PCA variants (Kernel PCA and Sparse PCA)
- Empirical evaluation on multiple datasets
├── Implementation_vs_Sklearn.ipynb # Notebook comparing our implementation to sklearn
├── gene_data.py # Utilities for loading gene expression dataset
├── kernel_pca_implementation.py # Kernel PCA implementation
├── newspaper_data.py # Utilities for loading newspaper dataset
├── sparse_pca_implementation.py # Sparse PCA implementation
├── standard_pca_implementation.py # Standard PCA implementation
└── README.md # This file
The paper establishes a formal link between PCA and K-means clustering by proving that:
- Principal components serve as continuous solutions to discrete cluster membership indicators
- The subspace spanned by cluster centroids corresponds to the truncated spectral expansion of the data covariance matrix
- For K clusters, the first K-1 principal components provide the optimal continuous solution to the K-means clustering problem
We provide implementations of three different PCA variants for clustering:
Our base implementation follows the original paper's approach:
- Center the data
- Compute covariance matrix
- Extract top K-1 principal components
- Construct connectivity matrix
- Apply recursive spectral clustering to identify clusters
Extends the standard approach by:
- Using kernel functions (primarily RBF) to handle non-linear structures
- Computing kernel matrix instead of covariance matrix
- Following the same recursive clustering approach on the kernel-transformed data
Implements a sparsity-constrained version that:
- Extracts principal components with sparsity constraints
- Reduces feature dimensionality while preserving clustering information
- Improves interpretability at the cost of computational complexity
We evaluate our implementations on two primary datasets:
-
Newspaper Dataset: Text data from the 20 Newsgroups corpus, with multiple configurations:
- A2, B2: 2-class problems
- A5, B5: 5-class problems
-
Gene Expression Dataset: High-dimensional RNA-Seq data with 801 observations and 20,531 features
Our review and implementation revealed several insights:
-
Theoretical Connection: We verified the mathematical link between PCA and K-means established in the paper, with some corrections to the original proofs.
-
Performance: Our standard PCA implementation achieved comparable or better performance than sklearn's K-means on the Newspaper dataset, confirming the paper's findings.
-
Advanced PCA Variants:
- Sparse PCA showed slightly lower performance on the Newspaper dataset
- Kernel PCA performed well on simple datasets but struggled with the high-dimensional gene expression data
-
Limitations: We identified limitations in the original paper, including imprecise definitions and a lack of reproducibility details.
To run the implementation comparison notebook:
jupyter notebook Implementation_vs_Sklearn.ipynbThis project was developed by:
- Oliver JACK
- Eva ROBILLARD
- Paulo SILVA
Ding, C., & He, X. (2004). K-means clustering via principal component analysis. In Proceedings of the twenty-first international conference on Machine learning (p. 29).