PCA-Based K-Means Clustering: A Review and Implementation

This repository contains a comprehensive review and implementation of the paper "K-means Clustering via Principal Component Analysis" by Ding and He (2004), as part of the Geometric Data Analysis course at MVA 2024-2025.

Project Overview

We review the theoretical connection between K-means clustering and Principal Component Analysis (PCA) established by Ding and He. The project provides:

A critical analysis of the original paper's methodology and findings
Our own implementation of the PCA-based clustering algorithm
Extensions using advanced PCA variants (Kernel PCA and Sparse PCA)
Empirical evaluation on multiple datasets

Repository Structure

├── Implementation_vs_Sklearn.ipynb      # Notebook comparing our implementation to sklearn
├── gene_data.py                         # Utilities for loading gene expression dataset
├── kernel_pca_implementation.py         # Kernel PCA implementation
├── newspaper_data.py                    # Utilities for loading newspaper dataset
├── sparse_pca_implementation.py         # Sparse PCA implementation
├── standard_pca_implementation.py       # Standard PCA implementation
└── README.md                            # This file

Theoretical Background

The paper establishes a formal link between PCA and K-means clustering by proving that:

Principal components serve as continuous solutions to discrete cluster membership indicators
The subspace spanned by cluster centroids corresponds to the truncated spectral expansion of the data covariance matrix
For K clusters, the first K-1 principal components provide the optimal continuous solution to the K-means clustering problem

Implementations

We provide implementations of three different PCA variants for clustering:

Standard PCA

Our base implementation follows the original paper's approach:

Center the data
Compute covariance matrix
Extract top K-1 principal components
Construct connectivity matrix
Apply recursive spectral clustering to identify clusters

Kernel PCA

Extends the standard approach by:

Using kernel functions (primarily RBF) to handle non-linear structures
Computing kernel matrix instead of covariance matrix
Following the same recursive clustering approach on the kernel-transformed data

Sparse PCA

Implements a sparsity-constrained version that:

Extracts principal components with sparsity constraints
Reduces feature dimensionality while preserving clustering information
Improves interpretability at the cost of computational complexity

Datasets

We evaluate our implementations on two primary datasets:

Newspaper Dataset: Text data from the 20 Newsgroups corpus, with multiple configurations:
- A2, B2: 2-class problems
- A5, B5: 5-class problems
Gene Expression Dataset: High-dimensional RNA-Seq data with 801 observations and 20,531 features

Key Findings

Our review and implementation revealed several insights:

Theoretical Connection: We verified the mathematical link between PCA and K-means established in the paper, with some corrections to the original proofs.
Performance: Our standard PCA implementation achieved comparable or better performance than sklearn's K-means on the Newspaper dataset, confirming the paper's findings.
Advanced PCA Variants:
- Sparse PCA showed slightly lower performance on the Newspaper dataset
- Kernel PCA performed well on simple datasets but struggled with the high-dimensional gene expression data
Limitations: We identified limitations in the original paper, including imprecise definitions and a lack of reproducibility details.

Usage

To run the implementation comparison notebook:

jupyter notebook Implementation_vs_Sklearn.ipynb

Credits

This project was developed by:

Oliver JACK
Eva ROBILLARD
Paulo SILVA

References

Ding, C., & He, X. (2004). K-means clustering via principal component analysis. In Proceedings of the twenty-first international conference on Machine learning (p. 29).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PCA-Based K-Means Clustering: A Review and Implementation

Project Overview

Repository Structure

Theoretical Background

Implementations

Standard PCA

Kernel PCA

Sparse PCA

Datasets

Key Findings

Usage

Credits

References

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
A Review of K-means Clustering via Principal Component Analysis - Insights and Limitations.pdf		A Review of K-means Clustering via Principal Component Analysis - Insights and Limitations.pdf
Implementation_vs_Sklearn.ipynb		Implementation_vs_Sklearn.ipynb
README.md		README.md
gene_data.py		gene_data.py
kernel_pca_implementation.py		kernel_pca_implementation.py
newspaper_data.py		newspaper_data.py
sparse_pca_implementation.py		sparse_pca_implementation.py
standard_pca_implementation.py		standard_pca_implementation.py

Folders and files

Latest commit

History

Repository files navigation

PCA-Based K-Means Clustering: A Review and Implementation

Project Overview

Repository Structure

Theoretical Background

Implementations

Standard PCA

Kernel PCA

Sparse PCA

Datasets

Key Findings

Usage

Credits

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages