Skip to content

olijacklu/K-MeansPCA-Project

Repository files navigation

PCA-Based K-Means Clustering: A Review and Implementation

This repository contains a comprehensive review and implementation of the paper "K-means Clustering via Principal Component Analysis" by Ding and He (2004), as part of the Geometric Data Analysis course at MVA 2024-2025.

Project Overview

We review the theoretical connection between K-means clustering and Principal Component Analysis (PCA) established by Ding and He. The project provides:

  1. A critical analysis of the original paper's methodology and findings
  2. Our own implementation of the PCA-based clustering algorithm
  3. Extensions using advanced PCA variants (Kernel PCA and Sparse PCA)
  4. Empirical evaluation on multiple datasets

Repository Structure

├── Implementation_vs_Sklearn.ipynb      # Notebook comparing our implementation to sklearn
├── gene_data.py                         # Utilities for loading gene expression dataset
├── kernel_pca_implementation.py         # Kernel PCA implementation
├── newspaper_data.py                    # Utilities for loading newspaper dataset
├── sparse_pca_implementation.py         # Sparse PCA implementation
├── standard_pca_implementation.py       # Standard PCA implementation
└── README.md                            # This file

Theoretical Background

The paper establishes a formal link between PCA and K-means clustering by proving that:

  • Principal components serve as continuous solutions to discrete cluster membership indicators
  • The subspace spanned by cluster centroids corresponds to the truncated spectral expansion of the data covariance matrix
  • For K clusters, the first K-1 principal components provide the optimal continuous solution to the K-means clustering problem

Implementations

We provide implementations of three different PCA variants for clustering:

Standard PCA

Our base implementation follows the original paper's approach:

  1. Center the data
  2. Compute covariance matrix
  3. Extract top K-1 principal components
  4. Construct connectivity matrix
  5. Apply recursive spectral clustering to identify clusters

Kernel PCA

Extends the standard approach by:

  1. Using kernel functions (primarily RBF) to handle non-linear structures
  2. Computing kernel matrix instead of covariance matrix
  3. Following the same recursive clustering approach on the kernel-transformed data

Sparse PCA

Implements a sparsity-constrained version that:

  1. Extracts principal components with sparsity constraints
  2. Reduces feature dimensionality while preserving clustering information
  3. Improves interpretability at the cost of computational complexity

Datasets

We evaluate our implementations on two primary datasets:

  1. Newspaper Dataset: Text data from the 20 Newsgroups corpus, with multiple configurations:

    • A2, B2: 2-class problems
    • A5, B5: 5-class problems
  2. Gene Expression Dataset: High-dimensional RNA-Seq data with 801 observations and 20,531 features

Key Findings

Our review and implementation revealed several insights:

  1. Theoretical Connection: We verified the mathematical link between PCA and K-means established in the paper, with some corrections to the original proofs.

  2. Performance: Our standard PCA implementation achieved comparable or better performance than sklearn's K-means on the Newspaper dataset, confirming the paper's findings.

  3. Advanced PCA Variants:

    • Sparse PCA showed slightly lower performance on the Newspaper dataset
    • Kernel PCA performed well on simple datasets but struggled with the high-dimensional gene expression data
  4. Limitations: We identified limitations in the original paper, including imprecise definitions and a lack of reproducibility details.

Usage

To run the implementation comparison notebook:

jupyter notebook Implementation_vs_Sklearn.ipynb

Credits

This project was developed by:

  • Oliver JACK
  • Eva ROBILLARD
  • Paulo SILVA

References

Ding, C., & He, X. (2004). K-means clustering via principal component analysis. In Proceedings of the twenty-first international conference on Machine learning (p. 29).

About

Implementation & analysis of the PCA-based K-Means Clustering method proposed in the paper "K-mean Clustering via Principcal Component Analysis" by Ding and He (2004).

Resources

Stars

Watchers

Forks

Contributors