One True Cluster

Introduction

This package provides a cluster function with the following signature:

def cluster(
    xs: npt.NDArray,
    k: int,
    metric: metrics.Metric = metrics.l2,
    method: Optional[Method] = None,
    maxiter=MAXITER,
) -> tuple[npt.NDArray, npt.NDArray]:

The returned tuple contains the centroids and the labels.

The Metric and Method types are protocols defined thus:

class Metric(Protocol):
    def __call__(self, x: npt.NDArray, y: npt.NDArray) -> float:
        ...

class Method(Protocol):
    def __call__(
        self, xs: npt.NDArray, ys: npt.NDArray, metric: metrics.Metric, maxiter: int
    ) -> tuple[npt.NDArray, npt.NDArray, bool]:
        ...

Metrics must be implemented in Numba. Methods are implemented by the package and there should generally be no need to specify one.

cluster uses Hamerly's algorithm to compute the Lloyd iteration when $n / k$ is large, and Philips's compare algorithm when $n / k$ is small. Initialization is performed using a version of K-Means++ wherein candidate clusters are sampled from a subset of the dataset. Relying on the law of large numbers.

Some observations

Hamerly isn't always the fastest algorithm ; it is advantaged as $n / k$ grows but Philips's algorithms (sort and compare) are better as $n / k$ decreases
Initialization methods developed since K-Means++ are slower, and do not seem more accurate, than accelerating K-Means++ by sampling from a random subset of the dataset ; I do not know whether some datasets can be problematic for this method

TODO

Generic version of Hamerly & LLoyd which do not use Numba to allow the clustering of arbitrary objects
- Must not only have a generic metric but also a generic prototype (which currently is np.mean) ; implementing this would mean different top level routines (lloyd, hamerly, etc...), which don't use the vector sum trick to update centroids and call prototype instead
- This will notably be useful for sparse data !
- Sparse data example
- Randomized Linear Algebra for very high dimensional data ?
Triangular & non triangular metrics, using a decorator ; different metrics use different algorithms
Implement some bregman divergences

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
notebooks		notebooks
src/cluster		src/cluster
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

One True Cluster

Introduction

Some observations

TODO

About

Uh oh!

Releases

Packages

Languages

ltricot/cluster

Folders and files

Latest commit

History

Repository files navigation

One True Cluster

Introduction

Some observations

TODO

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages