Skip to content

Performance issue: O(n³) complexity in cluster assignment #1

@gvonness-apolitical

Description

@gvonness-apolitical

Summary

I'm using hdbscan-ts for clustering 6,000+ data points and noticed significant performance issues compared to the Python HDBSCAN implementation.

Environment

  • hdbscan-ts version: 1.0.0
  • Node.js version: 20.x
  • Data: 6,000+ 512-dimensional embeddings

Performance Comparison

Implementation Time Speedup
Python (hdbscan + numba) ~17 seconds 220x faster
hdbscan-ts 65+ minutes baseline

Suspected Cause

The performance issue appears to be in the cluster assignment/lookup logic. Based on profiling, the bottleneck seems to be related to repeated linear searches that result in O(n³) complexity.

Specifically, operations that should use Set/Map lookups for O(1) membership checks may be using Array.includes() which is O(n).

Impact

This makes hdbscan-ts impractical for datasets above ~1,000 points in production scenarios where clustering needs to complete within reasonable timeframes.

Workaround

I currently use a Python bridge to call the Cython-accelerated Python HDBSCAN implementation, which works well but adds a Python dependency.

Potential Fix

If the issue is indeed Array.includes() vs Set.has():

// Before (O(n))
if (array.includes(item)) { ... }

// After (O(1))
const set = new Set(array);
if (set.has(item)) { ... }

Request

Would you be open to:

  1. Investigating the performance bottleneck?
  2. Accepting a PR if I identify and fix the issue?

I can provide more detailed profiling data if helpful.

Thank you for creating this library - HDBSCAN in TypeScript is valuable for the ecosystem!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions