-
Notifications
You must be signed in to change notification settings - Fork 1
Performance issue: O(n³) complexity in cluster assignment #1
Description
Summary
I'm using hdbscan-ts for clustering 6,000+ data points and noticed significant performance issues compared to the Python HDBSCAN implementation.
Environment
- hdbscan-ts version: 1.0.0
- Node.js version: 20.x
- Data: 6,000+ 512-dimensional embeddings
Performance Comparison
| Implementation | Time | Speedup |
|---|---|---|
| Python (hdbscan + numba) | ~17 seconds | 220x faster |
| hdbscan-ts | 65+ minutes | baseline |
Suspected Cause
The performance issue appears to be in the cluster assignment/lookup logic. Based on profiling, the bottleneck seems to be related to repeated linear searches that result in O(n³) complexity.
Specifically, operations that should use Set/Map lookups for O(1) membership checks may be using Array.includes() which is O(n).
Impact
This makes hdbscan-ts impractical for datasets above ~1,000 points in production scenarios where clustering needs to complete within reasonable timeframes.
Workaround
I currently use a Python bridge to call the Cython-accelerated Python HDBSCAN implementation, which works well but adds a Python dependency.
Potential Fix
If the issue is indeed Array.includes() vs Set.has():
// Before (O(n))
if (array.includes(item)) { ... }
// After (O(1))
const set = new Set(array);
if (set.has(item)) { ... }Request
Would you be open to:
- Investigating the performance bottleneck?
- Accepting a PR if I identify and fix the issue?
I can provide more detailed profiling data if helpful.
Thank you for creating this library - HDBSCAN in TypeScript is valuable for the ecosystem!