Clustering speed is very slow

Input:
About a million tree points spread out in an area of roughly 400 km2

I run this command. 
`i3dm.export.exe -c Host=127.0.0.1;Username=postgres;Password=password1234%;Database=fme_test;Port=5432;sslmode=prefer -t globe.tommelilla_trees_export_i3dm -o C:\work\tomelilla\testoutput_cross_cluster2 -g 700 --geometrycolumn geom --max_features_per_tile 2000 --use_external_model true --use_clustering false`
It runs for about a 30 seconds and generates:
Tiles written: 1877
Subtree files written: 122
SubtreeLevels: 4

Running with clustering: 
`i3dm.export.exe -c Host=127.0.0.1;Username=postgres;Password=password1234%;Database=fme_test;Port=5432;sslmode=prefer -t globe.tommelilla_trees_export_i3dm -o C:\work\tomelilla\testoutput_cross_cluster2 -g 700 --geometrycolumn geom --max_features_per_tile 2000 --use_external_model true --use_clustering true`

With clustering it has taken more than two hours now, and by looking at the created .cmpt I estimate that it will take 8 hours more to finish

CPU is only being used 10-20% and ram usage is as low as 120 mb so I think there is room for performance improvements.
I asked GPT about it and it describes the problem like this:

## Why MiniBatchKMeans is slow for this problem

**Context:** ~1 million dense tree points over a 400 km² area

### Reasons for poor performance
- **High computational cost**  
  Distance calculations scale with  
  `batchSize × numberOfClusters × numberOfFeatures`
- **Dense spatial data**  
  Every feature of every point is evaluated (no sparsity benefit)
- **No spatial awareness**  
  K-Means ignores geographic locality and recomputes global distances
- **MiniBatch helps samples, not dimensions**  
  Reduces sample count per iteration, but **does not reduce feature cost**
- **CPU-only implementation**  
  No GPU, limited vectorization in Accord

➡️ Result: high runtime with limited clustering quality.

---

## Recommended approach instead

### ✅ Spatial aggregation (key step)
- Aggregate points into a **regular grid** (e.g. 10–25 m cells)
- Compute cell features (tree density, mean height, canopy cover)
- Reduces data size by **10–100×**

### ✅ Use spatial / density-based clustering
- **DBSCAN / HDBSCAN**
  - Exploits natural tree density
  - No need to choose number of clusters
  - Handles irregular forest shapes well

### ✅ If fixed number of regions is required
- Grid aggregation → **PCA (2–3 components)** → **K-Means**

---

## Summary
❌ Raw MiniBatchKMeans on dense spatial tree data is slow and poorly suited  
✅ Spatial reduction + density-based clustering is fast, scalable, and meaningful

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering speed is very slow #97

Why MiniBatchKMeans is slow for this problem

Reasons for poor performance

Recommended approach instead

✅ Spatial aggregation (key step)

✅ Use spatial / density-based clustering

✅ If fixed number of regions is required

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clustering speed is very slow #97

Description

Why MiniBatchKMeans is slow for this problem

Reasons for poor performance

Recommended approach instead

✅ Spatial aggregation (key step)

✅ Use spatial / density-based clustering

✅ If fixed number of regions is required

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions