Skip to content

Clustering speed is very slow #97

@sweco-sekrsv

Description

@sweco-sekrsv

Input:
About a million tree points spread out in an area of roughly 400 km2

I run this command.
i3dm.export.exe -c Host=127.0.0.1;Username=postgres;Password=password1234%;Database=fme_test;Port=5432;sslmode=prefer -t globe.tommelilla_trees_export_i3dm -o C:\work\tomelilla\testoutput_cross_cluster2 -g 700 --geometrycolumn geom --max_features_per_tile 2000 --use_external_model true --use_clustering false
It runs for about a 30 seconds and generates:
Tiles written: 1877
Subtree files written: 122
SubtreeLevels: 4

Running with clustering:
i3dm.export.exe -c Host=127.0.0.1;Username=postgres;Password=password1234%;Database=fme_test;Port=5432;sslmode=prefer -t globe.tommelilla_trees_export_i3dm -o C:\work\tomelilla\testoutput_cross_cluster2 -g 700 --geometrycolumn geom --max_features_per_tile 2000 --use_external_model true --use_clustering true

With clustering it has taken more than two hours now, and by looking at the created .cmpt I estimate that it will take 8 hours more to finish

CPU is only being used 10-20% and ram usage is as low as 120 mb so I think there is room for performance improvements.
I asked GPT about it and it describes the problem like this:

Why MiniBatchKMeans is slow for this problem

Context: ~1 million dense tree points over a 400 km² area

Reasons for poor performance

  • High computational cost
    Distance calculations scale with
    batchSize × numberOfClusters × numberOfFeatures
  • Dense spatial data
    Every feature of every point is evaluated (no sparsity benefit)
  • No spatial awareness
    K-Means ignores geographic locality and recomputes global distances
  • MiniBatch helps samples, not dimensions
    Reduces sample count per iteration, but does not reduce feature cost
  • CPU-only implementation
    No GPU, limited vectorization in Accord

➡️ Result: high runtime with limited clustering quality.


Recommended approach instead

✅ Spatial aggregation (key step)

  • Aggregate points into a regular grid (e.g. 10–25 m cells)
  • Compute cell features (tree density, mean height, canopy cover)
  • Reduces data size by 10–100×

✅ Use spatial / density-based clustering

  • DBSCAN / HDBSCAN
    • Exploits natural tree density
    • No need to choose number of clusters
    • Handles irregular forest shapes well

✅ If fixed number of regions is required

  • Grid aggregation → PCA (2–3 components)K-Means

Summary

❌ Raw MiniBatchKMeans on dense spatial tree data is slow and poorly suited
✅ Spatial reduction + density-based clustering is fast, scalable, and meaningful

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions