-
Notifications
You must be signed in to change notification settings - Fork 26
Description
Input:
About a million tree points spread out in an area of roughly 400 km2
I run this command.
i3dm.export.exe -c Host=127.0.0.1;Username=postgres;Password=password1234%;Database=fme_test;Port=5432;sslmode=prefer -t globe.tommelilla_trees_export_i3dm -o C:\work\tomelilla\testoutput_cross_cluster2 -g 700 --geometrycolumn geom --max_features_per_tile 2000 --use_external_model true --use_clustering false
It runs for about a 30 seconds and generates:
Tiles written: 1877
Subtree files written: 122
SubtreeLevels: 4
Running with clustering:
i3dm.export.exe -c Host=127.0.0.1;Username=postgres;Password=password1234%;Database=fme_test;Port=5432;sslmode=prefer -t globe.tommelilla_trees_export_i3dm -o C:\work\tomelilla\testoutput_cross_cluster2 -g 700 --geometrycolumn geom --max_features_per_tile 2000 --use_external_model true --use_clustering true
With clustering it has taken more than two hours now, and by looking at the created .cmpt I estimate that it will take 8 hours more to finish
CPU is only being used 10-20% and ram usage is as low as 120 mb so I think there is room for performance improvements.
I asked GPT about it and it describes the problem like this:
Why MiniBatchKMeans is slow for this problem
Context: ~1 million dense tree points over a 400 km² area
Reasons for poor performance
- High computational cost
Distance calculations scale with
batchSize × numberOfClusters × numberOfFeatures - Dense spatial data
Every feature of every point is evaluated (no sparsity benefit) - No spatial awareness
K-Means ignores geographic locality and recomputes global distances - MiniBatch helps samples, not dimensions
Reduces sample count per iteration, but does not reduce feature cost - CPU-only implementation
No GPU, limited vectorization in Accord
➡️ Result: high runtime with limited clustering quality.
Recommended approach instead
✅ Spatial aggregation (key step)
- Aggregate points into a regular grid (e.g. 10–25 m cells)
- Compute cell features (tree density, mean height, canopy cover)
- Reduces data size by 10–100×
✅ Use spatial / density-based clustering
- DBSCAN / HDBSCAN
- Exploits natural tree density
- No need to choose number of clusters
- Handles irregular forest shapes well
✅ If fixed number of regions is required
- Grid aggregation → PCA (2–3 components) → K-Means
Summary
❌ Raw MiniBatchKMeans on dense spatial tree data is slow and poorly suited
✅ Spatial reduction + density-based clustering is fast, scalable, and meaningful