Conversation
Thank you for your contribution. Can you link to an implementation that uses this, or a paper that describes this approach and the advantages/disadvantages? I'm not familiar with it, and the advantages are not clear to me at present. If I can understand the motive behind this change it will make it easier to review. |
|
The scikit-learn implementation does this, referencing the original k-means++ paper, notably a remark in the conclusion:
(https://theory.stanford.edu/~sergei/papers/kMeansPP-soda.pdf) |
| ) -> Array2<V> { | ||
| assert!(k > 1); | ||
| assert!(data.dim().0 > 0); | ||
| let n_trials = n_trials.unwrap_or(2 + (k as f64).ln().floor() as usize); |
There was a problem hiding this comment.
I have issues with the default value as it doesn't preserve the original behavior of this algorithm implementation.
| pub fn kmeans_lloyd<V: Value>( | ||
| data: &ArrayView2<V>, | ||
| k: usize, | ||
| n_trials: Option<usize>, |
There was a problem hiding this comment.
This changes the behavior and structure of the algorithm sufficiently that I think it's better to have a separate implementation. Create a new kmeans_lloyd_with_n_trials function that takes the n_trials parameter.
As long as the original behavior is preserved when calling this function when n_trials is none I'm happy to drop this one. It might still be worth splitting the implementation of initialize_plusplus for the two different cases though.
This is apparently what a lot of implementations actually use.