Skip to content

implement n-trial greedy heuristic#4

Open
nlhepler wants to merge 2 commits intogenbattle:masterfrom
nlhepler:master
Open

implement n-trial greedy heuristic#4
nlhepler wants to merge 2 commits intogenbattle:masterfrom
nlhepler:master

Conversation

@nlhepler
Copy link
Contributor

This is apparently what a lot of implementations actually use.

@genbattle
Copy link
Owner

genbattle commented Sep 21, 2019

This is apparently what a lot of implementations actually use.

Thank you for your contribution.

Can you link to an implementation that uses this, or a paper that describes this approach and the advantages/disadvantages? I'm not familiar with it, and the advantages are not clear to me at present. If I can understand the motive behind this change it will make it easier to review.

@nlhepler
Copy link
Contributor Author

The scikit-learn implementation does this, referencing the original k-means++ paper, notably a remark in the conclusion:

Also, experiments showed thatk-means++generallyperformed better if it selected several new centers duringeach iteration, and then greedily chose the one thatdecreasedφas much as possible

(https://theory.stanford.edu/~sergei/papers/kMeansPP-soda.pdf)

) -> Array2<V> {
assert!(k > 1);
assert!(data.dim().0 > 0);
let n_trials = n_trials.unwrap_or(2 + (k as f64).ln().floor() as usize);
Copy link
Owner

@genbattle genbattle Sep 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have issues with the default value as it doesn't preserve the original behavior of this algorithm implementation.

pub fn kmeans_lloyd<V: Value>(
data: &ArrayView2<V>,
k: usize,
n_trials: Option<usize>,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the behavior and structure of the algorithm sufficiently that I think it's better to have a separate implementation. Create a new kmeans_lloyd_with_n_trials function that takes the n_trials parameter.

As long as the original behavior is preserved when calling this function when n_trials is none I'm happy to drop this one. It might still be worth splitting the implementation of initialize_plusplus for the two different cases though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants