Skip to content

Additional/new approach to estimating class overlap #8

@lindsaydbrin

Description

@lindsaydbrin

I'd like to add a new approach to estimating class overlap, but have questions about how best to implement this.

Currently, per-sample class overlap/probabilities are calculated (via compute_expectation_with_monte_carlo()) by:

  1. Tallying class count for the kNNs (into an array)
  2. Dividing the entire array by the Parzen-window

This is then added to expectation (to build the S-matrix), which is eventually normalized by total mass of the row.

This approach considers the total area of the kNNs (via Parzen-window-normalization), but not the distances/similarities to each neighbor. One consequence of this approach (desired or not) is that the farthest neighbor could have a disproportionate impact on the overlap with another class. E.g., compare two cases where a sample in class A has all neighbors in class B, but in (1) all samples are at a moderate distance, whereas in (2) all samples but one are very close, but one is very far. If I'm understanding correctly (and correct me if I'm wrong), (1) would appear to have greater overlap because of the smaller Parzen window, whereas one might prefer for (2) to be measured to have greater overlap because most of its neighbors are much closer and in theory it should be harder to determine a class boundary.

For comparison, I'd like to try to calculate per-sample class overlap/probabilities by the following:

  1. Summing the kNN's cosine similarities per class (into an array) (i.e., sum similarities rather than tally)

This would then be added to something similar to expectation but be divided by sample count rather than normalized by row mass, so that the relative similarity information would be maintained and one could compare across rows.

I.e., the original code does this:

            probability = np.array(
                # kNN class proportions
                [(target_k == nghbr_class).sum() / k_nearest for nghbr_class in range(n_class)]
            )
            # Get the Parzen-window probability
            probability_norm = probability / get_volume(data[indices_k])  # Parzen-window normalized

where probability is 1. above, and probability_norm is 2. above. This is then added to expectation, which is later normalized as expectation[class_ix] /= expectation[class_ix].sum().

My suggestion is to do this:

            sim_scores = np.array([similarities_k[target_k == nghbr_class].sum() for nghbr_class
                                  in range(n_class)])

where sim_scores is 3. above. This would then be added to expectation, which would be normalized as expectation[class_ix] /= len(class_indices[class_ix]).

Apologies for any confusion from variable names from my current PR rather than the original code! I have this working (seemingly correctly) within the function, although it breaks code downstream as described below.

I have a couple of implementation questions, basically:

  • Should this be an optional alternate calculation method?
  • If so, should this replace or add to the results already being passed by compute_expectation_with_monte_carlo()?
  • How should this be implemented in conjunction with the distance metric, as I am thinking about this in terms of cosine similarity rather than euclidean distance? (Although I suppose it could work with distance, just with an inverted interpretation of a high/low value...?)

Some thoughts:

  • With my current PR, the function would return, for each sample (by sample index), 1. and 2. above. With the new approach, you'd return 3., but I think you'd still want to return 1. above (kNN counts - why not) and I don't think it really matters if you return 2. or not. So 3. could replace 2., but it could also just be another array that gets returned (i.e. within the SimilarityArrays dataclass).
  • You'd want to return a new S-matrix, but I am pretty sure that just replacing the current one will break code downstream related to the actual spectral clustering metric. So it might make sense to return it in addition rather than as a replacement.
  • I'm not sure, but I think that with larger datasets, summing the similarities by class would increase computation time annoyingly. So probably this should be optional.
  • If you don't need to check/change the distance metric, this fits into the logic of compute_expectation_with_monte_carlo() really easily. If you want it to only be done with cosine similarity, I don't know if you'd want to throw an error if the wrong distance metric were selected vs. silently ignoring (which seems like a bad idea), or something else?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions