Skip to content

group shuffle helpers #52

@idavydov

Description

@idavydov

Hi @julianesiebourg, related to your group shuffle use-case and #51 .

I wanted to capture directions on which we can improve the library to accommodate for this case.

  1. Do you think this is a common problem? Intuitively I see that keeping replicates distributed across batches is usually a better strategy.
  2. We can introduce a simple scoring function which will penalize separated samples and use strict improvement. The downside is that it will make shuffling very difficult. I.e., if you want samples 1 and 2 moved from batch X to batch Y, a) we would need to shuffle at least two samples in one iteration (n_shuffle >= 2), b) 1 and 2 should be chosen together at random (quite unlikely) c) destination should be the same batch (probability 1/n_batches). So probably shuffling will be very slow.
  3. Some shuffle with constraints procedure could be a solution. We could try to generalize the example I shared with you. Basically by specifying what's the sample group column. The difficulty is that we could run into a pathological configuration from which there is no way back (without breaking the constraint).

At this stage I think we should just capture this and maybe if you already have an idea of how frequent this is and what other types of group shuffle we might need that would be great.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions