Skip to content

renaming grouping functions? #51

@julianesiebourg

Description

@julianesiebourg

We have some functions that suggest shuffling "grouped" data that for me are a bit confusing:

designit::shuffle_grouped_data(batch_container,  allocate_var,  keep_together_vars = c(),  keep_separate_vars = c(),  n_min = NA,  
                                                   n_max = NA,  n_ideal = NA,  subgroup_var_name = NULL,  report_grouping_as_attribute = FALSE,  
                                                   prefer_big_groups = FALSE,  strict = TRUE,  fullTree = FALSE, maxCalls = 1e+06)

designit::mk_subgroup_shuffling_function(subgroup_vars, restrain_on_subgroup_levels = c(), n_swaps = 1)

designit::shuffle_with_subgroup_formation(subgroup_object,  subgroup_allocations,  keep_separate_vars = c(),
                                                                     report_grouping_as_attribute = FALSE)

I guess they come from the invivo example and are tailored to that.

Maybe I didn't fully get them, but for a simple grouping problem I had, none of them were working.

  • I had ~50 patients, 25with one, 25 with two measurements.
  • I wanted to put them in batches such that samples of the same patient were put in the same batch.

I thought a function "shuffle_grouped_data" would do this, given the variable names that form the groups (in my case Patient ID).

Iakov helped with a solution for that particular case, which, a bit more generalized, could be part of the package:

# not parametrized...
keep_groups_together <- function(bc, i) {
  d <- bc$get_samples(include_id = TRUE) |>
    mutate(location_id = row_number())
  # select random src location
  src_id <- d |>
    # exclude empty locations
    filter(!is.na(.sample_id)) |>
    sample_n(1) |>
    pull(location_id)
  stopifnot(length(src_id) == 1)

  # find all samples with matching `Subject ID` and timepoint
  all_src_id <- d |>
    filter(
      # exclude empty locations
      !is.na(.sample_id),
      # we are searching for matching samples
      `Subject ID` == d$`Subject ID`[src_id]
    ) |>
    pull(location_id)

  dst_id <- d |>
    filter(
      # we don't want source locations
      !location_id %in% all_src_id
    ) |>
    group_by(`Subject ID`) |>
    # we only choose empty or location of "lonely" samples
    filter(is.na(.sample_id) | n() == 1) |>
    # find suitable Run with enough space
    group_by(Run) |>
    filter(n_distinct(location_id) >= length(all_src_id)) |>
    ungroup() |>
    # choose destination Run
    filter(Run == sample(unique(Run), 1)) |>
    sample_n(length(all_src_id)) |>
    pull(location_id)
  list(
    src = c(all_src_id, dst_id),
    dst = c(dst_id, all_src_id)
  )
}

But then I wonder, should we discuss about the namings of all those functions so that its clearer what they do?

@ingitwetrust and @idavydov what are your thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is neededquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions