-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
help wantedExtra attention is neededExtra attention is neededquestionFurther information is requestedFurther information is requested
Description
We have some functions that suggest shuffling "grouped" data that for me are a bit confusing:
designit::shuffle_grouped_data(batch_container, allocate_var, keep_together_vars = c(), keep_separate_vars = c(), n_min = NA,
n_max = NA, n_ideal = NA, subgroup_var_name = NULL, report_grouping_as_attribute = FALSE,
prefer_big_groups = FALSE, strict = TRUE, fullTree = FALSE, maxCalls = 1e+06)
designit::mk_subgroup_shuffling_function(subgroup_vars, restrain_on_subgroup_levels = c(), n_swaps = 1)
designit::shuffle_with_subgroup_formation(subgroup_object, subgroup_allocations, keep_separate_vars = c(),
report_grouping_as_attribute = FALSE)
I guess they come from the invivo example and are tailored to that.
Maybe I didn't fully get them, but for a simple grouping problem I had, none of them were working.
- I had ~50 patients, 25with one, 25 with two measurements.
- I wanted to put them in batches such that samples of the same patient were put in the same batch.
I thought a function "shuffle_grouped_data" would do this, given the variable names that form the groups (in my case Patient ID).
Iakov helped with a solution for that particular case, which, a bit more generalized, could be part of the package:
# not parametrized...
keep_groups_together <- function(bc, i) {
d <- bc$get_samples(include_id = TRUE) |>
mutate(location_id = row_number())
# select random src location
src_id <- d |>
# exclude empty locations
filter(!is.na(.sample_id)) |>
sample_n(1) |>
pull(location_id)
stopifnot(length(src_id) == 1)
# find all samples with matching `Subject ID` and timepoint
all_src_id <- d |>
filter(
# exclude empty locations
!is.na(.sample_id),
# we are searching for matching samples
`Subject ID` == d$`Subject ID`[src_id]
) |>
pull(location_id)
dst_id <- d |>
filter(
# we don't want source locations
!location_id %in% all_src_id
) |>
group_by(`Subject ID`) |>
# we only choose empty or location of "lonely" samples
filter(is.na(.sample_id) | n() == 1) |>
# find suitable Run with enough space
group_by(Run) |>
filter(n_distinct(location_id) >= length(all_src_id)) |>
ungroup() |>
# choose destination Run
filter(Run == sample(unique(Run), 1)) |>
sample_n(length(all_src_id)) |>
pull(location_id)
list(
src = c(all_src_id, dst_id),
dst = c(dst_id, all_src_id)
)
}
But then I wonder, should we discuss about the namings of all those functions so that its clearer what they do?
@ingitwetrust and @idavydov what are your thoughts?
Metadata
Metadata
Assignees
Labels
help wantedExtra attention is neededExtra attention is neededquestionFurther information is requestedFurther information is requested