Frequent Itemset Clustering (Apriori and ECLAT)#210
Frequent Itemset Clustering (Apriori and ECLAT)#210Wander03 wants to merge 72 commits intotidymodels:mainfrom
Conversation
add predict to vingette
EmilHvitfeldt
left a comment
There was a problem hiding this comment.
To more things:
- Add the following
.pred_item item preds row_id setNames truth_valuetoutils::globalVariables()inaaa.R. - Add exported functions to _pkgdown.yml`
i think i would like to chat about these prediction types in #211 before going through with this PR
R/extract_predictions.R
Outdated
| #' @return A data frame with items as columns and non-NA values as rows. | ||
| #' @export | ||
|
|
||
| extract_predictions <- function(pred_output) { |
There was a problem hiding this comment.
The idea here is essentially that we wanted to respect the tidyclust predict() output structure; namely, a one-column tibble. But the output of predictions in column-based clustering like association rules is not cluster assignments, but matrix completion.
What we arrived at was to return a list-col, where each element of the column represents the matrix completion result for that row of the test data.
However, in most use cases, the user wouldn't really need this list-col and would instead want the completed matrix. So, extract_predictions() was created to take the tidyclust output object and reconfigure it as the data matrix with predicted completions inserted.
We definitely have no issue with renaming it. But I believe helper function like this is very needed for methods of this structure - unless we choose to expand the allowed structures that predict() itself returns.
R/extract_fit_summary.R
Outdated
| #' @export | ||
| extract_fit_summary.itemsets <- function(object, ..., | ||
| call = rlang::caller_env(n = 0)) { | ||
| rlang::abort( |
There was a problem hiding this comment.
please convert all rlang::abort() calls to use {cli}, see 13f30dd for inspiration, or tag me if you need help
| toy_df <- data.frame( | ||
| 'beer' = c(F, T, T, T, F), | ||
| 'milk' = c(T, F, T, T, T), | ||
| 'bread' = c(T, T, F, T, T), | ||
| 'diapers' = c(T, T, T, T, T), | ||
| 'eggs' = c(F, T, F, F, F) | ||
| ) |
There was a problem hiding this comment.
| toy_df <- data.frame( | |
| 'beer' = c(F, T, T, T, F), | |
| 'milk' = c(T, F, T, T, T), | |
| 'bread' = c(T, T, F, T, T), | |
| 'diapers' = c(T, T, T, T, T), | |
| 'eggs' = c(F, T, F, F, F) | |
| ) | |
| toy_df <- data.frame( | |
| "beer" = c(FALSE, TRUE, TRUE, TRUE, FALSE), | |
| "milk" = c(TRUE, FALSE, TRUE, TRUE, TRUE), | |
| "bread" = c(TRUE, TRUE, FALSE, TRUE, TRUE), | |
| "diapers" = c(TRUE, TRUE, TRUE, TRUE, TRUE), | |
| "eggs" = c(FALSE, TRUE, FALSE, FALSE, FALSE) | |
| ) |
There was a problem hiding this comment.
This does two things, stops the usage of ' over " and uses the full name for TRUE and FALSE
There was a problem hiding this comment.
should be changed all places
| #' @export | ||
|
|
||
| augment_itemset_predict <- function(pred_output, truth_output) { |
There was a problem hiding this comment.
All exported functions need examples.
I would also like to see the example to help determine the use of it
| }) | ||
|
|
||
| test_that("extract_centroids errors for freq_itemsets", { | ||
| set.seed(1234) |
There was a problem hiding this comment.
please add skip_if_not_installed("arules") to all tests that use freq_itemsets()
R/extract_cluster_assignment.R
Outdated
| items <- attr(object, "item_names") | ||
| itemsets <- arules::DATAFRAME(object) | ||
|
|
||
| itemset_list <- lapply(strsplit(gsub("[{}]", "", itemsets$items), ","), stringr::str_trim) |
There was a problem hiding this comment.
R/predict_helpers.R
Outdated
| # Extract frequent itemsets and their supports | ||
| items <- attr(object, "item_names") | ||
| itemsets <- arules::DATAFRAME(object) | ||
| frequent_itemsets <- lapply(strsplit(gsub("[{}]", "", itemsets$items), ","), stringr::str_trim) |
There was a problem hiding this comment.
R/predict_helpers.R
Outdated
|
|
||
| # Create result data frame | ||
| data.frame( | ||
| item = stringr::str_remove_all(items, "`"), # Remove backticks from item names |
There was a problem hiding this comment.
R/extract_predictions.R
Outdated
|
|
||
| # Process each observation and combine results using reduce | ||
| result_df <- data_frames %>% | ||
| purrr::reduce(.f = ~ { |
There was a problem hiding this comment.
please use the reduce() from compat-purrr.R
R/extract_cluster_assignment.R
Outdated
| unique_non_zero_clusters <- unique(non_zero_clusters) | ||
|
|
||
| # Map each unique non-zero cluster to a new cluster starting from Cluster_1 | ||
| cluster_map <- setNames(paste0(prefix, seq_along(unique_non_zero_clusters)), unique_non_zero_clusters) |
There was a problem hiding this comment.
| cluster_map <- setNames(paste0(prefix, seq_along(unique_non_zero_clusters)), unique_non_zero_clusters) | |
| cluster_map <- stats::setNames(paste0(prefix, seq_along(unique_non_zero_clusters)), unique_non_zero_clusters) |
.pred_item item preds row_id setNames truth_value to utils::globalVariables() in aaa.R
add example to `extract_itemset_predictions`
|
Hi Emil! I believe that I addressed all your comments, please let me know if I missed something or if there is something else I need to edit. |
@kbodwin
Relates to other conversations about column-based clustering, e.g. Consider partition data reduction algorithm #66
Adds a partition mode with engine arules to tidyclust (freq_itemsets)
Adds custom cluster and predict functions for
freq_itemsets()Adds
extract_predictions()which reformatespredict()output into a more readable formatAdds
augment_itemset_predict()which reformatespredict()output for metric functions (e.g. in yardstick)Note:
devtools::check()resulted in a warning about code dependencies from purr and stringr