-
Notifications
You must be signed in to change notification settings - Fork 3
Changes made as part of CCR tutorial paper revisions #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- rag: concise table outputs (labels+conf), sort by confidence; gate per-doc logs behind retriever_params; suppress device message unless progress - bm25: use run_query in strict retry; add k1/b params; Unicode-aware tokenizer with override - hf_auth: avoid huggingface_hub login() when token present (suppress note) - demo: quieter (progress=FALSE) and rely on concise tables - build: ignore WARP.md in .Rbuildignore - misc: guard undefined embedding extraction in rag()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codex Review: Here are some suggestions.
Reply with @codex fix comments to fix any unresolved comments.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".
| if (!is.null(annotations_df)) { | ||
| # Ensure expected columns exist | ||
| if (!"image_path" %in% names(annotations_df)) { | ||
| # Try common alternatives | ||
| alt <- intersect(c("filepath", "path", "image"), names(annotations_df)) | ||
| if (length(alt) > 0) { | ||
| annotations_df$image_path <- annotations_df[[alt[1]]] | ||
| } | ||
| } | ||
| if (!"index" %in% names(annotations_df)) { | ||
| # Fallback to row index if no explicit id | ||
| annotations_df$index <- seq_len(nrow(annotations_df)) | ||
| } | ||
| if (!"emotion" %in% names(annotations_df)) { | ||
| # Try to find an emotion/label column | ||
| emo_col <- grep("emotion|label", names(annotations_df), value = TRUE, ignore.case = TRUE) | ||
| if (length(emo_col) > 0) annotations_df$emotion <- annotations_df[[emo_col[1]]] | ||
| } | ||
|
|
||
| # Derive file name (guard against missing image_path) | ||
| ipath <- annotations_df$image_path | ||
| ipath[is.null(ipath)] <- "" | ||
| ipath[is.na(ipath)] <- "" | ||
| annotations_df$image_file <- basename(ipath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P1] Handle annotations without image_path before deriving filenames
The download helper unconditionally accesses annotations_df$image_path and subsets it (ipath <- annotations_df$image_path; ipath[is.null(ipath)] <- ""). When the annotations returned from load_findingemo_annotations() do not contain an image_path (the bundled Python loader only provides columns such as image_id, valence, arousal), annotations_df$image_path is NULL and the subsetting call throws object of type 'NULL' is not subsettable. This means a successful dataset download will still fail during post-processing on any schema that lacks an image_path column. Guard against the column being absent by creating a safe default (e.g. derive names from image_id or skip matching) before trying to index.
Useful? React with 👍 / 👎.
| # Optional label set enforcement (lowercase normalization) | ||
| if (!is.null(labels_set) && length(predictions$pred) > 0) { | ||
| allowed <- tolower(labels_set) | ||
| pred_low <- tolower(predictions$pred) | ||
| keep <- pred_low %in% allowed | ||
| if (any(!keep)) { | ||
| # Keep rows matching allowed set; drop others with a warning | ||
| n_drop <- sum(!keep) | ||
| if (n_drop > 0) { | ||
| warning(sprintf("Filtered %d predictions not in labels_set", n_drop), call. = FALSE) | ||
| } | ||
| predictions <- predictions[keep, , drop = FALSE] | ||
| } | ||
| predictions$pred <- pred_low[keep] | ||
| } | ||
|
|
||
| # Create evaluation dataset | ||
| eval_data <- create_evaluation_dataset(predictions, ground_truth, id_col) | ||
|
|
||
| # Forward to evaluate_emotions (probability columns not available here) | ||
| results <- evaluate_emotions( | ||
| data = eval_data, | ||
| truth_col = "truth", | ||
| pred_col = "pred", | ||
| probs_cols = NULL, | ||
| classes = labels_set, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P1] Normalise ground truth and class labels when enforcing labels_set
In validate_rag_predictions() predictions are forced to lowercase when labels_set is supplied, but labels_set itself and the ground_truth vector are left untouched while the lowercased predictions are evaluated against the original labels_set. As a result, if callers pass labels in their natural casing (e.g. c("Joy","Anger")) the predictions become "joy"/"anger" and evaluate_emotions() receives mismatched class levels (classes = labels_set). Those rows are dropped as NAs and metrics are computed on an empty or truncated dataset. Convert labels_set and the ground_truth column to the same normalised case as predictions (or avoid altering the case at all) before delegating to evaluate_emotions().
Useful? React with 👍 / 👎.
support, and docs.
quality checks, and annotation loading.
validation, evaluation, and model lifecycle helpers.
conversions and refreshed registries.
Quick test of
uvinstall, text and image scores with:Rscript -e 'devtools::test()'