-
Notifications
You must be signed in to change notification settings - Fork 1
feat: include_query and name mapping in cellosaurus #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds include_query to mapCell2Accession and documents it; tightens and validates request-builder inputs; centralizes parallel request and parse flows; standardizes helper signatures and parsing; adjusts output ordering/renaming; and expands tests to cover include_query and parsed = FALSE behaviors. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant mapCell2Accession
participant RequestBuilder
participant Requester
participant Parser
User->>mapCell2Accession: call(ids, include_query, parsed, query_only, raw, ...)
mapCell2Accession->>RequestBuilder: build validated request(s)
RequestBuilder-->>mapCell2Accession: return URL(s)
alt query_only = TRUE
mapCell2Accession-->>User: return URL(s)
else
mapCell2Accession->>Requester: perform requests (parallel, mc.cores-aware)
Requester-->>mapCell2Accession: raw responses
alt raw = TRUE
mapCell2Accession-->>User: return raw responses
else
mapCell2Accession->>Parser: parse responses (per-entry, named args)
Parser-->>mapCell2Accession: parsed data.table
alt include_query = FALSE
mapCell2Accession->>mapCell2Accession: drop columns matching /^query/
end
alt parsed = FALSE
mapCell2Accession->>mapCell2Accession: rename API fields to user-friendly names
end
mapCell2Accession-->>User: return final data.table
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
🔇 Additional comments (13)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #66 +/- ##
==========================================
+ Coverage 72.21% 77.76% +5.55%
==========================================
Files 20 20
Lines 1004 1309 +305
==========================================
+ Hits 725 1018 +293
- Misses 279 291 +12 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
man/mapCell2Accession.Rd (1)
7-19: Docs: default for numResults inconsistent (10000 vs 1000)Usage shows
numResults = 10000, and tests expectrows=10000, but the Arguments section says “Default is 1000.”Please align the description to 10000 (or change the code), and mention any performance implications.
Apply one of:
-\item{numResults}{The maximum number of results to return for each query. Default is 1000.} +\item{numResults}{The maximum number of results to return for each query. Default is 10000.}or reduce the code default and update tests.
R/cellosaurus.R (1)
337-349: Code usesifelse()which corrupts list-column integrity for nested fieldsVerified: All three
ifelse()blocks exist at specified lines. The bug is real:
- Lines 344-348 (nestedKeys):
ifelse()coerceslist(.splitNestedCol(...))to atomic, breaking DI, DR, HI list-columns- Lines 351-355 (specialKeys):
ifelse()coercesx[key](list subsetting) to atomic, breaking CC list-column- Lines 337-341 (optionalKeys): Also uses incorrect pattern, though scalars are less vulnerable
The suggested patches are correct—replacing
ifelse()with scalarif/elsepreserves list-column structure for nested fields.
🧹 Nitpick comments (9)
R/cellosaurus_helpers.R (1)
154-262: Avoid hard-coding the external resources listThis static list will drift as Cellosaurus evolves, making tests brittle. Prefer deriving it from the OpenAPI schema (if exposed) or an authoritative endpoint, with a fallback snapshot.
Options:
- Query schema and extract enumerations; cache in package data for offline tests.
- In tests, assert that a core subset is present instead of full equality.
I can prototype a helper to pull and cache the list if useful.
tests/testthat/test_cellosaurus_helpers.R (1)
112-220: Loosen equality for external resourcesExact equality to a long, upstream-maintained list is fragile. Assert presence of a stable subset and uniqueness instead.
Example:
- expect_equal(resources, expected_resources) + expect_true(all(c("ATCC","DSMZ","DepMap","GDSC","Cosmic") %in% resources)) + expect_true(anyDuplicated(resources) == 0L)tests/testthat/test_cellosaurus.R (1)
122-128: Nice coverage for include_query; consider also asserting col orderYou’re removing all
^querycolumns. Add an assertion that core columns remain first andquery(when included) is last, to guard ordering regressions.Example follow-up:
res <- mapCell2Accession("Hela") expect_identical(colnames(res)[1:2], c("cellLineName","accession")) expect_identical(tail(colnames(res), 1), "query")R/cellosaurus_annotations.R (1)
50-54: Operational hardening: timeouts and rate‑limitsParallel requests to a public API can hang or trip rate limits.
- Ensure
.build_request()or.perform_request_parallel()sets per‑request timeouts and a sane concurrency.- Add bounded retries with jittered backoff on transient errors (HTTP 429/5xx).
If you want, I can sketch a small wrapper for httr2 with timeout/retry config.
Also applies to: 56-59, 61-67
R/cellosaurus.R (5)
19-43: Minor style nit: boolean checks
if (common == TRUE)andif (upper == TRUE)work, butisTRUE(common)/isTRUE(upper)is clearer and avoids truthy pitfalls.Optional tweak:
- if (common == TRUE) { + if (isTRUE(common)) { @@ - if (upper == TRUE) { + if (isTRUE(upper)) {Also applies to: 49-52
121-133: De‑duplicate field listsThe hard‑coded
tovector diverges fromcellosaurus_fields(common=TRUE). Consider sourcing from one place (with a curated subset) to avoid drift.Example:
to <- intersect(cellosaurus_fields(common = TRUE), c("ac","id","sy","misspelling","dr","cc","ca","di","ag","sx","hi"))
161-165: Parallel fetch: ensure bounded concurrency and backoffMake sure
.perform_request_parallel()caps concurrency and handles 429/5xx with retry + jitter. Otherwise, largeidsmay overwhelm the API.
233-247: Robust line slicingConsider guarding against malformed entries (missing delimiters) to avoid out‑of‑bounds indexing when splitting. Low risk, but easy to harden.
468-481: Avoid recomputingmatchNested(...)You call
matchNestedmultiple times with identical args inside conditionals. Cache results to reduce duplicate work and simplify flow.Sketch:
matches_query <- matchNested(query, responses_dt, keep_duplicates = keep_duplicates) if (any(responses_dt$cellLineName == query)) { ... } else if (length(matches_query) > 0) { ... }Also applies to: 486-492, 496-506
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
DESCRIPTION(1 hunks)R/cellosaurus.R(12 hunks)R/cellosaurus_annotations.R(3 hunks)R/cellosaurus_helpers.R(2 hunks)man/mapCell2Accession.Rd(2 hunks)tests/testthat/test_cellosaurus.R(2 hunks)tests/testthat/test_cellosaurus_helpers.R(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: ubuntu-latest (release)
- GitHub Check: ubuntu-latest (oldrel-1)
- GitHub Check: macos-latest (release)
- GitHub Check: ubuntu-latest (devel)
- GitHub Check: pkgdown
- GitHub Check: Code-Coverage
🔇 Additional comments (8)
DESCRIPTION (1)
44-44: Roxygen bump looks fineNo functional changes here. Remember to re-generate Rd and commit if roxygenized locally.
tests/testthat/test_cellosaurus.R (1)
116-120: LGTM: validates user‑friendly column renamesThis correctly asserts
parsed = FALSEreturns renamed columns likesynonymsanddiseases.man/mapCell2Accession.Rd (1)
40-42: Docs read well; include_query and parsed=FALSE behavior clearly statedThe additions accurately describe column behavior and friendly renames.
Also applies to: 47-50
R/cellosaurus.R (5)
101-112: Public API change LGTM
include_querydefault TRUE is sensible and backwards‑friendly with explicit opt‑out.
155-157: Query‑only return shape is consistentReturning a named list of URLs keyed by input IDs is handy for debugging. LGTM.
193-217: Column removal and ordering logic looks good
^queryremoval is robust (handlesquery:acetc.).- Reordering to keep core first and
querylast (when included) improves UX.LGTM.
253-284: Parsing flow is clear and separated byparsedflagNice separation; try/catch around
rbindlistis appreciated. LGTM.
526-531: Formatting helpers LGTM
synonymssplit andcommentsstructuring look correct.Also applies to: 539-546
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (6)
R/cellosaurus_helpers.R (6)
101-105: Makeoutputcase‑insensitive; preferassert_flag()forquery_only.Small polish to reduce footguns and be idiomatic with checkmate.
Apply this diff:
checkmate::assert_character(output, len = 1, min.chars = 1) - checkmate::assert_character(sort, len = 1, null.ok = TRUE, min.chars = 1) - checkmate::assert_logical(query_only, len = 1) + output <- toupper(output) + checkmate::assert_character(sort, len = 1, null.ok = TRUE, min.chars = 1) + checkmate::assert_flag(query_only)
112-119: Compare against lower‑cased schema and dedupetoto avoid invalid/duplicate fields.Prevents false negatives if
cellosaurus_fields()returns mixed case and avoids duplicate query params.- to <- tolower(to) - invalid_fields <- setdiff(to, cellosaurus_fields()) + to <- unique(tolower(to)) + allowed_to <- tolower(cellosaurus_fields()) + invalid_fields <- setdiff(to, allowed_to)Please confirm
cellosaurus_fields()returns field IDs (not labels) and whether case can vary across versions.
148-155: Setsortandrowsonly for the search endpoint.Avoids sending irrelevant params for non-search resources.
- if (!is.null(sort)) { - opts$sort <- paste0(sort, " asc") - } + if (identical(apiResource, "search/cell-line") && !is.null(sort)) { + opts$sort <- paste0(sort, " asc") + } @@ - opts$rows <- numResults + if (identical(apiResource, "search/cell-line")) { + opts$rows <- numResults + }Do non-search endpoints accept/ignore
rowsandsort? If yes, feel free to keep as-is.
55-56: Fix @return to reflect dual return types.Doc currently says “character string” but the function returns a URL when
query_only = TRUE, otherwise anhttr2_request.-#' @return A character string representing the constructed URL for the Cellosaurus API request. +#' @return If `query_only = TRUE`, a character URL string; otherwise an `httr2_request` object ready to be performed.
137-139: Current code correctly implements AND logic per Cellosaurus API documentation.The Cellosaurus API treats spaces in the search parameter as an implicit AND operator—the behavior is well-specified, not ambiguous. Your current implementation
collapse = " "correctly follows this specification. If you need to support OR queries or other boolean operators in the future, exposing a configurablequery_operatorparameter would be a valuable optional enhancement.
192-299: AlphabetizeextResourcesfor stability and maintainability.Verification found no duplicates or typos (only "RSCB" exists at line 289, which is correct). However, the list is not currently alphabetized. Sorting would ensure consistent ordering and reduce maintenance drift. Current order differs from alphabetical order due to mixed case; consider standardizing capitalization alongside alphabetization.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
R/cellosaurus_helpers.R(3 hunks)tests/testthat/test_cellosaurus_helpers.R(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/testthat/test_cellosaurus_helpers.R
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Code-Coverage
- GitHub Check: ubuntu-latest (release)
- GitHub Check: ubuntu-latest (devel)
- GitHub Check: ubuntu-latest (oldrel-1)
- GitHub Check: macos-latest (release)
- GitHub Check: pkgdown
🔇 Additional comments (1)
R/cellosaurus_helpers.R (1)
66-88: Nice cleanup on request builder signature and guardrails.Removing variadic
..., addingallowed_*sets, and explicit defaults tighten the API surface. Looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
R/cellosaurus.R (1)
345-351: Bug: ifelse() with list outputs corrupts column typesifelse() returns atomic vectors; mixing list outputs (nested columns) with character NA will coerce or error. Use plain if/else assignments.
- for (key in nestedKeys) { - dt[[key]] <- ifelse( - is.null(x[[key]]), - NA_character_, - list(.splitNestedCol(x, key, "; ")[[key]]) - ) - } + for (key in nestedKeys) { + if (is.null(x[[key]])) { + dt[[key]] <- NA_character_ + } else { + dt[[key]] <- list(.splitNestedCol(x, key, "; ")[[key]]) + } + } - for (key in specialKeys) { - dt[[key]] <- ifelse( - is.null(x[[key]]), - NA_character_, - x[key] - ) - } + for (key in specialKeys) { + if (is.null(x[[key]])) { + dt[[key]] <- NA_character_ + } else { + dt[[key]] <- x[key] + } + }Also applies to: 377-422
🧹 Nitpick comments (6)
R/cellosaurus_helpers.R (2)
36-38: Avoid 1:length(ids) to prevent 1:0 bug; vectorize constructionUse seq_along(ids) (empty-safe) and paste for equal-length vectors.
- sapply(1:length(ids), function(i) { - paste(from[i], ids[i], sep = ":") - }) + paste(from, ids, sep = ":")Or, if you prefer keeping indices:
- sapply(1:length(ids), function(i) { + sapply(seq_along(ids), function(i) { paste(from[i], ids[i], sep = ":") })
188-197: numResults and sort validation: LGTM; minor enhancementValidation is solid. Consider allowing explicit descending order via a separate order arg or by recognizing "field desc" to match API flexibility.
tests/testthat/test_cellosaurus_helpers.R (1)
41-45: Relax brittle row-count assertions against live APIExact nrow equality will flap as upstream data changes. Assert minima or classes instead.
- expect_equal(nrow(response), 2) + expect_gte(nrow(response), 1) - expect_equal(nrow(response), 3) + expect_gte(nrow(response), 1)Also applies to: 76-79
R/cellosaurus.R (3)
143-154: Parallelism portability: set mc.cores via optionmclapply defaults can warn or behave differently on Windows. Respect an option and default to 1.
- requests <- parallel::mclapply( + requests <- parallel::mclapply( queries, function(query) { .build_cellosaurus_request( query = query, to = to, numResults = numResults, sort = sort, output = "TXT" ) - } - ) + }, + mc.cores = getOption("mc.cores", 1L) + )
194-203: Order all query columns together at the end (not justquery)*Currently only “query” is repositioned. Group every column starting with “query” to the tail for consistency with include_query semantics.
- if (!include_query) { - query_cols <- grep( - pattern = "^query", - x = names(responses_dt), - value = TRUE - ) - if (length(query_cols) > 0L) { - responses_dt[, (query_cols) := NULL] - } - } + if (!include_query) { + query_cols <- grep("^query", names(responses_dt), value = TRUE) + if (length(query_cols) > 0L) responses_dt[, (query_cols) := NULL] + } core_cols <- c("cellLineName", "accession") existing_core <- core_cols[core_cols %in% names(responses_dt)] if (length(existing_core) > 0L) { - extra_cols <- setdiff(names(responses_dt), existing_core) - if (include_query && "query" %in% extra_cols) { - extra_cols <- setdiff(extra_cols, "query") - data.table::setcolorder( - responses_dt, - c(existing_core, extra_cols, "query") - ) - } else { - data.table::setcolorder(responses_dt, c(existing_core, extra_cols)) - } + query_cols <- if (include_query) grep("^query", names(responses_dt), value = TRUE) else character(0) + extra_cols <- setdiff(names(responses_dt), c(existing_core, query_cols)) + data.table::setcolorder(responses_dt, c(existing_core, extra_cols, query_cols)) }Also applies to: 205-218
100-111: Documented...in mapCell2Accession is not forwardedThe
...argument is accepted but not used. Either forward to helpers (e.g., pass to .perform_request_parallel/.build_cellosaurus_request if supported) or update docs to avoid implying propagation.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
R/cellosaurus.R(12 hunks)R/cellosaurus_helpers.R(3 hunks)tests/testthat/test_cellosaurus_helpers.R(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: macos-latest (release)
- GitHub Check: ubuntu-latest (release)
- GitHub Check: ubuntu-latest (oldrel-1)
- GitHub Check: Code-Coverage
- GitHub Check: ubuntu-latest (devel)
- GitHub Check: pkgdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
R/cellosaurus.R (2)
19-53: Consider idiomatic boolean checks in R.Using
== TRUEis less idiomatic in R. Prefer direct boolean evaluation orisTRUE()for safer checking.Apply this diff for more idiomatic style:
- if (common == TRUE) { + if (isTRUE(common)) { fields <- c( "id", ... ) } else { schema <- .cellosaurus_schema() fields <- schema$components$schemas$Fields$enum } - if (upper == TRUE) { + if (isTRUE(upper)) { fields <- toupper(fields) } else { fields <- tolower(fields) }
466-519: Consider refactoring repetitive conditional matching logic.The TODO at line 466 correctly identifies repetitive conditionals. Each branch calls
matchNestedwith similar patterns. Consider extracting a helper function that tries multiple match strategies in sequence.Would you like me to generate a refactored version that reduces duplication?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
R/cellosaurus.R(12 hunks)R/cellosaurus_helpers.R(4 hunks)man/mapCell2Accession.Rd(2 hunks)tests/testthat/test_cellosaurus_helpers.R(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/testthat/test_cellosaurus_helpers.R
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: ubuntu-latest (release)
- GitHub Check: ubuntu-latest (oldrel-1)
- GitHub Check: ubuntu-latest (devel)
- GitHub Check: Code-Coverage
- GitHub Check: macos-latest (release)
- GitHub Check: Test-Docker-Build
- GitHub Check: pkgdown
🔇 Additional comments (11)
R/cellosaurus.R (8)
100-112: LGTM: include_query parameter added with backward compatibility.The new
include_queryparameter with defaultTRUEmaintains backward compatibility while providing the requested functionality.
121-133: LGTM: Improved readability.Multi-line formatting makes the field list clearer and easier to maintain.
143-159: LGTM: Clean early return for query_only.The early return at lines 157-159 avoids unnecessary request execution when only URLs are needed.
174-195: LGTM: Named arguments improve clarity.Passing
parsed = parsedandkeep_duplicates = keep_duplicatesas named arguments enhances readability in the parallel processing flow.
199-204: LGTM: Safe column removal logic.The check for
length(query_cols) > 0Lprevents errors when no query columns exist, and the data.table syntax is idiomatic.
206-219: LGTM: Logical column ordering.The three-tier ordering (core → extra → query) provides a consistent, predictable structure. The conditional inclusion of query columns respects the
include_queryparameter.
265-272: LGTM: Enhanced error handling for rbindlist.The
tryCatchwrapper provides informative error messages when response parsing fails, aiding debugging.
378-423: LGTM: Quality-of-life column name mapping implemented.The renaming logic maps Cellosaurus API short codes to descriptive names (e.g.,
SY→synonyms,DI→diseases), directly addressing issue #33. The use ofskip_absent = TRUEensures robustness when optional fields are missing.man/mapCell2Accession.Rd (1)
17-17: LGTM: Documentation accurately reflects implementation.The documentation correctly describes the
include_queryparameter's behavior and clarifies the user-friendly column renaming whenparsed = FALSE.Also applies to: 40-43, 47-49
R/cellosaurus_helpers.R (2)
83-195: LGTM: Comprehensive validation addresses all past review concerns.The enhanced validation logic:
- Validates all parameters with clear error messages
- Implements field checking against available Cellosaurus fields with case normalization
- Provides fallback when API is unavailable
- Coerces and validates
numResultsas required- Expands
allowed_sortto include all valid fieldsThis directly addresses the past review comments and significantly improves input safety.
Based on past review comments.
257-367: LGTM: Improved readability and typo fix confirmed.Multi-line formatting makes the resource list easier to maintain. The "RSCB" → "RCSB" typo correction (line 355) from previous review is confirmed.
Based on past review comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
R/cellosaurus.R(13 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: Code-Coverage
- GitHub Check: ubuntu-latest (devel)
- GitHub Check: macos-latest (release)
- GitHub Check: Test-Docker-Build
- GitHub Check: ubuntu-latest (release)
- GitHub Check: ubuntu-latest (oldrel-1)
- GitHub Check: pkgdown
🔇 Additional comments (7)
R/cellosaurus.R (7)
100-112: LGTM: Well-designed parameter addition.The
include_queryparameter is properly added with a sensible default (TRUE) that maintains backward compatibility while enabling the new feature.
143-159: LGTM: Parallel request building is appropriate.The parallel processing pattern is well-suited for independent HTTP requests. Note that
mclapplyfalls back to sequential execution on Windows, which is acceptable for this use case.
174-195: LGTM: Robust parallel parsing implementation.The error handling for empty results (lines 180-185) is appropriate, and passing
parsedandkeep_duplicatesas named arguments improves code clarity.
199-219: LGTM: Correct implementation of include_query feature.The logic properly handles both
parsed = TRUEandparsed = FALSEcases:
- Query columns are removed when
include_query = FALSE(lines 199-204)- Column reordering places core fields first, then metadata, then query fields (lines 206-219)
The use of
length(existing_core) > 0Lguards against edge cases where core columns might be absent.
255-286: LGTM: Clean separation of parsed vs. unparsed paths.The early return for
!parsed(line 276-279) clearly separates the two output modes, and the minimal column selection forparsed = TRUE(line 283) aligns with the documented behavior.
418-423: Good practice: Using skip_absent = TRUE.The
skip_absent = TRUEparameter is essential here since Cellosaurus entries may not contain all possible fields. This prevents errors when renaming columns that don't exist in a particular response.
428-496: LGTM: Excellent refactor using strategy pattern.Consolidating the multiple matching strategies into a single helper function improves maintainability and readability. The sequential fallback approach (trying each strategy until one succeeds) is clear and appropriate for this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
R/cellosaurus.R(13 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Code-Coverage
- GitHub Check: ubuntu-latest (oldrel-1)
- GitHub Check: ubuntu-latest (devel)
- GitHub Check: ubuntu-latest (release)
- GitHub Check: pkgdown
🔇 Additional comments (6)
R/cellosaurus.R (6)
100-112: LGTM:include_queryparameter added correctly.The new parameter aligns with the PR objectives and defaults to
TRUEto maintain backward compatibility.
143-159: Good use of parallel processing with proper fallbacks.The explicit parameter passing to
.build_cellosaurus_requestand themc.coresoption fallback chain are well-implemented.
199-221: Well-implemented query column handling and ordering.The logic correctly implements the PR objective:
- Removes query columns when
include_query = FALSE- Reorders columns to place core fields first, followed by metadata, then query columns
The edge case handling (missing columns) is appropriate.
265-272: Excellent addition of error handling.The
tryCatchblock aroundrbindlistprovides graceful handling of malformed responses with informative error messages.
378-423: Excellent implementation of descriptive column name mapping.The explicit mapping from API column IDs (e.g.,
AC,SY) to user-friendly names (e.g.,accession,synonyms) directly addresses the PR objectives. Usingskip_absent = TRUEensures robustness when columns are missing.
428-496: Well-refactored matching logic using strategy pattern.The extraction of matching strategies into
.match_cellosaurus_candidatesaddresses the past review comment and improves maintainability. The strategy pattern makes the matching logic clear and easy to extend.
8cf2c40 to
132f307
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
R/cellosaurus.R (1)
537-547: Critical: Remove duplicate function call.Lines 537-542 and 543-547 are identical calls to
.match_cellosaurus_candidates(). This duplication will cause the first call's result to be overwritten and wastes computation. This appears to be a copy-paste or merge conflict error.Apply this diff to remove the duplicate:
result <- .match_cellosaurus_candidates( responses_dt = responses_dt, query = query, name = name, keep_duplicates = keep_duplicates ) - responses_dt = responses_dt, - query = query, - name = name, - keep_duplicates = keep_duplicates - ) -Note: This issue was flagged in previous review comments and appears to still be present.
🧹 Nitpick comments (4)
R/cellosaurus.R (4)
143-155: Consider adding error handling for parallel request building.The
mclapplycall lacks explicit error handling. If.build_cellosaurus_request()fails for any query, the error may propagate silently or cause unexpected behavior.Consider wrapping the
.build_cellosaurus_request()call in atryCatchto handle errors gracefully:requests <- parallel::mclapply( queries, function(query) { - .build_cellosaurus_request( - query = query, - to = to, - numResults = numResults, - sort = sort, - output = "TXT" - ) + tryCatch( + .build_cellosaurus_request( + query = query, + to = to, + numResults = numResults, + sort = sort, + output = "TXT" + ), + error = function(e) { + .err(paste0("Failed to build request for query: ", query, " - ", e$message)) + } + ) }, mc.cores = getOption("annotationgx.mc.cores", getOption("mc.cores", 1L)) )
174-195: Consider adding error handling for parallel parsing.Similar to the request-building phase, the parsing
mclapplylacks explicit error handling. If.parse_cellosaurus_lines()or.parse_cellosaurus_text()fail, errors may propagate unexpectedly.Wrap the parsing logic in a
tryCatchto handle errors gracefully:responses_dt <- parallel::mclapply( ids, function(name) { - resp <- responses[[name]] - - resp <- .parse_cellosaurus_lines(resp) - if (length(resp) == 0L) { - .warn(paste0("No results found for ", name)) - result <- data.table::data.table() - result$query <- name - return(result) - } - response_dt <- .parse_cellosaurus_text( - resp, - name, - parsed = parsed, - keep_duplicates = keep_duplicates - ) - response_dt + tryCatch( + { + resp <- responses[[name]] + resp <- .parse_cellosaurus_lines(resp) + if (length(resp) == 0L) { + .warn(paste0("No results found for ", name)) + result <- data.table::data.table() + result$query <- name + return(result) + } + response_dt <- .parse_cellosaurus_text( + resp, + name, + parsed = parsed, + keep_duplicates = keep_duplicates + ) + response_dt + }, + error = function(e) { + .warn(paste0("Failed to parse response for ", name, ": ", e$message)) + result <- data.table::data.table() + result$query <- name + result + } + ) }, mc.cores = getOption("annotationgx.mc.cores", getOption("mc.cores", 1L)) )
206-219: Column reordering logic is correct.The logic correctly reorders columns to place core fields first, followed by extra fields, and then query fields (when included). The implementation handles edge cases appropriately.
For slightly improved readability, consider extracting the column collection into a helper:
.get_column_order <- function(dt_names, include_query) { core_cols <- c("cellLineName", "accession") existing_core <- core_cols[core_cols %in% dt_names] query_cols <- if (include_query) { grep("^query", dt_names, value = TRUE) } else { character(0) } extra_cols <- setdiff(dt_names, c(existing_core, query_cols)) c(existing_core, extra_cols, query_cols) }
428-497: Excellent refactoring - consolidates matching logic.The new
.match_cellosaurus_candidates()function effectively consolidates multiple matching strategies using the strategy pattern. The implementation is clean, maintainable, and allows for easy addition of new matching strategies.Consider adding brief inline comments documenting each strategy's purpose for future maintainers:
strategies <- list( # Strategy 1: Exact match on cellLineName function() { ... }, # Strategy 2: Nested match using original query function() { ... }, # Strategy 3: Nested match using cleaned name function() { ... }, # Strategy 4: Match on cleaned cellLineName function() { ... }, # Strategy 5: Match on cleaned synonyms function() { ... } )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
R/cellosaurus.R(13 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: ubuntu-latest (release)
- GitHub Check: ubuntu-latest (devel)
- GitHub Check: macos-latest (release)
- GitHub Check: ubuntu-latest (oldrel-1)
- GitHub Check: Code-Coverage
- GitHub Check: pkgdown
🔇 Additional comments (6)
R/cellosaurus.R (6)
157-159: LGTM!The early return for
query_only = TRUEcorrectly avoids unnecessary network requests and returns the URLs as expected.
199-204: LGTM!The logic to remove query columns when
include_query = FALSEis correct and clear. The length check at line 201 is technically redundant (data.table handles NULL assignment to empty vectors gracefully), but it makes the intent explicit and prevents unnecessary operations.
317-425: LGTM! Robust handling of optional and nested fields.The updated
.processEntry()function now includes explicit NULL checks for optional, nested, and special keys, usingNA_character_as appropriate. The use ofskip_absent = TRUEin column renaming makes the code resilient to schema variations. This is excellent defensive programming.
255-286: LGTM! Improved error handling and cleaner flow.The updated function signature with explicit named parameters improves clarity. The
tryCatcharoundrbindlist(lines 265-272) provides robust error handling, and the early return forparsed = FALSE(line 276-278) improves efficiency.
87-111: LGTM! Well-designed parameter addition.The new
include_queryparameter is well-documented and defaults toTRUEfor backward compatibility. The parameter placement afterparsedis logical and maintains good API ergonomics.
19-47: Verify hardcoded common fields list is synchronized with current Cellosaurus API schema.The hardcoded field list (22 fields) used when
common = TRUElacks automated validation against the API schema. Test coverage exists but only validates the hardcoded behavior, not schema synchronization. No tests compare the hardcoded list againstschema$components$schemas$Fields$enumwhencommon = FALSE.Verify the list matches the current schema at https://api.cellosaurus.org/openapi.json, or add a test that validates equivalence between the two paths.
Closes #33
Summary by CodeRabbit
New Features
Behavior
Documentation
Tests