Skip to content

Conversation

@mtran-code
Copy link
Collaborator

@mtran-code mtran-code commented Oct 21, 2025

Closes #33

  • query columns are now optional (parameterized)
  • mapped api column ids to real names
  • added tests
  • formatted with air

Summary by CodeRabbit

  • New Features

    • Added include_query option to control presence of query-related columns (defaults to TRUE).
    • query_only path preserved to return request URLs without performing lookups.
  • Behavior

    • parsed = FALSE now returns user-friendly metadata column names and improved column ordering.
    • Stronger validation and more robust handling of edge cases for requests and parsing.
  • Documentation

    • Docs updated to describe include_query and parsed = FALSE output.
  • Tests

    • New tests covering include_query, parsed vs unparsed outputs, and request sorting/validation.

@coderabbitai
Copy link

coderabbitai bot commented Oct 21, 2025

Walkthrough

Adds include_query to mapCell2Accession and documents it; tightens and validates request-builder inputs; centralizes parallel request and parse flows; standardizes helper signatures and parsing; adjusts output ordering/renaming; and expands tests to cover include_query and parsed = FALSE behaviors.

Changes

Cohort / File(s) Summary
Metadata
DESCRIPTION
Bumped RoxygenNote from 7.3.2 to 7.3.3.
Core: mapCell2Accession & docs
R/cellosaurus.R, man/mapCell2Accession.Rd
Added include_query = TRUE parameter to mapCell2Accession; reordered signature; centralized mc.cores-aware parallel request/parse flow; preserved query_only early-return and raw/parsed paths; optionally drops/reorders query* columns and documents include_query and parsed = FALSE behavior.
Request builder & validation
R/cellosaurus_helpers.R
Replaced ... with explicit parameters in .build_cellosaurus_request; added validation for apiResource, output, sort, numResults, and to fields (normalized against cellosaurus_fields()); enforced allowed resources/outputs and strengthened input checks; adjusted URL assembly and query_only handling.
Annotations formatting
R/cellosaurus_annotations.R
Reformatted annotateCellAccession signature/call sites (multi-line); behavior unchanged.
Parsing & helpers refactor
R/cellosaurus.R (helpers: .parse_*, .processEntry, .match_cellosaurus_candidates, etc.)
Standardized helper signatures and spacing; introduced .match_cellosaurus_candidates; parallelized parsing per id; added explicit NA/NULL checks, safer rbindlist usage, consistent synonyms/comments formatting, and deterministic column renaming/ordering based on include_query.
Tests
tests/testthat/test_cellosaurus.R, tests/testthat/test_cellosaurus_helpers.R
Added tests asserting parsed = FALSE returns user-friendly names and include_query = FALSE removes query* columns; added skip_if_offline() / skip_on_cran() guards; added tests for allowed sort fields; adjusted some expectations to be >= checks.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant mapCell2Accession
    participant RequestBuilder
    participant Requester
    participant Parser

    User->>mapCell2Accession: call(ids, include_query, parsed, query_only, raw, ...)
    mapCell2Accession->>RequestBuilder: build validated request(s)
    RequestBuilder-->>mapCell2Accession: return URL(s)

    alt query_only = TRUE
        mapCell2Accession-->>User: return URL(s)
    else
        mapCell2Accession->>Requester: perform requests (parallel, mc.cores-aware)
        Requester-->>mapCell2Accession: raw responses

        alt raw = TRUE
            mapCell2Accession-->>User: return raw responses
        else
            mapCell2Accession->>Parser: parse responses (per-entry, named args)
            Parser-->>mapCell2Accession: parsed data.table

            alt include_query = FALSE
                mapCell2Accession->>mapCell2Accession: drop columns matching /^query/
            end

            alt parsed = FALSE
                mapCell2Accession->>mapCell2Accession: rename API fields to user-friendly names
            end

            mapCell2Accession-->>User: return final data.table
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I hopped through queries, tidy and spry,

nudged out the query* columns with a sigh.
Sy became "synonyms", neat and bright,
tests hopped in to check each light.
A rabbit nods — the changes sit just right.

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "feat: include_query and name mapping in cellosaurus" accurately reflects the two main objectives of the changeset: the addition of an include_query parameter to control whether query columns are included in results, and the implementation of name mapping from API column IDs to descriptive names. The title is concise, clear, and specific enough that a teammate reviewing the git history would understand the primary changes without requiring additional context.
Linked Issues Check ✅ Passed The PR successfully addresses both coding requirements from issue #33. First, the include_query parameter has been added to mapCell2Accession with a default value of TRUE, allowing users to optionally exclude query columns when set to FALSE, with corresponding test coverage verifying this behavior. Second, the name mapping feature has been implemented through updated old_names/new_names mappings in the data table operations, with tests confirming that parsed=FALSE returns user-friendly column names like synonyms and diseases. Helper functions have also been refactored to support streamlined data.table assignment operations as requested.
Out of Scope Changes Check ✅ Passed The PR changes are appropriately scoped to the objectives outlined in issue #33. The core functionality additions (include_query parameter and name mapping) directly address the requirements, while supporting changes to validation logic in .build_cellosaurus_request and helper function refactoring serve to enable or streamline these features. Formatting and cosmetic changes are minimal, explicitly mentioned in the PR description as "formatted with air," and a RoxygenNote version bump in DESCRIPTION is a minor metadata update. No significant out-of-scope functionality has been introduced beyond what is necessary to implement the stated objectives.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch mtran/cellosaurus-33

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8cf2c40 and 132f307.

📒 Files selected for processing (1)
  • R/cellosaurus.R (13 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: ubuntu-latest (oldrel-1)
  • GitHub Check: ubuntu-latest (release)
  • GitHub Check: ubuntu-latest (devel)
  • GitHub Check: macos-latest (release)
  • GitHub Check: Code-Coverage
  • GitHub Check: pkgdown
🔇 Additional comments (13)
R/cellosaurus.R (13)

19-53: LGTM! Improved control flow clarity.

The explicit if-else structure and use of isTRUE() improve readability and defensive coding.


87-94: LGTM! Clear documentation of new parameter.

The include_query parameter is well-documented with its default value and behavior clearly explained.


100-112: LGTM! Backward-compatible API extension.

The include_query = TRUE default preserves existing behavior while enabling the new feature.


157-159: LGTM! Clean early return for query-only mode.

The implementation correctly returns URLs without performing requests when query_only = TRUE.


199-204: LGTM! Correct implementation of query column filtering.

The pattern matching and conditional removal logic correctly implements the include_query feature.


206-219: LGTM! Logical column ordering with appropriate priority.

The reordering logic correctly places core fields first, followed by metadata, and query columns last when included. The complexity is justified by the need to handle varying column sets dynamically.


235-249: LGTM! Consistent formatting.

The formatting improvements enhance readability without changing logic.


255-286: LGTM! Proper separation of parsed and unparsed paths.

The function correctly handles both modes: when parsed = FALSE, it returns all data with renamed columns; when parsed = TRUE, it filters to matched results with core columns.


378-424: LGTM! Clean implementation of API-to-descriptive name mapping.

The explicit name mapping from API short IDs (e.g., SY, DI, DR) to descriptive names (e.g., synonyms, diseases, crossReferences) directly addresses the PR objective. The use of skip_absent = TRUE ensures robustness when not all fields are present.


428-496: LGTM! Well-structured strategy pattern for matching.

The function consolidates matching logic with a clear fallback hierarchy, improving maintainability. The progression from exact matches to broader synonym-based searches is logical.


537-543: LGTM! Clean integration with refactored matching helper.

The call to .match_cellosaurus_candidates successfully consolidates the previously repeated conditional logic mentioned in past review comments.


562-583: LGTM! Formatting consistency.

Minor formatting improvements maintain code consistency throughout the file.


143-156: Remove unnecessary parallelization for lightweight request construction.

The .build_cellosaurus_request() function performs only parameter validation, list operations, and string manipulation—all lightweight operations that complete quickly. Parallelizing these via mclapply introduces process forking and serialization overhead that exceeds any computational benefit. Reserve parallelization for expensive operations (parsing in lines 174-195).


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Oct 21, 2025

Codecov Report

❌ Patch coverage is 92.60143% with 31 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.76%. Comparing base (2a77bef) to head (132f307).
⚠️ Report is 19 commits behind head on main.

Files with missing lines Patch % Lines
R/cellosaurus.R 90.24% 20 Missing ⚠️
R/cellosaurus_annotations.R 0.00% 7 Missing ⚠️
R/cellosaurus_helpers.R 98.06% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #66      +/-   ##
==========================================
+ Coverage   72.21%   77.76%   +5.55%     
==========================================
  Files          20       20              
  Lines        1004     1309     +305     
==========================================
+ Hits          725     1018     +293     
- Misses        279      291      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
man/mapCell2Accession.Rd (1)

7-19: Docs: default for numResults inconsistent (10000 vs 1000)

Usage shows numResults = 10000, and tests expect rows=10000, but the Arguments section says “Default is 1000.”

Please align the description to 10000 (or change the code), and mention any performance implications.

Apply one of:

-\item{numResults}{The maximum number of results to return for each query. Default is 1000.}
+\item{numResults}{The maximum number of results to return for each query. Default is 10000.}

or reduce the code default and update tests.

R/cellosaurus.R (1)

337-349: Code uses ifelse() which corrupts list-column integrity for nested fields

Verified: All three ifelse() blocks exist at specified lines. The bug is real:

  • Lines 344-348 (nestedKeys): ifelse() coerces list(.splitNestedCol(...)) to atomic, breaking DI, DR, HI list-columns
  • Lines 351-355 (specialKeys): ifelse() coerces x[key] (list subsetting) to atomic, breaking CC list-column
  • Lines 337-341 (optionalKeys): Also uses incorrect pattern, though scalars are less vulnerable

The suggested patches are correct—replacing ifelse() with scalar if/else preserves list-column structure for nested fields.

🧹 Nitpick comments (9)
R/cellosaurus_helpers.R (1)

154-262: Avoid hard-coding the external resources list

This static list will drift as Cellosaurus evolves, making tests brittle. Prefer deriving it from the OpenAPI schema (if exposed) or an authoritative endpoint, with a fallback snapshot.

Options:

  • Query schema and extract enumerations; cache in package data for offline tests.
  • In tests, assert that a core subset is present instead of full equality.

I can prototype a helper to pull and cache the list if useful.

tests/testthat/test_cellosaurus_helpers.R (1)

112-220: Loosen equality for external resources

Exact equality to a long, upstream-maintained list is fragile. Assert presence of a stable subset and uniqueness instead.

Example:

-  expect_equal(resources, expected_resources)
+  expect_true(all(c("ATCC","DSMZ","DepMap","GDSC","Cosmic") %in% resources))
+  expect_true(anyDuplicated(resources) == 0L)
tests/testthat/test_cellosaurus.R (1)

122-128: Nice coverage for include_query; consider also asserting col order

You’re removing all ^query columns. Add an assertion that core columns remain first and query (when included) is last, to guard ordering regressions.

Example follow-up:

res <- mapCell2Accession("Hela")
expect_identical(colnames(res)[1:2], c("cellLineName","accession"))
expect_identical(tail(colnames(res), 1), "query")
R/cellosaurus_annotations.R (1)

50-54: Operational hardening: timeouts and rate‑limits

Parallel requests to a public API can hang or trip rate limits.

  • Ensure .build_request() or .perform_request_parallel() sets per‑request timeouts and a sane concurrency.
  • Add bounded retries with jittered backoff on transient errors (HTTP 429/5xx).

If you want, I can sketch a small wrapper for httr2 with timeout/retry config.

Also applies to: 56-59, 61-67

R/cellosaurus.R (5)

19-43: Minor style nit: boolean checks

if (common == TRUE) and if (upper == TRUE) work, but isTRUE(common) / isTRUE(upper) is clearer and avoids truthy pitfalls.

Optional tweak:

- if (common == TRUE) {
+ if (isTRUE(common)) {
@@
- if (upper == TRUE) {
+ if (isTRUE(upper)) {

Also applies to: 49-52


121-133: De‑duplicate field lists

The hard‑coded to vector diverges from cellosaurus_fields(common=TRUE). Consider sourcing from one place (with a curated subset) to avoid drift.

Example:

to <- intersect(cellosaurus_fields(common = TRUE), c("ac","id","sy","misspelling","dr","cc","ca","di","ag","sx","hi"))

161-165: Parallel fetch: ensure bounded concurrency and backoff

Make sure .perform_request_parallel() caps concurrency and handles 429/5xx with retry + jitter. Otherwise, large ids may overwhelm the API.


233-247: Robust line slicing

Consider guarding against malformed entries (missing delimiters) to avoid out‑of‑bounds indexing when splitting. Low risk, but easy to harden.


468-481: Avoid recomputing matchNested(...)

You call matchNested multiple times with identical args inside conditionals. Cache results to reduce duplicate work and simplify flow.

Sketch:

matches_query <- matchNested(query, responses_dt, keep_duplicates = keep_duplicates)
if (any(responses_dt$cellLineName == query)) {
  ...
} else if (length(matches_query) > 0) {
  ...
}

Also applies to: 486-492, 496-506

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2a77bef and 737331d.

📒 Files selected for processing (7)
  • DESCRIPTION (1 hunks)
  • R/cellosaurus.R (12 hunks)
  • R/cellosaurus_annotations.R (3 hunks)
  • R/cellosaurus_helpers.R (2 hunks)
  • man/mapCell2Accession.Rd (2 hunks)
  • tests/testthat/test_cellosaurus.R (2 hunks)
  • tests/testthat/test_cellosaurus_helpers.R (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: ubuntu-latest (release)
  • GitHub Check: ubuntu-latest (oldrel-1)
  • GitHub Check: macos-latest (release)
  • GitHub Check: ubuntu-latest (devel)
  • GitHub Check: pkgdown
  • GitHub Check: Code-Coverage
🔇 Additional comments (8)
DESCRIPTION (1)

44-44: Roxygen bump looks fine

No functional changes here. Remember to re-generate Rd and commit if roxygenized locally.

tests/testthat/test_cellosaurus.R (1)

116-120: LGTM: validates user‑friendly column renames

This correctly asserts parsed = FALSE returns renamed columns like synonyms and diseases.

man/mapCell2Accession.Rd (1)

40-42: Docs read well; include_query and parsed=FALSE behavior clearly stated

The additions accurately describe column behavior and friendly renames.

Also applies to: 47-50

R/cellosaurus.R (5)

101-112: Public API change LGTM

include_query default TRUE is sensible and backwards‑friendly with explicit opt‑out.


155-157: Query‑only return shape is consistent

Returning a named list of URLs keyed by input IDs is handy for debugging. LGTM.


193-217: Column removal and ordering logic looks good

  • ^query removal is robust (handles query:ac etc.).
  • Reordering to keep core first and query last (when included) improves UX.

LGTM.


253-284: Parsing flow is clear and separated by parsed flag

Nice separation; try/catch around rbindlist is appreciated. LGTM.


526-531: Formatting helpers LGTM

synonyms split and comments structuring look correct.

Also applies to: 539-546

@mtran-code mtran-code changed the title feat: cellosaurus updates feat: include_query and name mapping in cellosaurus Oct 21, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (6)
R/cellosaurus_helpers.R (6)

101-105: Make output case‑insensitive; prefer assert_flag() for query_only.

Small polish to reduce footguns and be idiomatic with checkmate.

Apply this diff:

   checkmate::assert_character(output, len = 1, min.chars = 1)
-  checkmate::assert_character(sort, len = 1, null.ok = TRUE, min.chars = 1)
-  checkmate::assert_logical(query_only, len = 1)
+  output <- toupper(output)
+  checkmate::assert_character(sort, len = 1, null.ok = TRUE, min.chars = 1)
+  checkmate::assert_flag(query_only)

112-119: Compare against lower‑cased schema and dedupe to to avoid invalid/duplicate fields.

Prevents false negatives if cellosaurus_fields() returns mixed case and avoids duplicate query params.

-  to <- tolower(to)
-  invalid_fields <- setdiff(to, cellosaurus_fields())
+  to <- unique(tolower(to))
+  allowed_to <- tolower(cellosaurus_fields())
+  invalid_fields <- setdiff(to, allowed_to)

Please confirm cellosaurus_fields() returns field IDs (not labels) and whether case can vary across versions.


148-155: Set sort and rows only for the search endpoint.

Avoids sending irrelevant params for non-search resources.

-  if (!is.null(sort)) {
-    opts$sort <- paste0(sort, " asc")
-  }
+  if (identical(apiResource, "search/cell-line") && !is.null(sort)) {
+    opts$sort <- paste0(sort, " asc")
+  }
@@
-  opts$rows <- numResults
+  if (identical(apiResource, "search/cell-line")) {
+    opts$rows <- numResults
+  }

Do non-search endpoints accept/ignore rows and sort? If yes, feel free to keep as-is.


55-56: Fix @return to reflect dual return types.

Doc currently says “character string” but the function returns a URL when query_only = TRUE, otherwise an httr2_request.

-#' @return A character string representing the constructed URL for the Cellosaurus API request.
+#' @return If `query_only = TRUE`, a character URL string; otherwise an `httr2_request` object ready to be performed.

137-139: Current code correctly implements AND logic per Cellosaurus API documentation.

The Cellosaurus API treats spaces in the search parameter as an implicit AND operator—the behavior is well-specified, not ambiguous. Your current implementation collapse = " " correctly follows this specification. If you need to support OR queries or other boolean operators in the future, exposing a configurable query_operator parameter would be a valuable optional enhancement.


192-299: Alphabetize extResources for stability and maintainability.

Verification found no duplicates or typos (only "RSCB" exists at line 289, which is correct). However, the list is not currently alphabetized. Sorting would ensure consistent ordering and reduce maintenance drift. Current order differs from alphabetical order due to mixed case; consider standardizing capitalization alongside alphabetization.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 737331d and 9257f6c.

📒 Files selected for processing (2)
  • R/cellosaurus_helpers.R (3 hunks)
  • tests/testthat/test_cellosaurus_helpers.R (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/testthat/test_cellosaurus_helpers.R
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Code-Coverage
  • GitHub Check: ubuntu-latest (release)
  • GitHub Check: ubuntu-latest (devel)
  • GitHub Check: ubuntu-latest (oldrel-1)
  • GitHub Check: macos-latest (release)
  • GitHub Check: pkgdown
🔇 Additional comments (1)
R/cellosaurus_helpers.R (1)

66-88: Nice cleanup on request builder signature and guardrails.

Removing variadic ..., adding allowed_* sets, and explicit defaults tighten the API surface. Looks good.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
R/cellosaurus.R (1)

345-351: Bug: ifelse() with list outputs corrupts column types

ifelse() returns atomic vectors; mixing list outputs (nested columns) with character NA will coerce or error. Use plain if/else assignments.

-  for (key in nestedKeys) {
-    dt[[key]] <- ifelse(
-      is.null(x[[key]]),
-      NA_character_,
-      list(.splitNestedCol(x, key, "; ")[[key]])
-    )
-  }
+  for (key in nestedKeys) {
+    if (is.null(x[[key]])) {
+      dt[[key]] <- NA_character_
+    } else {
+      dt[[key]] <- list(.splitNestedCol(x, key, "; ")[[key]])
+    }
+  }

-  for (key in specialKeys) {
-    dt[[key]] <- ifelse(
-      is.null(x[[key]]),
-      NA_character_,
-      x[key]
-    )
-  }
+  for (key in specialKeys) {
+    if (is.null(x[[key]])) {
+      dt[[key]] <- NA_character_
+    } else {
+      dt[[key]] <- x[key]
+    }
+  }

Also applies to: 377-422

🧹 Nitpick comments (6)
R/cellosaurus_helpers.R (2)

36-38: Avoid 1:length(ids) to prevent 1:0 bug; vectorize construction

Use seq_along(ids) (empty-safe) and paste for equal-length vectors.

-  sapply(1:length(ids), function(i) {
-    paste(from[i], ids[i], sep = ":")
-  })
+  paste(from, ids, sep = ":")

Or, if you prefer keeping indices:

-  sapply(1:length(ids), function(i) {
+  sapply(seq_along(ids), function(i) {
     paste(from[i], ids[i], sep = ":")
   })

188-197: numResults and sort validation: LGTM; minor enhancement

Validation is solid. Consider allowing explicit descending order via a separate order arg or by recognizing "field desc" to match API flexibility.

tests/testthat/test_cellosaurus_helpers.R (1)

41-45: Relax brittle row-count assertions against live API

Exact nrow equality will flap as upstream data changes. Assert minima or classes instead.

-  expect_equal(nrow(response), 2)
+  expect_gte(nrow(response), 1)

-  expect_equal(nrow(response), 3)
+  expect_gte(nrow(response), 1)

Also applies to: 76-79

R/cellosaurus.R (3)

143-154: Parallelism portability: set mc.cores via option

mclapply defaults can warn or behave differently on Windows. Respect an option and default to 1.

-  requests <- parallel::mclapply(
+  requests <- parallel::mclapply(
     queries,
     function(query) {
       .build_cellosaurus_request(
         query = query,
         to = to,
         numResults = numResults,
         sort = sort,
         output = "TXT"
       )
-    }
-  )
+    },
+    mc.cores = getOption("mc.cores", 1L)
+  )

194-203: Order all query columns together at the end (not just query)*

Currently only “query” is repositioned. Group every column starting with “query” to the tail for consistency with include_query semantics.

-  if (!include_query) {
-    query_cols <- grep(
-      pattern = "^query",
-      x = names(responses_dt),
-      value = TRUE
-    )
-    if (length(query_cols) > 0L) {
-      responses_dt[, (query_cols) := NULL]
-    }
-  }
+  if (!include_query) {
+    query_cols <- grep("^query", names(responses_dt), value = TRUE)
+    if (length(query_cols) > 0L) responses_dt[, (query_cols) := NULL]
+  }

   core_cols <- c("cellLineName", "accession")
   existing_core <- core_cols[core_cols %in% names(responses_dt)]
   if (length(existing_core) > 0L) {
-    extra_cols <- setdiff(names(responses_dt), existing_core)
-    if (include_query && "query" %in% extra_cols) {
-      extra_cols <- setdiff(extra_cols, "query")
-      data.table::setcolorder(
-        responses_dt,
-        c(existing_core, extra_cols, "query")
-      )
-    } else {
-      data.table::setcolorder(responses_dt, c(existing_core, extra_cols))
-    }
+    query_cols <- if (include_query) grep("^query", names(responses_dt), value = TRUE) else character(0)
+    extra_cols <- setdiff(names(responses_dt), c(existing_core, query_cols))
+    data.table::setcolorder(responses_dt, c(existing_core, extra_cols, query_cols))
   }

Also applies to: 205-218


100-111: Documented ... in mapCell2Accession is not forwarded

The ... argument is accepted but not used. Either forward to helpers (e.g., pass to .perform_request_parallel/.build_cellosaurus_request if supported) or update docs to avoid implying propagation.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9257f6c and 2510ca0.

📒 Files selected for processing (3)
  • R/cellosaurus.R (12 hunks)
  • R/cellosaurus_helpers.R (3 hunks)
  • tests/testthat/test_cellosaurus_helpers.R (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: macos-latest (release)
  • GitHub Check: ubuntu-latest (release)
  • GitHub Check: ubuntu-latest (oldrel-1)
  • GitHub Check: Code-Coverage
  • GitHub Check: ubuntu-latest (devel)
  • GitHub Check: pkgdown

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
R/cellosaurus.R (2)

19-53: Consider idiomatic boolean checks in R.

Using == TRUE is less idiomatic in R. Prefer direct boolean evaluation or isTRUE() for safer checking.

Apply this diff for more idiomatic style:

-  if (common == TRUE) {
+  if (isTRUE(common)) {
     fields <- c(
       "id",
       ...
     )
   } else {
     schema <- .cellosaurus_schema()
     fields <- schema$components$schemas$Fields$enum
   }
 
-  if (upper == TRUE) {
+  if (isTRUE(upper)) {
     fields <- toupper(fields)
   } else {
     fields <- tolower(fields)
   }

466-519: Consider refactoring repetitive conditional matching logic.

The TODO at line 466 correctly identifies repetitive conditionals. Each branch calls matchNested with similar patterns. Consider extracting a helper function that tries multiple match strategies in sequence.

Would you like me to generate a refactored version that reduces duplication?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2510ca0 and 5bee4e9.

📒 Files selected for processing (4)
  • R/cellosaurus.R (12 hunks)
  • R/cellosaurus_helpers.R (4 hunks)
  • man/mapCell2Accession.Rd (2 hunks)
  • tests/testthat/test_cellosaurus_helpers.R (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/testthat/test_cellosaurus_helpers.R
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: ubuntu-latest (release)
  • GitHub Check: ubuntu-latest (oldrel-1)
  • GitHub Check: ubuntu-latest (devel)
  • GitHub Check: Code-Coverage
  • GitHub Check: macos-latest (release)
  • GitHub Check: Test-Docker-Build
  • GitHub Check: pkgdown
🔇 Additional comments (11)
R/cellosaurus.R (8)

100-112: LGTM: include_query parameter added with backward compatibility.

The new include_query parameter with default TRUE maintains backward compatibility while providing the requested functionality.


121-133: LGTM: Improved readability.

Multi-line formatting makes the field list clearer and easier to maintain.


143-159: LGTM: Clean early return for query_only.

The early return at lines 157-159 avoids unnecessary request execution when only URLs are needed.


174-195: LGTM: Named arguments improve clarity.

Passing parsed = parsed and keep_duplicates = keep_duplicates as named arguments enhances readability in the parallel processing flow.


199-204: LGTM: Safe column removal logic.

The check for length(query_cols) > 0L prevents errors when no query columns exist, and the data.table syntax is idiomatic.


206-219: LGTM: Logical column ordering.

The three-tier ordering (core → extra → query) provides a consistent, predictable structure. The conditional inclusion of query columns respects the include_query parameter.


265-272: LGTM: Enhanced error handling for rbindlist.

The tryCatch wrapper provides informative error messages when response parsing fails, aiding debugging.


378-423: LGTM: Quality-of-life column name mapping implemented.

The renaming logic maps Cellosaurus API short codes to descriptive names (e.g., SYsynonyms, DIdiseases), directly addressing issue #33. The use of skip_absent = TRUE ensures robustness when optional fields are missing.

man/mapCell2Accession.Rd (1)

17-17: LGTM: Documentation accurately reflects implementation.

The documentation correctly describes the include_query parameter's behavior and clarifies the user-friendly column renaming when parsed = FALSE.

Also applies to: 40-43, 47-49

R/cellosaurus_helpers.R (2)

83-195: LGTM: Comprehensive validation addresses all past review concerns.

The enhanced validation logic:

  • Validates all parameters with clear error messages
  • Implements field checking against available Cellosaurus fields with case normalization
  • Provides fallback when API is unavailable
  • Coerces and validates numResults as required
  • Expands allowed_sort to include all valid fields

This directly addresses the past review comments and significantly improves input safety.

Based on past review comments.


257-367: LGTM: Improved readability and typo fix confirmed.

Multi-line formatting makes the resource list easier to maintain. The "RSCB" → "RCSB" typo correction (line 355) from previous review is confirmed.

Based on past review comments.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5bee4e9 and 1db38f0.

📒 Files selected for processing (1)
  • R/cellosaurus.R (13 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: Code-Coverage
  • GitHub Check: ubuntu-latest (devel)
  • GitHub Check: macos-latest (release)
  • GitHub Check: Test-Docker-Build
  • GitHub Check: ubuntu-latest (release)
  • GitHub Check: ubuntu-latest (oldrel-1)
  • GitHub Check: pkgdown
🔇 Additional comments (7)
R/cellosaurus.R (7)

100-112: LGTM: Well-designed parameter addition.

The include_query parameter is properly added with a sensible default (TRUE) that maintains backward compatibility while enabling the new feature.


143-159: LGTM: Parallel request building is appropriate.

The parallel processing pattern is well-suited for independent HTTP requests. Note that mclapply falls back to sequential execution on Windows, which is acceptable for this use case.


174-195: LGTM: Robust parallel parsing implementation.

The error handling for empty results (lines 180-185) is appropriate, and passing parsed and keep_duplicates as named arguments improves code clarity.


199-219: LGTM: Correct implementation of include_query feature.

The logic properly handles both parsed = TRUE and parsed = FALSE cases:

  • Query columns are removed when include_query = FALSE (lines 199-204)
  • Column reordering places core fields first, then metadata, then query fields (lines 206-219)

The use of length(existing_core) > 0L guards against edge cases where core columns might be absent.


255-286: LGTM: Clean separation of parsed vs. unparsed paths.

The early return for !parsed (line 276-279) clearly separates the two output modes, and the minimal column selection for parsed = TRUE (line 283) aligns with the documented behavior.


418-423: Good practice: Using skip_absent = TRUE.

The skip_absent = TRUE parameter is essential here since Cellosaurus entries may not contain all possible fields. This prevents errors when renaming columns that don't exist in a particular response.


428-496: LGTM: Excellent refactor using strategy pattern.

Consolidating the multiple matching strategies into a single helper function improves maintainability and readability. The sequential fallback approach (trying each strategy until one succeeds) is clear and appropriate for this use case.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1db38f0 and 1979d10.

📒 Files selected for processing (1)
  • R/cellosaurus.R (13 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Code-Coverage
  • GitHub Check: ubuntu-latest (oldrel-1)
  • GitHub Check: ubuntu-latest (devel)
  • GitHub Check: ubuntu-latest (release)
  • GitHub Check: pkgdown
🔇 Additional comments (6)
R/cellosaurus.R (6)

100-112: LGTM: include_query parameter added correctly.

The new parameter aligns with the PR objectives and defaults to TRUE to maintain backward compatibility.


143-159: Good use of parallel processing with proper fallbacks.

The explicit parameter passing to .build_cellosaurus_request and the mc.cores option fallback chain are well-implemented.


199-221: Well-implemented query column handling and ordering.

The logic correctly implements the PR objective:

  • Removes query columns when include_query = FALSE
  • Reorders columns to place core fields first, followed by metadata, then query columns

The edge case handling (missing columns) is appropriate.


265-272: Excellent addition of error handling.

The tryCatch block around rbindlist provides graceful handling of malformed responses with informative error messages.


378-423: Excellent implementation of descriptive column name mapping.

The explicit mapping from API column IDs (e.g., AC, SY) to user-friendly names (e.g., accession, synonyms) directly addresses the PR objectives. Using skip_absent = TRUE ensures robustness when columns are missing.


428-496: Well-refactored matching logic using strategy pattern.

The extraction of matching strategies into .match_cellosaurus_candidates addresses the past review comment and improves maintainability. The strategy pattern makes the matching logic clear and easy to extend.

@mtran-code mtran-code force-pushed the mtran/cellosaurus-33 branch from 8cf2c40 to 132f307 Compare October 21, 2025 18:51
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
R/cellosaurus.R (1)

537-547: Critical: Remove duplicate function call.

Lines 537-542 and 543-547 are identical calls to .match_cellosaurus_candidates(). This duplication will cause the first call's result to be overwritten and wastes computation. This appears to be a copy-paste or merge conflict error.

Apply this diff to remove the duplicate:

   result <- .match_cellosaurus_candidates(
     responses_dt = responses_dt,
     query = query,
     name = name,
     keep_duplicates = keep_duplicates
   )
-    responses_dt = responses_dt,
-    query = query,
-    name = name,
-    keep_duplicates = keep_duplicates
-  )
-

Note: This issue was flagged in previous review comments and appears to still be present.

🧹 Nitpick comments (4)
R/cellosaurus.R (4)

143-155: Consider adding error handling for parallel request building.

The mclapply call lacks explicit error handling. If .build_cellosaurus_request() fails for any query, the error may propagate silently or cause unexpected behavior.

Consider wrapping the .build_cellosaurus_request() call in a tryCatch to handle errors gracefully:

   requests <- parallel::mclapply(
     queries,
     function(query) {
-      .build_cellosaurus_request(
-        query = query,
-        to = to,
-        numResults = numResults,
-        sort = sort,
-        output = "TXT"
-      )
+      tryCatch(
+        .build_cellosaurus_request(
+          query = query,
+          to = to,
+          numResults = numResults,
+          sort = sort,
+          output = "TXT"
+        ),
+        error = function(e) {
+          .err(paste0("Failed to build request for query: ", query, " - ", e$message))
+        }
+      )
     },
     mc.cores = getOption("annotationgx.mc.cores", getOption("mc.cores", 1L))
   )

174-195: Consider adding error handling for parallel parsing.

Similar to the request-building phase, the parsing mclapply lacks explicit error handling. If .parse_cellosaurus_lines() or .parse_cellosaurus_text() fail, errors may propagate unexpectedly.

Wrap the parsing logic in a tryCatch to handle errors gracefully:

   responses_dt <- parallel::mclapply(
     ids,
     function(name) {
-      resp <- responses[[name]]
-
-      resp <- .parse_cellosaurus_lines(resp)
-      if (length(resp) == 0L) {
-        .warn(paste0("No results found for ", name))
-        result <- data.table::data.table()
-        result$query <- name
-        return(result)
-      }
-      response_dt <- .parse_cellosaurus_text(
-        resp,
-        name,
-        parsed = parsed,
-        keep_duplicates = keep_duplicates
-      )
-      response_dt
+      tryCatch(
+        {
+          resp <- responses[[name]]
+          resp <- .parse_cellosaurus_lines(resp)
+          if (length(resp) == 0L) {
+            .warn(paste0("No results found for ", name))
+            result <- data.table::data.table()
+            result$query <- name
+            return(result)
+          }
+          response_dt <- .parse_cellosaurus_text(
+            resp,
+            name,
+            parsed = parsed,
+            keep_duplicates = keep_duplicates
+          )
+          response_dt
+        },
+        error = function(e) {
+          .warn(paste0("Failed to parse response for ", name, ": ", e$message))
+          result <- data.table::data.table()
+          result$query <- name
+          result
+        }
+      )
     },
     mc.cores = getOption("annotationgx.mc.cores", getOption("mc.cores", 1L))
   )

206-219: Column reordering logic is correct.

The logic correctly reorders columns to place core fields first, followed by extra fields, and then query fields (when included). The implementation handles edge cases appropriately.

For slightly improved readability, consider extracting the column collection into a helper:

.get_column_order <- function(dt_names, include_query) {
  core_cols <- c("cellLineName", "accession")
  existing_core <- core_cols[core_cols %in% dt_names]
  
  query_cols <- if (include_query) {
    grep("^query", dt_names, value = TRUE)
  } else {
    character(0)
  }
  
  extra_cols <- setdiff(dt_names, c(existing_core, query_cols))
  c(existing_core, extra_cols, query_cols)
}

428-497: Excellent refactoring - consolidates matching logic.

The new .match_cellosaurus_candidates() function effectively consolidates multiple matching strategies using the strategy pattern. The implementation is clean, maintainable, and allows for easy addition of new matching strategies.

Consider adding brief inline comments documenting each strategy's purpose for future maintainers:

strategies <- list(
  # Strategy 1: Exact match on cellLineName
  function() { ... },
  # Strategy 2: Nested match using original query
  function() { ... },
  # Strategy 3: Nested match using cleaned name
  function() { ... },
  # Strategy 4: Match on cleaned cellLineName
  function() { ... },
  # Strategy 5: Match on cleaned synonyms
  function() { ... }
)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1979d10 and 8cf2c40.

📒 Files selected for processing (1)
  • R/cellosaurus.R (13 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: ubuntu-latest (release)
  • GitHub Check: ubuntu-latest (devel)
  • GitHub Check: macos-latest (release)
  • GitHub Check: ubuntu-latest (oldrel-1)
  • GitHub Check: Code-Coverage
  • GitHub Check: pkgdown
🔇 Additional comments (6)
R/cellosaurus.R (6)

157-159: LGTM!

The early return for query_only = TRUE correctly avoids unnecessary network requests and returns the URLs as expected.


199-204: LGTM!

The logic to remove query columns when include_query = FALSE is correct and clear. The length check at line 201 is technically redundant (data.table handles NULL assignment to empty vectors gracefully), but it makes the intent explicit and prevents unnecessary operations.


317-425: LGTM! Robust handling of optional and nested fields.

The updated .processEntry() function now includes explicit NULL checks for optional, nested, and special keys, using NA_character_ as appropriate. The use of skip_absent = TRUE in column renaming makes the code resilient to schema variations. This is excellent defensive programming.


255-286: LGTM! Improved error handling and cleaner flow.

The updated function signature with explicit named parameters improves clarity. The tryCatch around rbindlist (lines 265-272) provides robust error handling, and the early return for parsed = FALSE (line 276-278) improves efficiency.


87-111: LGTM! Well-designed parameter addition.

The new include_query parameter is well-documented and defaults to TRUE for backward compatibility. The parameter placement after parsed is logical and maintains good API ergonomics.


19-47: Verify hardcoded common fields list is synchronized with current Cellosaurus API schema.

The hardcoded field list (22 fields) used when common = TRUE lacks automated validation against the API schema. Test coverage exists but only validates the hardcoded behavior, not schema synchronization. No tests compare the hardcoded list against schema$components$schemas$Fields$enum when common = FALSE.

Verify the list matches the current schema at https://api.cellosaurus.org/openapi.json, or add a test that validates equivalence between the two paths.

@mtran-code mtran-code merged commit e1086fe into main Oct 21, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feature: mapCellosaurus functions

2 participants