-
Notifications
You must be signed in to change notification settings - Fork 1
feat: enable parallel queries for unichem #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe PR refactors Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant queryUnichemCompound
participant Validator
participant RequestBuilder
participant ParallelExec
participant API
participant ResponseParser
rect rgb(200, 230, 255)
Note over User,ResponseParser: Single Compound (Existing Path)
User->>queryUnichemCompound: compound="CHEMBL123"
queryUnichemCompound->>Validator: validate sourceID
Validator-->>queryUnichemCompound: ✓
queryUnichemCompound->>RequestBuilder: build request
RequestBuilder-->>queryUnichemCompound: request object
queryUnichemCompound->>API: execute query
API-->>queryUnichemCompound: response
queryUnichemCompound->>ResponseParser: parse response
ResponseParser-->>queryUnichemCompound: data frame
queryUnichemCompound-->>User: return result
end
rect rgb(230, 255, 230)
Note over User,ResponseParser: Multiple Compounds (New Parallel Path)
User->>queryUnichemCompound: compound=c("CHEMBL123", "CHEMBL456"), progress="Querying..."
queryUnichemCompound->>Validator: validate per-compound sourceID
Validator-->>queryUnichemCompound: sourceID vector recycled/validated
queryUnichemCompound->>RequestBuilder: build requests (per compound)
RequestBuilder-->>queryUnichemCompound: request list
queryUnichemCompound->>ParallelExec: .perform_request_parallel(progress="Querying...")
par Parallel Execution
ParallelExec->>API: compound 1 query
ParallelExec->>API: compound 2 query
and
API-->>ParallelExec: response 1
API-->>ParallelExec: response 2
end
ParallelExec->>ResponseParser: parse all responses
ResponseParser-->>ParallelExec: list of data frames
ParallelExec-->>queryUnichemCompound: named list {CHEMBL123=..., CHEMBL456=...}
queryUnichemCompound-->>User: return named list
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes The refactor introduces a parallel execution framework with per-compound validation logic, modularizes request/response handling, and significantly expands API surface (batch queries, progress feedback). While changes are focused within UniChem-related files, the density of new logic (compound batching, sourceID recycling, parallel dispatch) and interconnected updates across function signatures, documentation, and test expansion warrant careful review across logic pathways and edge cases. Poem
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #68 +/- ##
==========================================
+ Coverage 77.76% 78.20% +0.43%
==========================================
Files 20 20
Lines 1309 1413 +104
==========================================
+ Hits 1018 1105 +87
- Misses 291 308 +17 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (6)
man/queryUnichemCompound.Rd (2)
18-20: Clarify vector/list handling and “raw” return semantics.
- Specify whether nested lists are flattened or rejected; current code flattens, which can affect naming. Consider documenting “list of scalars (no nesting)” or that names come from the flattened vector.
- “raw = TRUE” returns the parsed JSON list, not the HTTP response; say “parsed API JSON” to avoid confusion.
Also applies to: 40-43
32-35: Document progress usage scope.Note that progress is only used when length(compound) > 1 and is passed to .perform_request_parallel(). Mention how to disable (progress = FALSE) and that single-query path ignores it.
tests/testthat/test_unichem.R (1)
55-71: Stabilize network tests and boolean style.
- Guard external API tests to avoid CI flakes: add skip_if_offline("www.ebi.ac.uk") and skip_on_cran() at the start of each networked test.
- Prefer TRUE/FALSE over T/F.
test_that("queryUnichemCompound handles vector inputs", { + skip_if_offline(host = "www.ebi.ac.uk") + skip_on_cran() fallback <- c( "444795", "444796" ) @@ results <- queryUnichemCompound( type = "sourceID", compound = fallback, sourceID = 22, - progress = FALSE + progress = FALSE ) @@ test_that("queryUnichemCompound handles non source id lists", { + skip_if_offline(host = "www.ebi.ac.uk") + skip_on_cran() compounds <- c("538323", "538324") @@ }) test_that("queryUnichemCompound returns the expected results 2", { + skip_if_offline(host = "www.ebi.ac.uk") + skip_on_cran() # Test case 1 result1 <- queryUnichemCompound( type = "inchikey", compound = "BSYNRYMUTXBXSQ-UHFFFAOYSA-N", - raw = T + raw = TRUE ) @@ result2 <- queryUnichemCompound( type = "inchikey", compound = "BSYNRYMUTXBXSQ-UHFFFAOYSA-N", - raw = F + raw = FALSE )Also applies to: 73-82, 85-107
R/unichem.R (3)
142-148: Validate allowed identifier types early.Add an explicit choice assertion to fail fast on invalid types.
- checkmate::assert_string(type) + checkmate::assert_string(type) + checkmate::assert_choice(type, c("uci", "inchi", "inchikey", "sourceID")) checkmate::assert_flag(request_only) checkmate::assert_flag(raw)
190-233: Harden parsing against missing fields; keep shape stable.The API may omit compounds$inchi or sources for some identifiers. Guard for NULL and return empty data.tables/lists with expected names to avoid subsetting or $ access errors. Also, consider returning NA for missing inchi-derived fields.
165-166: Avoid repeated source list fetches within hot paths.Memoize getUnichemSources() (already importing memoise) to reduce repeated API calls when validating source IDs across runs/sessions.
Also applies to: 179-180
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
DESCRIPTION(1 hunks)R/unichem.R(2 hunks)man/queryUnichemCompound.Rd(1 hunks)tests/testthat/test_unichem.R(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: ubuntu-latest (release)
- GitHub Check: ubuntu-latest (oldrel-1)
- GitHub Check: macos-latest (release)
- GitHub Check: Test-Docker-Build
- GitHub Check: ubuntu-latest (devel)
- GitHub Check: Code-Coverage
- GitHub Check: pkgdown
🔇 Additional comments (2)
DESCRIPTION (1)
44-44: RoxygenNote bump looks fine; align CI/tooling.Ensure roxygen2 7.3.3 is used in dev/CI to avoid churn in generated Rd files.
R/unichem.R (1)
31-35: Switch to .parse_resp_json looks good.No concerns here.
R/unichem.R
Outdated
| responses <- .perform_request_parallel(requests, progress = progress) | ||
| names(responses) <- names(requests) | ||
|
|
||
| parsed_responses <- Map(.parse_resp_json, responses) | ||
|
|
||
| } | ||
| if (raw) { | ||
| names(parsed_responses) <- names(responses) | ||
| return(parsed_responses) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Bound parallelism, set timeouts/backoff, and isolate per-request errors.
Ensure .perform_request_parallel:
- caps concurrency (e.g., via a worker pool),
- sets per-request timeouts and retry with backoff for 5xx/429,
- collects per-request errors without failing the whole batch unless requested (e.g., stop_on_error flag).
If already implemented, ignore; otherwise consider adding.
🏁 Script executed:
#!/bin/bash
# Inspect implementation details of .perform_request_parallel and related request defaults.
rg -nC3 --type=r '\.perform_request_parallel|\.(build|perform)_request|httr2::req_timeout|req_retry'Length of output: 11237
Based on my verification of the implementation against httr2 documentation:
Add per-request timeouts and consider retrying all transient 5xx errors.
The current implementation already addresses some concerns:
- Concurrency is capped with a default maximum of 10 concurrent requests
- The
on_error = "continue"setting collects per-request errors without failing the entire batch
However, two gaps remain:
-
Timeouts are missing:
req_timeout()can be applied to set a maximum number of seconds to wait for a request, but is not used in.build_request(). -
Limited retry coverage: By default, retry only covers 429 and 503 responses. Consider extending
.build_request()to also retry other transient 5xx errors (e.g., 500, 502, 504) via theis_transientparameter if your target APIs return them.
🤖 Prompt for AI Agents
In R/unichem.R around lines 247-255, the code returns parsed responses but the
underlying .build_request() lacks per-request timeouts and broader 5xx retry
handling; update .build_request() to attach a req_timeout(timeout_seconds) to
each httr2 request (expose timeout as a parameter or use a sensible default) and
expand the retry policy to include other transient 5xx errors (e.g., 500, 502,
504) by using the is_transient parameter or equivalent so retries cover all
transient server errors in addition to 429/503; ensure these settings are
applied per-request so the parallel runner benefits from timeouts and extended
retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
R/unichem.R (1)
164-175: Validate sourceID by membership, not range.The range check
upper = total_sourcesaccepts non-existent sourceID values that fall within the range. For example, if valid IDs are{1, 2, 5, 10}, the current code would incorrectly accept3,4,6,7,8, or9.Replace the range assertion with membership validation:
- checkmate::assert_integerish(src_ids, any.missing = FALSE) - total_sources <- max(getUnichemSources()$SourceID) + checkmate::assert_integerish(src_ids, any.missing = FALSE) + valid_sources <- getUnichemSources()$SourceID if (length(src_ids) == 1L) { src_ids <- rep(src_ids, length(compounds)) } else if (length(src_ids) != length(compounds)) { stop( "`sourceID` must be length 1 or match the number of compounds ", "when type = 'sourceID'" ) } - checkmate::assert_integerish(src_ids, lower = 1, upper = total_sources) - src_ids + + if (!all(src_ids %in% valid_sources)) { + invalid_ids <- unique(src_ids[!src_ids %in% valid_sources]) + stop( + sprintf( + "Invalid sourceID(s): %s. Use getUnichemSources() to see valid IDs.", + paste(invalid_ids, collapse = ", ") + ) + ) + } + src_ids
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
R/unichem.R(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: ubuntu-latest (release)
- GitHub Check: Code-Coverage
- GitHub Check: ubuntu-latest (oldrel-1)
- GitHub Check: ubuntu-latest (devel)
- GitHub Check: macos-latest (release)
- GitHub Check: pkgdown
🔇 Additional comments (3)
R/unichem.R (3)
28-104: LGTM! Formatting improvements with no functional changes.The refactoring improves readability through consistent formatting and alignment while preserving the original logic.
142-156: LGTM! Parameter validation and input flattening are solid.The validation correctly handles the new
progressparameter (flag or string), anduse.names = TRUEpreserves names when flattening lists.
237-244: LGTM! Naming logic and single-query path are well-implemented.The naming strategy correctly validates name availability and length before applying names to requests, with a sensible fallback. The single-query path maintains consistency by reusing the same helper functions.
Also applies to: 270-283
c0f8a3f to
9af9a80
Compare
9af9a80 to
aaa1c43
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
R/unichem.R (1)
9-23: Clarify return docs for all_columns default.Docs imply all columns are always returned. Note the default 2‑column path.
-#' Returns a `data.table` with the following columns: +#' When `all_columns = FALSE` (default), returns a two‑column `data.table` +#' with `Name` and `SourceID`. When `all_columns = TRUE`, returns a +#' `data.table` with the following columns:Also applies to: 25-26
♻️ Duplicate comments (1)
R/unichem.R (1)
260-264: Verify per-request timeouts and broader transient 5xx retries are configured.Ensure
.build_request()(or helpers) set an explicitreq_timeout()and retry on common transient 5xx (500/502/504) in addition to 429/503. If already present elsewhere, ignore.#!/bin/bash # Search for timeout and retry usage across the package. rg -nP --type=r 'req_timeout|req_retry|is_transient|\.(build|perform)_request|\.perform_request_parallel|\.build_unichem_'
🧹 Nitpick comments (8)
tests/testthat/test_unichem.R (4)
5-36: Stabilize networked tests: skip on CRAN/offline to avoid flakiness.These tests hit UniChem. Add standard skips so CI/CRAN/offline runs don’t fail due to network/service variance.
test_that("getUnichemSources returns a data.table with the correct columns", { + skip_on_cran() + skip_if_offline() sources <- getUnichemSources(all_columns = TRUE)test_that("queryUnichemCompound returns the expected results", { + skip_on_cran() + skip_if_offline()test_that("queryUnichemCompound handles vector inputs", { + skip_on_cran() + skip_if_offline()test_that("queryUnichemCompound handles non source id lists", { + skip_on_cran() + skip_if_offline()test_that("queryUnichemCompound returns the expected results 2", { + skip_on_cran() + skip_if_offline()Also applies to: 39-55, 57-73, 75-84, 85-129
57-73: Assert vector-path returns have no per-compound errors.Strengthen to fail if any element is a captured error from the parallel path.
expect_length(results, length(fallback)) expect_named(results, fallback) expect_true(all(vapply(results, is.list, logical(1)))) + # Ensure no per-compound errors were returned + expect_true( + all(!vapply(results, inherits, logical(1), "unichem_error")), + info = "No per-compound errors expected" + )
75-84: Broaden checks for non-sourceID list case.Also assert length and absence of error wrappers for each element.
test_that("queryUnichemCompound handles non source id lists", { compounds <- c("538323", "538324") results <- queryUnichemCompound( compound = compounds, type = "uci", progress = FALSE ) expect_named(results, compounds) + expect_length(results, length(compounds)) + expect_true(all(vapply(results, is.list, logical(1)))) + expect_true(all(!vapply(results, inherits, logical(1), "unichem_error"))) })
87-91: Use TRUE/FALSE instead of T/F in tests.R CMD check flags T/F usage; prefer explicit logicals.
- result1 <- queryUnichemCompound( + result1 <- queryUnichemCompound( type = "inchikey", compound = "BSYNRYMUTXBXSQ-UHFFFAOYSA-N", - raw = T + raw = TRUE )- result2 <- queryUnichemCompound( + result2 <- queryUnichemCompound( type = "inchikey", compound = "BSYNRYMUTXBXSQ-UHFFFAOYSA-N", - raw = F + raw = FALSE )Also applies to: 105-109
R/unichem.R (4)
79-79: Defensive rename against UniChem schema drift; fail fast with a clear error.Current
setnameswill error or silently misalign if fields change. Guard for missing fields and then rename.- data.table::setnames(sources_dt, old_names, new_names) + missing <- setdiff(old_names, names(sources_dt)) + if (length(missing)) { + .err( + funContext, + sprintf( + "UniChem 'sources' schema changed; missing fields: %s", + paste(missing, collapse = ", ") + ) + ) + } + data.table::setnames(sources_dt, old = old_names, new = new_names)
191-198: Coerce compound IDs to character when building requests.Avoids surprises with factors/integers in API calls.
build_request <- function(cmp, src) { .build_unichem_compound_req( type = type, - compound = cmp, + compound = as.character(cmp), sourceID = if (is.na(src)) NULL else src, ... ) }
213-229: Handle compounds with no external mappings (empty sources) without error.If
parsed$compounds$sourcesis empty,setnames/subset will fail. Initialize a typed empty table.- mapped_sources_dt <- .asDT(parsed$compounds$sources) - old_names <- c("compoundId", "shortName", "longName", "id", "url") - new_names <- c( - "compoundID", - "Name", - "NameLong", - "sourceID", - "sourceURL" - ) - data.table::setnames( - mapped_sources_dt, - old = old_names, - new = new_names - ) + old_names <- c("compoundId", "shortName", "longName", "id", "url") + new_names <- c("compoundID", "Name", "NameLong", "sourceID", "sourceURL") + if (is.null(parsed$compounds$sources) || length(parsed$compounds$sources) == 0L) { + mapped_sources_dt <- data.table::as.data.table(list( + compoundID = integer(), + Name = character(), + NameLong = character(), + sourceID = integer(), + sourceURL = character() + )) + } else { + mapped_sources_dt <- .asDT(parsed$compounds$sources) + data.table::setnames(mapped_sources_dt, old = old_names, new = new_names) + }
125-128: Document multi-query error semantics in @return.Make it clear that per‑compound errors are wrapped (class
unichem_error) instead of stopping.-#' compound). If `raw = TRUE`, raw responses are returned instead. +#' compound). If `raw = TRUE`, raw responses are returned instead. +#' For multiple queries, per‑compound failures are returned as elements +#' of class `"unichem_error"` with an `error` message, without stopping.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
R/unichem.R(2 hunks)man/queryUnichemCompound.Rd(1 hunks)tests/testthat/test_unichem.R(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- man/queryUnichemCompound.Rd
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: pkgdown
- GitHub Check: macos-latest (release)
- GitHub Check: Code-Coverage
- GitHub Check: ubuntu-latest (release)
- GitHub Check: ubuntu-latest (devel)
- GitHub Check: ubuntu-latest (oldrel-1)
| checkmate::assert_string(type) | ||
| checkmate::assert_flag(request_only) | ||
| checkmate::assert_flag(raw) | ||
| checkmate::assert( | ||
| checkmate::check_flag(progress), | ||
| checkmate::check_string(progress, min.chars = 1) | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate type against allowed choices early.
Prevents ambiguous errors downstream.
checkmate::assert_string(type)
+ checkmate::assert_choice(type, c("uci", "inchi", "inchikey", "sourceID"))🤖 Prompt for AI Agents
In R/unichem.R around lines 142 to 149, the parameter `type` is asserted as a
string but not checked against the allowed set; add an explicit choice
validation early (e.g. define an `allowed_types` vector and call
checkmate::assert_choice(type, choices = allowed_types)) so invalid values fail
fast with a clear message; place this check immediately after the existing
checkmate assertions for `type` and before any downstream logic that depends on
specific `type` values.
Closes #50
Summary by CodeRabbit
Release Notes
New Features
Enhancements