-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
extract_mixture() currently only detects ratio-based mixture patterns (e.g., 60:40, 3/1 w/w). It misses keyword-based mixtures like "mixture of acids", "blend of oils", or "proprietary blend". In real-world uncurated chemical data (~12,000 rows), approximately 200 rows contain keyword-only mixture indicators that ratio detection misses.
Current Behavior
extract_mixture("Ethanol, water (1:1)")
#> TRUE (ratio detected)
extract_mixture("mixture of acids")
#> FALSE (keyword not detected)
extract_mixture("proprietary blend")
#> FALSE (keyword not detected)Proposed Change
Add an optional include_keywords parameter (default FALSE for backward compatibility):
extract_mixture <- function(name_vector, include_keywords = FALSE) {
# ... existing ratio detection ...
ratio_hit <- stringr::str_detect(name_vector, pattern)
if (include_keywords) {
kw <- "\b(mixture|blend|combination|formulation|compound(?:ed)?|composition)\b"
keyword_hit <- stringr::str_detect(name_vector, stringr::regex(kw, ignore_case = TRUE))
return(ratio_hit | keyword_hit)
}
ratio_hit
}Expected Behavior
| Input | include_keywords = FALSE |
include_keywords = TRUE |
|---|---|---|
"Ethanol, water (1:1)" |
TRUE (ratio) | TRUE |
"mixture of acids" |
FALSE | TRUE |
"proprietary blend" |
FALSE | TRUE |
"sodium chloride" |
FALSE | FALSE |
"compounded rubber" |
FALSE | TRUE |
"organic compound" |
FALSE | FALSE (word boundary prevents partial match) |
Tests
- Existing ratio-pattern tests still pass with default
include_keywords = FALSE -
"Ethanol, water (1:1)"→ TRUE (ratio, same as before) -
"mixture of acids"→ FALSE with default, TRUE withinclude_keywords = TRUE -
"sodium chloride"→ FALSE in both modes -
"compounded rubber"→ TRUE withinclude_keywords = TRUE -
"organic compound"→ FALSE in both modes (word boundary) -
"blend of oils"→ FALSE with default, TRUE withinclude_keywords = TRUE - NA input → NA output
Context
This enhancement supports a downstream chemical name pre-curation pipeline (ChemReg) that needs to flag mixture-type entries before CompTox curation. The keyword list is intentionally conservative — downstream applications can extend it.
Source: PRE_POST_CURATION_PLAN.md section 12.1
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request