Skip to content

Enhance extract_mixture() with keyword-based mixture detection #123

@seanthimons

Description

@seanthimons

Summary

extract_mixture() currently only detects ratio-based mixture patterns (e.g., 60:40, 3/1 w/w). It misses keyword-based mixtures like "mixture of acids", "blend of oils", or "proprietary blend". In real-world uncurated chemical data (~12,000 rows), approximately 200 rows contain keyword-only mixture indicators that ratio detection misses.

Current Behavior

extract_mixture("Ethanol, water (1:1)")
#> TRUE (ratio detected)

extract_mixture("mixture of acids")
#> FALSE (keyword not detected)

extract_mixture("proprietary blend")
#> FALSE (keyword not detected)

Proposed Change

Add an optional include_keywords parameter (default FALSE for backward compatibility):

extract_mixture <- function(name_vector, include_keywords = FALSE) {
  # ... existing ratio detection ...
  ratio_hit <- stringr::str_detect(name_vector, pattern)

  if (include_keywords) {
    kw <- "\b(mixture|blend|combination|formulation|compound(?:ed)?|composition)\b"
    keyword_hit <- stringr::str_detect(name_vector, stringr::regex(kw, ignore_case = TRUE))
    return(ratio_hit | keyword_hit)
  }
  ratio_hit
}

Expected Behavior

Input include_keywords = FALSE include_keywords = TRUE
"Ethanol, water (1:1)" TRUE (ratio) TRUE
"mixture of acids" FALSE TRUE
"proprietary blend" FALSE TRUE
"sodium chloride" FALSE FALSE
"compounded rubber" FALSE TRUE
"organic compound" FALSE FALSE (word boundary prevents partial match)

Tests

  • Existing ratio-pattern tests still pass with default include_keywords = FALSE
  • "Ethanol, water (1:1)" → TRUE (ratio, same as before)
  • "mixture of acids" → FALSE with default, TRUE with include_keywords = TRUE
  • "sodium chloride" → FALSE in both modes
  • "compounded rubber" → TRUE with include_keywords = TRUE
  • "organic compound" → FALSE in both modes (word boundary)
  • "blend of oils" → FALSE with default, TRUE with include_keywords = TRUE
  • NA input → NA output

Context

This enhancement supports a downstream chemical name pre-curation pipeline (ChemReg) that needs to flag mixture-type entries before CompTox curation. The keyword list is intentionally conservative — downstream applications can extend it.

Source: PRE_POST_CURATION_PLAN.md section 12.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions