Skip to content

New function: extract_hazard_warnings() — extract hazard parentheticals from chemical names #125

@seanthimons

Description

@seanthimons

Summary

Add a new exported function extract_hazard_warnings() that finds and extracts parenthetical phrases containing hazard classification language (carcinogen, mutagen, teratogen, irritant, etc.) from chemical name strings.

Motivation

Uncurated chemical datasets frequently contain hazard warnings embedded in chemical name fields, e.g.:

  • "ethylene oxide (suspected 2a human carcinogen by iarc)"
  • "lampblack (suspected human carcinogen by ACGIH)"
  • "dichloromethane (methylene chloride) (suspected human carcinogen by ACGIH, NTP)"
  • "Fragrance (Irritating to eyes)"

These warnings poison downstream operations (synonym splitting, CompTox curation lookups) and need to be extracted before cleaning. This is thematically correct for ComptoxR since it relates to chemical safety classification.

In real-world data (~12,000 uncurated chemical records), approximately 35 rows contain hazard warnings embedded in names.

Proposed API

#' Extract hazard warning parentheticals from chemical names
#'
#' Finds and extracts parenthetical phrases containing hazard classification
#' language (carcinogen, mutagen, teratogen, irritant) from chemical name strings.
#'
#' @param name_vector Character vector of chemical names
#' @return A list of character vectors (one per input). Each contains the
#'   warning text found, or character(0) if none.
#' @export
extract_hazard_warnings <- function(name_vector) { ... }

Implementation Approach

Pattern (case-insensitive):

"\(([^)]*(?:carcinogen|mutagen|teratogen|irritant|hazard|toxic)[^)]*)\)"

Extract all matching parenthetical content. Return the inner text (without the parentheses).

Tests

Input Expected Output
"ethylene oxide (suspected 2a human carcinogen by iarc)" "suspected 2a human carcinogen by iarc"
"lampblack (suspected human carcinogen by ACGIH)" "suspected human carcinogen by ACGIH"
"Fragrance (Irritating to eyes)" "Irritating to eyes"
"compound (toxic to aquatic life)" "toxic to aquatic life"
"Iron(III) chloride" character(0) — oxidation state, not a warning
"acetone" character(0) — no parentheticals
"dichloromethane (methylene chloride) (suspected human carcinogen by ACGIH, NTP)" "suspected human carcinogen by ACGIH, NTP" — only the warning, not the synonym parenthetical
NA NA

Additional Considerations

  • A companion helper strip_hazard_warnings() that removes the matched parentheticals from the name string (returning the cleaned name) may also be useful, but could live in the downstream ChemReg package instead. Consider whether to include both extract + strip in ComptoxR or just the extractor.
  • The function should NOT match oxidation states like (III), (IV), (2+) — these are chemical notation, not warnings.
  • The function should NOT match synonym parentheticals like (methylene chloride) — only phrases containing hazard keywords.

Context

Source: PRE_POST_CURATION_PLAN.md section 14

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions