Skip to content

New function: is_formula_name() — detect bare molecular formula strings #124

@seanthimons

Description

@seanthimons

Summary

Add a new exported function is_formula_name() that detects whether a string is entirely a molecular formula (e.g., "H2O", "NaCl", "C2H5OH"). This is distinct from extract_formulas(), which extracts formulas embedded inside parentheses/brackets within larger text.

Motivation

In uncurated chemical datasets, some entries have molecular formulas as their chemical name (e.g., "C9H20" instead of "nonane"). These need to be flagged so curators can resolve them to proper names. extract_formulas() doesn't handle this case — it only finds formulas inside () or [] by design.

Real-world data from ~12,000 uncurated chemical records contains entries like:

  • C9H20 (CAS: 111-84-2)
  • C10H22 (CAS: 124-18-5)
  • NaCl (CAS: 7647-14-5)
  • CaCl2 (CAS: 10043-52-3)

Proposed API

#' Test whether a string is a bare molecular formula
#'
#' Uses the periodic table to validate that a string consists only of
#' element symbols and stoichiometric numbers. Does not match formulas
#' embedded in larger text — use extract_formulas() for that.
#'
#' @param x Character vector of strings to test
#' @return Logical vector: TRUE if the string is a bare formula, FALSE otherwise, NA for NA input
#' @export
is_formula_name <- function(x) { ... }

Implementation Approach

  1. Load element symbols from the internal periodic table data (already available in package, used by extract_formulas())
  2. Build regex: ^(Element)(\d*)+$ where Element is the alternation of all symbols, ordered longest-first to avoid partial matches (Na before N)
  3. Also allow: hydrate notation (·, .), charge notation (+, -), grouped substructures with parentheses like Ca(OH)2
  4. Reject: strings shorter than 2 chars, strings with spaces, strings that are just numbers

Tests

Input Expected Reason
"H2O" TRUE Simple formula
"NaCl" TRUE Ionic compound, no digits needed
"C2H5OH" TRUE Organic formula
"CuSO4" TRUE Inorganic formula
"Ca(OH)2" TRUE Grouped substructure
"CO" TRUE Carbon monoxide
"water" FALSE English word
"Acetone" FALSE Has lowercase letters not matching elements
"H2O and more" FALSE Contains spaces/extra text
"Iron" FALSE Element name, not symbol
NA NA NA passthrough
"" FALSE Empty string
"123" FALSE Just digits

Edge Cases to Document

  • "CO" vs "Co": CO = carbon + oxygen (formula); Co = cobalt (element symbol). Both are technically valid formulas. This is acceptable.
  • Single-element symbols like "S", "P", "I" are technically valid formulas but may cause false positives. Consider requiring nchar(x) >= 2.
  • "DEHP", "PFOA" — abbreviations that happen to contain element symbols. These should NOT match because they contain lowercase letters that don't correspond to two-letter element symbols (e.g., "EH" is not an element).

Context

Source: PRE_POST_CURATION_PLAN.md sections 12.2 and 13

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions