-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Add a new exported function is_formula_name() that detects whether a string is entirely a molecular formula (e.g., "H2O", "NaCl", "C2H5OH"). This is distinct from extract_formulas(), which extracts formulas embedded inside parentheses/brackets within larger text.
Motivation
In uncurated chemical datasets, some entries have molecular formulas as their chemical name (e.g., "C9H20" instead of "nonane"). These need to be flagged so curators can resolve them to proper names. extract_formulas() doesn't handle this case — it only finds formulas inside () or [] by design.
Real-world data from ~12,000 uncurated chemical records contains entries like:
C9H20(CAS: 111-84-2)C10H22(CAS: 124-18-5)NaCl(CAS: 7647-14-5)CaCl2(CAS: 10043-52-3)
Proposed API
#' Test whether a string is a bare molecular formula
#'
#' Uses the periodic table to validate that a string consists only of
#' element symbols and stoichiometric numbers. Does not match formulas
#' embedded in larger text — use extract_formulas() for that.
#'
#' @param x Character vector of strings to test
#' @return Logical vector: TRUE if the string is a bare formula, FALSE otherwise, NA for NA input
#' @export
is_formula_name <- function(x) { ... }Implementation Approach
- Load element symbols from the internal periodic table data (already available in package, used by
extract_formulas()) - Build regex:
^(Element)(\d*)+$where Element is the alternation of all symbols, ordered longest-first to avoid partial matches (NabeforeN) - Also allow: hydrate notation (
·,.), charge notation (+,-), grouped substructures with parentheses likeCa(OH)2 - Reject: strings shorter than 2 chars, strings with spaces, strings that are just numbers
Tests
| Input | Expected | Reason |
|---|---|---|
"H2O" |
TRUE | Simple formula |
"NaCl" |
TRUE | Ionic compound, no digits needed |
"C2H5OH" |
TRUE | Organic formula |
"CuSO4" |
TRUE | Inorganic formula |
"Ca(OH)2" |
TRUE | Grouped substructure |
"CO" |
TRUE | Carbon monoxide |
"water" |
FALSE | English word |
"Acetone" |
FALSE | Has lowercase letters not matching elements |
"H2O and more" |
FALSE | Contains spaces/extra text |
"Iron" |
FALSE | Element name, not symbol |
NA |
NA | NA passthrough |
"" |
FALSE | Empty string |
"123" |
FALSE | Just digits |
Edge Cases to Document
"CO"vs"Co":CO= carbon + oxygen (formula);Co= cobalt (element symbol). Both are technically valid formulas. This is acceptable.- Single-element symbols like
"S","P","I"are technically valid formulas but may cause false positives. Consider requiringnchar(x) >= 2. "DEHP","PFOA"— abbreviations that happen to contain element symbols. These should NOT match because they contain lowercase letters that don't correspond to two-letter element symbols (e.g.,"EH"is not an element).
Context
Source: PRE_POST_CURATION_PLAN.md sections 12.2 and 13
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request