Skip to content

Evaluate clean_unicode() against real chemical data for mapping gaps #126

@seanthimons

Description

@seanthimons

Summary

clean_unicode() has 100+ Unicode-to-ASCII mappings via the internal unicode_map, but it hasn't been tested against a real uncurated chemical dataset. The function's check_unhandled() mechanism warns about unmapped characters — we need to run it against production data to identify any gaps.

Motivation

The PRE_POST_CURATION_PLAN.md identifies three specific Unicode characters found in real chemical name data:

Character Unicode Example Expected Mapping
µ (micro sign) U+00B5 "Palygorskite fibers (> 5µm in length)" u or micro
® (registered trademark) U+00AE "TRIM® VX" Strip or (R)
Double spaces after ® "Vertasil® Trisiloxanyl-cannabidiol" Collapse to single space

Action Items

  • Test clean_unicode() against the 172-row test dataset (chemical_validation_test.csv) — specifically the 3 Unicode test records
  • If available, test against the full 12,144-row uncurated dataset (uncurated_chemicals_2023-05-16_12-43-41.csv)
  • Check whether µ (U+00B5) is in the current unicode_map — if not, add it
  • Check whether ® (U+00AE) is in the current unicode_map — if not, add it
  • Run check_unhandled() output to identify any additional unmapped characters
  • Add any missing mappings to the internal unicode_map dataset
  • Verify double-space collapsing happens (may need post-processing step)

Tests

After any additions:

  • clean_unicode("5µm") produces ASCII output (e.g., "5um" or "5microm")
  • clean_unicode("TRIM® VX") produces ASCII output (e.g., "TRIM(R) VX" or "TRIM VX")
  • clean_unicode("Vertasil® Trisiloxanyl-cannabidiol") produces clean ASCII with no double spaces
  • Existing clean_unicode() tests still pass

Context

clean_unicode() is already far more comprehensive than the Python equivalent (100+ vs 8 mappings). This issue is about verifying coverage against real data rather than a major rewrite.

Source: PRE_POST_CURATION_PLAN.md section 12.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions