-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
clean_unicode() has 100+ Unicode-to-ASCII mappings via the internal unicode_map, but it hasn't been tested against a real uncurated chemical dataset. The function's check_unhandled() mechanism warns about unmapped characters — we need to run it against production data to identify any gaps.
Motivation
The PRE_POST_CURATION_PLAN.md identifies three specific Unicode characters found in real chemical name data:
| Character | Unicode | Example | Expected Mapping |
|---|---|---|---|
µ (micro sign) |
U+00B5 | "Palygorskite fibers (> 5µm in length)" |
u or micro |
® (registered trademark) |
U+00AE | "TRIM® VX" |
Strip or (R) |
Double spaces after ® |
— | "Vertasil® Trisiloxanyl-cannabidiol" |
Collapse to single space |
Action Items
- Test
clean_unicode()against the 172-row test dataset (chemical_validation_test.csv) — specifically the 3 Unicode test records - If available, test against the full 12,144-row uncurated dataset (
uncurated_chemicals_2023-05-16_12-43-41.csv) - Check whether
µ(U+00B5) is in the currentunicode_map— if not, add it - Check whether
®(U+00AE) is in the currentunicode_map— if not, add it - Run
check_unhandled()output to identify any additional unmapped characters - Add any missing mappings to the internal
unicode_mapdataset - Verify double-space collapsing happens (may need post-processing step)
Tests
After any additions:
-
clean_unicode("5µm")produces ASCII output (e.g.,"5um"or"5microm") -
clean_unicode("TRIM® VX")produces ASCII output (e.g.,"TRIM(R) VX"or"TRIM VX") -
clean_unicode("Vertasil® Trisiloxanyl-cannabidiol")produces clean ASCII with no double spaces - Existing
clean_unicode()tests still pass
Context
clean_unicode() is already far more comprehensive than the Python equivalent (100+ vs 8 mappings). This issue is about verifying coverage against real data rather than a major rewrite.
Source: PRE_POST_CURATION_PLAN.md section 12.3
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request