-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
A comprehensive audit of the ndd_entity table reveals pre-existing data integrity issues that predate the Phase 76 ontology safeguard work. The same issues are present identically in both the Jan 31 2026 database snapshot and the current database, confirming they were not introduced by recent changes.
Database snapshot used for analysis
.plan/data/202601311251.sysndd_db.sql.gz (Jan 31 2026 dump, 4,471 entities)
Overview
| Metric | Value |
|---|---|
| Total entities | 4,471 |
| Active | 4,203 |
| Inactive | 268 |
| Suffixed disease versions | 188 entities (130 unique base IDs) |
| Issue | Severity | Count |
|---|---|---|
| Suffix-gene misalignment (active) | Critical | 13 |
| Suffix-gene misalignment (inactive) | Low | 3 |
Orphaned replaced_by pointer |
High | 1 |
Broken FK to disease_ontology_set (inactive) |
Moderate | 5 |
Inactive entities without replaced_by or active counterpart |
Low | 2 |
| Active entities without status record | Low | 3 |
Structural checks that passed
- Suffix sequence gaps: 0 (all suffix sequences are contiguous)
- Circular replacement chains: 0
- Cross-gene replacements: 0 (all
replaced_bylinks maintain the same gene) - Multi-hop chains: 3 — all correctly resolve to an active terminal entity
Issue 1 — CRITICAL: 13 Active Entities with Suffix-Gene Misalignment
Root cause
The suffix assignment algorithm in build_omim_from_genemap2() (api/functions/omim-functions.R) assigns _1, _2, etc. by sorting (disease_id, hgnc_id, name, inheritance) and numbering sequentially with cumsum(). When OMIM adds or removes a gene-disease association, all suffix numbers for that MIM can shift. Before Phase 76's auto-fix safeguard, these shifts were applied silently to disease_ontology_set without updating the corresponding ndd_entity.disease_ontology_id_version FK.
Impact
The entity's curated gene association is correct, but its disease_ontology_id_version FK now points to a disease_ontology_set row describing a different gene's disease. This means:
- Disease name metadata shown in the UI may be wrong for these entities
- Ontology cross-references (MONDO, Orphanet, DOID) are for the wrong gene-disease pairing
- Any downstream analysis using the disease name is affected
Complete entity list
| entity_id | Gene | HGNC | disease_version | Ontology now says gene | Ontology disease name | Inheritance | Category | Source | Created |
|---|---|---|---|---|---|---|---|---|---|
| 59 | ARL6 | HGNC:13210 | OMIM:605231 | MKKS (HGNC:7108) | Bardet-Biedl syndrome 6 | HP:0000007 (AR) | Definitive | SysID | 2010-11-23 |
| 111 | BSCL2 | HGNC:15832 | OMIM:600794 | GARS1 (HGNC:4162) | Neuronopathy, distal hereditary motor, AD 5 | HP:0000006 (AD) | N/A | SysID | 2010-11-23 |
| 393 | MAP2K2 | HGNC:6842 | OMIM:115150 | BRAF (HGNC:1097) | Cardiofaciocutaneous syndrome | HP:0000006 (AD) | Definitive | SysID | 2010-11-23 |
| 444 | MT-TV | HGNC:7500 | OMIM:300438 | HSD17B10 (HGNC:4800) | HSD10 mitochondrial disease | HP:0001427 (mito) | Definitive | SysID | 2010-11-23 |
| 645 | SDHB | HGNC:10681 | OMIM:171300_1 | VHL (HGNC:12687) | Pheochromocytoma | HP:0000006 (AD) | N/A | SysID | 2012-02-20 |
| 655 | SHH | HGNC:10848 | OMIM:269160_1 | SIX3 (HGNC:10889) | Schizencephaly | HP:0000006 (AD) | Limited | SysID | 2010-11-23 |
| 662 | SIX3 | HGNC:10889 | OMIM:269160_2 | EMX2 (HGNC:3341) | Schizencephaly | HP:0000006 (AD) | Limited | SysID | 2010-11-23 |
| 798 | ASCL1 | HGNC:738 | OMIM:209880 | PHOX2B (HGNC:9143) | Central hypoventilation syndrome, congenital, 1 | HP:0000006 (AD) | N/A | SysID | 2013-02-01 |
| 1023 | NDUFA2 | HGNC:7685 | OMIM:618244 | NDUFA12 (HGNC:23987) | Mitochondrial complex I deficiency, nuclear type 23 | HP:0000007 (AR) | Definitive | SysID | 2014-03-04 |
| 1922 | MAP3K20 | HGNC:17797 | OMIM:617308 | ACOX2 (HGNC:120) | Bile acid synthesis defect, congenital, 6 | HP:0000007 (AR) | Limited | SysID | 2016-11-25 |
| 3625 | SDHD | HGNC:10683 | OMIM:171300_2 | TMEM127 (HGNC:26038) | {Pheochromocytoma, susceptibility to} | HP:0000006 (AD) | N/A | SysID | 2021-11-14 |
| 3635 | GOSR2 | HGNC:4431 | OMIM:300438 | HSD17B10 (HGNC:4800) | HSD10 mitochondrial disease | HP:0000007 (AR) | Limited | SysID | 2021-11-14 |
| 4513 | ABCC9 | HGNC:60 | OMIM:613443_2 | MEF2C (HGNC:6996) | NDD with hypotonia, stereotypic hand movements, impaired language | HP:0000006 (AD) | Limited | sysndd | 2025-02-26 |
Fixability analysis
| entity_id | Gene | Fixability | Notes |
|---|---|---|---|
| 59 | ARL6 | Needs curator | Gene has other OMIM entries: 209900_1, 209900_2, 600151, 613575 |
| 111 | BSCL2 | Needs curator | Gene has other OMIM entries: 269700, 270685, 615924, 619112 |
| 393 | MAP2K2 | Needs curator | Gene has other OMIM entry: 615280 |
| 444 | MT-TV | Needs curator | Gene has NO OMIM entries in current ontology set |
| 645 | SDHB | Needs curator | Gene has other OMIM entries: 115310, 606764_1, 606764_2, 606864_1, 619224 |
| 655 | SHH | Needs curator | Gene has other OMIM entries: 142945, 147250, 611638 |
| 662 | SIX3 | Auto-fixable | Should be OMIM:269160_1 (suffix swap with entity 655) |
| 798 | ASCL1 | Needs curator | Gene has NO OMIM entries in current ontology set |
| 1023 | NDUFA2 | Needs curator | Gene has other OMIM entry: 618235 |
| 1922 | MAP3K20 | Needs curator | Gene has other OMIM entries: 616890, 617760 |
| 3625 | SDHD | Needs curator | Gene has other OMIM entries: 168000, 606864_3, 619167 |
| 3635 | GOSR2 | Needs curator | Gene has other OMIM entries: 614018, 620166 |
| 4513 | ABCC9 | Needs curator | Gene has other OMIM entries: 239850, 608569, 614050, 619719 |
Issue 2 — HIGH: Orphaned replaced_by Pointer
Entity 4269 is is_active=1 but has replaced_by=4271, which does not exist in the database (max entity_id is 4516).
| Field | Value |
|---|---|
| entity_id | 4269 |
| Gene | FAM222B (HGNC:25563) |
| Disease | MONDO:0001071 |
| Inheritance | HP:0000006 (AD) |
| Status | Limited (category_id=3), approved |
| Created | 2024-05-27 |
| replaced_by | 4271 (does not exist) |
Fix: Set replaced_by = NULL for entity 4269.
Issue 3 — MODERATE: 5 Inactive Entities with Broken FK to disease_ontology_set
These inactive entities reference disease_ontology_id_version values that no longer exist in disease_ontology_set. Old suffix versions were removed during ontology TRUNCATE/rewrite operations.
| entity_id | Gene | disease_version | Base exists? | Replacement | Created |
|---|---|---|---|---|---|
| 481 | NLGN3 (HGNC:14289) | OMIM:300494_1 | No (base gone entirely) | None | 2010-11-23 |
| 1058 | SEMA3E (HGNC:10727) | OMIM:214800_1 | Yes (unsuffixed only) | replaced_by=3897 | 2014-03-04 |
| 3217 | SLC6A19 (HGNC:27960) | OMIM:138500_2 | Yes (unsuffixed only) | None | 2020-07-22 |
| 3291 | SELENON (HGNC:15999) | OMIM:255310_5 | Yes (unsuffixed only) | None | 2020-11-13 |
| 3467 | SELENON (HGNC:15999) | OMIM:255310_6 | Yes (unsuffixed only) | None | 2021-04-11 |
Fix: Add compatibility rows to disease_ontology_set with is_active=FALSE for these 5 versions, or accept as historical artifacts (all entities are already inactive).
Issue 4 — LOW: 3 Inactive Entities with Suffix-Gene Misalignment
Same root cause as Issue 1 but these entities are already inactive with valid replacement chains.
| entity_id | Gene | disease_version | Ontology gene | replaced_by |
|---|---|---|---|---|
| 992 | MT-TE (HGNC:7479) | OMIM:300438 | HSD17B10 (HGNC:4800) | 4319 |
| 3554 | NFS1 (HGNC:15910) | OMIM:300438 | HSD17B10 (HGNC:4800) | 4373 |
| 4026 | H4C5 (HGNC:4790) | OMIM:619951 | H4C9 (HGNC:4793) | 4027 |
No action needed — these are correctly deactivated with replacement links.
Issue 5 — LOW: 2 Inactive Entities Without Active Counterpart
These entities are is_active=0 with no replaced_by link, and no active entity exists for the same gene.
| entity_id | Gene | Disease | Inheritance | Status | Created |
|---|---|---|---|---|---|
| 4417 | TRR-CCT1-1 (HGNC:34638) | MONDO:0001071 | HP:0000006 (AD) | Definitive (category_id=1) | 2024-11-30 |
| 4419 | CCT2 (HGNC:1615) | MONDO:0001071 | HP:0000006 (AD) | Moderate (category_id=2) | 2024-11-30 |
Action: Curator should verify whether these were intentionally deactivated or should be reactivated.
Issue 6 — LOW: 3 Active Entities Without Status Record
| entity_id | Gene | Disease | Inheritance | Created |
|---|---|---|---|---|
| 4188 | VCP (HGNC:12666) | OMIM:167320 | HP:0000006 (AD) | 2024-04-01 |
| 4469 | GAP43 (HGNC:4140) | MONDO:0001071 | HP:0000006 (AD) | 2025-02-04 |
| 4474 | FGF14 (HGNC:3671) | MONDO:0005071 | HP:0000007 (AR) | 2025-02-04 |
Action: Assign initial status to these entities.
Appendix: 45 Inactive Entities Without replaced_by
43 of these have active counterparts for the same gene (expected for entities that transitioned from generic MONDO to specific OMIM). 2 have no counterpart (see Issue 5).
Click to expand full list
| entity_id | Gene | Disease Version | Inheritance | Created | Source |
|---|---|---|---|---|---|
| 99 | BCOR (HGNC:20893) | OMIM:300166 | HP:0001423 (XL) | 2010-11-23 | SysID |
| 266 | GAD1 (HGNC:4092) | MONDO:0005071 | HP:0000007 (AR) | 2010-11-23 | SysID |
| 314 | HDAC4 (HGNC:14063) | MONDO:0001071 | HP:0000006 (AD) | 2011-01-30 | SysID |
| 481 | NLGN3 (HGNC:14289) | OMIM:300494_1 | HP:0001417 (XLR) | 2010-11-23 | SysID |
| 493 | NRXN1 (HGNC:8008) | OMIM:614332 | HP:0000006 (AD) | 2010-11-23 | SysID |
| 756 | TSC1 (HGNC:12362) | OMIM:606690_1 | HP:0001428 | 2010-11-23 | SysID |
| 788 | ADK (HGNC:257) | MONDO:0001071 | HP:0000007 (AR) | 2013-02-01 | SysID |
| 823 | EEF1B2 (HGNC:3208) | MONDO:0001071 | HP:0000007 (AR) | 2013-02-01 | SysID |
| 1071 | GJA1 (HGNC:4274) | MONDO:0002254 | HP:0000006 (AD) | 2014-03-04 | SysID |
| 1073 | GJA1 (HGNC:4274) | MONDO:0002254 | HP:0000007 (AR) | 2014-03-04 | SysID |
| 1182 | DRP2 (HGNC:3032) | MONDO:0005258 | HP:0001423 (XL) | 2014-03-04 | SysID |
| 1318 | DYNC1H1 (HGNC:2961) | MONDO:0020022 | HP:0000006 (AD) | 2015-01-02 | SysID |
| 1744 | ADAM22 (HGNC:201) | MONDO:0100062 | HP:0000007 (AR) | 2016-08-04 | SysID |
| 1818 | UBE4A (HGNC:12499) | MONDO:0001071 | HP:0000007 (AR) | 2016-08-05 | SysID |
| 1826 | TAOK1 (HGNC:29259) | MONDO:0001071 | HP:0000006 (AD) | 2016-08-05 | SysID |
| 1975 | OGDHL (HGNC:25590) | MONDO:0001071 | HP:0000007 (AR) | 2017-03-08 | SysID |
| 2079 | ACTL6A (HGNC:24124) | MONDO:0001071 | HP:0000006 (AD) | 2017-08-07 | SysID |
| 2215 | H4C3 (HGNC:4787) | MONDO:0001071 | HP:0000006 (AD) | 2017-12-01 | SysID |
| 2334 | BCAS3 (HGNC:14347) | MONDO:0001071 | HP:0000007 (AR) | 2018-02-26 | SysID |
| 2471 | FBXO28 (HGNC:29046) | MONDO:0001071 | HP:0000006 (AD) | 2018-09-25 | SysID |
| 2671 | POLRMT (HGNC:9200) | MONDO:0001071 | HP:0000007 (AR) | 2018-10-24 | SysID |
| 2723 | H3-3A (HGNC:4764) | MONDO:0001071 | HP:0000006 (AD) | 2018-10-25 | SysID |
| 2818 | ZBTB7A (HGNC:18078) | MONDO:0001071 | HP:0000006 (AD) | 2019-02-19 | SysID |
| 3032 | TRPM3 (HGNC:17992) | MONDO:0001071 | HP:0000006 (AD) | 2019-08-01 | SysID |
| 3202 | KAT8 (HGNC:17933) | MONDO:0001071 | HP:0000007 (AR) | 2020-07-22 | SysID |
| 3217 | SLC6A19 (HGNC:27960) | OMIM:138500_2 | HP:0000006 (AD) | 2020-07-22 | SysID |
| 3278 | LMBRD2 (HGNC:25287) | MONDO:0001071 | HP:0000006 (AD) | 2020-11-11 | SysID |
| 3291 | SELENON (HGNC:15999) | OMIM:255310_5 | HP:0000006 (AD) | 2020-11-13 | SysID |
| 3365 | POLR3B (HGNC:30348) | MONDO:0001071 | HP:0000006 (AD) | 2021-04-05 | SysID |
| 3388 | UFSP2 (HGNC:25640) | MONDO:0001071 | HP:0000007 (AR) | 2021-04-06 | SysID |
| 3393 | KCNN2 (HGNC:6291) | MONDO:0001071 | HP:0000006 (AD) | 2021-04-07 | SysID |
| 3405 | POLRMT (HGNC:9200) | MONDO:0001071 | HP:0000006 (AD) | 2021-04-08 | SysID |
| 3467 | SELENON (HGNC:15999) | OMIM:255310_6 | HP:0000007 (AR) | 2021-04-11 | SysID |
| 3473 | H3-3B (HGNC:4765) | MONDO:0001071 | HP:0000006 (AD) | 2021-07-21 | SysID |
| 3490 | NAA20 (HGNC:15908) | MONDO:0001071 | HP:0000007 (AR) | 2021-07-21 | SysID |
| 3505 | ATP1A2 (HGNC:800) | MONDO:0001071 | HP:0000006 (AD) | 2021-07-22 | SysID |
| 3520 | RNF220 (HGNC:25552) | MONDO:0001071 | HP:0000007 (AR) | 2021-07-22 | SysID |
| 3582 | PI4KA (HGNC:8983) | MONDO:0001071 | HP:0000007 (AR) | 2021-11-04 | SysID |
| 3584 | AFG2B (HGNC:28762) | MONDO:0005071 | HP:0000007 (AR) | 2021-11-04 | SysID |
| 3585 | ABHD16A (HGNC:13921) | MONDO:0001071 | HP:0000007 (AR) | 2021-11-04 | SysID |
| 3586 | SPRED2 (HGNC:17722) | MONDO:0002254 | HP:0000007 (AR) | 2021-11-04 | SysID |
| 3612 | RBL2 (HGNC:9894) | MONDO:0001071 | HP:0000007 (AR) | 2021-11-13 | SysID |
| 3642 | PRORP (HGNC:19958) | MONDO:0002254 | HP:0000007 (AR) | 2021-11-18 | SysID |
| 4417 | TRR-CCT1-1 (HGNC:34638) | MONDO:0001071 | HP:0000006 (AD) | 2024-11-30 | sysndd |
| 4419 | CCT2 (HGNC:1615) | MONDO:0001071 | HP:0000006 (AD) | 2024-11-30 | sysndd |
Prevention
The Phase 76 ontology update safeguard (identify_critical_ontology_changes() + auto-fix + force-apply) now prevents new suffix-gene misalignments by:
- Detecting suffix shifts via fingerprint matching (gene + disease_id + inheritance)
- Auto-fixing safe shifts (same fingerprint, different suffix number)
- Blocking truly critical changes (no fingerprint match) for curator review
- Adding compatibility rows (
is_active=FALSE) for force-applied changes
These 13 misalignments predate the safeguard and need one-time manual resolution.