Skip to content

Entity data integrity audit: 13 suffix-gene misalignments and other pre-existing issues #167

@berntpopp

Description

@berntpopp

Summary

A comprehensive audit of the ndd_entity table reveals pre-existing data integrity issues that predate the Phase 76 ontology safeguard work. The same issues are present identically in both the Jan 31 2026 database snapshot and the current database, confirming they were not introduced by recent changes.

Database snapshot used for analysis

.plan/data/202601311251.sysndd_db.sql.gz (Jan 31 2026 dump, 4,471 entities)

Overview

Metric Value
Total entities 4,471
Active 4,203
Inactive 268
Suffixed disease versions 188 entities (130 unique base IDs)
Issue Severity Count
Suffix-gene misalignment (active) Critical 13
Suffix-gene misalignment (inactive) Low 3
Orphaned replaced_by pointer High 1
Broken FK to disease_ontology_set (inactive) Moderate 5
Inactive entities without replaced_by or active counterpart Low 2
Active entities without status record Low 3

Structural checks that passed

  • Suffix sequence gaps: 0 (all suffix sequences are contiguous)
  • Circular replacement chains: 0
  • Cross-gene replacements: 0 (all replaced_by links maintain the same gene)
  • Multi-hop chains: 3 — all correctly resolve to an active terminal entity

Issue 1 — CRITICAL: 13 Active Entities with Suffix-Gene Misalignment

Root cause

The suffix assignment algorithm in build_omim_from_genemap2() (api/functions/omim-functions.R) assigns _1, _2, etc. by sorting (disease_id, hgnc_id, name, inheritance) and numbering sequentially with cumsum(). When OMIM adds or removes a gene-disease association, all suffix numbers for that MIM can shift. Before Phase 76's auto-fix safeguard, these shifts were applied silently to disease_ontology_set without updating the corresponding ndd_entity.disease_ontology_id_version FK.

Impact

The entity's curated gene association is correct, but its disease_ontology_id_version FK now points to a disease_ontology_set row describing a different gene's disease. This means:

  • Disease name metadata shown in the UI may be wrong for these entities
  • Ontology cross-references (MONDO, Orphanet, DOID) are for the wrong gene-disease pairing
  • Any downstream analysis using the disease name is affected

Complete entity list

entity_id Gene HGNC disease_version Ontology now says gene Ontology disease name Inheritance Category Source Created
59 ARL6 HGNC:13210 OMIM:605231 MKKS (HGNC:7108) Bardet-Biedl syndrome 6 HP:0000007 (AR) Definitive SysID 2010-11-23
111 BSCL2 HGNC:15832 OMIM:600794 GARS1 (HGNC:4162) Neuronopathy, distal hereditary motor, AD 5 HP:0000006 (AD) N/A SysID 2010-11-23
393 MAP2K2 HGNC:6842 OMIM:115150 BRAF (HGNC:1097) Cardiofaciocutaneous syndrome HP:0000006 (AD) Definitive SysID 2010-11-23
444 MT-TV HGNC:7500 OMIM:300438 HSD17B10 (HGNC:4800) HSD10 mitochondrial disease HP:0001427 (mito) Definitive SysID 2010-11-23
645 SDHB HGNC:10681 OMIM:171300_1 VHL (HGNC:12687) Pheochromocytoma HP:0000006 (AD) N/A SysID 2012-02-20
655 SHH HGNC:10848 OMIM:269160_1 SIX3 (HGNC:10889) Schizencephaly HP:0000006 (AD) Limited SysID 2010-11-23
662 SIX3 HGNC:10889 OMIM:269160_2 EMX2 (HGNC:3341) Schizencephaly HP:0000006 (AD) Limited SysID 2010-11-23
798 ASCL1 HGNC:738 OMIM:209880 PHOX2B (HGNC:9143) Central hypoventilation syndrome, congenital, 1 HP:0000006 (AD) N/A SysID 2013-02-01
1023 NDUFA2 HGNC:7685 OMIM:618244 NDUFA12 (HGNC:23987) Mitochondrial complex I deficiency, nuclear type 23 HP:0000007 (AR) Definitive SysID 2014-03-04
1922 MAP3K20 HGNC:17797 OMIM:617308 ACOX2 (HGNC:120) Bile acid synthesis defect, congenital, 6 HP:0000007 (AR) Limited SysID 2016-11-25
3625 SDHD HGNC:10683 OMIM:171300_2 TMEM127 (HGNC:26038) {Pheochromocytoma, susceptibility to} HP:0000006 (AD) N/A SysID 2021-11-14
3635 GOSR2 HGNC:4431 OMIM:300438 HSD17B10 (HGNC:4800) HSD10 mitochondrial disease HP:0000007 (AR) Limited SysID 2021-11-14
4513 ABCC9 HGNC:60 OMIM:613443_2 MEF2C (HGNC:6996) NDD with hypotonia, stereotypic hand movements, impaired language HP:0000006 (AD) Limited sysndd 2025-02-26

Fixability analysis

entity_id Gene Fixability Notes
59 ARL6 Needs curator Gene has other OMIM entries: 209900_1, 209900_2, 600151, 613575
111 BSCL2 Needs curator Gene has other OMIM entries: 269700, 270685, 615924, 619112
393 MAP2K2 Needs curator Gene has other OMIM entry: 615280
444 MT-TV Needs curator Gene has NO OMIM entries in current ontology set
645 SDHB Needs curator Gene has other OMIM entries: 115310, 606764_1, 606764_2, 606864_1, 619224
655 SHH Needs curator Gene has other OMIM entries: 142945, 147250, 611638
662 SIX3 Auto-fixable Should be OMIM:269160_1 (suffix swap with entity 655)
798 ASCL1 Needs curator Gene has NO OMIM entries in current ontology set
1023 NDUFA2 Needs curator Gene has other OMIM entry: 618235
1922 MAP3K20 Needs curator Gene has other OMIM entries: 616890, 617760
3625 SDHD Needs curator Gene has other OMIM entries: 168000, 606864_3, 619167
3635 GOSR2 Needs curator Gene has other OMIM entries: 614018, 620166
4513 ABCC9 Needs curator Gene has other OMIM entries: 239850, 608569, 614050, 619719

Issue 2 — HIGH: Orphaned replaced_by Pointer

Entity 4269 is is_active=1 but has replaced_by=4271, which does not exist in the database (max entity_id is 4516).

Field Value
entity_id 4269
Gene FAM222B (HGNC:25563)
Disease MONDO:0001071
Inheritance HP:0000006 (AD)
Status Limited (category_id=3), approved
Created 2024-05-27
replaced_by 4271 (does not exist)

Fix: Set replaced_by = NULL for entity 4269.


Issue 3 — MODERATE: 5 Inactive Entities with Broken FK to disease_ontology_set

These inactive entities reference disease_ontology_id_version values that no longer exist in disease_ontology_set. Old suffix versions were removed during ontology TRUNCATE/rewrite operations.

entity_id Gene disease_version Base exists? Replacement Created
481 NLGN3 (HGNC:14289) OMIM:300494_1 No (base gone entirely) None 2010-11-23
1058 SEMA3E (HGNC:10727) OMIM:214800_1 Yes (unsuffixed only) replaced_by=3897 2014-03-04
3217 SLC6A19 (HGNC:27960) OMIM:138500_2 Yes (unsuffixed only) None 2020-07-22
3291 SELENON (HGNC:15999) OMIM:255310_5 Yes (unsuffixed only) None 2020-11-13
3467 SELENON (HGNC:15999) OMIM:255310_6 Yes (unsuffixed only) None 2021-04-11

Fix: Add compatibility rows to disease_ontology_set with is_active=FALSE for these 5 versions, or accept as historical artifacts (all entities are already inactive).


Issue 4 — LOW: 3 Inactive Entities with Suffix-Gene Misalignment

Same root cause as Issue 1 but these entities are already inactive with valid replacement chains.

entity_id Gene disease_version Ontology gene replaced_by
992 MT-TE (HGNC:7479) OMIM:300438 HSD17B10 (HGNC:4800) 4319
3554 NFS1 (HGNC:15910) OMIM:300438 HSD17B10 (HGNC:4800) 4373
4026 H4C5 (HGNC:4790) OMIM:619951 H4C9 (HGNC:4793) 4027

No action needed — these are correctly deactivated with replacement links.


Issue 5 — LOW: 2 Inactive Entities Without Active Counterpart

These entities are is_active=0 with no replaced_by link, and no active entity exists for the same gene.

entity_id Gene Disease Inheritance Status Created
4417 TRR-CCT1-1 (HGNC:34638) MONDO:0001071 HP:0000006 (AD) Definitive (category_id=1) 2024-11-30
4419 CCT2 (HGNC:1615) MONDO:0001071 HP:0000006 (AD) Moderate (category_id=2) 2024-11-30

Action: Curator should verify whether these were intentionally deactivated or should be reactivated.


Issue 6 — LOW: 3 Active Entities Without Status Record

entity_id Gene Disease Inheritance Created
4188 VCP (HGNC:12666) OMIM:167320 HP:0000006 (AD) 2024-04-01
4469 GAP43 (HGNC:4140) MONDO:0001071 HP:0000006 (AD) 2025-02-04
4474 FGF14 (HGNC:3671) MONDO:0005071 HP:0000007 (AR) 2025-02-04

Action: Assign initial status to these entities.


Appendix: 45 Inactive Entities Without replaced_by

43 of these have active counterparts for the same gene (expected for entities that transitioned from generic MONDO to specific OMIM). 2 have no counterpart (see Issue 5).

Click to expand full list
entity_id Gene Disease Version Inheritance Created Source
99 BCOR (HGNC:20893) OMIM:300166 HP:0001423 (XL) 2010-11-23 SysID
266 GAD1 (HGNC:4092) MONDO:0005071 HP:0000007 (AR) 2010-11-23 SysID
314 HDAC4 (HGNC:14063) MONDO:0001071 HP:0000006 (AD) 2011-01-30 SysID
481 NLGN3 (HGNC:14289) OMIM:300494_1 HP:0001417 (XLR) 2010-11-23 SysID
493 NRXN1 (HGNC:8008) OMIM:614332 HP:0000006 (AD) 2010-11-23 SysID
756 TSC1 (HGNC:12362) OMIM:606690_1 HP:0001428 2010-11-23 SysID
788 ADK (HGNC:257) MONDO:0001071 HP:0000007 (AR) 2013-02-01 SysID
823 EEF1B2 (HGNC:3208) MONDO:0001071 HP:0000007 (AR) 2013-02-01 SysID
1071 GJA1 (HGNC:4274) MONDO:0002254 HP:0000006 (AD) 2014-03-04 SysID
1073 GJA1 (HGNC:4274) MONDO:0002254 HP:0000007 (AR) 2014-03-04 SysID
1182 DRP2 (HGNC:3032) MONDO:0005258 HP:0001423 (XL) 2014-03-04 SysID
1318 DYNC1H1 (HGNC:2961) MONDO:0020022 HP:0000006 (AD) 2015-01-02 SysID
1744 ADAM22 (HGNC:201) MONDO:0100062 HP:0000007 (AR) 2016-08-04 SysID
1818 UBE4A (HGNC:12499) MONDO:0001071 HP:0000007 (AR) 2016-08-05 SysID
1826 TAOK1 (HGNC:29259) MONDO:0001071 HP:0000006 (AD) 2016-08-05 SysID
1975 OGDHL (HGNC:25590) MONDO:0001071 HP:0000007 (AR) 2017-03-08 SysID
2079 ACTL6A (HGNC:24124) MONDO:0001071 HP:0000006 (AD) 2017-08-07 SysID
2215 H4C3 (HGNC:4787) MONDO:0001071 HP:0000006 (AD) 2017-12-01 SysID
2334 BCAS3 (HGNC:14347) MONDO:0001071 HP:0000007 (AR) 2018-02-26 SysID
2471 FBXO28 (HGNC:29046) MONDO:0001071 HP:0000006 (AD) 2018-09-25 SysID
2671 POLRMT (HGNC:9200) MONDO:0001071 HP:0000007 (AR) 2018-10-24 SysID
2723 H3-3A (HGNC:4764) MONDO:0001071 HP:0000006 (AD) 2018-10-25 SysID
2818 ZBTB7A (HGNC:18078) MONDO:0001071 HP:0000006 (AD) 2019-02-19 SysID
3032 TRPM3 (HGNC:17992) MONDO:0001071 HP:0000006 (AD) 2019-08-01 SysID
3202 KAT8 (HGNC:17933) MONDO:0001071 HP:0000007 (AR) 2020-07-22 SysID
3217 SLC6A19 (HGNC:27960) OMIM:138500_2 HP:0000006 (AD) 2020-07-22 SysID
3278 LMBRD2 (HGNC:25287) MONDO:0001071 HP:0000006 (AD) 2020-11-11 SysID
3291 SELENON (HGNC:15999) OMIM:255310_5 HP:0000006 (AD) 2020-11-13 SysID
3365 POLR3B (HGNC:30348) MONDO:0001071 HP:0000006 (AD) 2021-04-05 SysID
3388 UFSP2 (HGNC:25640) MONDO:0001071 HP:0000007 (AR) 2021-04-06 SysID
3393 KCNN2 (HGNC:6291) MONDO:0001071 HP:0000006 (AD) 2021-04-07 SysID
3405 POLRMT (HGNC:9200) MONDO:0001071 HP:0000006 (AD) 2021-04-08 SysID
3467 SELENON (HGNC:15999) OMIM:255310_6 HP:0000007 (AR) 2021-04-11 SysID
3473 H3-3B (HGNC:4765) MONDO:0001071 HP:0000006 (AD) 2021-07-21 SysID
3490 NAA20 (HGNC:15908) MONDO:0001071 HP:0000007 (AR) 2021-07-21 SysID
3505 ATP1A2 (HGNC:800) MONDO:0001071 HP:0000006 (AD) 2021-07-22 SysID
3520 RNF220 (HGNC:25552) MONDO:0001071 HP:0000007 (AR) 2021-07-22 SysID
3582 PI4KA (HGNC:8983) MONDO:0001071 HP:0000007 (AR) 2021-11-04 SysID
3584 AFG2B (HGNC:28762) MONDO:0005071 HP:0000007 (AR) 2021-11-04 SysID
3585 ABHD16A (HGNC:13921) MONDO:0001071 HP:0000007 (AR) 2021-11-04 SysID
3586 SPRED2 (HGNC:17722) MONDO:0002254 HP:0000007 (AR) 2021-11-04 SysID
3612 RBL2 (HGNC:9894) MONDO:0001071 HP:0000007 (AR) 2021-11-13 SysID
3642 PRORP (HGNC:19958) MONDO:0002254 HP:0000007 (AR) 2021-11-18 SysID
4417 TRR-CCT1-1 (HGNC:34638) MONDO:0001071 HP:0000006 (AD) 2024-11-30 sysndd
4419 CCT2 (HGNC:1615) MONDO:0001071 HP:0000006 (AD) 2024-11-30 sysndd

Prevention

The Phase 76 ontology update safeguard (identify_critical_ontology_changes() + auto-fix + force-apply) now prevents new suffix-gene misalignments by:

  1. Detecting suffix shifts via fingerprint matching (gene + disease_id + inheritance)
  2. Auto-fixing safe shifts (same fingerprint, different suffix number)
  3. Blocking truly critical changes (no fingerprint match) for curator review
  4. Adding compatibility rows (is_active=FALSE) for force-applied changes

These 13 misalignments predate the safeguard and need one-time manual resolution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions