Document Version: 1.0 Last Updated: 2025-01-08 Purpose: Define systematic validation rules for detecting data quality issues in RootsMagic databases
This document provides comprehensive validation rules for identifying data quality issues in RootsMagic databases. These rules help AI agents and data analysts detect:
- Incomplete records - Missing required fields
- Logical inconsistencies - Events in impossible orders (death before birth)
- Orphaned records - References to non-existent entities
- Unsourced facts - Claims without supporting citations
- Invalid values - Data outside acceptable ranges
Severity: CRITICAL
Check:
SELECT p.PersonID, p.Sex
FROM PersonTable p
LEFT JOIN NameTable n ON p.PersonID = n.OwnerID AND n.IsPrimary = 1
WHERE n.NameID IS NULLExpected: 0 records Remediation: Add primary name or mark person as placeholder
Severity: HIGH
Check:
SELECT n.NameID, n.OwnerID, n.Surname, n.Given
FROM NameTable n
WHERE n.IsPrimary = 1
AND (n.Surname IS NULL OR n.Surname = '')
AND (n.Given IS NULL OR n.Given = '')Typical Rate: <2% acceptable (for unknown individuals) Remediation: Research to find missing name components, or use "[Unknown]" placeholder
Severity: MEDIUM
Check:
SELECT e.EventID, e.OwnerID
FROM EventTable e
WHERE e.EventType = 1 -- Birth
AND (e.Date IS NULL OR e.Date = '' OR e.Date = '.')
AND (e.PlaceID = 0 OR e.PlaceID IS NULL)Typical Rate: 10-15% acceptable Remediation: Research vital records, census data for birth information
Severity: HIGH
Check:
SELECT e.EventID, e.OwnerID, p.PersonID, n.Surname, n.Given
FROM EventTable e
JOIN PersonTable p ON e.OwnerID = p.PersonID
JOIN NameTable n ON p.PersonID = n.OwnerID AND n.IsPrimary = 1
WHERE e.EventType = 2 -- Death
AND (e.Date IS NULL OR e.Date = '' OR e.Date = '.')Typical Rate: <1% acceptable (very recent deaths may lack dates) Remediation: Check death certificates, obituaries, cemetery records
Severity: MEDIUM
Check (Python):
import xml.etree.ElementTree as ET
def check_empty_citations(conn):
cursor = conn.cursor()
cursor.execute("SELECT CitationID, SourceID, Fields FROM CitationTable")
empty_citations = []
for cit_id, src_id, blob in cursor.fetchall():
if blob:
xml = blob.decode('utf-8-sig')
root = ET.fromstring(xml)
fields = root.findall('.//Field')
# Check if all field values are empty
all_empty = all(
(f.find('Value').text or '').strip() == ''
for f in fields
)
if all_empty:
empty_citations.append((cit_id, src_id))
return empty_citationsRemediation: Add page numbers, URLs, or other citation details
Severity: CRITICAL
Check:
SELECT
p.PersonID,
n.Surname || ', ' || n.Given as Name,
birth.Date as BirthDate,
birth.SortDate as BirthSort,
death.Date as DeathDate,
death.SortDate as DeathSort
FROM PersonTable p
JOIN NameTable n ON p.PersonID = n.OwnerID AND n.IsPrimary = 1
LEFT JOIN EventTable birth ON p.PersonID = birth.OwnerID AND birth.EventType = 1
LEFT JOIN EventTable death ON p.PersonID = death.OwnerID AND death.EventType = 2
WHERE birth.SortDate IS NOT NULL
AND death.SortDate IS NOT NULL
AND birth.SortDate != 9223372036854775807 -- Not "unknown" date
AND death.SortDate != 9223372036854775807
AND death.SortDate < birth.SortDateExpected: 0 records
Note: SortDate value 9223372036854775807 indicates unknown/missing date
Remediation: Verify dates, check for data entry errors
Severity: CRITICAL
Check:
SELECT
child.PersonID as ChildID,
cn.Surname || ', ' || cn.Given as ChildName,
cb.Date as ChildBirth,
parent.PersonID as ParentID,
pn.Surname || ', ' || pn.Given as ParentName,
pb.Date as ParentBirth
FROM ChildTable ct
JOIN FamilyTable f ON ct.FamilyID = f.FamilyID
JOIN PersonTable child ON ct.ChildID = child.PersonID
JOIN NameTable cn ON child.PersonID = cn.OwnerID AND cn.IsPrimary = 1
LEFT JOIN EventTable cb ON child.PersonID = cb.OwnerID AND cb.EventType = 1
JOIN PersonTable parent ON (f.FatherID = parent.PersonID OR f.MotherID = parent.PersonID)
JOIN NameTable pn ON parent.PersonID = pn.OwnerID AND pn.IsPrimary = 1
LEFT JOIN EventTable pb ON parent.PersonID = pb.OwnerID AND pb.EventType = 1
WHERE cb.SortDate IS NOT NULL
AND pb.SortDate IS NOT NULL
AND cb.SortDate != 9223372036854775807
AND pb.SortDate != 9223372036854775807
AND cb.SortDate < pb.SortDateExpected: 0 records Remediation: Verify parent-child relationships, check for adoption/step-relationships
Severity: MEDIUM (for <12 or >65), HIGH (for <10 or >70)
Check:
SELECT
parent.PersonID,
pn.Surname || ', ' || pn.Given as ParentName,
parent.Sex,
pb.Date as ParentBirth,
cb.Date as ChildBirth,
CAST((cb.SortDate - pb.SortDate) / 10000000000000 AS INTEGER) as AgeAtBirth
FROM ChildTable ct
JOIN FamilyTable f ON ct.FamilyID = f.FamilyID
JOIN PersonTable child ON ct.ChildID = child.PersonID
LEFT JOIN EventTable cb ON child.PersonID = cb.OwnerID AND cb.EventType = 1
JOIN PersonTable parent ON (f.FatherID = parent.PersonID OR f.MotherID = parent.PersonID)
JOIN NameTable pn ON parent.PersonID = pn.OwnerID AND pn.IsPrimary = 1
LEFT JOIN EventTable pb ON parent.PersonID = pb.OwnerID AND pb.EventType = 1
WHERE cb.SortDate IS NOT NULL
AND pb.SortDate IS NOT NULL
AND cb.SortDate != 9223372036854775807
AND pb.SortDate != 9223372036854775807
AND (
(cb.SortDate - pb.SortDate) < 120000000000000 -- < 12 years
OR (cb.SortDate - pb.SortDate) > 650000000000000 -- > 65 years
)Age Calculation: SortDate difference / 10000000000000 ≈ years Remediation: Verify birth dates, check for data entry errors, consider adoption/step-relationships
Severity: HIGH
Check:
SELECT
p.PersonID,
n.Surname || ', ' || n.Given as Name,
birth.Date as BirthDate,
marriage.Date as MarriageDate
FROM PersonTable p
JOIN NameTable n ON p.PersonID = n.OwnerID AND n.IsPrimary = 1
LEFT JOIN EventTable birth ON p.PersonID = birth.OwnerID AND birth.EventType = 1
LEFT JOIN FamilyTable f ON (p.PersonID = f.FatherID OR p.PersonID = f.MotherID)
LEFT JOIN EventTable marriage ON f.FamilyID = marriage.OwnerID AND marriage.EventType = 300
WHERE birth.SortDate IS NOT NULL
AND marriage.SortDate IS NOT NULL
AND birth.SortDate != 9223372036854775807
AND marriage.SortDate != 9223372036854775807
AND marriage.SortDate < birth.SortDateExpected: 0 records Remediation: Verify dates, check for data entry errors
Severity: HIGH (most events), LOW (for Burial, Probate, Will, Obituary)
Check:
SELECT
p.PersonID,
n.Surname || ', ' || n.Given as Name,
death.Date as DeathDate,
ft.Name as EventType,
e.Date as EventDate
FROM PersonTable p
JOIN NameTable n ON p.PersonID = n.OwnerID AND n.IsPrimary = 1
JOIN EventTable death ON p.PersonID = death.OwnerID AND death.EventType = 2
JOIN EventTable e ON p.PersonID = e.OwnerID
JOIN FactTypeTable ft ON e.EventType = ft.FactTypeID
WHERE death.SortDate IS NOT NULL
AND e.SortDate IS NOT NULL
AND death.SortDate != 9223372036854775807
AND e.SortDate != 9223372036854775807
AND e.EventType != 2 -- Not the death event itself
AND e.EventType NOT IN (4, 19, 20, 1000, 1022) -- Exclude: Burial, Probate, Will, Obituary, News
AND e.SortDate > death.SortDateEvents Allowed After Death: Burial, Probate, Will, Obituary, News articles Remediation: Verify event dates, move posthumous events to appropriate fact types
Severity: LOW (common before modern era)
Check:
SELECT
f.FamilyID,
father.Surname || ', ' || father.Given as Father,
mother.Surname || ', ' || mother.Given as Mother,
marriage.Date as MarriageDate,
child.Surname || ', ' || child.Given as FirstChild,
firstborn.Date as FirstChildBirth
FROM FamilyTable f
JOIN NameTable father ON f.FatherID = father.OwnerID AND father.IsPrimary = 1
JOIN NameTable mother ON f.MotherID = mother.OwnerID AND mother.IsPrimary = 1
LEFT JOIN EventTable marriage ON f.FamilyID = marriage.OwnerID AND marriage.EventType = 300
JOIN ChildTable ct ON f.FamilyID = ct.FamilyID
JOIN PersonTable cp ON ct.ChildID = cp.PersonID
JOIN NameTable child ON cp.PersonID = child.OwnerID AND child.IsPrimary = 1
LEFT JOIN EventTable firstborn ON cp.PersonID = firstborn.OwnerID AND firstborn.EventType = 1
WHERE marriage.SortDate IS NOT NULL
AND firstborn.SortDate IS NOT NULL
AND marriage.SortDate != 9223372036854775807
AND firstborn.SortDate != 9223372036854775807
AND firstborn.SortDate < marriage.SortDate
ORDER BY f.FamilyID, firstborn.SortDateNote: Historical context matters - common law marriages, prior relationships Remediation: Verify marriage date, consider multiple marriages
Severity: CRITICAL
Check:
SELECT c.CitationID, c.SourceID, c.CitationName
FROM CitationTable c
LEFT JOIN SourceTable s ON c.SourceID = s.SourceID
WHERE s.SourceID IS NULLExpected: 0 records Remediation: Delete orphaned citations or restore missing sources
Severity: CRITICAL
Check:
-- Person events
SELECT e.EventID, e.OwnerType, e.OwnerID
FROM EventTable e
WHERE e.OwnerType = 0 -- Person
AND NOT EXISTS (SELECT 1 FROM PersonTable p WHERE p.PersonID = e.OwnerID)
UNION
-- Family events
SELECT e.EventID, e.OwnerType, e.OwnerID
FROM EventTable e
WHERE e.OwnerType = 1 -- Family
AND NOT EXISTS (SELECT 1 FROM FamilyTable f WHERE f.FamilyID = e.OwnerID)Expected: 0 records Remediation: Delete orphaned events or restore missing persons/families
Severity: CRITICAL
Check:
SELECT ct.ChildID, ct.FamilyID
FROM ChildTable ct
LEFT JOIN PersonTable p ON ct.ChildID = p.PersonID
LEFT JOIN FamilyTable f ON ct.FamilyID = f.FamilyID
WHERE p.PersonID IS NULL OR f.FamilyID IS NULLExpected: 0 records Remediation: Delete orphaned relationships or restore missing persons/families
Severity: MEDIUM
Check:
SELECT e.EventID, e.OwnerID, e.PlaceID
FROM EventTable e
WHERE e.PlaceID > 0
AND NOT EXISTS (SELECT 1 FROM PlaceTable p WHERE p.PlaceID = e.PlaceID)Expected: 0 records Remediation: Restore missing places or reset PlaceID to 0
Severity: HIGH
Check:
SELECT
ft.Name as EventType,
COUNT(DISTINCT e.EventID) as TotalEvents,
COUNT(DISTINCT cl.LinkID) as SourcedEvents,
COUNT(DISTINCT e.EventID) - COUNT(DISTINCT cl.LinkID) as UnsourcedEvents
FROM EventTable e
JOIN FactTypeTable ft ON e.EventType = ft.FactTypeID
LEFT JOIN CitationLinkTable cl ON e.EventID = cl.OwnerID AND cl.OwnerType = 2
WHERE e.EventType IN (1, 2, 3, 4, 300) -- Birth, Death, Christen, Burial, Marriage
GROUP BY ft.FactTypeID, ft.Name
HAVING COUNT(DISTINCT e.EventID) > 0
ORDER BY (COUNT(DISTINCT e.EventID) - COUNT(DISTINCT cl.LinkID)) DESCAcceptable Rate: <20% unsourced for vital events Sample Data:
- Birth: 83% unsourced
- Death: 74% unsourced
- Marriage: 82% unsourced
Remediation: Research and add citations from vital records, certificates, church records
Severity: LOW
Check:
SELECT s.SourceID, s.Name
FROM SourceTable s
LEFT JOIN CitationTable c ON s.SourceID = c.SourceID
WHERE c.CitationID IS NULLNote: Unused sources may be placeholders for future research Remediation: Delete unused sources or create citations linking them to facts
Severity: MEDIUM
Check (Python):
def check_incomplete_sources(conn):
cursor = conn.cursor()
# Get template field definitions
cursor.execute("""
SELECT s.SourceID, s.Name, s.TemplateID, s.Fields, st.FieldDefs
FROM SourceTable s
JOIN SourceTemplateTable st ON s.TemplateID = st.TemplateID
WHERE s.TemplateID > 0
""")
incomplete = []
for src_id, name, tmpl_id, src_blob, tmpl_blob in cursor.fetchall():
# Parse template to get required source-level fields
tmpl_xml = tmpl_blob.decode('utf-8-sig')
tmpl_root = ET.fromstring(tmpl_xml)
required_fields = [
field.find('FieldName').text
for field in tmpl_root.findall('.//Field')
if field.find('CitationField').text == 'False'
]
# Parse source BLOB to get actual fields
src_xml = src_blob.decode('utf-8-sig')
src_root = ET.fromstring(src_xml)
actual_fields = {
field.find('Name').text: field.find('Value').text or ''
for field in src_root.findall('.//Field')
}
# Check for missing or empty required fields
missing = [
f for f in required_fields
if f not in actual_fields or not actual_fields[f].strip()
]
if missing:
incomplete.append((src_id, name, missing))
return incompleteRemediation: Fill in missing source metadata (author, title, publisher, dates)
Severity: MEDIUM
Check: Compare Date field encoding with SortDate value Implementation: Parse Date field per RM11_Date_Format.md and verify SortDate
Severity: HIGH
Check:
SELECT e.EventID, e.OwnerID, ft.Name, e.Date, e.SortDate
FROM EventTable e
JOIN FactTypeTable ft ON e.EventType = ft.FactTypeID
WHERE e.SortDate IS NOT NULL
AND e.SortDate != 9223372036854775807 -- Unknown date marker
AND (
-- Before year 1000 AD (unlikely for most genealogy)
e.SortDate < 5000000000000000000
-- After current year + 1
OR e.SortDate > 6800000000000000000
)Note: Adjust ranges based on your research period Remediation: Verify dates, check for data entry errors
Severity: MEDIUM
Check:
SELECT
p.PersonID,
n.Surname || ', ' || n.Given as Name,
birth.Date as BirthDate,
death.Date as DeathDate,
CAST((death.SortDate - birth.SortDate) / 10000000000000 AS INTEGER) as Lifespan
FROM PersonTable p
JOIN NameTable n ON p.PersonID = n.OwnerID AND n.IsPrimary = 1
JOIN EventTable birth ON p.PersonID = birth.OwnerID AND birth.EventType = 1
JOIN EventTable death ON p.PersonID = death.OwnerID AND death.EventType = 2
WHERE birth.SortDate IS NOT NULL
AND death.SortDate IS NOT NULL
AND birth.SortDate != 9223372036854775807
AND death.SortDate != 9223372036854775807
AND (
(death.SortDate - birth.SortDate) < 0 -- Negative lifespan
OR (death.SortDate - birth.SortDate) > 1200000000000000 -- > 120 years
)Acceptable Range: 0-120 years (rare exceptions for longevity) Remediation: Verify both dates, check for data entry errors
Severity: MEDIUM
Check:
SELECT PersonID, Sex
FROM PersonTable
WHERE Sex NOT IN (0, 1, 2)Values: 0=Male, 1=Female, 2=Unknown Expected: 0 records Remediation: Correct invalid sex values
Severity: LOW
Check:
SELECT EventID, Proof
FROM EventTable
WHERE Proof NOT IN (0, 1, 2, 3)Values: 0=Blank, 1=Proven, 2=Disproven, 3=Disputed Expected: 0 records Remediation: Correct invalid proof values
Severity: MEDIUM
Check:
SELECT NameID, IsPrimary
FROM NameTable
WHERE IsPrimary NOT IN (0, 1)Expected: 0 records Remediation: Correct invalid boolean values
Severity: HIGH
Check:
SELECT OwnerID, COUNT(*) as PrimaryNameCount
FROM NameTable
WHERE IsPrimary = 1
GROUP BY OwnerID
HAVING COUNT(*) != 1Expected: 0 records Remediation: Designate exactly one name as primary per person
| Severity | Impact | Action Required |
|---|---|---|
| CRITICAL | Database integrity violated | Fix immediately |
| HIGH | Significant genealogical error | Investigate and correct |
| MEDIUM | Possible error or incomplete data | Review when time permits |
| LOW | Minor issue or style preference | Optional improvement |
def generate_quality_report(db_path):
"""Generate comprehensive data quality report."""
conn = connect_rmtree(db_path)
report = {
'database': db_path,
'generated': datetime.now(),
'issues': {
'critical': [],
'high': [],
'medium': [],
'low': []
},
'summary': {}
}
# Run all validation rules
# Rule 1.1: People without names
cursor = conn.cursor()
cursor.execute("""
SELECT COUNT(*)
FROM PersonTable p
LEFT JOIN NameTable n ON p.PersonID = n.OwnerID AND n.IsPrimary = 1
WHERE n.NameID IS NULL
""")
count = cursor.fetchone()[0]
if count > 0:
report['issues']['critical'].append({
'rule': '1.1',
'name': 'People without primary names',
'count': count
})
# Rule 2.1: Death before birth
cursor.execute("""
SELECT COUNT(*)
FROM PersonTable p
JOIN EventTable birth ON p.PersonID = birth.OwnerID AND birth.EventType = 1
JOIN EventTable death ON p.PersonID = death.OwnerID AND death.EventType = 2
WHERE birth.SortDate != 9223372036854775807
AND death.SortDate != 9223372036854775807
AND death.SortDate < birth.SortDate
""")
count = cursor.fetchone()[0]
if count > 0:
report['issues']['critical'].append({
'rule': '2.1',
'name': 'Death before birth',
'count': count
})
# ... more rules ...
# Summary statistics
report['summary'] = {
'total_people': get_count(cursor, 'PersonTable'),
'total_events': get_count(cursor, 'EventTable'),
'total_sources': get_count(cursor, 'SourceTable'),
'total_citations': get_count(cursor, 'CitationTable'),
'critical_issues': len(report['issues']['critical']),
'high_issues': len(report['issues']['high']),
'medium_issues': len(report['issues']['medium']),
'low_issues': len(report['issues']['low'])
}
conn.close()
return report-
Unknown Date Marker: SortDate value
9223372036854775807indicates missing/unknown date - exclude from date comparisons -
Age Calculation: Approximate years =
(SortDate2 - SortDate1) / 10000000000000 -
PlaceID = 0 means no place assigned (not an error, just unknown)
-
IsPrivate flag - respect privacy settings when generating reports
-
Proof levels - use to assess source quality:
- 0 = Blank/unknown
- 1 = Proven (high confidence)
- 2 = Disproven (contradicts evidence)
- 3 = Disputed (conflicting evidence)
-
Multiple issues - prioritize by severity, fix critical issues first
-
Historical context - some "issues" are historically normal (marriage age, children before marriage)
-
Batch processing - run all checks periodically, not per-record
- RM11_Schema_Reference.md - Table structures and relationships
- RM11_Date_Format.md - Date encoding specification
- RM11_BLOB_SourceFields.md - Source metadata extraction
- RM11_BLOB_CitationFields.md - Citation detail extraction
End of Document