Developer Guide

Development Setup

Clone the repository:

git clone https://github.com/kimon1230/gedcom-tools.git
cd gedcom-tools

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate

Install in development mode with dev dependencies:
```
pip install -e ".[dev]"
```

Project Structure

gedcom_tools/
├── src/
│   └── gedcom_tools/
│       ├── __init__.py          # Package init, version
│       ├── cli.py               # Main entry point, argument parsing
│       ├── constants.py         # Shared constants (exit codes, thresholds)
│       ├── dates.py             # Shared date parsing utilities
│       ├── graph.py             # Graph algorithms (UnionFind, components, ParentChildGraph, BFS traversal)
│       ├── progress.py          # Terminal UI (Colors, PhaseTracker)
│       ├── phonetics.py         # Shared phonetics (American Soundex and Double Metaphone)
│       ├── utils.py             # Shared utilities (encoding, xref, validation)
│       ├── commands/
│       │   ├── __init__.py      # Commands package init
│       │   ├── isolated.py      # Isolated individuals command
│       │   ├── validate.py      # Validation command handler
│       │   ├── stats/           # Stats command package
│       │   │   ├── __init__.py  # CLI registration, public API
│       │   │   ├── models.py    # Data classes (IndividualData, etc.)
│       │   │   ├── collector.py # StatsCollector class
│       │   │   └── formatters.py# StatsResult with format methods
│       │   ├── search/          # Search command package
│       │   │   ├── __init__.py  # CLI registration, orchestration
│       │   │   ├── models.py    # Data classes (SearchIndividual, etc.)
│       │   │   ├── query.py     # Query parsing and validation
│       │   │   ├── collector.py # Individual extraction/normalization
│       │   │   ├── matcher.py   # Term matching (text, phonetic, wildcard, regex)
│       │   │   ├── relationships.py # BFS ancestor/descendant traversal
│       │   │   └── formatter.py # Text and JSON output formatting
│       │   ├── compare/         # Compare command package
│       │   │   ├── __init__.py  # CLI registration, orchestration
│       │   │   ├── models.py    # Data classes (CompareIndividual, etc.)
│       │   │   ├── collector.py # Individual extraction/normalization
│       │   │   ├── phonetics.py # American Soundex encoding
│       │   │   ├── blocker.py   # Multi-pass blocking
│       │   │   ├── scorer.py    # Weighted Jaro-Winkler scoring
│       │   │   ├── dedup.py     # Greedy one-to-one deduplication
│       │   │   └── formatters.py# Text and JSON output formatting
│       │   ├── duplicates/      # Duplicates command package
│       │   │   ├── __init__.py  # CLI registration, orchestration
│       │   │   ├── models.py    # Data classes (DuplicatesResult)
│       │   │   └── formatters.py# Text and JSON output formatting
│       │   ├── relationship/    # Relationship command package
│       │   │   ├── __init__.py  # CLI registration, xref validation, orchestration
│       │   │   ├── models.py    # Data classes (RelIndividual, RelationshipPath, RelationshipResult)
│       │   │   ├── classifier.py# Relationship classification and description building
│       │   │   ├── algorithm.py # LCA algorithm, half-detection, sorting
│       │   │   └── formatter.py # Text and JSON output formatting
│       │   ├── export/          # Export command package
│       │   │   ├── __init__.py  # CLI registration, format resolution, orchestration
│       │   │   ├── models.py    # Data classes (ExportIndividual, ExportFamily, ExportResult, estimate_living)
│       │   │   ├── collector.py # Individual/family extraction from GEDCOM
│       │   │   └── formatters.py# CSV and JSON output formatting
│       │   ├── convert/         # Convert command package
│       │   │   ├── __init__.py  # CLI registration, orchestration
│       │   │   └── transcoder.py# Encoding transcoder (codec resolution, BOM, CHAR header, NFC)
│       │   └── filter/          # Filter command package
│       │       ├── __init__.py  # CLI registration, validation, pipeline orchestration
│       │       ├── models.py    # Data classes (GedcomLine, GedcomRecord, FilterSpec, FilterResult, RecordCounts)
│       │       ├── parser.py    # Line-level GEDCOM parser (lossless round-trip)
│       │       ├── transforms.py# Strip transforms and subtree extraction
│       │       └── writer.py    # Dangling pointer cleanup, empty family cascade, serialization
│       └── validation/
│           ├── __init__.py      # Public API: validate_file()
│           ├── engine.py        # 4-phase validation orchestrator
│           ├── issues.py        # Error/warning codes and data classes
│           ├── reference.py     # Cross-reference validation
│           ├── result.py        # Result formatting (text/JSON)
│           └── semantic.py      # Semantic validation (dates, cycles)
├── tests/
│   ├── conftest.py              # Pytest fixtures
│   ├── fixtures/                # Test GEDCOM files (555sample.ged, etc.)
│   ├── test_cli.py              # CLI integration tests
│   ├── test_dates.py            # Date parsing utility tests
│   ├── test_graph.py            # Graph algorithm tests
│   ├── test_isolated.py         # Isolated command tests
│   ├── test_progress.py         # Progress UI tests
│   ├── test_stats.py            # Stats command tests
│   ├── test_stats_schema.py     # JSON schema validation tests
│   ├── test_search_*.py         # Search command tests (query, collector, matcher, relationships, formatter, integration)
│   ├── test_compare_*.py        # Compare command tests (scorer, dedup, formatters, integration)
│   ├── test_duplicates_*.py    # Duplicates command tests (formatters, integration)
│   ├── test_export_*.py        # Export command tests (collector, formatters, integration)
│   ├── test_convert.py         # Convert command tests
│   ├── test_filter_*.py       # Filter command tests (parser, writer, transforms, integration)
│   ├── test_relationship.py    # Relationship command tests
│   ├── test_utils.py            # Shared utility tests
│   └── test_validation/         # Validation engine tests
├── docs/
│   ├── isolated.md              # Isolated command documentation
│   ├── validate.md              # Validation error/warning codes
│   ├── stats.md                 # Stats command documentation
│   ├── stats-schema.json        # JSON schema for stats output
│   ├── search.md                # Search command documentation
│   ├── compare.md               # Compare command documentation
│   ├── duplicates.md            # Duplicates command documentation
│   ├── relationship.md          # Relationship command documentation
│   ├── export.md                # Export command documentation
│   ├── convert.md              # Convert command documentation
│   └── filter.md              # Filter command documentation
├── pyproject.toml               # Project metadata and tool config
├── README.md                    # User documentation
├── DEVELOPER.md                 # This file
└── CHANGELOG.md                 # Version history

Architecture

Shared Modules

constants.py — Shared constants used across validation and stats:

Exit codes (EXIT_SUCCESS, EXIT_ERROR, EXIT_USAGE_ERROR)
Validation thresholds (MAX_LIFESPAN, MIN_PARENT_AGE, MAX_PARENT_AGE_AT_BIRTH)
Sibling spacing (MIN_SIBLING_SPACING_MONTHS)
SEX validation (VALID_SEX_VALUES: M, F, U, X)
Stats thresholds (MIN_MARRIAGE_AGE, MAX_MARRIAGE_AGE, MAX_FIRST_CHILD_AGE, MAX_SPOUSAL_AGE_GAP)

utils.py — Common utilities used across all commands:

EncodingInfo dataclass — encoding detection results (detected, declared, BOM)
detect_encoding() — BOM detection + declared CHAR header parsing
extract_xref() — extract xref string from ged4py records
validate_input_file() — shared file existence/readability check
count_sources_recursive() — count SOUR citations at all nesting levels
strip_bom() — strip byte order mark from raw bytes, returning stripped bytes and BOM type
resolve_source_codec() — resolve encoding info to a Python codec name
check_output_safety() — validate output path (overwrite, same-file, directory existence)

dates.py — Date parsing and classification:

extract_year_from_date() — year extraction from GEDCOM date strings
extract_month() — month extraction for birth pattern analysis
classify_date_precision() — categorize dates as full/partial/approximate/missing

graph.py — Graph algorithms for family connectivity:

UnionFind — union-find data structure with path compression and union by rank
find_connected_components() — identify connected components from family member sets
ParentChildGraph — directed parent-child graph with spouse pairings
build_parent_child_graph() — construct graph from GEDCOM FAM records
find_ancestors(), find_descendants() — BFS traversal with depth limit
find_ancestors_with_depth() — BFS returning ancestor-to-depth map with truncation flag

Command Architecture

Each command follows the same pattern: register_subcommand(subparsers) to wire up argparse, run(args) as the entry point. Results are dataclasses with format_text() and format_json() methods.

validate — 4-phase validation engine (see below)
stats — Collector/Models/Formatters pattern: StatsCollector gathers raw data into IndividualData/FamilyData models, then StatsResult computes aggregates and formats output
isolated — Builds family member sets from GEDCOM FAM records, runs find_connected_components() via UnionFind, reports singletons (size 1) and isolated pairs (size 2)
search — Query/Collector/Matcher/Relationships/Formatter pipeline: parses query syntax into terms, collects individuals with pre-computed Soundex codes, matches via substring/exact/phonetic/wildcard/regex, optionally traverses parent-child graph for ancestor/descendant queries
compare — Collector/Models/Phonetics/Blocker/Scorer/Dedup/Formatters pipeline: extracts individuals from two files, generates candidate pairs via multi-pass blocking, scores with weighted Jaro-Winkler, deduplicates greedily, and formats results
duplicates — Single-file duplicate detection reusing the compare scoring engine: self-join blocking with self-pair/symmetric filtering, single used set for greedy dedup, own formatters for single-file output
relationship — LCA-based relationship classification with half-detection and multi-key sort: loads individuals and parent-child graph, BFS from both endpoints, classifies via generation-pair table, detects half-blood via spouse pairing
export — Data extraction to CSV or JSON: collector builds ExportIndividual/ExportFamily from GEDCOM, formatters produce CSV (with optional BOM) or JSON (with meta section, alt_names, notes). Living estimation + privacy redaction via --redact-living
convert — Raw byte-level encoding transcoder: reads file as bytes, strips BOM, decodes with source codec, NFC-normalizes ANSEL sources, updates CHAR header, re-encodes in target codec. Supports auto-detection and --from override for non-standard codecs
filter — Line-level GEDCOM transformer: parses raw lines into records, applies strip transforms (record-level and line-level) and/or subtree extraction (BFS via ParentChildGraph), cleans dangling pointers, cascades empty families, serializes back preserving encoding/BOM/line endings

Validation Engine (4-Phase Design)

The validation engine (src/gedcom_tools/validation/engine.py) processes GEDCOM files in four sequential phases:

┌─────────────────────────────────────────────────────────────────┐
│                     GEDCOM File Input                           │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  Phase 1: Encoding Detection                                    │
│  - Check BOM (Byte Order Mark)                                  │
│  - Read declared CHAR encoding from header                      │
│  - Detect ANSEL encoding (supported via ansel codec)                                 │
│  - Report encoding mismatches                                   │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  Phase 2: Structure Parsing                                     │
│  - Verify HEAD/TRLR records exist                               │
│  - Collect all xref definitions (@I1@, @F1@, etc.)              │
│  - Collect all xref usages (references)                         │
│  - Extract individual/family data for semantic checks           │
│  - Build line number map for error reporting                    │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  Phase 3: Reference Validation                                  │
│  - Unresolved references (xref used but never defined)          │
│  - Duplicate definitions (same xref defined twice)              │
│  - Orphaned records (defined but never referenced)              │
│  - Isolated individuals (no family connections)                 │
│  - Empty families (no members)                                  │
│  - Asymmetric links (one-sided FAM↔INDI cross-references)      │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  Phase 4: Semantic Validation                                   │
│  - Ancestry cycles (person is their own ancestor)               │
│  - Date logic (death before birth, etc.)                        │
│  - Age plausibility (parent too young/old, lifespan > 120)      │
│  - Marriage before birth                                        │
│  - Sibling spacing (< 9 months, excluding twins)                │
│  - Sex-role mismatch (HUSB with SEX F, WIFE with SEX M)        │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                     ValidationResult                            │
│  - List of issues (errors + warnings)                           │
│  - Encoding info                                                │
│  - Record counts                                                │
│  - Format as text or JSON                                       │
└─────────────────────────────────────────────────────────────────┘

Design Rationale:

Sequential phases - Each phase depends on data collected by previous phases. Encoding must be detected before parsing; parsing must complete before reference checking.
Separation of concerns - Reference validation (reference.py) and semantic validation (semantic.py) are independent modules that don't know about each other.
Quick vs Full modes - In quick mode, validation stops at the first error. In full mode, all phases run to completion, collecting all issues.
Line number tracking - A byte-offset-to-line-number map is built during phase 2 to provide accurate line numbers in error messages.

Error Codes

Error codes follow a consistent scheme in issues.py:

E0xx - Errors (fatal issues that indicate invalid GEDCOM)
W0xx - Warnings (issues that may indicate problems but aren't fatal)

The severity is automatically derived from the code prefix.

Test Data

tests/fixtures/555sample.ged — from gedcom.org, used as the primary regression test fixture.
tests/fixtures/royal92.ged — 3,010 individuals of European royalty, created by Denis R. Reid (1992). Used for README sample output and manual testing.
Most unit tests use inline GEDCOM strings via tmp_path for isolation and readability.

Running Tests

# Run all tests with coverage
pytest

# Run tests without coverage requirement (useful during development)
pytest --no-cov

# Run specific test file
pytest tests/test_cli.py -v

# Run tests matching a pattern
pytest -k "validate" -v

Coverage requirement: 95%+

Code Quality

Formatting

# Check formatting
black --check .

# Apply formatting
black .

Linting

# Check for issues
ruff check .

# Auto-fix where possible
ruff check . --fix

Type Checking

mypy src/

Security Audit

pip-audit

Adding a New Command

Create a new module in src/gedcom_tools/commands/:

# src/gedcom_tools/commands/mycommand.py

def register_subcommand(subparsers):
    parser = subparsers.add_parser(
        "mycommand",
        help="Description of the command",
    )
    # Add arguments...

def run(args):
    # Implementation...
    return 0  # Exit code

from gedcom_tools.commands import mycommand

# In create_parser():
mycommand.register_subcommand(subparsers)

# In _dispatch_command():
handlers = {
    "validate": validate.run,
    "stats": stats.run,
    "isolated": isolated.run,
    "search": search.run,
    "mycommand": mycommand.run,
}

Add tests in tests/test_mycommand.py
Update README.md with usage documentation

Exit Codes

Code	Meaning
0	Success
1	Error (runtime error, validation failure, etc.)
2	Usage error (invalid arguments, missing command)

Dependencies

Runtime

ged4py - GEDCOM file parsing
rapidfuzz - Jaro-Winkler string similarity (used by compare command)
DoubleMetaphone - Double Metaphone phonetic encoding for European name matching (used by search, compare, and duplicates commands via --phonetic metaphone)

Development

pytest - Testing framework
pytest-cov - Coverage reporting
jsonschema - JSON schema validation (used in tests)
ruff - Linting
black - Code formatting
mypy - Type checking
pip-audit - Security vulnerability scanning

Versioning

This project follows Semantic Versioning.

See CHANGELOG.md for release history and notable changes.

When making changes:

Update version in pyproject.toml and src/gedcom_tools/__init__.py
Add entry to CHANGELOG.md under "Unreleased" section
On release, move "Unreleased" items to new version section with date

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Developer Guide

Development Setup

Project Structure

Architecture

Shared Modules

Command Architecture

Validation Engine (4-Phase Design)

Error Codes

Test Data

Running Tests

Code Quality

Formatting

Linting

Type Checking

Security Audit

Adding a New Command

Exit Codes

Dependencies

Runtime

Development

Versioning

FilesExpand file tree

DEVELOPER.md

Latest commit

History

DEVELOPER.md

File metadata and controls

Developer Guide

Development Setup

Project Structure

Architecture

Shared Modules

Command Architecture

Validation Engine (4-Phase Design)

Error Codes

Test Data

Running Tests

Code Quality

Formatting

Linting

Type Checking

Security Audit

Adding a New Command

Exit Codes

Dependencies

Runtime

Development

Versioning