Last Updated: 2025-12-30
RMCitecraft is an active automation agent for genealogy research. Unlike passive tools that wait for user input, RMCitecraft actively navigates genealogy websites (FamilySearch, Find a Grave), extracts comprehensive data, interprets it using AI, and synchronizes the results with the user's RootsMagic database. It transforms the role of the genealogist from "data entry clerk" to "research reviewer."
- Active Extraction: Automatically navigates to source URLs to scrape full record details, not just citations.
- AI-Powered Transcription: Uses Large Language Models (LLMs) to transcribe handwriting and interpret unstructured data from census images.
- Sidecar Database Architecture: Maintains a comprehensive
census.dbresearch log separate from the main genealogy tree, enabling deep analysis and data quality tracking. - Crash-Proof Workflow: Robust state management ensures long-running batch processes can pause, resume, and recover from failures without data loss.
- OS: macOS (primary), Windows (planned).
- Integration: RootsMagic 8/9/10/11 (SQLite format).
- Browser: Google Chrome (via CDP/Playwright).
RMCitecraft operates as a local agent with four distinct subsystems:
- The Orchestrator (Python/NiceGUI): The central brain that manages queues, user interaction, and state.
- The Driver (Playwright/CDP): Connects to a user's existing Chrome session (with FamilySearch login) to navigate websites, manage sessions, and download media.
- The Analyst (LLM/Regex): A processing layer that parses raw HTML/Text into structured data.
- The Archivist (Database Layer): Manages the
census.db(transcription data) and synchronizes validated facts to the RootsMagic database.
- Ingest: Scan RootsMagic for placeholder citations (e.g., "Fed Census: 1950...").
- Queue: Add targets to the active batch session.
- Extract: The Driver navigates to the Source URL.
- Transcribe: The Analyst extracts full household data (names, ages, relationships).
- Store: Save full transcription to
census.db. - Format: Generate Evidence Explained citations and file names.
- Sync: Update RootsMagic (Citations, Media Links) and move downloaded images to the file system.
Goal: Convert thousands of placeholder citations into fully sourced, media-linked records.
- Supported Records:
- US Federal Census 1790-1950 (population schedules).
- Slave Schedules (1850, 1860).
- Mortality Schedules (1850-1880).
- Extraction:
- Metadata: Year, State, County, Township, ED, Sheet/Stamp, Line.
- Household: Full roster extraction (names, relationships, ages) for validation.
- Hungarian algorithm for optimal person-to-RIN matching.
- Formatting:
- Generates
Footnote,Short Footnote, andBibliographystrictly adhering to Evidence Explained. - Year-specific templates (1850 penned pages, 1880+ stamped, 1950 stamps).
- Special schedule formatting (slave schedules with owner attribution, mortality schedules).
- Generates
- Validation:
- 6-criterion validation: year format, census reference, sheet/stamp, ED (1880+), distinct footnote vs short footnote, all forms complete.
- Cross-referencing extracted data with existing RootsMagic data.
- Quality check script for batch validation across census years.
Goal: Automate the retrieval of burial details and memorial photos.
- Extraction: Birth/Death dates, Cemetery details, Bio text, Family links.
- Image Handling:
- Download primary headstone photos.
- Skip generic "flower" images.
- Deduplicate existing media.
- Citation: Specialized templates for online memorial citations.
Goal: A persistent, schema-less (EAV) storage for transcription data that exceeds RootsMagic's schema limits.
- Schema: Entity-Attribute-Value pattern to support varying census columns (e.g., "Radio Ownership" in 1930 vs. "Years Married" in 1900).
- Analysis: Enables SQL queries across the entire research set (e.g., "Find all neighbors of ancestors in 1940").
- Quality Control: Tracks "Confidence Score" for every extracted field.
Goal: Zero-touch file organization.
- Naming Convention:
YYYY, State, County - Surname, GivenName.ext. - Organization: Auto-sorts into folder hierarchy:
RootsMagic/Records - Census/[Year] Federal/. - Linking: Creates
MultimediaTablerecords and links them to:- The
Citation. - The
Event(Census Fact). - The
Source(optional).
- The
Goal: Browse and link extracted census data to RootsMagic persons.
- Page View Mode: Browse extracted census pages with all household members displayed.
- Census Form Rendering: Jinja2 templates render 30-line census forms matching original document layout.
- Match Suggestions: Confidence-scored candidate matches using fuzzy name matching.
- Manual RIN Linking: Enter RIN directly or select from household members.
- Link Status Indicators: Visual icons show linked (purple) vs citation-only (gray) status.
- Hybrid Citation Lookup: Finds valid citations via stored ID, RIN lookup, or location matching.
Goal: Real-time monitoring and manual intervention.
- Live Progress: Progress bars for batch operations.
- Review Queue: "Traffic light" system (Green=Auto-Approved, Yellow=Review, Red=Error).
- Manual Override: Form-based editor to correct extraction errors before syncing.
- Image Viewer: Side-by-side view of downloaded images vs. transcription data (275% zoom default).
- Language: Python 3.11+.
- Package Manager:
uv(strict requirement). - UI Framework: NiceGUI (Native Mode via PyWebView).
- Browser Automation: Playwright (connected to Chrome via CDP).
- Database: SQLite with
ICUextension (required for RMNOCASE collation). - LLM Integration: LangChain (supporting Anthropic/OpenAI).
- No-Write Default: Database connections default to Read-Only.
- Atomic Transactions: All writes wrapped in transaction blocks.
- Backup Check: Warn user if no recent backup is detected (future).
- Version Pinning: Validate RootsMagic database version before connecting.
- Extraction: < 5 seconds per page (cached).
- Batch Throughput: Process > 500 citations/hour (unattended).
- Startup: < 2 seconds to interactive UI.
- Establish
census.dbsidecar architecture. - Implement robust batch processing state machine.
- Deprecate passive file monitoring and Chrome Extension.
- Finalize Playwright/CDP transition for all extractors.
- Census Extraction Viewer with form rendering.
- 6-criterion citation validation.
- Slave and mortality schedule support.
- LLM integration for census image transcription.
- Hungarian algorithm for household member matching.
- "Household Reconstruction": Automatically creating missing family members in RootsMagic based on
census.dbdata. - Improved handling of "hard to read" handwritten records.
- Support for State Census records (NY, IA, KS, etc.).
- Generalized "Generic Source" extractor for sites like Ancestry or MyHeritage (long-term).
- Agent: The autonomous process performing tasks.
- CDP (Chrome DevTools Protocol): The interface used to drive the browser.
- EAV (Entity-Attribute-Value): Database pattern for flexible schema.
- ED (Enumeration District): Geographic subdivision used for census taking (1880+).
- Evidence Explained: The citation style guide by Elizabeth Shown Mills; the standard for genealogical citations.
- Hungarian Algorithm: Optimal assignment algorithm used for matching extracted household members to RootsMagic persons.
- RIN (Record Identification Number): Unique identifier for a person in RootsMagic.
- RMNOCASE: Custom collation used by RootsMagic; requires ICU extension.
- Sidecar: The
census.dbdatabase living alongside the main.rmtreefile.