Skip to content

Latest commit

 

History

History
239 lines (190 loc) · 7.29 KB

File metadata and controls

239 lines (190 loc) · 7.29 KB

Duplicates Command

The duplicates command scans a single GEDCOM file for potential duplicate individuals using the same probabilistic record linkage engine as the compare command.

Usage

gedcom-tools duplicates <file> [options]

Options

Option Description
--format {text,json} Output format (default: text)
-v, --verbose Show timing and per-field scores
-q, --quiet One-line summary
--certain-threshold F Minimum score for certain duplicate (default: 0.85)
--probable-threshold F Minimum score for probable duplicate (default: 0.65)
--show-matches {all,certain,probable} Which matches to show (default: all)
--limit N Max items per output section (text default: 50, JSON default: unlimited)
--reject-sex-mismatch Treat sex mismatches as hard reject (score 0.0)
--phonetic {soundex,metaphone} Phonetic algorithm for blocking and scoring (default: soundex)

How It Works

The command runs in three phases:

  1. Read the file and extract individuals (name, dates, places, sex)
  2. Find duplicates: multi-pass blocking, weighted Jaro-Winkler scoring, greedy one-to-one deduplication
  3. Format results

Differences from Compare

The compare command matches individuals across two different files. The duplicates command matches individuals within a single file:

Aspect Compare Duplicates
Files Two input files (A, B) Single input file
Self-pairs Not possible (different files) Filtered out (@I1@ vs @I1@)
Symmetric pairs Not possible (A→B direction) Collapsed ((@I1@, @I2@) = (@I2@, @I1@))
Deduplication Separate used_a/used_b sets Single used set (each individual in at most one pair)
Unique listing --list-unique shows unmatched Not applicable

All other aspects — scoring, blocking, classification, thresholds — are identical.

Scoring

Uses the same weighted Jaro-Winkler scoring as compare across 7 fields. See Compare: Scoring Approach for field weights, string similarity, phonetic bonus, year proximity bands, and place comparison details.

The --phonetic metaphone option uses Double Metaphone for blocking and scoring, improving recall for European name variants. See Compare: Multi-Pass Blocking for details.

Classification

Classification Criteria
Certain Score >= certain_threshold AND >= 4 comparable fields
Probable Score >= probable_threshold
Non-match Score < probable_threshold (not shown)

The insufficient_data flag is set when fewer than 3 comparable fields exist or when no corroborating fields (dates, places) were compared. These matches are annotated with (low confidence) in text output.

Greedy Deduplication

Pairs are sorted by descending score. Each individual can appear in at most one matched pair. Once an individual is claimed by a higher-scoring pair, it is excluded from lower-scoring pairs.

Note: this means transitive duplicates are not fully reported. If I1↔I2 (score 0.92) and I1↔I3 (score 0.88), only I1↔I2 is shown. I3 remains unmatched. Use --limit 0 and inspect the probable section for additional leads.

Output

Text Output

File: family.ged

=== Duplicate Scan Summary ===
  Individuals scanned:   500
  Certain duplicates:      3
  Probable duplicates:     5

=== Certain Duplicates (3) ===
  John Smith (1850-1920) [@I1@] ↔ John Smith (1850-1920) [@I42@]  score: 0.95
    Birth Place: "London, England" vs "London, Middlesex, England"

  Mary Jones (1872-1945) [@I3@] ↔ Maria Jones (1873-1945) [@I88@]  score: 0.91
    Given Name: "Mary" vs "Maria"
    Birth Year: "1872" vs "1873"

  William Brown (1900-?) [@I10@] ↔ Wm Brown (1900-1965) [@I55@]  score: 0.88
    Given Name: "William" vs "Wm"
    Death Year: "None" vs "1965"

=== Probable Duplicates (5) ===
  ...

When --show-matches certain is used, the probable section is hidden (summary counts still reflect the full scan). When --show-matches probable is used, the certain section is hidden. Verbose mode adds a per-field score breakdown below each match.

Text Output (Quiet)

Single line:

3 certain, 5 probable

Text Output (Verbose)

Verbose mode shows per-field scores and sex penalty (if applicable):

  John Smith (1850-1920) [@I1@] ↔ John Smith (1850-1920) [@I42@]  score: 0.95
    Birth Place: "London, England" vs "London, Middlesex, England"
    [Scores: Surname 1.00, Given Name 1.00, Birth Year 1.00, Death Year 1.00, Birth Place 0.85, Sex 1.00]

When a sex mismatch penalty is applied:

    [Scores: Surname 1.00, Given Name 0.90, Birth Year 1.00, Sex mismatch ×0.70]

JSON Output

{
  "file": "family.ged",
  "encoding": {
    "detected": "UTF-8",
    "has_bom": false,
    "declared": "UTF-8"
  },
  "total_individuals": 500,
  "certain_duplicates": [
    {
      "individual_a": {
        "xref": "@I1@",
        "name": "John Smith",
        "given_name": "John",
        "surname": "Smith",
        "sex": "M",
        "birth_year": 1850,
        "birth_place": "London, England",
        "death_year": 1920,
        "death_place": null
      },
      "individual_b": {
        "xref": "@I42@",
        "name": "John Smith",
        "given_name": "John",
        "surname": "Smith",
        "sex": "M",
        "birth_year": 1850,
        "birth_place": "London, Middlesex, England",
        "death_year": 1920,
        "death_place": null
      },
      "score": 0.95,
      "classification": "certain",
      "field_scores": {
        "Surname": 1.0,
        "Given Name": 1.0,
        "Birth Year": 1.0,
        "Death Year": 1.0,
        "Birth Place": 0.85,
        "Sex": 1.0
      },
      "differences": [
        {
          "field": "Birth Place",
          "value_a": "London, England",
          "value_b": "London, Middlesex, England"
        }
      ]
    }
  ],
  "certain_duplicates_total": 3,
  "probable_duplicates": [],
  "probable_duplicates_total": 5
}

The *_total fields reflect the full count before any --limit truncation or --show-matches filtering, so consumers always know the complete picture.

The insufficient_data key is only present when true:

{
  "score": 0.72,
  "classification": "probable",
  "insufficient_data": true,
  ...
}

Exit Codes

Code Meaning
0 Success
1 Error during processing
2 Usage error (file not found, invalid thresholds)

Known Limitations

  • Greedy deduplication is one-to-one: transitive chains (I1↔I2, I2↔I3) only report the highest-scoring pair; the third individual remains unmatched
  • No cluster mode: related duplicates are not grouped into transitive sets
  • No family context: matches are field-level only; shared parents/children are not considered as corroborating evidence
  • Blocking may miss pairs with no shared blocking key (rare with 5 passes)
  • Large blocks (500+ individuals sharing a blocking key) are silently capped to avoid quadratic blowup

Related Commands

  • compare — match individuals across two different files
  • search — find individuals using flexible query syntax
  • isolated — find unconnected individuals within a single file