Duplicates Command

The duplicates command scans a single GEDCOM file for potential duplicate individuals using the same probabilistic record linkage engine as the compare command.

Usage

gedcom-tools duplicates <file> [options]

Options

Option	Description
`--format {text,json}`	Output format (default: text)
`-v, --verbose`	Show timing and per-field scores
`-q, --quiet`	One-line summary
`--certain-threshold F`	Minimum score for certain duplicate (default: 0.85)
`--probable-threshold F`	Minimum score for probable duplicate (default: 0.65)
`--show-matches {all,certain,probable}`	Which matches to show (default: all)
`--limit N`	Max items per output section (text default: 50, JSON default: unlimited)
`--reject-sex-mismatch`	Treat sex mismatches as hard reject (score 0.0)
`--phonetic {soundex,metaphone}`	Phonetic algorithm for blocking and scoring (default: soundex)

How It Works

The command runs in three phases:

Read the file and extract individuals (name, dates, places, sex)
Find duplicates: multi-pass blocking, weighted Jaro-Winkler scoring, greedy one-to-one deduplication
Format results

Differences from Compare

The compare command matches individuals across two different files. The duplicates command matches individuals within a single file:

Aspect	Compare	Duplicates
Files	Two input files (A, B)	Single input file
Self-pairs	Not possible (different files)	Filtered out (`@I1@` vs `@I1@`)
Symmetric pairs	Not possible (A→B direction)	Collapsed (`(@I1@, @I2@)` = `(@I2@, @I1@)`)
Deduplication	Separate `used_a`/`used_b` sets	Single `used` set (each individual in at most one pair)
Unique listing	`--list-unique` shows unmatched	Not applicable

All other aspects — scoring, blocking, classification, thresholds — are identical.

Scoring

Uses the same weighted Jaro-Winkler scoring as compare across 7 fields. See Compare: Scoring Approach for field weights, string similarity, phonetic bonus, year proximity bands, and place comparison details.

The --phonetic metaphone option uses Double Metaphone for blocking and scoring, improving recall for European name variants. See Compare: Multi-Pass Blocking for details.

Classification

Classification	Criteria
Certain	Score >= certain_threshold AND >= 4 comparable fields
Probable	Score >= probable_threshold
Non-match	Score < probable_threshold (not shown)

The insufficient_data flag is set when fewer than 3 comparable fields exist or when no corroborating fields (dates, places) were compared. These matches are annotated with (low confidence) in text output.

Greedy Deduplication

Pairs are sorted by descending score. Each individual can appear in at most one matched pair. Once an individual is claimed by a higher-scoring pair, it is excluded from lower-scoring pairs.

Note: this means transitive duplicates are not fully reported. If I1↔I2 (score 0.92) and I1↔I3 (score 0.88), only I1↔I2 is shown. I3 remains unmatched. Use --limit 0 and inspect the probable section for additional leads.

Output

Text Output

File: family.ged

=== Duplicate Scan Summary ===
  Individuals scanned:   500
  Certain duplicates:      3
  Probable duplicates:     5

=== Certain Duplicates (3) ===
  John Smith (1850-1920) [@I1@] ↔ John Smith (1850-1920) [@I42@]  score: 0.95
    Birth Place: "London, England" vs "London, Middlesex, England"

  Mary Jones (1872-1945) [@I3@] ↔ Maria Jones (1873-1945) [@I88@]  score: 0.91
    Given Name: "Mary" vs "Maria"
    Birth Year: "1872" vs "1873"

  William Brown (1900-?) [@I10@] ↔ Wm Brown (1900-1965) [@I55@]  score: 0.88
    Given Name: "William" vs "Wm"
    Death Year: "None" vs "1965"

=== Probable Duplicates (5) ===
  ...

When --show-matches certain is used, the probable section is hidden (summary counts still reflect the full scan). When --show-matches probable is used, the certain section is hidden. Verbose mode adds a per-field score breakdown below each match.

Text Output (Quiet)

Single line:

3 certain, 5 probable

Text Output (Verbose)

Verbose mode shows per-field scores and sex penalty (if applicable):

  John Smith (1850-1920) [@I1@] ↔ John Smith (1850-1920) [@I42@]  score: 0.95
    Birth Place: "London, England" vs "London, Middlesex, England"
    [Scores: Surname 1.00, Given Name 1.00, Birth Year 1.00, Death Year 1.00, Birth Place 0.85, Sex 1.00]

When a sex mismatch penalty is applied:

    [Scores: Surname 1.00, Given Name 0.90, Birth Year 1.00, Sex mismatch ×0.70]

JSON Output

{
  "file": "family.ged",
  "encoding": {
    "detected": "UTF-8",
    "has_bom": false,
    "declared": "UTF-8"
  },
  "total_individuals": 500,
  "certain_duplicates": [
    {
      "individual_a": {
        "xref": "@I1@",
        "name": "John Smith",
        "given_name": "John",
        "surname": "Smith",
        "sex": "M",
        "birth_year": 1850,
        "birth_place": "London, England",
        "death_year": 1920,
        "death_place": null
      },
      "individual_b": {
        "xref": "@I42@",
        "name": "John Smith",
        "given_name": "John",
        "surname": "Smith",
        "sex": "M",
        "birth_year": 1850,
        "birth_place": "London, Middlesex, England",
        "death_year": 1920,
        "death_place": null
      },
      "score": 0.95,
      "classification": "certain",
      "field_scores": {
        "Surname": 1.0,
        "Given Name": 1.0,
        "Birth Year": 1.0,
        "Death Year": 1.0,
        "Birth Place": 0.85,
        "Sex": 1.0
      },
      "differences": [
        {
          "field": "Birth Place",
          "value_a": "London, England",
          "value_b": "London, Middlesex, England"
        }
      ]
    }
  ],
  "certain_duplicates_total": 3,
  "probable_duplicates": [],
  "probable_duplicates_total": 5
}

The *_total fields reflect the full count before any --limit truncation or --show-matches filtering, so consumers always know the complete picture.

The insufficient_data key is only present when true:

{
  "score": 0.72,
  "classification": "probable",
  "insufficient_data": true,
  ...
}

Exit Codes

Code	Meaning
0	Success
1	Error during processing
2	Usage error (file not found, invalid thresholds)

Known Limitations

Greedy deduplication is one-to-one: transitive chains (I1↔I2, I2↔I3) only report the highest-scoring pair; the third individual remains unmatched
No cluster mode: related duplicates are not grouped into transitive sets
No family context: matches are field-level only; shared parents/children are not considered as corroborating evidence
Blocking may miss pairs with no shared blocking key (rare with 5 passes)
Large blocks (500+ individuals sharing a blocking key) are silently capped to avoid quadratic blowup

Related Commands

compare — match individuals across two different files
search — find individuals using flexible query syntax
isolated — find unconnected individuals within a single file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicates Command

Usage

Options

How It Works

Differences from Compare

Scoring

Classification

Greedy Deduplication

Output

Text Output

Text Output (Quiet)

Text Output (Verbose)

JSON Output

Exit Codes

Known Limitations

Related Commands

FilesExpand file tree

duplicates.md

Latest commit

History

duplicates.md

File metadata and controls

Duplicates Command

Usage

Options

How It Works

Differences from Compare

Scoring

Classification

Greedy Deduplication

Output

Text Output

Text Output (Quiet)

Text Output (Verbose)

JSON Output

Exit Codes

Known Limitations

Related Commands