The duplicates command scans a single GEDCOM file for potential duplicate
individuals using the same probabilistic record linkage engine as the
compare command.
gedcom-tools duplicates <file> [options]| Option | Description |
|---|---|
--format {text,json} |
Output format (default: text) |
-v, --verbose |
Show timing and per-field scores |
-q, --quiet |
One-line summary |
--certain-threshold F |
Minimum score for certain duplicate (default: 0.85) |
--probable-threshold F |
Minimum score for probable duplicate (default: 0.65) |
--show-matches {all,certain,probable} |
Which matches to show (default: all) |
--limit N |
Max items per output section (text default: 50, JSON default: unlimited) |
--reject-sex-mismatch |
Treat sex mismatches as hard reject (score 0.0) |
--phonetic {soundex,metaphone} |
Phonetic algorithm for blocking and scoring (default: soundex) |
The command runs in three phases:
- Read the file and extract individuals (name, dates, places, sex)
- Find duplicates: multi-pass blocking, weighted Jaro-Winkler scoring, greedy one-to-one deduplication
- Format results
The compare command matches individuals across two different files. The
duplicates command matches individuals within a single file:
| Aspect | Compare | Duplicates |
|---|---|---|
| Files | Two input files (A, B) | Single input file |
| Self-pairs | Not possible (different files) | Filtered out (@I1@ vs @I1@) |
| Symmetric pairs | Not possible (A→B direction) | Collapsed ((@I1@, @I2@) = (@I2@, @I1@)) |
| Deduplication | Separate used_a/used_b sets |
Single used set (each individual in at most one pair) |
| Unique listing | --list-unique shows unmatched |
Not applicable |
All other aspects — scoring, blocking, classification, thresholds — are identical.
Uses the same weighted Jaro-Winkler scoring as compare across 7 fields.
See Compare: Scoring Approach for field weights,
string similarity, phonetic bonus, year proximity bands, and place comparison
details.
The --phonetic metaphone option uses Double Metaphone for blocking and scoring,
improving recall for European name variants. See
Compare: Multi-Pass Blocking for details.
| Classification | Criteria |
|---|---|
| Certain | Score >= certain_threshold AND >= 4 comparable fields |
| Probable | Score >= probable_threshold |
| Non-match | Score < probable_threshold (not shown) |
The insufficient_data flag is set when fewer than 3 comparable fields exist or
when no corroborating fields (dates, places) were compared. These matches are
annotated with (low confidence) in text output.
Pairs are sorted by descending score. Each individual can appear in at most one matched pair. Once an individual is claimed by a higher-scoring pair, it is excluded from lower-scoring pairs.
Note: this means transitive duplicates are not fully reported. If I1↔I2 (score
0.92) and I1↔I3 (score 0.88), only I1↔I2 is shown. I3 remains unmatched. Use
--limit 0 and inspect the probable section for additional leads.
File: family.ged
=== Duplicate Scan Summary ===
Individuals scanned: 500
Certain duplicates: 3
Probable duplicates: 5
=== Certain Duplicates (3) ===
John Smith (1850-1920) [@I1@] ↔ John Smith (1850-1920) [@I42@] score: 0.95
Birth Place: "London, England" vs "London, Middlesex, England"
Mary Jones (1872-1945) [@I3@] ↔ Maria Jones (1873-1945) [@I88@] score: 0.91
Given Name: "Mary" vs "Maria"
Birth Year: "1872" vs "1873"
William Brown (1900-?) [@I10@] ↔ Wm Brown (1900-1965) [@I55@] score: 0.88
Given Name: "William" vs "Wm"
Death Year: "None" vs "1965"
=== Probable Duplicates (5) ===
...
When --show-matches certain is used, the probable section is hidden (summary
counts still reflect the full scan). When --show-matches probable is used, the
certain section is hidden. Verbose mode adds a per-field score breakdown below
each match.
Single line:
3 certain, 5 probable
Verbose mode shows per-field scores and sex penalty (if applicable):
John Smith (1850-1920) [@I1@] ↔ John Smith (1850-1920) [@I42@] score: 0.95
Birth Place: "London, England" vs "London, Middlesex, England"
[Scores: Surname 1.00, Given Name 1.00, Birth Year 1.00, Death Year 1.00, Birth Place 0.85, Sex 1.00]
When a sex mismatch penalty is applied:
[Scores: Surname 1.00, Given Name 0.90, Birth Year 1.00, Sex mismatch ×0.70]
{
"file": "family.ged",
"encoding": {
"detected": "UTF-8",
"has_bom": false,
"declared": "UTF-8"
},
"total_individuals": 500,
"certain_duplicates": [
{
"individual_a": {
"xref": "@I1@",
"name": "John Smith",
"given_name": "John",
"surname": "Smith",
"sex": "M",
"birth_year": 1850,
"birth_place": "London, England",
"death_year": 1920,
"death_place": null
},
"individual_b": {
"xref": "@I42@",
"name": "John Smith",
"given_name": "John",
"surname": "Smith",
"sex": "M",
"birth_year": 1850,
"birth_place": "London, Middlesex, England",
"death_year": 1920,
"death_place": null
},
"score": 0.95,
"classification": "certain",
"field_scores": {
"Surname": 1.0,
"Given Name": 1.0,
"Birth Year": 1.0,
"Death Year": 1.0,
"Birth Place": 0.85,
"Sex": 1.0
},
"differences": [
{
"field": "Birth Place",
"value_a": "London, England",
"value_b": "London, Middlesex, England"
}
]
}
],
"certain_duplicates_total": 3,
"probable_duplicates": [],
"probable_duplicates_total": 5
}The *_total fields reflect the full count before any --limit truncation or
--show-matches filtering, so consumers always know the complete picture.
The insufficient_data key is only present when true:
{
"score": 0.72,
"classification": "probable",
"insufficient_data": true,
...
}| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Error during processing |
| 2 | Usage error (file not found, invalid thresholds) |
- Greedy deduplication is one-to-one: transitive chains (I1↔I2, I2↔I3) only report the highest-scoring pair; the third individual remains unmatched
- No cluster mode: related duplicates are not grouped into transitive sets
- No family context: matches are field-level only; shared parents/children are not considered as corroborating evidence
- Blocking may miss pairs with no shared blocking key (rare with 5 passes)
- Large blocks (500+ individuals sharing a blocking key) are silently capped to avoid quadratic blowup