Skip to content
This repository was archived by the owner on Jul 7, 2025. It is now read-only.
This repository was archived by the owner on Jul 7, 2025. It is now read-only.

Gazzeteer can only link dataframes with one column? #65

@havardox

Description

@havardox

Is it not possible with gazzeteer to link dataframes with multiple columns?

Example code:

import pandas as pd
import pandas_dedupe

FIELDS = [
    {"field": "title", "type": "String"},
    {"field": "year", "type": "Exact"},
    {"field": "authors", "type": "String"},
    {"field": "isbn", "type": "String"},
]


# Test data for messy data
messy_data = [
    {"title": "  The Great Gatsby ", "year": 1925, "authors": "F. Scott Fitzerald", "isbn": "9780743273565", "poes_index": 1},
    {"title": "1985", "year": None, "authors": "George Orwel", "isbn": "978045152493", "poes_index": 2},
    {"title": "To Kill a Mocking Bird", "year": 1960, "authors": None, "isbn": "9780061120084", "poes_index": 3},
    {"title": "Great Gatsby", "year": 1925, "authors": "Fitzgerald", "isbn": None, "poes_index": 4},
    {"title": "1984", "year": 1949, "authors": "George Orwell", "isbn": None, "poes_index": 5},
    {"title": "Catcher in the Rye", "year": 1951, "authors": "J.D Salinger", "isbn": "9780316769488", "poes_index": 6},
    {"title": "Catch-22", "year": 1961, "authors": "Joseph Heller", "isbn": "9780684833392", "poes_index": 7},
    {"title": "Moby Dick", "year": 1851, "authors": "Herman Melvile", "isbn": None, "poes_index": 8},
    {"title": "Moby-Dick or The Whale", "year": 1851, "authors": "Herman Melville", "isbn": "9780142437247", "poes_index": 9},
    {"title": "Brave New World", "year": 1932, "authors": "Aldous Huxly", "isbn": None, "poes_index": 10},
]

# Test data for canonical data
canonical_data = [
    {"title": "The Great Gatsby", "year": 1925, "authors": "F. Scott Fitzgerald", "isbn": "9780743273565"},
    {"title": "1984", "year": 1949, "authors": "George Orwell", "isbn": "9780451524935"},
    {"title": "To Kill a Mockingbird", "year": 1960, "authors": "Harper Lee", "isbn": "9780061120084"},
    {"title": "The Catcher in the Rye", "year": 1951, "authors": "J.D. Salinger", "isbn": "9780316769488"},
    {"title": "Catch-22", "year": 1961, "authors": "Joseph Heller", "isbn": "9780684833392"},
    {"title": "Moby-Dick", "year": 1851, "authors": "Herman Melville", "isbn": "9780142437247"},
    {"title": "Brave New World", "year": 1932, "authors": "Aldous Huxley", "isbn": "9780060850524"},
]

# Create dataframes
messy_data_df = pd.DataFrame(messy_data)

canonical_data_df = pd.DataFrame(canonical_data)

# Run the gazetteer process
final_df = pandas_dedupe.gazetteer_dataframe(
    canonical_data_df, messy_data_df, FIELDS
)

print(final_df)

Output:

Importing data ...
Traceback (most recent call last):
  File "/home/havardox/Documents/Projects/Python/dedupe/link_books_pandas.py", line 43, in <module>
    final_df = pandas_dedupe.gazetteer_dataframe(
  File "/home/havardox/.cache/pypoetry/virtualenvs/dedupe-TGGS-TpB-py3.10/lib/python3.10/site-packages/pandas_dedupe/gazetteer_dataframe.py", line 251, in gazetteer_dataframe
    assert len(clean_data.columns)==1, 'Please provide a gazetteer dataframe made of a single variable'
AssertionError: Please provide a gazetteer dataframe made of a single variable

Meanwhile, it's easy to provide multiple fields in dedupe:

import pandas as pd
import dedupe

# Fields definition
FIELDS = [
    {"field": "title", "type": "String"},
    {"field": "year", "type": "Exact"},
    {"field": "authors", "type": "String"},
    {"field": "isbn", "type": "String"},
]

# Test data for messy data
messy_data = [
    {"title": "  The Great Gatsby ", "year": 1925, "authors": "F. Scott Fitzerald", "isbn": "9780743273565", "poes_index": 1},
    {"title": "1985", "year": None, "authors": "George Orwel", "isbn": "978045152493", "poes_index": 2},
    {"title": "To Kill a Mocking Bird", "year": 1960, "authors": None, "isbn": "9780061120084", "poes_index": 3},
    {"title": "Great Gatsby", "year": 1925, "authors": "Fitzgerald", "isbn": None, "poes_index": 4},
    {"title": "1984", "year": 1949, "authors": "George Orwell", "isbn": None, "poes_index": 5},
    {"title": "Catcher in the Rye", "year": 1951, "authors": "J.D Salinger", "isbn": "9780316769488", "poes_index": 6},
    {"title": "Catch-22", "year": 1961, "authors": "Joseph Heller", "isbn": "9780684833392", "poes_index": 7},
    {"title": "Moby Dick", "year": 1851, "authors": "Herman Melvile", "isbn": None, "poes_index": 8},
    {"title": "Moby-Dick or The Whale", "year": 1851, "authors": "Herman Melville", "isbn": "9780142437247", "poes_index": 9},
    {"title": "Brave New World", "year": 1932, "authors": "Aldous Huxly", "isbn": None, "poes_index": 10},
]

# Test data for canonical data
canonical_data = [
    {"title": "The Great Gatsby", "year": 1925, "authors": "F. Scott Fitzgerald", "isbn": "9780743273565"},
    {"title": "1984", "year": 1949, "authors": "George Orwell", "isbn": "9780451524935"},
    {"title": "To Kill a Mockingbird", "year": 1960, "authors": "Harper Lee", "isbn": "9780061120084"},
    {"title": "The Catcher in the Rye", "year": 1951, "authors": "J.D. Salinger", "isbn": "9780316769488"},
    {"title": "Catch-22", "year": 1961, "authors": "Joseph Heller", "isbn": "9780684833392"},
    {"title": "Moby-Dick", "year": 1851, "authors": "Herman Melville", "isbn": "9780142437247"},
    {"title": "Brave New World", "year": 1932, "authors": "Aldous Huxley", "isbn": "9780060850524"},
]

# Load data into a combined dictionary for dedupe processing
messy_data_dict = {i: row for i, row in enumerate(messy_data)}
canonical_data_dict = {i + len(messy_data): row for i, row in enumerate(canonical_data)}

# Define dedupe gazetteer
gazetteer = dedupe.Gazetteer(FIELDS)

# Prepare training data
gazetteer.prepare_training(messy_data_dict, canonical_data_dict)

dedupe.console_label(gazetteer)

gazetteer.train()

gazetteer.cleanup_training()

# Index the canonical data
gazetteer.index(canonical_data_dict)

# Link the messy data to the canonical data
matches = gazetteer.search(messy_data_dict, n_matches=1, generator=True)

linked_results = []

for messy_id, canonical in matches:
    for match_id, score in canonical:
        messy_record = messy_data_dict[messy_id]
        canonical_record = canonical_data_dict[match_id]
        
        linked_results.append({
            "messy_title": messy_record.get("title"),
            "canonical_title": canonical_record.get("title"),
            "messy_year": messy_record.get("year"),
            "canonical_year": canonical_record.get("year"),
            "messy_authors": messy_record.get("authors"),
            "canonical_authors": canonical_record.get("authors"),
            "messy_isbn": messy_record.get("isbn"),
            "canonical_isbn": canonical_record.get("isbn"),
            "score": score,
        })
final_df = pd.DataFrame(linked_results)

print(final_df)

Output:

...
10/10 positive, 9/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
f
Finished labeling
              messy_title         canonical_title  messy_year  canonical_year  ...    canonical_authors     messy_isbn canonical_isbn     score
0       The Great Gatsby         The Great Gatsby      1925.0            1925  ...  F. Scott Fitzgerald  9780743273565  9780743273565  0.769821
1                    1985                    1984         NaN            1949  ...        George Orwell   978045152493  9780451524935  0.700635
2  To Kill a Mocking Bird   To Kill a Mockingbird      1960.0            1960  ...           Harper Lee  9780061120084  9780061120084  0.813530
3            Great Gatsby        The Great Gatsby      1925.0            1925  ...  F. Scott Fitzgerald           None  9780743273565  0.673570
4                    1984                    1984      1949.0            1949  ...        George Orwell           None  9780451524935  0.864510
5      Catcher in the Rye  The Catcher in the Rye      1951.0            1951  ...        J.D. Salinger  9780316769488  9780316769488  0.744632
6                Catch-22                Catch-22      1961.0            1961  ...        Joseph Heller  9780684833392  9780684833392  0.834958
7               Moby Dick               Moby-Dick      1851.0            1851  ...      Herman Melville           None  9780142437247  0.819759
8  Moby-Dick or The Whale               Moby-Dick      1851.0            1851  ...      Herman Melville  9780142437247  9780142437247  0.822753
9         Brave New World         Brave New World      1932.0            1932  ...        Aldous Huxley           None  9780060850524  0.850851

[10 rows x 9 columns]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions