Is it not possible with gazzeteer to link dataframes with multiple columns?
Example code:
import pandas as pd
import pandas_dedupe
FIELDS = [
{"field": "title", "type": "String"},
{"field": "year", "type": "Exact"},
{"field": "authors", "type": "String"},
{"field": "isbn", "type": "String"},
]
# Test data for messy data
messy_data = [
{"title": " The Great Gatsby ", "year": 1925, "authors": "F. Scott Fitzerald", "isbn": "9780743273565", "poes_index": 1},
{"title": "1985", "year": None, "authors": "George Orwel", "isbn": "978045152493", "poes_index": 2},
{"title": "To Kill a Mocking Bird", "year": 1960, "authors": None, "isbn": "9780061120084", "poes_index": 3},
{"title": "Great Gatsby", "year": 1925, "authors": "Fitzgerald", "isbn": None, "poes_index": 4},
{"title": "1984", "year": 1949, "authors": "George Orwell", "isbn": None, "poes_index": 5},
{"title": "Catcher in the Rye", "year": 1951, "authors": "J.D Salinger", "isbn": "9780316769488", "poes_index": 6},
{"title": "Catch-22", "year": 1961, "authors": "Joseph Heller", "isbn": "9780684833392", "poes_index": 7},
{"title": "Moby Dick", "year": 1851, "authors": "Herman Melvile", "isbn": None, "poes_index": 8},
{"title": "Moby-Dick or The Whale", "year": 1851, "authors": "Herman Melville", "isbn": "9780142437247", "poes_index": 9},
{"title": "Brave New World", "year": 1932, "authors": "Aldous Huxly", "isbn": None, "poes_index": 10},
]
# Test data for canonical data
canonical_data = [
{"title": "The Great Gatsby", "year": 1925, "authors": "F. Scott Fitzgerald", "isbn": "9780743273565"},
{"title": "1984", "year": 1949, "authors": "George Orwell", "isbn": "9780451524935"},
{"title": "To Kill a Mockingbird", "year": 1960, "authors": "Harper Lee", "isbn": "9780061120084"},
{"title": "The Catcher in the Rye", "year": 1951, "authors": "J.D. Salinger", "isbn": "9780316769488"},
{"title": "Catch-22", "year": 1961, "authors": "Joseph Heller", "isbn": "9780684833392"},
{"title": "Moby-Dick", "year": 1851, "authors": "Herman Melville", "isbn": "9780142437247"},
{"title": "Brave New World", "year": 1932, "authors": "Aldous Huxley", "isbn": "9780060850524"},
]
# Create dataframes
messy_data_df = pd.DataFrame(messy_data)
canonical_data_df = pd.DataFrame(canonical_data)
# Run the gazetteer process
final_df = pandas_dedupe.gazetteer_dataframe(
canonical_data_df, messy_data_df, FIELDS
)
print(final_df)
Output:
Importing data ...
Traceback (most recent call last):
File "/home/havardox/Documents/Projects/Python/dedupe/link_books_pandas.py", line 43, in <module>
final_df = pandas_dedupe.gazetteer_dataframe(
File "/home/havardox/.cache/pypoetry/virtualenvs/dedupe-TGGS-TpB-py3.10/lib/python3.10/site-packages/pandas_dedupe/gazetteer_dataframe.py", line 251, in gazetteer_dataframe
assert len(clean_data.columns)==1, 'Please provide a gazetteer dataframe made of a single variable'
AssertionError: Please provide a gazetteer dataframe made of a single variable
Meanwhile, it's easy to provide multiple fields in dedupe:
import pandas as pd
import dedupe
# Fields definition
FIELDS = [
{"field": "title", "type": "String"},
{"field": "year", "type": "Exact"},
{"field": "authors", "type": "String"},
{"field": "isbn", "type": "String"},
]
# Test data for messy data
messy_data = [
{"title": " The Great Gatsby ", "year": 1925, "authors": "F. Scott Fitzerald", "isbn": "9780743273565", "poes_index": 1},
{"title": "1985", "year": None, "authors": "George Orwel", "isbn": "978045152493", "poes_index": 2},
{"title": "To Kill a Mocking Bird", "year": 1960, "authors": None, "isbn": "9780061120084", "poes_index": 3},
{"title": "Great Gatsby", "year": 1925, "authors": "Fitzgerald", "isbn": None, "poes_index": 4},
{"title": "1984", "year": 1949, "authors": "George Orwell", "isbn": None, "poes_index": 5},
{"title": "Catcher in the Rye", "year": 1951, "authors": "J.D Salinger", "isbn": "9780316769488", "poes_index": 6},
{"title": "Catch-22", "year": 1961, "authors": "Joseph Heller", "isbn": "9780684833392", "poes_index": 7},
{"title": "Moby Dick", "year": 1851, "authors": "Herman Melvile", "isbn": None, "poes_index": 8},
{"title": "Moby-Dick or The Whale", "year": 1851, "authors": "Herman Melville", "isbn": "9780142437247", "poes_index": 9},
{"title": "Brave New World", "year": 1932, "authors": "Aldous Huxly", "isbn": None, "poes_index": 10},
]
# Test data for canonical data
canonical_data = [
{"title": "The Great Gatsby", "year": 1925, "authors": "F. Scott Fitzgerald", "isbn": "9780743273565"},
{"title": "1984", "year": 1949, "authors": "George Orwell", "isbn": "9780451524935"},
{"title": "To Kill a Mockingbird", "year": 1960, "authors": "Harper Lee", "isbn": "9780061120084"},
{"title": "The Catcher in the Rye", "year": 1951, "authors": "J.D. Salinger", "isbn": "9780316769488"},
{"title": "Catch-22", "year": 1961, "authors": "Joseph Heller", "isbn": "9780684833392"},
{"title": "Moby-Dick", "year": 1851, "authors": "Herman Melville", "isbn": "9780142437247"},
{"title": "Brave New World", "year": 1932, "authors": "Aldous Huxley", "isbn": "9780060850524"},
]
# Load data into a combined dictionary for dedupe processing
messy_data_dict = {i: row for i, row in enumerate(messy_data)}
canonical_data_dict = {i + len(messy_data): row for i, row in enumerate(canonical_data)}
# Define dedupe gazetteer
gazetteer = dedupe.Gazetteer(FIELDS)
# Prepare training data
gazetteer.prepare_training(messy_data_dict, canonical_data_dict)
dedupe.console_label(gazetteer)
gazetteer.train()
gazetteer.cleanup_training()
# Index the canonical data
gazetteer.index(canonical_data_dict)
# Link the messy data to the canonical data
matches = gazetteer.search(messy_data_dict, n_matches=1, generator=True)
linked_results = []
for messy_id, canonical in matches:
for match_id, score in canonical:
messy_record = messy_data_dict[messy_id]
canonical_record = canonical_data_dict[match_id]
linked_results.append({
"messy_title": messy_record.get("title"),
"canonical_title": canonical_record.get("title"),
"messy_year": messy_record.get("year"),
"canonical_year": canonical_record.get("year"),
"messy_authors": messy_record.get("authors"),
"canonical_authors": canonical_record.get("authors"),
"messy_isbn": messy_record.get("isbn"),
"canonical_isbn": canonical_record.get("isbn"),
"score": score,
})
final_df = pd.DataFrame(linked_results)
print(final_df)
Output:
...
10/10 positive, 9/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
f
Finished labeling
messy_title canonical_title messy_year canonical_year ... canonical_authors messy_isbn canonical_isbn score
0 The Great Gatsby The Great Gatsby 1925.0 1925 ... F. Scott Fitzgerald 9780743273565 9780743273565 0.769821
1 1985 1984 NaN 1949 ... George Orwell 978045152493 9780451524935 0.700635
2 To Kill a Mocking Bird To Kill a Mockingbird 1960.0 1960 ... Harper Lee 9780061120084 9780061120084 0.813530
3 Great Gatsby The Great Gatsby 1925.0 1925 ... F. Scott Fitzgerald None 9780743273565 0.673570
4 1984 1984 1949.0 1949 ... George Orwell None 9780451524935 0.864510
5 Catcher in the Rye The Catcher in the Rye 1951.0 1951 ... J.D. Salinger 9780316769488 9780316769488 0.744632
6 Catch-22 Catch-22 1961.0 1961 ... Joseph Heller 9780684833392 9780684833392 0.834958
7 Moby Dick Moby-Dick 1851.0 1851 ... Herman Melville None 9780142437247 0.819759
8 Moby-Dick or The Whale Moby-Dick 1851.0 1851 ... Herman Melville 9780142437247 9780142437247 0.822753
9 Brave New World Brave New World 1932.0 1932 ... Aldous Huxley None 9780060850524 0.850851
[10 rows x 9 columns]
Is it not possible with gazzeteer to link dataframes with multiple columns?
Example code:
Output:
Meanwhile, it's easy to provide multiple fields in dedupe:
Output: