GitHub

Sparkly: TF/IDF Blocking for Entity Matching

Sparkly is an open-source tool for the blocking step of entity matching. Entity matching finds tuples from two tables A and B that match, that is, refer to the same real-world entity. It typically proceeds in two steps. The blocking step uses heuristics to quickly identify a relatively small set of tuple pairs that can be matches. The matching step applies a (rule- or learning-based) matcher to each surviving pair to predict match/no-match. (See this page for details.)

Sparkly focuses on the blocking step, and is distinguished in four aspects:

It can scale to large tables, for example, with tens of millions or hundreds of millions of tuples per table.
It uses the TF/IDF similarity measure to perform blocking.
It outperforms many state-of-the-art blocking solutions. See this paper for details.
Variations of Sparkly have been implemented in industry and used by hundreds of enterprises.

If you have blocking heuristics you want to use, you may want to consider Delex, an even more powerful blocking solution, which allows you to combine multiple blocking heuristics, including TF/IDF. See the Delex homepage for more details, including when to use Sparkly versus Delex.

How Sparkly Works

Let A be the smaller table (the one with fewer tuples). For each tuple b in Table B, Sparkly finds k tuples in Table A that have the highest TF/IDF similarity score with tuple b (where k is a parameter specified by the user). Let these tuples be a₁, a₂, ..., a_k. Then Sparkly returns the tuple pairs (a₁,b), (a₂,b), ..., (a_k,b) as potential matches in its output.

TF/IDF is a similarity score commonly used in text document retrieval and keyword search on the Web. Many TF/IDF variations exist. Sparkly uses the well-known BM25 variation.

Implementation-wise, Sparkly builds an index over the tuples in Table A, then uses this index and a Spark cluster to perform the top-k tuple findings fast. Sparkly uses Lucene to build indexes and perform top-k search. See the paper for details.

Case Studies and Performance Statistics

Sparkly can block tables of tens of millions of tuples in hours on relatively small clusters of machines. It scales to hundreds of millions of tuples. See this page for details.

Installation

See instructions to install Sparkly on a single machine or on a cluster of machines.

How to Use

See this page, which points to a technical report, slides, a tutorial, and examples.

Further Pointers

See API documentation. For questions / comments, contact AnHai Doan.

Name		Name	Last commit message	Last commit date
Latest commit History 275 Commits
doc		doc
examples		examples
sparkly		sparkly
tests		tests
tips		tips
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkly: TF/IDF Blocking for Entity Matching

How Sparkly Works

Case Studies and Performance Statistics

Installation

How to Use

Further Pointers

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

anhaidgroup/sparkly

Folders and files

Latest commit

History

Repository files navigation

Sparkly: TF/IDF Blocking for Entity Matching

How Sparkly Works

Case Studies and Performance Statistics

Installation

How to Use

Further Pointers

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages