VChunkJoin Edit String Similarity Join Source Code

Basic information

Source code author: Jianbin Qin
Version: 0.1 (Only experimental)
Contact: jqin@inf.ed.ac.uk
More information: http://qinjianbin.com/

Introduction:

Title: VChunkJoin: An Efficient Algorithm for Edit Similarity Joins
Authors: Wei Wang, Jianbin Qin, Chuan Xiao, Xuemin Lin, Heng Tao Shen
Published in TKDE, 2012

Abstract:

Similarity joins play an important role in many application areas, such as data integration and cleaning, record linkage, and pattern recognition. In this paper, we study efficient algorithms for similarity joins with an edit distance constraint. Currently, the most prevalent approach is based on extracting overlapping grams from strings and considering only strings that share a certain number of grams as candidates. Unlike these existing approaches, we propose a novel approach to edit similarity join based on extracting nonoverlapping substrings, or chunks, from strings. We propose a class of chunking schemes based on the notion of tail-restricted chunk boundary dictionary. A new algorithm, VChunkJoin, is designed by integrating existing filtering methods and several new filters unique to our chunk-based method. We also design a greedy algorithm to automatically select a good chunking scheme for a given data set. We demonstrate experimentally that the new algorithm is faster than alternative methods yet occupies less space.

Comments:

This work explored the problem of variable length chunking in edit string similarity. It runs exceptionally efficient when there is a high quality CBD. Finding a high quality CBD is a really hard problem. We presented initial idea, but this part requires further exploration.

Please Cite this paper:

@article{DBLP:journals/tkde/WangQXLS13,
author    = {Wei Wang and
Jianbin Qin and
Chuan Xiao and
Xuemin Lin and
Heng Tao Shen},
title     = {VChunkJoin: An Efficient Algorithm for Edit Similarity Joins},
journal   = {{IEEE} Trans. Knowl. Data Eng.},
volume    = {25},
number    = {8},
pages     = {1916--1929},
year      = {2013},
url       = {https://doi.org/10.1109/TKDE.2012.79},
doi       = {10.1109/TKDE.2012.79},
timestamp = {Sat, 20 May 2017 00:24:23 +0200},
biburl    = {https://dblp.org/rec/bib/journals/tkde/WangQXLS13},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

Overview of Programs

Executables

Program Name Description

charstat Preprocess that used to suggest a prefix character set and a suffix character set.
cbdselect Preprocess that takes input text and prefix and suffix character set to generate a optimized (CBD) dictionary.
chunk_ed_join The join process. It takes the CBD and text input. Output the join result.

Package Manual

Installation

code, say

$ git clone https://github.com/qinbill/VChunkJoin.git
$ cd VChunkJoin/src/
$ make

Preprocessing

The preprocessing dose three things:

Find a character set suggestion.
Generate a CBD.

Find a character set suggestion.

Usage:

$ cat ../data/dblp.sample.10k.gz | gzip -d | ./charstat 2
-s "evwzACPST" -u "abkpyDLMR"

This process will generate two parameters: -s “evwzACPST” -u “abkpyDLMR” Use those two as parameter for the next process.

Generate a CBD:

Usage:

$ ./cbdselect -h
Usage: Program <-s 'character set one'> 
       <-u 'character set two'> 
       <-m 'last prefix length min bound'> 
       <-d dump final split of string records 
       <-t edit distance> 
       <-i cbd in file name> 
       <-o cbd out put file name > 
       <-h help > 
       <-v version > 

$ cat ../data/dblp.sample.10k.gz | gzip -d | ./cbdselect -s 'aefg' -u 'vsxz' -t 3 -o cbd

Join Query Processing

The join processing part takes input text from standard input. Usage:

$ ./chunk_ed_join -h
usage: <-b bound dict file>
       <-g virtual bound random seed>
       <-t edit distance /tau>
       [-d dump all the chunks and strings]
       [-o not join underflow strings]
       [-p dump all the prefix sorted by frequence]
       [-c mute the chunk number filtering
       [-r calculate the final edit-distance result
       [-u print underflow candidates
       [-s print ppjoin running time
       [-h> for help information]
This program output one line in stdout and all the candidates and information in stderr
  --Stdout output format  explanation:
       RNUM    :  All Input Records Number
       CBD     :  The Chunk Boundary chars
       TAU     :  Inputededit distence
       TCADT   :  Total candidates
       RCADT   :  TCADT-UDCDT, it means the candidates exclude underflows
       UDCDT   :  Number of candidates create by underflows
       UDNUM   :  Number of records is underflow
       DCNUM   :  All Distinct Chunk Number
       DWCNUM  :  All Distinct Widow Chunk Number
       DICNUM  :  All Distinct Indexed Chunk Number
       AVGPCL  :  Average Prefix Chunk length
   --Stderr output format explanation
       Candidate line is begin by CAND or BLCAND format is below
       CAND[Candidate id] <Record id of a>[L:record length of a ][T:Chunk number of a]- \
                 -<Record id of b>[L:record lenght of b][T:chunk number of b] <string a>---<string b> 
   --Prefix chunks information format
       PREFIX_CHUNKS: PF[the frequence in prefix] TF[the frequence in the overall]  "prefix string"
Version: 0.0.1.0_PROD

A example:

$ cat ../data/dblp.sample.10k.gz | gzip -d | ./chunk_ed_join -b cbd -t 3
2 5313-2308
RNUM= 10000  TAU= 3 CAND0= 501   CAND1= 2  CAND2= 1  UDCDT= 0  UDNUM= 5  RST_NUM= 1  TCN= 295739  TPL= 46221  APL= 4.624  TIDX= 15146  VCN_SEED= 121231  CBD= "cbd"  TOTAL= 0.124  PRE_PRC= 0.123  JOINT= 0.001  Usage: System usage stats 0.000805 elapsed 0.000668 user 0.000140 system sec

Last Modified: <2018-03-29 Mon> by Jianbin Qin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VChunkJoin Edit String Similarity Join Source Code

Basic information

Introduction:

Abstract:

Comments:

Please Cite this paper:

Overview of Programs

Executables

Package Manual

Installation

Preprocessing

Find a character set suggestion.

Generate a CBD:

Join Query Processing

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
script		script
src		src
LICENSE		LICENSE
README.org		README.org

License

qinbill/VChunkJoin

Folders and files

Latest commit

History

Repository files navigation

VChunkJoin Edit String Similarity Join Source Code

Basic information

Introduction:

Abstract:

Comments:

Please Cite this paper:

Overview of Programs

Executables

Package Manual

Installation

Preprocessing

Find a character set suggestion.

Generate a CBD:

Join Query Processing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages