- Source code author: Jianbin Qin
- Version: 0.1 (Only experimental)
- Contact: jqin@inf.ed.ac.uk
- More information: http://qinjianbin.com/
- Title: VChunkJoin: An Efficient Algorithm for Edit Similarity Joins
- Authors: Wei Wang, Jianbin Qin, Chuan Xiao, Xuemin Lin, Heng Tao Shen
- Published in TKDE, 2012
Similarity joins play an important role in many application areas, such as data integration and cleaning, record linkage, and pattern recognition. In this paper, we study efficient algorithms for similarity joins with an edit distance constraint. Currently, the most prevalent approach is based on extracting overlapping grams from strings and considering only strings that share a certain number of grams as candidates. Unlike these existing approaches, we propose a novel approach to edit similarity join based on extracting nonoverlapping substrings, or chunks, from strings. We propose a class of chunking schemes based on the notion of tail-restricted chunk boundary dictionary. A new algorithm, VChunkJoin, is designed by integrating existing filtering methods and several new filters unique to our chunk-based method. We also design a greedy algorithm to automatically select a good chunking scheme for a given data set. We demonstrate experimentally that the new algorithm is faster than alternative methods yet occupies less space.
This work explored the problem of variable length chunking in edit string similarity. It runs exceptionally efficient when there is a high quality CBD. Finding a high quality CBD is a really hard problem. We presented initial idea, but this part requires further exploration.
@article{DBLP:journals/tkde/WangQXLS13,
author = {Wei Wang and
Jianbin Qin and
Chuan Xiao and
Xuemin Lin and
Heng Tao Shen},
title = {VChunkJoin: An Efficient Algorithm for Edit Similarity Joins},
journal = {{IEEE} Trans. Knowl. Data Eng.},
volume = {25},
number = {8},
pages = {1916--1929},
year = {2013},
url = {https://doi.org/10.1109/TKDE.2012.79},
doi = {10.1109/TKDE.2012.79},
timestamp = {Sat, 20 May 2017 00:24:23 +0200},
biburl = {https://dblp.org/rec/bib/journals/tkde/WangQXLS13},
bibsource = {dblp computer science bibliography, https://dblp.org}
}Program Name Description
- charstat Preprocess that used to suggest a prefix character set and a suffix character set.
- cbdselect Preprocess that takes input text and prefix and suffix character set to generate a optimized (CBD) dictionary.
- chunk_ed_join The join process. It takes the CBD and text input. Output the join result.
- code, say
$ git clone https://github.com/qinbill/VChunkJoin.git
$ cd VChunkJoin/src/
$ makeThe preprocessing dose three things:
- Find a character set suggestion.
- Generate a CBD.
Usage:
$ cat ../data/dblp.sample.10k.gz | gzip -d | ./charstat 2
-s "evwzACPST" -u "abkpyDLMR"This process will generate two parameters: -s “evwzACPST” -u “abkpyDLMR” Use those two as parameter for the next process.
Usage:
$ ./cbdselect -h
Usage: Program <-s 'character set one'>
<-u 'character set two'>
<-m 'last prefix length min bound'>
<-d dump final split of string records
<-t edit distance>
<-i cbd in file name>
<-o cbd out put file name >
<-h help >
<-v version >
$ cat ../data/dblp.sample.10k.gz | gzip -d | ./cbdselect -s 'aefg' -u 'vsxz' -t 3 -o cbdThe join processing part takes input text from standard input. Usage:
$ ./chunk_ed_join -h
usage: <-b bound dict file>
<-g virtual bound random seed>
<-t edit distance /tau>
[-d dump all the chunks and strings]
[-o not join underflow strings]
[-p dump all the prefix sorted by frequence]
[-c mute the chunk number filtering
[-r calculate the final edit-distance result
[-u print underflow candidates
[-s print ppjoin running time
[-h> for help information]
This program output one line in stdout and all the candidates and information in stderr
--Stdout output format explanation:
RNUM : All Input Records Number
CBD : The Chunk Boundary chars
TAU : Inputededit distence
TCADT : Total candidates
RCADT : TCADT-UDCDT, it means the candidates exclude underflows
UDCDT : Number of candidates create by underflows
UDNUM : Number of records is underflow
DCNUM : All Distinct Chunk Number
DWCNUM : All Distinct Widow Chunk Number
DICNUM : All Distinct Indexed Chunk Number
AVGPCL : Average Prefix Chunk length
--Stderr output format explanation
Candidate line is begin by CAND or BLCAND format is below
CAND[Candidate id] <Record id of a>[L:record length of a ][T:Chunk number of a]- \
-<Record id of b>[L:record lenght of b][T:chunk number of b] <string a>---<string b>
--Prefix chunks information format
PREFIX_CHUNKS: PF[the frequence in prefix] TF[the frequence in the overall] "prefix string"
Version: 0.0.1.0_PRODA example:
$ cat ../data/dblp.sample.10k.gz | gzip -d | ./chunk_ed_join -b cbd -t 3
2 5313-2308
RNUM= 10000 TAU= 3 CAND0= 501 CAND1= 2 CAND2= 1 UDCDT= 0 UDNUM= 5 RST_NUM= 1 TCN= 295739 TPL= 46221 APL= 4.624 TIDX= 15146 VCN_SEED= 121231 CBD= "cbd" TOTAL= 0.124 PRE_PRC= 0.123 JOINT= 0.001 Usage: System usage stats 0.000805 elapsed 0.000668 user 0.000140 system secLast Modified: <2018-03-29 Mon> by Jianbin Qin