Address Finder is a package for matching textual addresses. Several similarity scoring methods are implemented in this package. The methods implemented in this package are targeting non-uniform textual addresses which may contain misspelling, missing words, etc. The package can be used with Spark.
The following similarity measures are defined in this package.
LetterPrefixDistancemeasures the length of common prefix in lettersWordPrefixDistancemeasures the length of common prefix in wordsNumberOverlapDistancemeasures the differences of numbers appearing the addressesNumberSeqDistancemeasures the numbers difference by Levenshtein distancePriorityWordDistancemeasures word differences and assign more weights in the difference if the word appears at the beginning of the stringStrictNumberOverlapDistancea combination of number overlap and number seq differences with binary resultsWordBagDistancemeasures the difference in word bag (word counts)WordSetDistancemeasures the difference in words without considering of the numbers of appearanceSymmetricWordSetDistancesimilar toWordSetDistanceand is symmetricSymmetricWordSetWithIDFsimilar toSymmetricWordSetDistanceand words are weighted by IDF in the collection
Searchers can be composed with different components.
Please refer to uk.ac.cdrc.data.utility.text.AddressSearcher for an example of search composition.
import uk.ac.cdrc.data.utility.text.AddressSearcher
val addressSetA = IndexedSeq(
"1 some street some city",
"2 some street some city")
val addressSetB = IndexedSeq(
"2 some street some city",
"2a some street some city")
val as = AddressSearcher(addressSetA)
val matching = for {a <- addressSetB
r <- as search a
if !r.multiTops
} yield (r.top, a)