Skip to content

Fuzzy matching closest neighbor hash-map backed tool for high performance multicolumn V-Look Up style operations

Notifications You must be signed in to change notification settings

luislascano01/AI_VLookUp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI VLookUp

This project provides a multi-data structure backed approach to tackle efficient fuzzy matching in tables; with its engine backed by HashMaps, Tokenization, Jaccard, and Damerau plus small group sorting this approach is capable of creating its own database for fuzzy look up; which is then used to query itself in order to solve the most optimal mapping between entries of two different tables.

Use case

Let’s say we have the following tables: 1. Fuzzy_Table and the 2. Reference_Table. The first table is the one we want to "clean" or fill based on the second table that provides the expected actual values per entry. In this case, we attempt to find the ID each entry based on the values of other columns. The challenge here is that for some entry each column may or may not have data necessary to perform an "ordinary" V Look Up by itself; additionally, it may contain data but this text could be not exact matching to the valid one. Therefore, our objective here is per each entry (row) use all the non-empty columns as reference points.

Without getting into deeper details (check documentation for more) the functionality of this program is to perform a many-to-many (in regards to the columns) weighted fuzzy matching per each incomplete entry against a complete or valid reference table.

Dependencies and Installation

Verify java version > 1.8

java -version

Install maven

brew install maven

The package dependencies may be found under Engine/pom.xml

You may build the program (JAR) from source code as a Maven project.

Example usage

Continuing our example from the use case. Let's say we want to find the Customer_ID for each entry in the Fuzzy_Table. However, if we see our table there might be values that appear in the incorrect columns. Therefore, we plot the mapping between columns across both tables as below.

Description of image

​ Graph 1: Default mapping configuration

Set-up

For a more detailed guide please check the User Manual.

Download code

git clone https://github.com/luislascano01/AI_VLookUp
cd AI_VLookUp

Build JAR

cd Engine
mvn clean install

Custom run configuration

Reference and edit sample configuration according to excel tables to be processed.

This configuration includes header mappings –as seen on the image– as well as excel workbooks paths, operating directory path, and secondary data columnwise RegEx set.

To view sample from terminal:

cat Engine/src/main/resources/header_configuration.yaml

Execution

Sample execution

java -jar Engine/target/ai_vlookup-0.0.1-SNAPSHOT.jar Engine/src/main/resources/header_configuration.yaml

Find output as

./OperatingDir/results.csv 

Custom execution

Modify the header_configuration.yaml according to your needs. Refer to Graph 1 to understand the mapping – soft-max is applied to weights. Such graph corresponds to the mapping of the sample YAML configuration; specifically, the "BackboneConfiguration".

java -jar Engine/target/ai_vlookup-0.0.1-SNAPSHOT.jar custom_config.yaml

Copyright Notice

© [2025] [Luis Lascano]. All rights reserved.

Open to use as is through instructed installation for personal use. No permission authorized to copy, modify, or distribute this software (or part of it) and its documentation for any purpose without the express written permission of the copyright holder.

About

Fuzzy matching closest neighbor hash-map backed tool for high performance multicolumn V-Look Up style operations

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published