Implementing Levenshtein Distance

In order to implement LD, There are two main ways, listed below

-  run an LD test on every entry of the array of strings that gets passed to SQL once a file is parsed. This would correct any misspellings in the file list before it gets put into SQL, However, two different datasets could have two different spellings of the same entity, so the LD test would have to be run again.
- Run an LD test after a datasets has been put into SQL but before running the Linkage method on on that particular table, likely as a part of the LinkTable method itself. This method would ensure that all misspellings across all tables are corrected.

The latter option seems best.

On top of this, there's one small issue of what if a misspelling gets accepted as the proper way to spell something, and every proper spelling of a word gets "corrected" to the improper way. My only thought is having an internal dictionary of all proper spellings.

Finally, what should an acceptable LD be? Should we use a static number, like 3, or base it off some metric like the length of the string being used?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing Levenshtein Distance #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implementing Levenshtein Distance #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions