-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
In order to implement LD, There are two main ways, listed below
- run an LD test on every entry of the array of strings that gets passed to SQL once a file is parsed. This would correct any misspellings in the file list before it gets put into SQL, However, two different datasets could have two different spellings of the same entity, so the LD test would have to be run again.
- Run an LD test after a datasets has been put into SQL but before running the Linkage method on on that particular table, likely as a part of the LinkTable method itself. This method would ensure that all misspellings across all tables are corrected.
The latter option seems best.
On top of this, there's one small issue of what if a misspelling gets accepted as the proper way to spell something, and every proper spelling of a word gets "corrected" to the improper way. My only thought is having an internal dictionary of all proper spellings.
Finally, what should an acceptable LD be? Should we use a static number, like 3, or base it off some metric like the length of the string being used?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels