Skip to content

Implementing Levenshtein Distance #18

@MNSleeper

Description

@MNSleeper

In order to implement LD, There are two main ways, listed below

  • run an LD test on every entry of the array of strings that gets passed to SQL once a file is parsed. This would correct any misspellings in the file list before it gets put into SQL, However, two different datasets could have two different spellings of the same entity, so the LD test would have to be run again.
  • Run an LD test after a datasets has been put into SQL but before running the Linkage method on on that particular table, likely as a part of the LinkTable method itself. This method would ensure that all misspellings across all tables are corrected.

The latter option seems best.

On top of this, there's one small issue of what if a misspelling gets accepted as the proper way to spell something, and every proper spelling of a word gets "corrected" to the improper way. My only thought is having an internal dictionary of all proper spellings.

Finally, what should an acceptable LD be? Should we use a static number, like 3, or base it off some metric like the length of the string being used?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions