Skip to content

Create deduplication data for hand-annotating training data for a simple classifier #646

@janehmueller

Description

@janehmueller
  • generate every (string) similarity measure score for the names of two subjects
  • threshold the mean of every score (to reduce data)
  • export duplicate candidates to a file in the following way:
    • for every new subject there should be a list of subjects in the knowledge base that can be linked
    • the grouped subjects should be ranked by the mean of the scores in descending order
    • every subject should contain all the subject data, the similarity measure scores and the mean of the scores
    • export format should be json
  • a python script should then use the exported data and enable manually annotating it

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions