Create deduplication data for hand-annotating training data for a simple classifier

- generate every (string) similarity measure score for the names of two subjects
- threshold the mean of every score (to reduce data)
- export duplicate candidates to a file in the following way:
  - for every new subject there should be a list of subjects in the knowledge base that can be linked
  - the grouped subjects should be ranked by the mean of the scores in descending order
  - every subject should contain all the subject data, the similarity measure scores and the mean of the scores
  - export format should be json
- a python script should then use the exported data and enable manually annotating it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create deduplication data for hand-annotating training data for a simple classifier #646

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Create deduplication data for hand-annotating training data for a simple classifier #646

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions