Skip to content

API Duplicate resource detection #424

@regetz

Description

@regetz

What

The storage system should have some capabilities around detecting when a user is attempting to add a resource that already exists, and/or that two or more resources already in the database are putative duplicates.

Details

Scenario: User is uploading one or more records (plots, plot observations, taxon observations, projects, references, etc) as though they are new resources in the system. Depending on how we implement upload functionality, this may mean adding a resource without a an identifier (accessionCode), or specifying an identifier that does not exist in the system. We would like the system to be able to recognize when this resource already appears to exist in the system, with some (other) identifier, and flag this for the user.

TBD:

  • Do we also check for duplication internally in a bulk upload? e.g. user uploads 2 (new) records with different identifiers but the same values of other fields
  • What fields should we check for duplication? Do we need to do this on a resource by resource basis?
  • To what extent we want to do fuzzy matching -- e.g. do we require plot name to be identical or merely very similar? If the latter, consider that some fields are likely to be very similar across records (e.g. plot "Santa Rosa plot 1" and "Santa Rosa plot 2"), leading to a high false positive rate in duplication detection.
  • How many fields have the be the same to qualify as a duplicate? Do we need to (and can we) define a specific and reliable rule on a resource type by resource type basis?
  • When we detect duplicates, what do we do?
    • Fail to add the data, and return a corresponding warning message?
    • In that case, how do we allow the user to override duplicate detection? Do we do this on a record by record basis within an upload request, or apply to the entire upload request?
  • Do we want to also do periodic duplicate detection on the database as a background job (or have as an API admin method that can be run on demand), flagging potential duplicates for review and cleanup? If so, what cleanup do we offer?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    Backlog

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions