Skip to content

Better tracking of source annotations for adjudication #5

@etgld

Description

@etgld

Currently the sources are passed around as a collection of json.dumps strings associated with the annotation dataclasses. This is bloated and also creates some issues with duplicates when we convert the internal annotation objects back to dictionaries for dumping to Label Studio input, since annotation dataclasses correspond to a class of annotation type we want to score individually (DocTimeRel, Adverse Event, etc.), but since we're keeping track of sources by ID and offset this means sources might be shared.

To me there seem to be three better possibilities:

  1. Mapping from ID and offsets to their collection of internal annotations (problem, giant omniscient data structure)
  2. Have source be exactly one entity (still some bloat)
  3. (Favorite obviously) develop better models of Label Studio annotation types and have conversion method from internal dataclasses to those (can use this as a pivot to bring in Pydantic and fold into Ian's code if that's ever helpful)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions