Skip to content

Ingest: Data from INSDC with invalid author format is parsed incorrectly, leading sequences to error #6111

@anna-parker

Description

@anna-parker

We receive the error:

Authors: Invalid name(s): 'Lee Cynthia K, [. U. S. ].'; 'Monath Thomas P, [. U. S. ].'; 'Guertin Patrick M, [. U. S. ].' ... and 1 others. Please ensure that authors are separated by semi-colons. Each author's name should be in the format 'last name, first name;'. Last name(s) is mandatory, a comma is mandatory to separate first names/initials from last name. Only ASCII alphabetical characters A-Z are allowed. For example: 'Smith, Anna; Perez, Tom J.; Xu, X.L.;' or 'Xu,;' if the first name is unknown.

for the sequences: JA784072.1, JA784073.1 JA784080.1 JA784081.1 JA784082.1 JA784083.1 JA784086.1

Ingest submitted: Lee Cynthia K, [. U. S. ].; Monath Thomas P, [. U. S. ].; Guertin Patrick M, [. U. S. ].; Hayman Edward G, [. U. S. ].

What we received in the NCBI Virus download: "submitter":{"names":["LEE CYNTHIA K,[.U.S.].","MONATH THOMAS P,[.U.S.].","GUERTIN PATRICK M,[.U.S.].","HAYMAN EDWARD G,[.U.S.]."]} - the expected structure is ["PUSHKO,P.","LUKASHEVICH,I."] - therefore ingest assumes [.U.S.]. are initials, and preprocessing errors as we do not accept non-ASCII characters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingingestIngest pipelinepreprocessingIssues related to the preprocessing component

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions