Skip to content

Curation of SMILES strings? #7

@roccomoretti

Description

@roccomoretti

I'm curious about the extent of curation of the SMILES strings provided in dataset/LP_PDBBind.csv

Some entries seem to be missing a SMILES string (e.g. 4or4) -- spot checking, these seem to be primarily nucleotides & derivatives. Is this intentional, or is this a processing issue? All of these are excluded from Clean Level 1, but I'm also wondering which direction that goes. (Are they missing SMILES because of structural issues, or are they excluded from CL1 because the pipeline didn't give them SMILES.)

I also noticed that a number of SMILES (e.g. 1hti's O=C(O)COP(=O)(O)O ) seem to be represented in neutral form, whereas other structures (e.g. 5uxm's [NH3+][C@@H](Cc1c[nH]c2ccccc12)C(=O)[O-], just one line above it) are represented with explicit charges, albeit with some potential inconsistency. (e.g. 4dxj's CCC[NH2+]CC(P(=O)(O)O)P(=O)(O)O -- I'm doubtful those phosphates are fully protonated at experimental conditions) Are the presence/absence of charges in these strings intended to be meaningful?

I'm also seeing strange inconsistencies, like 2jg8 which is supposed to be bound to phosphoserine, but which is annotated (at CL2) with a SMILES of [NH3+]C(C=O)COP(=O)(O)O. While this matches the atoms present in the structure, it's highly doubtful that it's an aldehyde bound to the structure versus a carboxylate -- I'd wager the more likely explanation is the crystallographers omitting the atom due to missing density. (A quick check of the paper and the PDB entry does not provide additional guidance, save from an explicit annotation that the OXT atom of SEP is missing coordinates.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions