-
Notifications
You must be signed in to change notification settings - Fork 7
Description
I'm curious about the extent of curation of the SMILES strings provided in dataset/LP_PDBBind.csv
Some entries seem to be missing a SMILES string (e.g. 4or4) -- spot checking, these seem to be primarily nucleotides & derivatives. Is this intentional, or is this a processing issue? All of these are excluded from Clean Level 1, but I'm also wondering which direction that goes. (Are they missing SMILES because of structural issues, or are they excluded from CL1 because the pipeline didn't give them SMILES.)
I also noticed that a number of SMILES (e.g. 1hti's O=C(O)COP(=O)(O)O ) seem to be represented in neutral form, whereas other structures (e.g. 5uxm's [NH3+][C@@H](Cc1c[nH]c2ccccc12)C(=O)[O-], just one line above it) are represented with explicit charges, albeit with some potential inconsistency. (e.g. 4dxj's CCC[NH2+]CC(P(=O)(O)O)P(=O)(O)O -- I'm doubtful those phosphates are fully protonated at experimental conditions) Are the presence/absence of charges in these strings intended to be meaningful?
I'm also seeing strange inconsistencies, like 2jg8 which is supposed to be bound to phosphoserine, but which is annotated (at CL2) with a SMILES of [NH3+]C(C=O)COP(=O)(O)O. While this matches the atoms present in the structure, it's highly doubtful that it's an aldehyde bound to the structure versus a carboxylate -- I'd wager the more likely explanation is the crystallographers omitting the atom due to missing density. (A quick check of the paper and the PDB entry does not provide additional guidance, save from an explicit annotation that the OXT atom of SEP is missing coordinates.)