Curation of SMILES strings?

I'm curious about the extent of curation of the SMILES strings provided in `dataset/LP_PDBBind.csv`

Some entries seem to be missing a SMILES string (e.g. `4or4`) -- spot checking, these seem to be primarily nucleotides & derivatives. Is this intentional, or is this a processing issue? All of these are excluded from Clean Level 1, but I'm also wondering which direction that goes. (Are they missing SMILES because of structural issues, or are they excluded from CL1 because the pipeline didn't give them SMILES.)

I also noticed that a number of SMILES (e.g. `1hti`'s `O=C(O)COP(=O)(O)O` ) seem to be represented in neutral form, whereas other structures (e.g. `5uxm`'s `[NH3+][C@@H](Cc1c[nH]c2ccccc12)C(=O)[O-]`, just one line above it) are represented with explicit charges, albeit with some potential inconsistency. (e.g. `4dxj`'s `CCC[NH2+]CC(P(=O)(O)O)P(=O)(O)O` -- I'm doubtful those phosphates are fully protonated at experimental conditions) Are the presence/absence of charges in these strings intended to be meaningful?

I'm also seeing strange inconsistencies, like `2jg8` which is supposed to be bound to phosphoserine, but which is annotated (at CL2) with a SMILES of `[NH3+]C(C=O)COP(=O)(O)O`. While this matches the atoms present in the structure, it's highly doubtful that it's an aldehyde bound to the structure versus a carboxylate -- I'd wager the more likely explanation is the crystallographers omitting the atom due to missing density. (A quick check of the paper and the PDB entry does not provide additional guidance, save from an explicit annotation that the OXT atom of SEP is missing coordinates.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Curation of SMILES strings? #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Curation of SMILES strings? #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions