-
Notifications
You must be signed in to change notification settings - Fork 24
Description
From debugging this BioSimSpace issue it is clear that our handling of HETATM records from PDB files is problematic and needs improving. Unfortunately the formatting of these records seems to be quite variable between PDBs, making it hard to develop a single strategy for dealing with them. For example (copied from the above issue thread):
For example, in this there a HETAMs with the same chain identifier before and after the TER. Some examples of the different formatting:
HETATM in chain B before and after TER, followed by HETATMs from chain A.
...
ATOM 2097 HD11 ILE B 36 -7.894 6.751 -22.957 1.00 52.74 H
ATOM 2098 HD12 ILE B 36 -8.945 7.001 -24.122 1.00 52.74 H
ATOM 2099 HD13 ILE B 36 -8.598 5.518 -23.670 1.00 52.74 H
HETATM 2100 N NH2 B 37 -7.355 7.417 -29.288 1.00 58.31 N
TER 2101 NH2 B 37
HETATM 2102 ZN ZN B 101 0.000 0.000 -9.201 0.33 15.72 ZN
HETATM 2103 O AHOH A 201 -30.782 29.811 -17.433 0.50 20.93 O
HETATM 2104 O BHOH A 201 -30.377 31.224 -16.358 0.50 18.33 O
HETATM 2105 O HOH A 202 -10.750 28.703 -23.497 1.00 39.82 O
...
ATOM and HETATM interspersed within the same chain.
...
HETATM 2006 HEABXCP B 31 -6.322 12.783 -15.760 0.37 16.94 H
HETATM 2007 HA AXCP B 31 -6.311 10.105 -16.572 0.63 23.43 H
HETATM 2008 HA BXCP B 31 -5.758 10.612 -16.628 0.37 19.64 H
ATOM 2009 N AHIS B 32 -5.542 10.707 -18.873 0.63 18.98 N
ANISOU 2009 N AHIS B 32 1967 2279 2965 480 -109 265 N
ATOM 2010 N BHIS B 32 -5.238 10.930 -18.956 0.37 18.62 N
ANISOU 2010 N BHIS B 32 1887 2264 2926 494 -76 294 N
ATOM 2011 CA AHIS B 32 -4.988 11.199 -20.158 0.63 21.40 C
...
In my option the important thing isn't necessarily the PDB files themselves, rather what LEaP etc. require in order to function. (In most cases someone will be simply loading a PDB as a starting point for parametrisation.) As such, seeing how pdb4amber processes a bunch of files including various types of HETAM formatting. In some cases these are converted to ATOM records, in others they are left in place, and sometimes they are even moved. ParmEd uses the approach of labelling everything in a non-standard residue (using template name matching) as a HETATM, but I'm not sure how it deals with those that are misplaced.
Our main problem is that we fully convert the information from the PDB into an internal molecular data structure. Residues in the PDB are reparented to their chains, which are reparented to molecules. When writing back, we reverse this process. If some HETATM records need to be placed before the end of a chain (where the TER record is placed) and some after, this is very tricky to achieve without knowing exactly which ones should go where, and why.
I'll try to determine some rules-of-thumb for the position of various HETATM records, then test how robust these are. Perhaps it's possible to move all records to the end of the file without issue, i.e. after the final TER. This would certainly be the easiest solution.