Skip to content

How to handle HETATM records in PDB files #409

@lohedges

Description

@lohedges

From debugging this BioSimSpace issue it is clear that our handling of HETATM records from PDB files is problematic and needs improving. Unfortunately the formatting of these records seems to be quite variable between PDBs, making it hard to develop a single strategy for dealing with them. For example (copied from the above issue thread):

For example, in this there a HETAMs with the same chain identifier before and after the TER. Some examples of the different formatting:

HETATM in chain B before and after TER, followed by HETATMs from chain A.

...
ATOM   2097 HD11 ILE B  36      -7.894   6.751 -22.957  1.00 52.74           H
ATOM   2098 HD12 ILE B  36      -8.945   7.001 -24.122  1.00 52.74           H
ATOM   2099 HD13 ILE B  36      -8.598   5.518 -23.670  1.00 52.74           H
HETATM 2100  N   NH2 B  37      -7.355   7.417 -29.288  1.00 58.31           N
TER    2101      NH2 B  37
HETATM 2102 ZN    ZN B 101       0.000   0.000  -9.201  0.33 15.72          ZN
HETATM 2103  O  AHOH A 201     -30.782  29.811 -17.433  0.50 20.93           O
HETATM 2104  O  BHOH A 201     -30.377  31.224 -16.358  0.50 18.33           O
HETATM 2105  O   HOH A 202     -10.750  28.703 -23.497  1.00 39.82           O
...

ATOM and HETATM interspersed within the same chain.

...
HETATM 2006  HEABXCP B  31      -6.322  12.783 -15.760  0.37 16.94           H
HETATM 2007  HA AXCP B  31      -6.311  10.105 -16.572  0.63 23.43           H
HETATM 2008  HA BXCP B  31      -5.758  10.612 -16.628  0.37 19.64           H
ATOM   2009  N  AHIS B  32      -5.542  10.707 -18.873  0.63 18.98           N
ANISOU 2009  N  AHIS B  32     1967   2279   2965    480   -109    265       N
ATOM   2010  N  BHIS B  32      -5.238  10.930 -18.956  0.37 18.62           N
ANISOU 2010  N  BHIS B  32     1887   2264   2926    494    -76    294       N
ATOM   2011  CA AHIS B  32      -4.988  11.199 -20.158  0.63 21.40           C
...

In my option the important thing isn't necessarily the PDB files themselves, rather what LEaP etc. require in order to function. (In most cases someone will be simply loading a PDB as a starting point for parametrisation.) As such, seeing how pdb4amber processes a bunch of files including various types of HETAM formatting. In some cases these are converted to ATOM records, in others they are left in place, and sometimes they are even moved. ParmEd uses the approach of labelling everything in a non-standard residue (using template name matching) as a HETATM, but I'm not sure how it deals with those that are misplaced.

Our main problem is that we fully convert the information from the PDB into an internal molecular data structure. Residues in the PDB are reparented to their chains, which are reparented to molecules. When writing back, we reverse this process. If some HETATM records need to be placed before the end of a chain (where the TER record is placed) and some after, this is very tricky to achieve without knowing exactly which ones should go where, and why.

I'll try to determine some rules-of-thumb for the position of various HETATM records, then test how robust these are. Perhaps it's possible to move all records to the end of the file without issue, i.e. after the final TER. This would certainly be the easiest solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions