How to handle HETATM records in PDB files

From debugging [this](https://github.com/michellab/BioSimSpace/issues/369) BioSimSpace issue it is clear that our handling of HETATM records from PDB files is problematic and needs improving. Unfortunately the formatting of these records seems to be quite variable between PDBs, making it hard to develop a single strategy for dealing with them. For example (copied from the above issue thread):

For example, in [this](https://www.rcsb.org/structure/7UZO) there a `HETAM`s with the same chain identifier before _and_ after the `TER`. Some examples of the different formatting:

`HETATM` in chain `B` before _and_ after `TER`, followed by `HETATM`s from chain `A`.
```
...
ATOM   2097 HD11 ILE B  36      -7.894   6.751 -22.957  1.00 52.74           H
ATOM   2098 HD12 ILE B  36      -8.945   7.001 -24.122  1.00 52.74           H
ATOM   2099 HD13 ILE B  36      -8.598   5.518 -23.670  1.00 52.74           H
HETATM 2100  N   NH2 B  37      -7.355   7.417 -29.288  1.00 58.31           N
TER    2101      NH2 B  37
HETATM 2102 ZN    ZN B 101       0.000   0.000  -9.201  0.33 15.72          ZN
HETATM 2103  O  AHOH A 201     -30.782  29.811 -17.433  0.50 20.93           O
HETATM 2104  O  BHOH A 201     -30.377  31.224 -16.358  0.50 18.33           O
HETATM 2105  O   HOH A 202     -10.750  28.703 -23.497  1.00 39.82           O
...
```
`ATOM` and `HETATM` interspersed within the same chain.
```
...
HETATM 2006  HEABXCP B  31      -6.322  12.783 -15.760  0.37 16.94           H
HETATM 2007  HA AXCP B  31      -6.311  10.105 -16.572  0.63 23.43           H
HETATM 2008  HA BXCP B  31      -5.758  10.612 -16.628  0.37 19.64           H
ATOM   2009  N  AHIS B  32      -5.542  10.707 -18.873  0.63 18.98           N
ANISOU 2009  N  AHIS B  32     1967   2279   2965    480   -109    265       N
ATOM   2010  N  BHIS B  32      -5.238  10.930 -18.956  0.37 18.62           N
ANISOU 2010  N  BHIS B  32     1887   2264   2926    494    -76    294       N
ATOM   2011  CA AHIS B  32      -4.988  11.199 -20.158  0.63 21.40           C
...
```

In my option the important thing isn't necessarily the PDB files themselves, rather what `LEaP` etc. require in order to function. (In most cases someone will be simply loading a PDB as a starting point for parametrisation.) As such, seeing how `pdb4amber` processes a bunch of files including various types of `HETAM` formatting. In some cases these are converted to `ATOM` records, in others they are left in place, and sometimes they are even moved. ParmEd uses the approach of labelling everything in a non-standard residue (using template name matching) as a `HETATM`, but I'm not sure how it deals with those that are misplaced.

Our main problem is that we fully convert the information from the PDB into an internal molecular data structure. Residues in the PDB are reparented to their chains, which are reparented to molecules. When writing back, we reverse this process. If some `HETATM` records need to be placed before the end of a chain (where the `TER` record is placed) and some after, this is very tricky to achieve without knowing exactly which ones should go where, and why.

I'll try to determine some rules-of-thumb for the position of various `HETATM` records, then test how robust these are. Perhaps it's possible to move all records to the end of the file without issue, i.e. after the final `TER`. This would certainly be the easiest solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to handle HETATM records in PDB files #409

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to handle HETATM records in PDB files #409

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions