Protein file preparation seems to fail if adding hydrogen atoms

Hello,

I have tried to run ```./ai2bmd ./chig.pdb``` to make sure your model gets setup properly and I manage to simulate the MD trajectory given the reference protein downloaded from
https://raw.githubusercontent.com/microsoft/AI2BMD/main/examples/chig.pdb

Now I want to run simulations on predicted monomer structures, for example I choose a rather short one (57 residues) from the AF2DB
https://alphafold.ebi.ac.uk/entry/H2QDC4

Following your readme and observations on how your example pdb files look like, I drafted the following script to prepare AF2 predictions for use in ai2bmd.

```
import os
from pymol import cmd
import sys

def prepare_pdb(input_pdb, output_prefix, n_residues, n_atoms, h_add=1):
    # TODO: below are modifications applied to the PDB file according to the repo's README

    # prepare protein with PyMol

    cmd.load(input_pdb, "molecule")

    if h_add:
        output_prefix += "withH__"
        cmd.h_add("molecule")  # add hydrogens
        cmd.save(f"{output_prefix}h_add.pdb")
    else:
        output_prefix += "withoutH__"

    cmd.wizard("mutagenesis")

    cmd.get_wizard().set_n_cap("acet")  # add n-terminal cap
    selection = "/%s//%s/%s" % ("molecule", "A", 1)  # selection of N-term
    cmd.get_wizard().do_select(selection)
    cmd.get_wizard().apply()
    cmd.save(f"{output_prefix}n_cap.pdb")

    cmd.get_wizard().set_c_cap("nmet")  # add c-terminal cap
    selection = "/%s//%s/%s" % ("molecule", "A", n_residues)  # selection of C-term
    cmd.get_wizard().do_select(selection)
    cmd.get_wizard().apply()
    cmd.save(f"{output_prefix}pymol_prep.pdb")

    # adjust atom names with Amber

    os.system(f"pdb4amber -i {output_prefix}pymol_prep.pdb -o {output_prefix}amber_prep.pdb")

    # ensure no TER separator
    # (and residue numbering is from 1 without gaps since we start from an AF2 predicted structure..)

    pdb_lines = []
    with open(f"{output_prefix}amber_prep.pdb", 'r') as f:
        for line in f:
            if not line.startswith("TER"):
                pdb_lines.append(line.strip())
            else:
                print(f"cleaning up TER line\n{line}")

    with open(f"{output_prefix}noTER_prep.pdb", 'w') as f:
        for line in pdb_lines:
            f.write(f"{line}\n")

    # TODO: below are additional modifications to the PDB file to be most similar to the example files..

    # make sure all records are ATOM and not HETATM as in example files

    pdb_lines = ["ATOM  "+l[6:] if l.startswith("HETATM") else l for l in pdb_lines]

    with open(f"{output_prefix}noHETATM_prep.pdb", 'w') as f:
        for line in pdb_lines:
            f.write(f"{line}\n")

    # in the example files there is no chain identifier

    pdb_lines = [l.replace(" A  ", "    ") if l.startswith("ATOM") else l for l in pdb_lines]

    with open(f"{output_prefix}noCHAIN_prep.pdb", 'w') as f:
        for line in pdb_lines:
            f.write(f"{line}\n")

    # in chig.pdb the header format is different (8 columns instead of 9 in our pdb file at this stage)
    # replace with chig's header (or delete as in other example pdb files?)

    new_header = "CRYST1   25.939   27.327   23.669  90.00  90.00  90.00               1"
    print(f"\nreplacing header\n{pdb_lines[0]}\nwith header from reference CHIG file\n{new_header}")
    pdb_lines[0] = new_header

    with open(f"{output_prefix}chig_header.pdb", 'w') as f:
        for line in pdb_lines:
            f.write(f"{line}\n")

    # in the example files there is "TER" between last "ATOM" and "END"

    pdb_lines = (pdb_lines[:-1] +
                 [f"TER     {n_atoms+1}      NME    {n_residues+2} "] +
                 [pdb_lines[-1]])

    with open(f"{output_prefix}modified_TER.pdb", 'w') as f:
        for line in pdb_lines:
            f.write(f"{line}\n")


if __name__ == '__main__':
    path_to_data = "/path_to_input_pdb_files"

    # example structure for trying out https://alphafold.ebi.ac.uk/entry/H2QDC4
    input_pdb = f"{path_to_data}/AF-H2QDC4-F1-model_v4.pdb"
    output_prefix = f"{path_to_data}/AF-H2QDC4__"
    n_residues = 57  # on purpose we take a short protein to try setting up AI2BMD

    h_add = int(sys.argv[1])
    if h_add:
        n_atoms = 842
    else:
        n_atoms = 418

    prepare_pdb(input_pdb, output_prefix, n_residues, n_atoms, h_add=h_add)

    # python input_prep_cleaned.py 1 > input_prep_withH.txt
    # ./ai2bmd --prot-file AF-H2QDC4__withH__modified_TER.pdb > ai2bmd_withH.txt
    # FIXME: H atoms are causing errors except for last residue C-cap (<NME 59>)
    #     FATAL:  Atom .R<ACE 1>.A<HH31 7> does not have a type.
    #     FATAL:  Atom .R<ACE 1>.A<HH32 8> does not have a type.
    #     FATAL:  Atom .R<ACE 1>.A<HH33 9> does not have a type.
    #     FATAL:  Atom .R<MET 2>.A<2HB 18> does not have a type.
    #     FATAL:  Atom .R<MET 2>.A<3HB 19> does not have a type.
    #     FATAL:  Atom .R<MET 2>.A<2HG 20> does not have a type.
    #     FATAL:  Atom .R<MET 2>.A<3HG 21> does not have a type.
    #     FATAL:  Atom .R<MET 2>.A<1HE 22> does not have a type.
    #     FATAL:  Atom .R<MET 2>.A<2HE 23> does not have a type.
    #     FATAL:  Atom .R<MET 2>.A<3HE 24> does not have a type.
    #     ...
    #     FATAL:  Atom .R<ILE 57>.A<H01 20> does not have a type.
    #     FATAL:  Atom .R<ILE 57>.A<H02 21> does not have a type.
    #     FATAL:  Atom .R<ILE 57>.A<H03 22> does not have a type.
    #     FATAL:  Atom .R<ILE 57>.A<H04 23> does not have a type.
    #     FATAL:  Atom .R<ILE 57>.A<H05 24> does not have a type.
    #     FATAL:  Atom .R<ILE 57>.A<H06 25> does not have a type.
    #     FATAL:  Atom .R<ILE 57>.A<H07 26> does not have a type.
    #     FATAL:  Atom .R<ILE 57>.A<H08 27> does not have a type.
    #     FATAL:  Atom .R<ILE 57>.A<H09 28> does not have a type.
    #     FATAL:  Atom .R<ILE 57>.A<H10 29> does not have a type.
    #     FATAL:  Atom .R<ILE 57>.A<H13 30> does not have a type.
    #     FATAL:  Atom .R<PRO 58>.A<2HB 15> does not have a type.
    #     FATAL:  Atom .R<PRO 58>.A<3HB 16> does not have a type.
    #     FATAL:  Atom .R<PRO 58>.A<2HG 17> does not have a type.
    #     FATAL:  Atom .R<PRO 58>.A<3HG 18> does not have a type.
    #     FATAL:  Atom .R<PRO 58>.A<2HD 19> does not have a type.
    #     FATAL:  Atom .R<PRO 58>.A<3HD 20> does not have a type.
    #     /opt/conda/envs/ambertools/bin/teLeap: Fatal Error!
    #     Failed to generate parameters
    #     Exiting LEaP: Errors = 1; Warnings = 2; Notes = 0.

    # python input_prep_cleaned.py 0 > input_prep_withoutH.txt
    # ./ai2bmd --prot-file AF-H2QDC4__withoutH__modified_TER.pdb > ai2bmd_withoutH.txt
```

If I run the preparation with applying ```h_add``` then ```./ai2bmd``` crashes almost instantly and no additional files are created along the input pdb file (AF-H2QDC4__withH__modified_TER.pdb). There seems to be an issue on how H are formatted in all residues besides the last C-cap residue. (see prepared pdb files in the next comment)

If I run preparation but skip ```h_add``` then ```./ai2bmd``` is still running and has already generated additional files:
- AF-H2QDC4__withoutH__modified_TER.pdb
- AF-H2QDC4__withoutH__modified_TER1.inpcrd
- AF-H2QDC4__withoutH__modified_TER.inpcrd
- AF-H2QDC4__withoutH__modified_TER.top
- AF-H2QDC4__withoutH__modified_TER1.top


I am waiting to see whether the simulation will complete with input preparation without adding hydrogens.
Is it fine to run without? Or can anyone share insights on what I should fix in the preparation to successfully run simulation while adding hydrogen atoms?


Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Protein file preparation seems to fail if adding hydrogen atoms #66

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Protein file preparation seems to fail if adding hydrogen atoms #66

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions