Skip to content

Errors with MSA files #31

@chad-hyer

Description

@chad-hyer

Hey,

I am having some issues with the way ProteinNPT is handling my MSA files. It took me a bit to figure out A2M files to start with, and my way of getting them is a little roundabout, but I believe my files have the correct structure. I get a strange error that the columns are of different sizes even though I have confirmed in my source files that they have the same sizes. I've attached the original fasta I used to make an A2M as well as the source A2M file I am working with. This error doesn't seem to stop the program from running, but later ProteinNPT throws a file not found error for one of the hhfiltered_cov_75_maxid_90_minid_0.a2m files. I went to the specific location, and the file exists, so I am little confused about what is going on.

MSA_files.zip

The error I am getting is as follows:

/scratch/groups/mjewett/proteinnpt_env/lib/python3.12/site-packages/wandb/sdk/launch/builder/build.py:11: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
133+1 records in
133+1 records out
68430 bytes (68 kB) copied, 0.00222253 s, 30.8 MB/s
- 16:44:28.835 INFO: Input file = /scratch/groups/mjewett/Chad_Hyer_Tiling_Experiment/preprocessed/CA_MSA_Final_UC.a2m

- 16:44:28.835 INFO: Output file = /scratch/groups/mjewett/Chad_Hyer_Tiling_Experiment/hhfiltered/CA_MSA_Final_hhfiltered_cov_75_maxid_90_minid_0.a2m

- 16:44:28.838 ERROR: CARBONIC ANHYDRASE FAMILY PROTEIN [AQUIFICOTA BACTERIUM]
- 16:44:28.838 ERROR: Error in /big/martin/hh-suite/src/hhalignment.cpp:1244: Compress:

- 16:44:28.838 ERROR: 	sequences in /scratch/groups/mjewett/Chad_Hyer_Tiling_Experiment/preprocessed/CA_MSA_Final_UC.a2m do not all have the same number of columns, 

- 16:44:28.838 ERROR: 	
e.g. first sequence and sequence GB|NPA13942-1|.

- 16:44:28.838 ERROR: Check input format for '-M a2m' option and consider using '-M first' or '-M 50'

Training ProteinNPT model on the CA_Alanine_Screen assay
####################################################################################################
 Step1: Computing sequence embeddings for the CA_Alanine_Screen assay 
####################################################################################################
Using embedding model: MSA_Transformer
MSA start or MSA end not provided -- Assuming the MSA is covering the full WT sequence
Assay: CA_Alanine_Screen.csv
Traceback (most recent call last):
  File "/scratch/groups/mjewett/ProteinNPT/pipeline.py", line 126, in <module>
    run_embeddings(
  File "/scratch/groups/mjewett/ProteinNPT/embeddings.py", line 182, in main
    MSA_sequences, MSA_weights = process_MSA(MSA_data_folder=MSA_data_folder, MSA_weight_data_folder=MSA_weight_data_folder, MSA_filename=MSA_filename, MSA_weights_filename=MSA_weights_filename, path_to_hhfilter=path_to_hhfilter)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/groups/mjewett/ProteinNPT/proteinnpt/utils/msa_utils.py", line 518, in process_MSA
    MSA_all_sequences, MSA_non_ref_sequences_weights = compute_sequence_weights(MSA_filename=filtered_MSA_filename, MSA_weights_filename=os.path.join(MSA_weight_data_folder, "hhfiltered", MSA_weights_filename))
                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/groups/mjewett/ProteinNPT/proteinnpt/utils/msa_utils.py", line 419, in compute_sequence_weights
    processed_MSA = MSA_processing(
                    ^^^^^^^^^^^^^^^
  File "/scratch/groups/mjewett/ProteinNPT/proteinnpt/utils/msa_utils.py", line 105, in __init__
    self.gen_alignment()
  File "/scratch/groups/mjewett/ProteinNPT/proteinnpt/utils/msa_utils.py", line 128, in gen_alignment
    with open(self.MSA_location, "r") as msa_data:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/groups/mjewett/Chad_Hyer_Tiling_Experiment/hhfiltered/CA_MSA_Final_hhfiltered_cov_75_maxid_90_minid_0.a2m'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions