I am having some issues with the way ProteinNPT is handling my MSA files. It took me a bit to figure out A2M files to start with, and my way of getting them is a little roundabout, but I believe my files have the correct structure. I get a strange error that the columns are of different sizes even though I have confirmed in my source files that they have the same sizes. I've attached the original fasta I used to make an A2M as well as the source A2M file I am working with. This error doesn't seem to stop the program from running, but later ProteinNPT throws a file not found error for one of the hhfiltered_cov_75_maxid_90_minid_0.a2m files. I went to the specific location, and the file exists, so I am little confused about what is going on.
/scratch/groups/mjewett/proteinnpt_env/lib/python3.12/site-packages/wandb/sdk/launch/builder/build.py:11: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
133+1 records in
133+1 records out
68430 bytes (68 kB) copied, 0.00222253 s, 30.8 MB/s
- 16:44:28.835 INFO: Input file = /scratch/groups/mjewett/Chad_Hyer_Tiling_Experiment/preprocessed/CA_MSA_Final_UC.a2m
- 16:44:28.835 INFO: Output file = /scratch/groups/mjewett/Chad_Hyer_Tiling_Experiment/hhfiltered/CA_MSA_Final_hhfiltered_cov_75_maxid_90_minid_0.a2m
- 16:44:28.838 ERROR: CARBONIC ANHYDRASE FAMILY PROTEIN [AQUIFICOTA BACTERIUM]
- 16:44:28.838 ERROR: Error in /big/martin/hh-suite/src/hhalignment.cpp:1244: Compress:
- 16:44:28.838 ERROR: sequences in /scratch/groups/mjewett/Chad_Hyer_Tiling_Experiment/preprocessed/CA_MSA_Final_UC.a2m do not all have the same number of columns,
- 16:44:28.838 ERROR:
e.g. first sequence and sequence GB|NPA13942-1|.
- 16:44:28.838 ERROR: Check input format for '-M a2m' option and consider using '-M first' or '-M 50'
Training ProteinNPT model on the CA_Alanine_Screen assay
####################################################################################################
Step1: Computing sequence embeddings for the CA_Alanine_Screen assay
####################################################################################################
Using embedding model: MSA_Transformer
MSA start or MSA end not provided -- Assuming the MSA is covering the full WT sequence
Assay: CA_Alanine_Screen.csv
Traceback (most recent call last):
File "/scratch/groups/mjewett/ProteinNPT/pipeline.py", line 126, in <module>
run_embeddings(
File "/scratch/groups/mjewett/ProteinNPT/embeddings.py", line 182, in main
MSA_sequences, MSA_weights = process_MSA(MSA_data_folder=MSA_data_folder, MSA_weight_data_folder=MSA_weight_data_folder, MSA_filename=MSA_filename, MSA_weights_filename=MSA_weights_filename, path_to_hhfilter=path_to_hhfilter)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/groups/mjewett/ProteinNPT/proteinnpt/utils/msa_utils.py", line 518, in process_MSA
MSA_all_sequences, MSA_non_ref_sequences_weights = compute_sequence_weights(MSA_filename=filtered_MSA_filename, MSA_weights_filename=os.path.join(MSA_weight_data_folder, "hhfiltered", MSA_weights_filename))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/groups/mjewett/ProteinNPT/proteinnpt/utils/msa_utils.py", line 419, in compute_sequence_weights
processed_MSA = MSA_processing(
^^^^^^^^^^^^^^^
File "/scratch/groups/mjewett/ProteinNPT/proteinnpt/utils/msa_utils.py", line 105, in __init__
self.gen_alignment()
File "/scratch/groups/mjewett/ProteinNPT/proteinnpt/utils/msa_utils.py", line 128, in gen_alignment
with open(self.MSA_location, "r") as msa_data:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/groups/mjewett/Chad_Hyer_Tiling_Experiment/hhfiltered/CA_MSA_Final_hhfiltered_cov_75_maxid_90_minid_0.a2m'
Hey,
I am having some issues with the way ProteinNPT is handling my MSA files. It took me a bit to figure out A2M files to start with, and my way of getting them is a little roundabout, but I believe my files have the correct structure. I get a strange error that the columns are of different sizes even though I have confirmed in my source files that they have the same sizes. I've attached the original fasta I used to make an A2M as well as the source A2M file I am working with. This error doesn't seem to stop the program from running, but later ProteinNPT throws a file not found error for one of the
hhfiltered_cov_75_maxid_90_minid_0.a2mfiles. I went to the specific location, and the file exists, so I am little confused about what is going on.MSA_files.zip
The error I am getting is as follows: