Hi there,
Thanks for the resources!
I noticed there seems to be some differences in how these files were annotated:
'https://marks.hms.harvard.edu/proteingym/clinical_ProteinGym_substitutions.zip'
'https://marks.hms.harvard.edu/proteingym/clinical_ProteinGym_indels.zip'
DATA_PATHS = {
# Clinical
## Processed
'clinical_ProteinGym_substitutions':('https://marks.hms.harvard.edu/proteingym/clinical_ProteinGym_substitutions.zip',
None),
'clinical_ProteinGym_indels':('https://marks.hms.harvard.edu/proteingym/clinical_ProteinGym_indels.zip',
None),
## Raw
'substitutions_raw_clinical':('https://marks.hms.harvard.edu/proteingym/substitutions_raw_clinical.zip',
'caa461bd2e0c58501131e7c1ad9d26c118c67704efe1b67c7ff7ca1d72ae7275'),
'indels_raw_clinical':('https://marks.hms.harvard.edu/proteingym/indels_raw_clinical.zip',
'f9eb7232657ab5732eda8dcb922bf17b228eae212ca794e753ba73a017f40a8d'),
# DMS
## Processed
'DMS_ProteinGym_substitutions':('https://marks.hms.harvard.edu/proteingym/DMS_ProteinGym_substitutions.zip',
None),
'DMS_ProteinGym_indels':('https://marks.hms.harvard.edu/proteingym/DMS_ProteinGym_indels.zip',
None),
## Raw
'substitutions_raw_DMS':('https://marks.hms.harvard.edu/proteingym/substitutions_raw_DMS.zip',
None),
'indels_raw_DMS':('https://marks.hms.harvard.edu/proteingym/indels_raw_DMS.zip',
None),
}
def download_data(file_keys=DATA_PATHS.keys()):
file_dict = {}
for fk in file_keys:
if DATA_PATHS[fk][0].endswith('.zip'):
processor = pooch.Unzip()
else:
processor = None
file_dict[fk] = pooch.retrieve(DATA_PATHS[fk][0],
known_hash=DATA_PATHS[fk][1],
progressbar=True,
processor=processor)
return file_dict
pd.set_option('display.max_columns', None)
pg_data = download_data(file_keys=['substitutions_raw_clinical','indels_raw_clinical'])
subs = pd.read_csv(pg_data['substitutions_raw_clinical'][0])
print(subs.shape)
print(subs['Gene'].nunique(),"genes")
print(subs.columns)
subs.head()

indels = pd.read_csv(pg_data['indels_raw_clinical'][0])
print(indels.shape)
print(indels['Gene'].nunique(),"genes")
print(indels.columns)
indels.head()

The substitutions file uses one kinds of ID (NM_152486.4) and the indels file uses Ensembl IDs (e.g. ENST00000263574.5).
This adds an extra steps of mapping between ID types
Columns affected by this include:
Gene
HGVSc
HGVSp (missing in indels file)
Symbol (missing from substitutions file)
protein (missing from indels file)
If possible, could you update the files to ensure more consistent annotation? The Ensembl IDs I find especially useful for mapping onto other resources, and comparing results within the same transcripts between subs and indels.
Thanks,
Brian
Hi there,
Thanks for the resources!
I noticed there seems to be some differences in how these files were annotated:
'https://marks.hms.harvard.edu/proteingym/clinical_ProteinGym_substitutions.zip'
'https://marks.hms.harvard.edu/proteingym/clinical_ProteinGym_indels.zip'
The substitutions file uses one kinds of ID (
NM_152486.4) and the indels file uses Ensembl IDs (e.g.ENST00000263574.5).This adds an extra steps of mapping between ID types
Columns affected by this include:
GeneHGVScHGVSp(missing in indels file)Symbol(missing from substitutions file)protein(missing from indels file)If possible, could you update the files to ensure more consistent annotation? The Ensembl IDs I find especially useful for mapping onto other resources, and comparing results within the same transcripts between subs and indels.
Thanks,
Brian