Inconsistencies across raw sub vs. indel files

Hi there, 

Thanks for the resources!

I noticed there seems to be some differences in how these files were annotated:

'https://marks.hms.harvard.edu/proteingym/clinical_ProteinGym_substitutions.zip'
'https://marks.hms.harvard.edu/proteingym/clinical_ProteinGym_indels.zip'


```python
DATA_PATHS = {
    # Clinical
    ## Processed
    'clinical_ProteinGym_substitutions':('https://marks.hms.harvard.edu/proteingym/clinical_ProteinGym_substitutions.zip',
                                          None),
    'clinical_ProteinGym_indels':('https://marks.hms.harvard.edu/proteingym/clinical_ProteinGym_indels.zip',
                                  None),
    ## Raw
    'substitutions_raw_clinical':('https://marks.hms.harvard.edu/proteingym/substitutions_raw_clinical.zip',
                                  'caa461bd2e0c58501131e7c1ad9d26c118c67704efe1b67c7ff7ca1d72ae7275'), 
    'indels_raw_clinical':('https://marks.hms.harvard.edu/proteingym/indels_raw_clinical.zip',
                           'f9eb7232657ab5732eda8dcb922bf17b228eae212ca794e753ba73a017f40a8d'),
    # DMS
    ## Processed
    'DMS_ProteinGym_substitutions':('https://marks.hms.harvard.edu/proteingym/DMS_ProteinGym_substitutions.zip',
                                    None),
    'DMS_ProteinGym_indels':('https://marks.hms.harvard.edu/proteingym/DMS_ProteinGym_indels.zip',
                             None),
    ## Raw
    'substitutions_raw_DMS':('https://marks.hms.harvard.edu/proteingym/substitutions_raw_DMS.zip',
                             None),
    'indels_raw_DMS':('https://marks.hms.harvard.edu/proteingym/indels_raw_DMS.zip',
                      None),
} 
def download_data(file_keys=DATA_PATHS.keys()): 
    file_dict = {}
    for fk in file_keys: 
        if DATA_PATHS[fk][0].endswith('.zip'):
            processor = pooch.Unzip() 
        else: 
            processor = None
        file_dict[fk] = pooch.retrieve(DATA_PATHS[fk][0], 
                                        known_hash=DATA_PATHS[fk][1], 
                                        progressbar=True,
                                        processor=processor) 
    return file_dict
```

```python
pd.set_option('display.max_columns', None)
pg_data = download_data(file_keys=['substitutions_raw_clinical','indels_raw_clinical'])
```


```python
subs = pd.read_csv(pg_data['substitutions_raw_clinical'][0])
print(subs.shape)
print(subs['Gene'].nunique(),"genes")
print(subs.columns)
subs.head() 
```
![Image](https://github.com/user-attachments/assets/0b4c9a50-8a25-4230-a622-1a9f4f4bfdb3)

```python
indels = pd.read_csv(pg_data['indels_raw_clinical'][0])
print(indels.shape)
print(indels['Gene'].nunique(),"genes")
print(indels.columns)
indels.head()
```
![Image](https://github.com/user-attachments/assets/935ae9ff-2137-4944-87d6-ab45770c0b59)

The substitutions file uses one kinds of ID (`NM_152486.4`) and the indels file uses Ensembl  IDs (e.g. `ENST00000263574.5`).
This adds an extra steps of mapping between ID types 

Columns affected by this include:
- `Gene`
- `HGVSc`
- `HGVSp` (missing in indels file)
- `Symbol` (missing from substitutions file)
- `protein` (missing from indels file)

If possible, could you update the files to ensure more consistent annotation? The Ensembl IDs I find especially useful for mapping onto other resources, and comparing results within the same transcripts between subs and indels.

Thanks, 
Brian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistencies across raw sub vs. indel files #61

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistencies across raw sub vs. indel files #61

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions