Regex Typo in Frangieh 2021 processing

Hi,

The perturbations in Frangieh 2021 processing notebook are processed using this regex line
```
adata.obs['perturbation_name'] = [re.sub('[_123]+', '', s) for s in adata.obs['sgRNA']]
```

This should be changed to `'_[123]+'`. This is meant to strip the guide number, right now it also strips out any 1 2 or 3 in the gene.

Unique genes in `adata.obs["sgRNA"]`
```
Index(['A2M_1', 'A2M_2', 'A2M_3', 'ACSL3_1', 'ACSL3_2', 'ACSL3_3', 'ACTA2_1',
       'ACTA2_2', 'ACTA2_3', 'AEBP1_1',
       ...
       'VDAC2_3', 'WBP2_1', 'WBP2_2', 'WBP2_3', 'WNT7A_1', 'WNT7A_2',
       'WNT7A_3', 'XAGE1A_1', 'XAGE1A_2', 'XAGE1A_3'],
      dtype='object', length=818)
```
In the processed data file the current line changes the first gene `A2M` turns into `AM`. This loses information and makes matching guides to the data difficult.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex Typo in Frangieh 2021 processing #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Regex Typo in Frangieh 2021 processing #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions