-
Notifications
You must be signed in to change notification settings - Fork 0
compare_gtf_to_ER_01ksb.py crashes when gene_id contains substring "ER" #10
Description
Bug: compare_gtf_to_ER_01ksb.py crashes when gene_id contains substring "ER"
Summary
utilities/compare_gtf_to_ER_01ksb.py raises a ValueError during ER sorting if any gene_id includes the substring "ER" (e.g., CHAER1). The script constructs ER labels as "{gene_id}:ER{n}", but the sorting key uses split("ER"), which breaks when "ER" appears earlier in the gene_id.
Steps to reproduce
- Ensure the ER GTF contains a gene whose
gene_idincludes"ER", e.g.:
chr4 TranD exon 56697216 56697886 . - . transcript_id "CHAER1"; gene_id "CHAER1";
- Run the script:
python utilities/compare_gtf_to_ER_01ksb.py \
-i <input.gtf> \
-er <er.gtf> \
-o <outdir>Expected behavior
Script completes successfully and writes the output CSVs:
*_infoERP.csv*_flagER.csv
Actual behavior
Script crashes while building geneDct:
ValueError: invalid literal for int() with base 10: '1:'
Where it fails
The error occurs at the ER sorting step:
geneDct = dict(geneERDf.groupby('gene_id').apply(
lambda x: sorted(set(x['ER']), key=lambda x: int(x.split("ER")[1]))))When gene_id = "CHAER1", the script generates ER IDs like CHAER1:ER1.
Then "CHAER1:ER1".split("ER") returns ["CHA", "1:", "1"], so split("ER")[1] == "1:", and int("1:") raises ValueError.
Cause
The ER parsing assumes the first "ER" in the string is the ER suffix delimiter, but "ER" can appear inside gene_id. Using split("ER") is not robust for the constructed ER label format.