Skip to content

compare_gtf_to_ER_01ksb.py crashes when gene_id contains substring "ER" #10

@netanyak

Description

@netanyak

Bug: compare_gtf_to_ER_01ksb.py crashes when gene_id contains substring "ER"

Summary

utilities/compare_gtf_to_ER_01ksb.py raises a ValueError during ER sorting if any gene_id includes the substring "ER" (e.g., CHAER1). The script constructs ER labels as "{gene_id}:ER{n}", but the sorting key uses split("ER"), which breaks when "ER" appears earlier in the gene_id.

Steps to reproduce

  1. Ensure the ER GTF contains a gene whose gene_id includes "ER", e.g.:
chr4	TranD	exon	56697216	56697886	.	-	.	transcript_id "CHAER1"; gene_id "CHAER1";
  1. Run the script:
python utilities/compare_gtf_to_ER_01ksb.py \
  -i <input.gtf> \
  -er <er.gtf> \
  -o <outdir>

Expected behavior

Script completes successfully and writes the output CSVs:

  • *_infoERP.csv
  • *_flagER.csv

Actual behavior

Script crashes while building geneDct:

ValueError: invalid literal for int() with base 10: '1:'

Where it fails

The error occurs at the ER sorting step:

geneDct = dict(geneERDf.groupby('gene_id').apply(
    lambda x: sorted(set(x['ER']), key=lambda x: int(x.split("ER")[1]))))

When gene_id = "CHAER1", the script generates ER IDs like CHAER1:ER1.
Then "CHAER1:ER1".split("ER") returns ["CHA", "1:", "1"], so split("ER")[1] == "1:", and int("1:") raises ValueError.

Cause

The ER parsing assumes the first "ER" in the string is the ER suffix delimiter, but "ER" can appear inside gene_id. Using split("ER") is not robust for the constructed ER label format.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions