Skip to content

FIRE on targeted seq data: fiber-locations-shuffled.bed.gz is created empty #7

@Strausyatina

Description

@Strausyatina

Hi Mitchell!
We've tried to run FIRE on targeted seq data, and pipeline is failing with "polars.exceptions.NoDataError: empty CSV", since fiber-locations-shuffled.bed.gz is created empty.

The bed file with complement to targeted regions was used for exclusion in filtered_and_shuffled_fiber_locations_chromosome.

What could be an issue in our usage of FIRE? Is it suitable for such a task?

Config yaml:

ref: /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa
ref_name: hg38
n_chunks: 1 # split bam file across x chunks
max_t: 4 # use X threaeds per chunk
manifest: config/config_targeted_project.tbl # table with samples to process

keep_chromosomes: chr4 # only keep chrs matching this regex.
keep_chromosomes: chr7
keep_chromosomes: chr20
## Force a read coverage instead of calulating it genome wide from the bam file.
## This can be useful if only a subset of the genome has reads.
#force_coverage: 50

## regions to not use when identifying null regions that should not have RE, below are the defaults auto used for hg38.
excludes:
 - workflow/annotations/hg38.fa.sorted.bed
#- workflow/annotations/hg38.gap.bed.gz
#- workflow/annotations/SDs.merged.hg38.bed.gz

## you can optionally specify a model that is not the default.
# model: models/my-custom-model.dat

##
## only used if training a new model
##
# train: True
# dhs: workflow/annotations/GM12878_DHS.bed.gz # regions of suspected regulatory elements

Example of error log:

Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=204800, mem_mib=195313, disk_mb=4096, disk_mib=3907, time=100440, gpus=0
Select jobs to execute...
[Thu Dec 28 16:04:53 2023]
rule fdr_table:
    input: results/bc2031/fiber-calls/FIRE.bed.gz, results/bc2031/coverage/filtered-for-coverage/fiber-locations.bed.gz, results/bc2031/coverage/filtered-for-coverage/fiber-locations-shuffled.bed.gz, /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa.fai
    output: results/bc2031/FDR-peaks/FIRE.score.to.FDR.tbl
    jobid: 0
    reason: Forced execution
    wildcards: sm=bc2031
    threads: 8
    resources: mem_mb=204800, mem_mib=195313, disk_mb=4096, disk_mib=3907, tmpdir=/tmp, time=100440, gpus=0
        python /home/nshaikhutdinov/.cache/snakemake/snakemake/source-cache/runtime-cache/tmpiwuex449/file/net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/workflow/rules/../scripts/fire-null-distribution.py -v 1 results/bc2031/fiber-calls/FIRE.bed.gz results/bc2031/coverage/filtered-for-coverage/fiber-locations.bed.gz /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa.fai -s results/bc2031/coverage/filtered-for-coverage/fiber-locations-shuffled.bed.gz -o results/bc2031/FDR-peaks/FIRE.score.to.FDR.tbl
        
Activating conda environment: ../../../../../../../home/nshaikhutdinov/FIRE/env/72529d38651d38b3fc44b5aae6fe7a22_
[INFO][Time elapsed (ms) 1068]: Reading FIRE file: results/bc2031/fiber-calls/FIRE.bed.gz
/home/nshaikhutdinov/.cache/snakemake/snakemake/source-cache/runtime-cache/tmpiwuex449/file/net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/workflow/rules/../scripts/fire-null-distribution.py:486: DeprecationWarning: `the argument comment_char` for `read_csv` is deprecated. It has been renamed to `comment_prefix`.
  fire = pl.read_csv(
[INFO][Time elapsed (ms) 1082]: Reading genome file: /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa.fai
[INFO][Time elapsed (ms) 1085]: Reading fiber locations file: results/bc2031/coverage/filtered-for-coverage/fiber-locations.bed.gz
[INFO][Time elapsed (ms) 1095]: Reading shuffled fiber locations file: results/bc2031/coverage/filtered-for-coverage/fiber-locations-shuffled.bed.gz
Traceback (most recent call last):
  File "/home/nshaikhutdinov/.cache/snakemake/snakemake/source-cache/runtime-cache/tmpiwuex449/file/net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/workflow/rules/../scripts/fire-null-distribution.py", line 539, in <module>
    defopt.run(main, show_types=True, version="0.0.1")
  File "/home/nshaikhutdinov/.local/lib/python3.11/site-packages/defopt.py", line 356, in run
    return call()
           ^^^^^^
  File "/home/nshaikhutdinov/.cache/snakemake/snakemake/source-cache/runtime-cache/tmpiwuex449/file/net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/workflow/rules/../scripts/fire-null-distribution.py", line 517, in main
    shuffled_locations = pl.read_csv(
                         ^^^^^^^^^^^^
  File "/home/nshaikhutdinov/.local/lib/python3.11/site-packages/polars/utils/deprecation.py", line 100, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nshaikhutdinov/.local/lib/python3.11/site-packages/polars/io/csv/functions.py", line 369, in read_csv
    df = pl.DataFrame._read_csv(
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nshaikhutdinov/.local/lib/python3.11/site-packages/polars/dataframe/frame.py", line 784, in _read_csv
    self._df = PyDataFrame.read_csv(
               ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.NoDataError: empty CSV
[Thu Dec 28 16:04:54 2023]
Error in rule fdr_table:
    jobid: 0
    input: results/bc2031/fiber-calls/FIRE.bed.gz, results/bc2031/coverage/filtered-for-coverage/fiber-locations.bed.gz, results/bc2031/coverage/filtered-for-coverage/fiber-locations-shuffled.bed.gz, /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa.fai
    output: results/bc2031/FDR-peaks/FIRE.score.to.FDR.tbl
    conda-env: /home/nshaikhutdinov/FIRE/env/72529d38651d38b3fc44b5aae6fe7a22_
    shell:
        
        python /home/nshaikhutdinov/.cache/snakemake/snakemake/source-cache/runtime-cache/tmpiwuex449/file/net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/workflow/rules/../scripts/fire-null-distribution.py -v 1 results/bc2031/fiber-calls/FIRE.bed.gz results/bc2031/coverage/filtered-for-coverage/fiber-locations.bed.gz /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa.fai -s results/bc2031/coverage/filtered-for-coverage/fiber-locations-shuffled.bed.gz -o results/bc2031/FDR-peaks/FIRE.score.to.FDR.tbl
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Index(['bc2029', 'bc2031', 'bc2025', 'bc2027', 'bc2026', 'bc2032', 'bc2030',
       'bc2028'],
      dtype='object', name='sample')

Exclusion bed file:

chr1    1   248956422
chr10   1   133797422
chr11   1   135086622
chr12   1   133275309
chr13   1   114364328
chr14   1   107043718
chr15   1   101991189
chr16   1   90338345
chr17   1   83257441
chr18   1   80373285
chr19   1   58617616
chr2    1   242193529
chr20   1   4680670
chr20   4690391 64444167
chr21   1   46709983
chr22   1   50818468
chr3    1   198295559
chr4    1   3072454
chr4    3077294 190214555
chr5    1   181538259
chr6    1   170805979
chr7    1   140917955
chr7    140927420   159345973
chr8    1   145138636
chr9    1   138394717
chrM    1   16569
chrX    1   156040895
chrY    1   57227415

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions