Skip to content

Issue with Arabidopsis data #39

@dverac

Description

@dverac

Hi! I am setting up FIRE with Arabidopsis Thaliana data (TAIR10), also produced with nanopore. I have been able to run PacBio and nanopore data from humans without any issues. However, I am having an error while using the Arabidopsis samples. It seems like an issue with the contigs' names, perhaps. In the TAIR10, the names we have are 1,2,3,4,5, Pt, and Mt. I have tried running without the keep_chromosomes parameter, even when I try to set the keep_chromosomes parameter to omit the Mt contig, it still has problems.
keep_chrosomosomes -> "^(1|2|3|4|5|Pt)+$" and "^(1|2|3|4|5|Mt|Pt)+$"
I modified the ref and ref_name parameters in the config.yaml to adjust for the genome.

Error:
localrule coverage:
input: temp/AT_control.filtered.nuc/coverage/AT_control.filtered.nuc-v0.1.1.bed.gz
output: results/AT_control.filtered.nuc/additional-outputs-v0.1.1/coverage/AT_control.filtered.nuc-v0.1.1-median-coverage.txt, results/AT_control.filtered.nuc/additional-outputs-v0.1.1/coverage/AT_control.filtered.nuc-v0.1.1-minimum-coverage.txt, results/AT_control.filtered.nuc/additional-outputs-v0.1.1/coverage/AT_control.filtered.nuc-v0.1.1-maximum-coverage.txt
jobid: 0
benchmark: results/AT_control.filtered.nuc/additional-outputs-v0.1.1/benchmarks/coverage/AT_control.filtered.nuc.txt
reason: Forced execution
wildcards: sm=AT_control.filtered.nuc, v=v0.1.1
resources: mem_mb=65536, mem_mib=62500, disk_mb=4096, disk_mib=3907, tmpdir=/scratch/local/jobs/31068711, runtime=200, slurm_account=pi-spott, slurm_partition=caslake

Activating conda environment: ../../../dveracruz/bin/snakemake_conda_envs/04f36a1cabb48e10bcbd66f83d9ec8ab_
Activating conda environment: ../../../dveracruz/bin/snakemake_conda_envs/04f36a1cabb48e10bcbd66f83d9ec8ab_
Traceback (most recent call last):
File "/project/spott/1_Shared_projects/AT_Fiber_seq/FIRE/.snakemake/scripts/tmpibtfejrf.cov.py", line 72, in
df = polars_read()
File "/project/spott/1_Shared_projects/AT_Fiber_seq/FIRE/.snakemake/scripts/tmpibtfejrf.cov.py", line 53, in polars_read
pl.read_csv(
File "/project/spott/dveracruz/bin/snakemake_conda_envs/04f36a1cabb48e10bcbd66f83d9ec8ab_/lib/python3.10/site-packages/polars/utils/deprecation.py", line 91, in wrapper
return function(*args, **kwargs)
File "/project/spott/dveracruz/bin/snakemake_conda_envs/04f36a1cabb48e10bcbd66f83d9ec8ab
/lib/python3.10/site-packages/polars/utils/deprecation.py", line 91, in wrapper
return function(*args, **kwargs)
File "/project/spott/dveracruz/bin/snakemake_conda_envs/04f36a1cabb48e10bcbd66f83d9ec8ab
/lib/python3.10/site-packages/polars/utils/deprecation.py", line 91, in wrapper
return function(*args, **kwargs)
File "/project/spott/dveracruz/bin/snakemake_conda_envs/04f36a1cabb48e10bcbd66f83d9ec8ab
/lib/python3.10/site-packages/polars/io/csv/functions.py", line 499, in read_csv
df = read_csv_impl(
File "/project/spott/dveracruz/bin/snakemake_conda_envs/04f36a1cabb48e10bcbd66f83d9ec8ab
/lib/python3.10/site-packages/polars/io/csv/functions.py", line 645, in _read_csv_impl
pydf = PyDataFrame.read_csv(
polars.exceptions.ComputeError: could not parse Mt as dtype i64 at column 'column_1' (column number 1)

  The current offset in the file is 1325113890 bytes.
  
  You might want to try:
    - increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
  - specifying correct dtype with the `dtypes` argument
  - setting `ignore_errors` to `True`,
  - adding `Mt` to the `null_values` list.
  
  Original error: ```remaining bytes non-empty```
  RuleException:
    CalledProcessError in file /project/spott/1_Shared_projects/AT_Fiber_seq/FIRE/workflow/rules/coverages.smk, line 48:
    Command 'source /software/python-anaconda-2024.10-el8-x86_64/bin/activate '/project/spott/dveracruz/bin/snakemake_conda_envs/04f36a1cabb48e10bcbd66f83d9ec8ab_'; set -euo pipefail;  python /project/spott/1_Shared_projects/AT_Fiber_seq/FIRE/.snakemake/scripts/tmpibtfejrf.cov.py' returned non-zero exit status 1.
  [Sun May 18 15:54:45 2025]
  Error in rule coverage:
    jobid: 0
  input: temp/AT_control.filtered.nuc/coverage/AT_control.filtered.nuc-v0.1.1.bed.gz
  output: results/AT_control.filtered.nuc/additional-outputs-v0.1.1/coverage/AT_control.filtered.nuc-v0.1.1-median-coverage.txt, results/AT_control.filtered.nuc/additional-outputs-v0.1.1/coverage/AT_control.filtered.nuc-v0.1.1-minimum-coverage.txt, results/AT_control.filtered.nuc/additional-outputs-v0.1.1/coverage/AT_control.filtered.nuc-v0.1.1-maximum-coverage.txt
  conda-env: /project/spott/dveracruz/bin/snakemake_conda_envs/04f36a1cabb48e10bcbd66f83d9ec8ab_

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions