Skip to content

Panic in Rust Code When Running tokenise_bio.py on GRCh38.p13 FASTA, Successful on Smaller FASTA File #5

@schultzjack

Description

@schultzjack

Issue Description:
While attempting to reproduce the findings in your preprint, I'm encountering an error running the tokenise_bio.py script on large (3GB or larger) FASTA files, including the GCF_000001405.39_GRCh38.p13. It fails with a panic in the Rust code. The code runs successfully on (much) smaller FASTA files (200KB). I have tried using the original .fna file extension as well as .fasta but have not had success in getting past this particular error.

The error log indicates a panic due to a Rust-related error, specifically in the tokenizers-lib library. The panic occurred at the line tokenizers-lib/src/models/unigram rainer.rs:212:53, with the message "called Result::unwrap() on an Err value: Internal." The error thread panicked at called Result::unwrap() on an Err value: Internal, tokenizers-lib/src/models/unigram/trainer.rs:212:53 is related to the Hugging Face Tokenizers library.

To Reproduce
OS: EC2 Instance
Conda venv with python 3.9

Command Used for Error:

RUST_BACKTRACE=full tokenise_bio -i /data/ncbi_dataset/GCF_000001405.39_GRCh38.p13_genomic.fasta -t '/data/generated/ncbi_tokenisers/tokeniser_39_GRCh38.json

Output & error message: https://gist.github.com/stepwise-ai-dev/f23a79faaedd006bf51d486259440dd5

Steps to Reproduce:
Run the tokenise_bio.py script with GCF_000001405.39_GRCh38.p13_genomic.fna as input

Expected output
I expected similar output to previous successfully execution of tokenise_bio.py on a smaller (200KB) FASTA file.
Successful output :https://gist.github.com/stepwise-ai-dev/bbd782f3d09afaca6219b2c3a176bb42

Questions:

  1. Is this a known issue with the code or a known issue specific to certain system specifications?
  2. Are there any additional specific requirements or configurations needed for processing larger genome file sizes?
  3. Are there any additional data pre-processing steps required between downloading the NCBI data and using it as input for tokenise_bio.py?

Any guidance on resolving this issue would be greatly appreciated.

Thank you so much!

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions