Panic in Rust Code When Running tokenise_bio.py on GRCh38.p13 FASTA, Successful on Smaller FASTA File

**Issue Description:**
While attempting to reproduce the findings in your preprint, I'm encountering an error running the tokenise_bio.py script on large (3GB or larger) FASTA files, including the GCF_000001405.39_GRCh38.p13. It fails with a panic in the Rust code. The code runs successfully on (much) smaller FASTA files (200KB). I have tried using the original .fna file extension as well as .fasta but have not had success in getting past this particular error.

The error log indicates a panic due to a Rust-related error, specifically in the tokenizers-lib library. The panic occurred at the line tokenizers-lib/src/models/unigram rainer.rs:212:53, with the message "called Result::unwrap() on an Err value: Internal." The error thread panicked at called Result::unwrap() on an Err value: Internal, tokenizers-lib/src/models/unigram/trainer.rs:212:53 is related to the Hugging Face Tokenizers library.

**To Reproduce**
OS: EC2 Instance
Conda venv with python 3.9

Command Used for Error:
```
RUST_BACKTRACE=full tokenise_bio -i /data/ncbi_dataset/GCF_000001405.39_GRCh38.p13_genomic.fasta -t '/data/generated/ncbi_tokenisers/tokeniser_39_GRCh38.json
```
Output & error message: https://gist.github.com/stepwise-ai-dev/f23a79faaedd006bf51d486259440dd5

Steps to Reproduce:
Run the tokenise_bio.py script with GCF_000001405.39_GRCh38.p13_genomic.fna as input

**Expected output**
I expected similar output to previous successfully execution of tokenise_bio.py on a smaller (200KB) FASTA file.
Successful output :https://gist.github.com/stepwise-ai-dev/bbd782f3d09afaca6219b2c3a176bb42

**Questions:**
1. Is this a known issue with the code or a known issue specific to certain system specifications?
2. Are there any additional specific requirements or configurations needed for processing larger genome file sizes?
3. Are there any additional data pre-processing steps required between downloading the NCBI data and using it as input for tokenise_bio.py?

Any guidance on resolving this issue would be greatly appreciated.

Thank you so much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic in Rust Code When Running tokenise_bio.py on GRCh38.p13 FASTA, Successful on Smaller FASTA File #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Panic in Rust Code When Running tokenise_bio.py on GRCh38.p13 FASTA, Successful on Smaller FASTA File #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions