-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Issue Description:
While attempting to reproduce the findings in your preprint, I'm encountering an error running the tokenise_bio.py script on large (3GB or larger) FASTA files, including the GCF_000001405.39_GRCh38.p13. It fails with a panic in the Rust code. The code runs successfully on (much) smaller FASTA files (200KB). I have tried using the original .fna file extension as well as .fasta but have not had success in getting past this particular error.
The error log indicates a panic due to a Rust-related error, specifically in the tokenizers-lib library. The panic occurred at the line tokenizers-lib/src/models/unigram rainer.rs:212:53, with the message "called Result::unwrap() on an Err value: Internal." The error thread panicked at called Result::unwrap() on an Err value: Internal, tokenizers-lib/src/models/unigram/trainer.rs:212:53 is related to the Hugging Face Tokenizers library.
To Reproduce
OS: EC2 Instance
Conda venv with python 3.9
Command Used for Error:
RUST_BACKTRACE=full tokenise_bio -i /data/ncbi_dataset/GCF_000001405.39_GRCh38.p13_genomic.fasta -t '/data/generated/ncbi_tokenisers/tokeniser_39_GRCh38.json
Output & error message: https://gist.github.com/stepwise-ai-dev/f23a79faaedd006bf51d486259440dd5
Steps to Reproduce:
Run the tokenise_bio.py script with GCF_000001405.39_GRCh38.p13_genomic.fna as input
Expected output
I expected similar output to previous successfully execution of tokenise_bio.py on a smaller (200KB) FASTA file.
Successful output :https://gist.github.com/stepwise-ai-dev/bbd782f3d09afaca6219b2c3a176bb42
Questions:
- Is this a known issue with the code or a known issue specific to certain system specifications?
- Are there any additional specific requirements or configurations needed for processing larger genome file sizes?
- Are there any additional data pre-processing steps required between downloading the NCBI data and using it as input for tokenise_bio.py?
Any guidance on resolving this issue would be greatly appreciated.
Thank you so much!