BioSet2Vec is a tool designed to extract k-mer dictionaries from multiple sets of biological sequences using distributed computing. This method is efficient for large-scale biological sequence analysis, enabling users to handle diverse sequence sets, such as DNA sequences, and extract k-mer representations in a distributed fashion. The extracted k-mer dictionaries can be used for downstream tasks like sequence comparison, feature extraction, and machine learning.
- Distributed k-mer extraction: Handles multiple biological sequence sets.
- Scalability: Optimized for large datasets using distributed computation.
- Flexible k-mer size: Adjustable k-mer lengths to suit the user’s specific needs.
- Support for various biological sequences: Compatible with DNA sequences.
- Efficient I/O handling: Works with large files in FASTA, FASTQ, or other common biological sequence formats.
- Easy setup with JSON configuration: Use an
input.jsonfile to manage configuration parameters.
- Python ">=3.8,<3.9" https://www.python.org/downloads/release/python-380/
- JDK 8
- Apache Spark (for distributed computing) version 3.5.1
- bioft.jar from https://github.com/Ylenia1211/Bioft.git
Operating system(s): Platform independent
-
Install and setup JAVA:
To install Liberica JDK 1.8.0_422 on your system, you can follow the instructions provided in the official Liberica documentation https://bell-sw.com/pages/downloads/#jdk-8-lts. Once the installation is complete, make sure to configure your development environment to point to the new Java installation.
/usr/libexec/java_home -V
export JAVA_HOME=`/usr/libexec/java_home -v 1.8.0_422-b06`
nano ~/.zshrc
export JAVA_HOME=$(/usr/libexec/java_home -v 1.8.0_422-b06)Press CTRL+X to exit the editor Press Y to save your changes and check:
source ~/.zshrc
echo $JAVA_HOME
java -version
-
Set up Apache Spark version 3.5.1:
Follow the official guide to set up Apache Spark in your environment.
-
Clone the repository:
git clone https://github.com/Ylenia1211/BioSet2Vec.git cd BioSet2Vec -
Place
bioft.jarin the working directory (Rename the file to "bioft.jar" if it does not have this name):bioft.jaris required for your job, download and place it in your project directory.cp /path/to/bioft.jar ./bioset2vec/bioft.jar
-
Create and activate a virtual environment in BioSet2Vec directory
python3.8 -m venv venv_bioset2vec
# Activate on Linux/macOS
source venv_bioset2vec/bin/activate
# Or activate on Windows
venv_bioset2vec\Scripts\activate-
Install the library "BioVec2Set" (with the dependencies). Run:
pip install --upgrade pip
Build the distribuition. From the directory that contains setup.py (where bioset2vec is located), run:
pip install wheel python setup.py sdist bdist_wheel
Once the distributions are created, you can install the package directly from the dist/ directory:
pip install dist/bioset2vec-0.1.0-py3-none-any.whl
Inside the main folder (same level as setup.py), run:
pip install -e .to install the Package Locally, or
pip install bioset2vec/dist/bioset2vec-0.1.0-py3-none-any.whl --force-reinstall
Configure BioSet2Vec by creating an input.json file with the following parameters:
{
"path_jar": "./bioset2vec/bioft.jar",
"folder_path": "./set_synthetic/",
"k_min": 3,
"k_max": 4,
"n": 0,
"n_core": 6,
"ram": 6,
"offHeap_size": 2
}path_jar: Path to the bioft.jar file.
folder_path: Path to the folder containing sequence sets.
k_min and k_max: Minimum and maximum k-mer lengths.
n: number of time to perform test Montecarlo (if you want to use the package only to perform an easy TFIDF on your input sets set "n": 0).
n_core: Number of CPU cores to use.
ram: RAM in GB for Spark.
offHeap_size: Off-heap memory size in GB.
You can use the functionality directly in your Python scripts:
from bioSet2Vec import BioSet2Vec
# Configure BioSet2Vec
input_file = "input.json"
params = BioSet2Vec.read_parameters_from_file(input_file)
#Running TF-IDF transformation and saving results
BioSet2Vec.compute(params)For a hands-on example of how to use BioSet2Vec, check out the notebooks/example_notebook.ipynb in the notebooks folder.
The BioSet2Vec library was tested by using the following public data:
- https://afproject.org/app/benchmark/genome/std/assembled/ecoli/dataset/
- https://afproject.org/app/benchmark/genome/std/unassembled/plants/dataset/
- https://hgdownload.cse.ucsc.edu/goldenPath/dm3/bigZips/
To easily test the package use the dataset: Synthetic Data and Results.zip
If you use BioSet2Vec in your research or project, please cite the tool as follows:
@article{galluzzo2025bioset2vec,
title={BioSet2Vec: extraction of k-mer dictionaries from multiple sets of biological sequences via big data technologies},
author={Galluzzo, Ylenia and Giancarlo, Raffaele and Rombo, Simona E and Utro, Filippo},
journal={BMC bioinformatics},
volume={26},
number={1},
pages={264},
year={2025},
publisher={Springer}
}