BioSet2Vec

BioSet2Vec is a tool designed to extract k-mer dictionaries from multiple sets of biological sequences using distributed computing. This method is efficient for large-scale biological sequence analysis, enabling users to handle diverse sequence sets, such as DNA sequences, and extract k-mer representations in a distributed fashion. The extracted k-mer dictionaries can be used for downstream tasks like sequence comparison, feature extraction, and machine learning.

Features

Distributed k-mer extraction: Handles multiple biological sequence sets.
Scalability: Optimized for large datasets using distributed computation.
Flexible k-mer size: Adjustable k-mer lengths to suit the user’s specific needs.
Support for various biological sequences: Compatible with DNA sequences.
Efficient I/O handling: Works with large files in FASTA, FASTQ, or other common biological sequence formats.
Easy setup with JSON configuration: Use an input.json file to manage configuration parameters.

Requirements

Python ">=3.8,<3.9" https://www.python.org/downloads/release/python-380/
JDK 8
Apache Spark (for distributed computing) version 3.5.1
bioft.jar from https://github.com/Ylenia1211/Bioft.git

Operating system(s): Platform independent

Installation

Install and setup JAVA:

To install Liberica JDK 1.8.0_422 on your system, you can follow the instructions provided in the official Liberica documentation https://bell-sw.com/pages/downloads/#jdk-8-lts. Once the installation is complete, make sure to configure your development environment to point to the new Java installation.

        /usr/libexec/java_home -V
        export JAVA_HOME=`/usr/libexec/java_home -v 1.8.0_422-b06`
        nano ~/.zshrc
        export JAVA_HOME=$(/usr/libexec/java_home -v 1.8.0_422-b06)

Press CTRL+X to exit the editor Press Y to save your changes and check:

        source ~/.zshrc
        echo $JAVA_HOME
        java -version

Set up Apache Spark version 3.5.1:

Follow the official guide to set up Apache Spark in your environment.

Clone the repository:

git clone https://github.com/Ylenia1211/BioSet2Vec.git
cd BioSet2Vec

Place bioft.jar in the working directory (Rename the file to "bioft.jar" if it does not have this name):

bioft.jar is required for your job, download and place it in your project directory.
```
cp /path/to/bioft.jar ./bioset2vec/bioft.jar
```
Create and activate a virtual environment in BioSet2Vec directory

   python3.8 -m venv venv_bioset2vec
   # Activate on Linux/macOS
   source venv_bioset2vec/bin/activate
   # Or activate on Windows
   venv_bioset2vec\Scripts\activate

Install the library "BioVec2Set" (with the dependencies). Run:
```
    pip install --upgrade pip
```
Build the distribuition. From the directory that contains setup.py (where bioset2vec is located), run:
```
   pip install wheel
   python setup.py sdist bdist_wheel
```
Once the distributions are created, you can install the package directly from the dist/ directory:
```
pip install dist/bioset2vec-0.1.0-py3-none-any.whl
```
Inside the main folder (same level as setup.py), run:
```
 pip install -e .
```
to install the Package Locally, or
```
 pip install bioset2vec/dist/bioset2vec-0.1.0-py3-none-any.whl --force-reinstall
```

Configuration

Configure BioSet2Vec by creating an input.json file with the following parameters:

    {
        "path_jar": "./bioset2vec/bioft.jar",
        "folder_path": "./set_synthetic/",
        "k_min": 3,
        "k_max": 4,
        "n": 0, 
        "n_core": 6,
        "ram": 6,
        "offHeap_size": 2
    }

Parameter Details

path_jar: Path to the bioft.jar file.

folder_path: Path to the folder containing sequence sets.

k_min and k_max: Minimum and maximum k-mer lengths.

n: number of time to perform test Montecarlo (if you want to use the package only to perform an easy TFIDF on your input sets set "n": 0).

n_core: Number of CPU cores to use.

ram: RAM in GB for Spark.

offHeap_size: Off-heap memory size in GB.

Usage

You can use the functionality directly in your Python scripts:

from bioSet2Vec import BioSet2Vec

# Configure BioSet2Vec 
input_file = "input.json"
params = BioSet2Vec.read_parameters_from_file(input_file)

#Running TF-IDF transformation and saving results
BioSet2Vec.compute(params)

Other Examples

For a hands-on example of how to use BioSet2Vec, check out the notebooks/example_notebook.ipynb in the notebooks folder.

Dataset

The BioSet2Vec library was tested by using the following public data:

To easily test the package use the dataset: Synthetic Data and Results.zip

Citation

If you use BioSet2Vec in your research or project, please cite the tool as follows:

@article{galluzzo2025bioset2vec,
  title={BioSet2Vec: extraction of k-mer dictionaries from multiple sets of biological sequences via big data technologies},
  author={Galluzzo, Ylenia and Giancarlo, Raffaele and Rombo, Simona E and Utro, Filippo},
  journal={BMC bioinformatics},
  volume={26},
  number={1},
  pages={264},
  year={2025},
  publisher={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
bioset2vec		bioset2vec
data_example/set_synthetic		data_example/set_synthetic
doc		doc
notebooks		notebooks
scripts		scripts
LICENSE		LICENSE
README.md		README.md
Synthetic Data and Results.zip		Synthetic Data and Results.zip
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BioSet2Vec

Features

Requirements

Installation

Configuration

Parameter Details

Usage

Other Examples

Dataset

Citation

About

Uh oh!

Releases

Packages

Languages

License

Ylenia1211/BioSet2Vec

Folders and files

Latest commit

History

Repository files navigation

BioSet2Vec

Features

Requirements

Installation

Configuration

Parameter Details

Usage

Other Examples

Dataset

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages