Skip to content

BioSet2Vec is a tool designed to extract k-mer dictionaries from multiple sets of biological sequences using distributed computing. This method is efficient for large-scale biological sequence analysis, enabling users to handle diverse sequence sets, such as DNA sequences, and extract k-mer representations in a distributed fashion.

License

Notifications You must be signed in to change notification settings

Ylenia1211/BioSet2Vec

Repository files navigation

BioSet2Vec

BioSet2Vec is a tool designed to extract k-mer dictionaries from multiple sets of biological sequences using distributed computing. This method is efficient for large-scale biological sequence analysis, enabling users to handle diverse sequence sets, such as DNA sequences, and extract k-mer representations in a distributed fashion. The extracted k-mer dictionaries can be used for downstream tasks like sequence comparison, feature extraction, and machine learning.

General Overview

Features

  • Distributed k-mer extraction: Handles multiple biological sequence sets.
  • Scalability: Optimized for large datasets using distributed computation.
  • Flexible k-mer size: Adjustable k-mer lengths to suit the user’s specific needs.
  • Support for various biological sequences: Compatible with DNA sequences.
  • Efficient I/O handling: Works with large files in FASTA, FASTQ, or other common biological sequence formats.
  • Easy setup with JSON configuration: Use an input.json file to manage configuration parameters.

Requirements

Operating system(s): Platform independent

Installation

  1. Install and setup JAVA:

    To install Liberica JDK 1.8.0_422 on your system, you can follow the instructions provided in the official Liberica documentation https://bell-sw.com/pages/downloads/#jdk-8-lts. Once the installation is complete, make sure to configure your development environment to point to the new Java installation.

        /usr/libexec/java_home -V
        export JAVA_HOME=`/usr/libexec/java_home -v 1.8.0_422-b06`
        nano ~/.zshrc
        export JAVA_HOME=$(/usr/libexec/java_home -v 1.8.0_422-b06)

Press CTRL+X to exit the editor Press Y to save your changes and check:

        source ~/.zshrc
        echo $JAVA_HOME
        java -version
 
  1. Set up Apache Spark version 3.5.1:

    Follow the official guide to set up Apache Spark in your environment.

  2. Clone the repository:

    git clone https://github.com/Ylenia1211/BioSet2Vec.git
    cd BioSet2Vec
  3. Place bioft.jar in the working directory (Rename the file to "bioft.jar" if it does not have this name):

    bioft.jar is required for your job, download and place it in your project directory.

    cp /path/to/bioft.jar ./bioset2vec/bioft.jar
  4. Create and activate a virtual environment in BioSet2Vec directory

   python3.8 -m venv venv_bioset2vec
   # Activate on Linux/macOS
   source venv_bioset2vec/bin/activate
   # Or activate on Windows
   venv_bioset2vec\Scripts\activate
  1. Install the library "BioVec2Set" (with the dependencies). Run:

        pip install --upgrade pip

    Build the distribuition. From the directory that contains setup.py (where bioset2vec is located), run:

       pip install wheel
       python setup.py sdist bdist_wheel

    Once the distributions are created, you can install the package directly from the dist/ directory:

    pip install dist/bioset2vec-0.1.0-py3-none-any.whl

    Inside the main folder (same level as setup.py), run:

     pip install -e .

    to install the Package Locally, or

     pip install bioset2vec/dist/bioset2vec-0.1.0-py3-none-any.whl --force-reinstall

Configuration

Configure BioSet2Vec by creating an input.json file with the following parameters:

    {
        "path_jar": "./bioset2vec/bioft.jar",
        "folder_path": "./set_synthetic/",
        "k_min": 3,
        "k_max": 4,
        "n": 0, 
        "n_core": 6,
        "ram": 6,
        "offHeap_size": 2
    }

Parameter Details

path_jar: Path to the bioft.jar file.

folder_path: Path to the folder containing sequence sets.

k_min and k_max: Minimum and maximum k-mer lengths.

n: number of time to perform test Montecarlo (if you want to use the package only to perform an easy TFIDF on your input sets set "n": 0).

n_core: Number of CPU cores to use.

ram: RAM in GB for Spark.

offHeap_size: Off-heap memory size in GB.

Usage

You can use the functionality directly in your Python scripts:

from bioSet2Vec import BioSet2Vec

# Configure BioSet2Vec 
input_file = "input.json"
params = BioSet2Vec.read_parameters_from_file(input_file)

#Running TF-IDF transformation and saving results
BioSet2Vec.compute(params)

Other Examples

For a hands-on example of how to use BioSet2Vec, check out the notebooks/example_notebook.ipynb in the notebooks folder.

Dataset

The BioSet2Vec library was tested by using the following public data:

To easily test the package use the dataset: Synthetic Data and Results.zip

Citation

If you use BioSet2Vec in your research or project, please cite the tool as follows:

@article{galluzzo2025bioset2vec,
  title={BioSet2Vec: extraction of k-mer dictionaries from multiple sets of biological sequences via big data technologies},
  author={Galluzzo, Ylenia and Giancarlo, Raffaele and Rombo, Simona E and Utro, Filippo},
  journal={BMC bioinformatics},
  volume={26},
  number={1},
  pages={264},
  year={2025},
  publisher={Springer}
}

About

BioSet2Vec is a tool designed to extract k-mer dictionaries from multiple sets of biological sequences using distributed computing. This method is efficient for large-scale biological sequence analysis, enabling users to handle diverse sequence sets, such as DNA sequences, and extract k-mer representations in a distributed fashion.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published