Skip to content
Smutin Daniil edited this page Feb 19, 2025 · 1 revision

Data Description

This page provides an overview of the data used in PROBEst, organized into two main groups: Test Data and Database Data. Each group serves a specific purpose in the development, testing, and execution of the tool.


1. Test Data

Test data is used to validate the functionality of PROBEst and ensure its accuracy and reliability. It consists of two main categories: General Test Data and Grid Search Test Data.

a. General Test Data

  • Description: Small "toy" datasets designed for script testing and development.
  • Purpose: Provide a lightweight dataset for quick testing and script overview.
  • Location:
    • Stored in the data/test/general/ directory.
    • Obligatory Files:
      • test.fna: A reference genome file for probe generation.
      • fasta_base/true_base/: A folder containing related genomes for universality checks.
      • fasta_base/false_base_1 and fasta_base/false_base_2: Folders containing unrelated genomes for specificity checks.

For general test, prepare data using bash scripts/generator/prep_db.sh or use bash test_run_generator.sh

b. Grid Search Test Data

  • Description: A larger dataset used for grid search testing and parameter optimization.
  • Purpose: Enable comprehensive testing of the tool's performance under various conditions.
  • Location:
    • Stored in the data/test/grid/ directory.
    • Download Script:
      • The dataset can be downloaded using the genome_download.sh script, which relies on ncbi-genome-download.
      • Dependencies:
        • Install ncbi-genome-download using conda:
          conda install -c bioconda ncbi-genome-download
      • Usage:
        bash data/test/grid/genome_download.sh
        This script downloads Borrelia and Rickettsia genomes from NCBI and organizes them into the appropriate directories. For grid search test, prepare data using bash scripts/generator/prep_db.sh.

2. Database Data

Database data is used for probe generation and validation. It includes parsed probe databases and data extracted from scientific articles. The database data is divided into two main folders:

a. Open Databases

  • Description: Contains parsed open databases such as ProbeDB and probebase.
  • Purpose: Provide a foundation for universality and specificity checks during probe generation.
  • Location:
    • Stored in the data/databases/open/ directory.
    • Status: Under development.

b. Scientific Articles Data

  • Description: Contains data parsed from scientific articles using advanced text-mining techniques.
  • Purpose: Extract valuable information about nucleotide probes and their testing from previous research.
  • Location:
    • Stored in the data/databases/articles/ directory.
    • Status: Under development.

Accessing Data

  • General Test Data: Located in the data/test/general/ directory.
  • Grid Search Test Data: Located in the data/test/grid/ directory and downloaded using the provided script.
  • Database Data: Located in the data/databases/ directory, with subfolders for open databases (open/) and scientific articles (articles/).

For more details on how to use these datasets, refer to the README or the Contribution Guide.


This page provides a comprehensive overview of the data used in PROBEst. If you have questions or need assistance, feel free to reach out via GitHub Issues.

Clone this wiki locally