-
Notifications
You must be signed in to change notification settings - Fork 0
Data
This page provides an overview of the data used in PROBEst, organized into two main groups: Test Data and Database Data. Each group serves a specific purpose in the development, testing, and execution of the tool.
Test data is used to validate the functionality of PROBEst and ensure its accuracy and reliability. It consists of two main categories: General Test Data and Grid Search Test Data.
- Description: Small "toy" datasets designed for script testing and development.
- Purpose: Provide a lightweight dataset for quick testing and script overview.
-
Location:
- Stored in the
data/test/general/directory. -
Obligatory Files:
-
test.fna: A reference genome file for probe generation. -
fasta_base/true_base/: A folder containing related genomes for universality checks. -
fasta_base/false_base_1andfasta_base/false_base_2: Folders containing unrelated genomes for specificity checks.
-
- Stored in the
For general test, prepare data using bash scripts/generator/prep_db.sh or use bash test_run_generator.sh
- Description: A larger dataset used for grid search testing and parameter optimization.
- Purpose: Enable comprehensive testing of the tool's performance under various conditions.
-
Location:
- Stored in the
data/test/grid/directory. -
Download Script:
- The dataset can be downloaded using the
genome_download.shscript, which relies onncbi-genome-download. -
Dependencies:
- Install
ncbi-genome-downloadusing conda:conda install -c bioconda ncbi-genome-download
- Install
-
Usage:
This script downloads Borrelia and Rickettsia genomes from NCBI and organizes them into the appropriate directories. For grid search test, prepare data using
bash data/test/grid/genome_download.sh
bash scripts/generator/prep_db.sh.
- The dataset can be downloaded using the
- Stored in the
Database data is used for probe generation and validation. It includes parsed probe databases and data extracted from scientific articles. The database data is divided into two main folders:
- Description: Contains parsed open databases such as ProbeDB and probebase.
- Purpose: Provide a foundation for universality and specificity checks during probe generation.
-
Location:
- Stored in the
data/databases/open/directory. - Status: Under development.
- Stored in the
- Description: Contains data parsed from scientific articles using advanced text-mining techniques.
- Purpose: Extract valuable information about nucleotide probes and their testing from previous research.
-
Location:
- Stored in the
data/databases/articles/directory. - Status: Under development.
- Stored in the
-
General Test Data: Located in the
data/test/general/directory. -
Grid Search Test Data: Located in the
data/test/grid/directory and downloaded using the provided script. -
Database Data: Located in the
data/databases/directory, with subfolders for open databases (open/) and scientific articles (articles/).
For more details on how to use these datasets, refer to the README or the Contribution Guide.
This page provides a comprehensive overview of the data used in PROBEst. If you have questions or need assistance, feel free to reach out via GitHub Issues.