Skip to content

Modular framework for creating a DB, ingesting, enriching, and exploring clinical variant data. Built for scalable tertiary analysis and interpretation. From raw data to searchable insights.

License

Notifications You must be signed in to change notification settings

im175pinheiro/POsDBtools

Repository files navigation

POs Database Tools

Docker Python Shiny MariaDB

The POs Database Tools project provides a modular framework for creating and managing the POs Database - an adaptable system built to ingest, normalize, and organize tertiary analysis data from diverse sources.

Its core functionality processes raw patient genomic data, extracts variant and clinical metadata, and enriches records through external APIs, ultimately building a comprehensive database to support tertiary analysis.

A lightweight web application complements the backend, offering an interactive interface for querying and exploring the data.

For more details, please refer to the Documentation.ipynb file.

System Requirements

  • Python 3.12.3
  • Docker and Docker Compose (for containerized setup)
  • MySQL/MariaDB (if running without Docker)

Quick Start (with Docker)

Download Docker Desktop for Mac or Windows. Docker Compose will be automatically installed. On Linux, make sure you have the latest version of Compose.

1. Clone the repository:

git clone https://github.com/im175pinheiro/POsDBtools.git

cd POsDBtools

2. Create the .env file and credentials' files from the examples

cp .env.example .env

cp pos_database_tools/credentials/db_config.yaml.example pos_database_tools/credentials/db_config.yaml

cp pos_database_tools/credentials/ncbi_credentials.yaml.example pos_database_tools/credentials/ncbi_credentials.yaml

Edit with your own DB credentials if desired. Add NCBI API key if you have one.

Note: .env credentials must match db_config.yaml.

3. Build and Start the containers:

docker-compose up --build

Note: The Quick Start setup is configured to ingest only the first 20 entries from the data source to ensure rapid deployment and demonstration.

4. Access the web application:

To stop the services:

docker-compose down

Overview

Database

The database consists of three core tables and platform-specific table(s).

  • Probands - Analysis and proband metadata, as well as file provenance
  • Variants - Unique variant information, identified by vcf_shorthand
  • Main - Variant entries per analysis
  • Platform - Platform-specific attributes

Database Schema

User Interface

The application provides several key features that directly support variant interpretation workflows:

  • Search modes: users can query the database either by Proband or Genomic Position, selecting the preferred mode via radio buttons;
  • Dynamic Filters (in Proband search mode): a case filter appears, allowing users to refine results;
  • Results tables: query results are displayed in interactive tables, with each row offering a View button to access detailed variant information;
  • Detailed Information Panels:
    • Variant Information card provides a consistent overview of the variant, including key identifiers and annotations;
    • Interpretation card displays the current assessment of the variant;
    • Related Cases card (in Proband search mode) shows a table listing other probands/cases where the same variant was observed;
    • Case Information card (in Genomic Position search mode) shows metadata about the proband and case associated with the variant.

User Interface

Usage (for devs)

Docker Setup

git clone https://github.com/im175pinheiro/POsDBtools.git

cd POsDBtools
cp .env.example .env

cp pos_database_tools/credentials/db_config.yaml.example pos_database_tools/credentials/db_config.yaml

cp pos_database_tools/credentials/ncbi_credentials.yaml.example pos_database_tools/credentials/ncbi_credentials.yaml

Copy environment variables and edit with personal credentials.

  • If you are running the database inside Docker, make sure that the connection parameters in credentials/db_config.yaml match the database credentials defined in your .env file.
  • If you are using a local MariaDB installation, the .env file can be ignored, and only db_config.yaml needs to be updated.

Important Note: Set RUN_FULL_SETUP=0 in .env before building, as you will be running scripts directly on your machine.

docker-compose up --build

Dependencies

pip install -r requirements.txt

1. Create the Database Schema

Run the schema creation script to initialize the database structure.

python -m pos_database_tools.create_database_schema

2. Run the Data Ingestion Pipeline

Process data and populate all tables.

python -m pos_database_tools.data_ingestion_pipeline --basedir /path/to/directory/containing/data --limit nritems

3. Launch Web Application

Deploy the interface that allows data exploration.

cd pos_database_tools/gui_shiny 
DB_CONFIG='../credentials/db_config.yaml' shiny run --reload app.py

Integrating new platforms or data sources

Developers aiming to adapt the tool to new data sources can refer to the Platform_Integration_Guide.ipynb, which outlines the integration process step by step.

Troubleshooting and Validation

Diagnosing Ingestion Problems

A dedicated script, troubleshooting_pmid.py, helps diagnose and debug common issues with data ingestion in this project.

This script is meant to be a flexible tool for troubleshooting. If new problems arise with pmid platform, update the script to include additional checks and logic.

For a more detailed discussion of the troubleshooting results, refer to the Jupyter notebook Documentation.ipynb.

Usage

Run the script from the command line:

python pos_database_tools/tools/troubleshooting_pmid.py --log logfile_path/to/inspect --excel path/to/excel --problem problemnr

Detailed inspection of ClinVar XML result

The script inspect_clinvar_xml.py is meant for a more thorough analysis on the results of the ClinVar API call, for a specific variant of a Proband. It returns a txt file with a pretty print of the entire xml result.

Usage

Run the script from the command line:

python pos_database_tools/tools/inspect_clinvar_xml.py --proband proband_name --variant variantnumber --excel path/to/excel



Acknowledgments

This work is part of the Master's thesis titled 'Design and Implementation of a Database-Driven Software for Genetic Variant Interpretation', completed within the Master in Clinical Bioinformatics (Genome specialization) at the University of Aveiro.

Developed by: Inês Pinheiro, during the MSc Internship at Unilabs Genetics, Portugal.

Contributors: Alberto Pessoa, Unilabs Genetics Team

Supervised by: Alberto Pessoa, Unilabs Genetics Team


Licence

This project is licensed under the terms of the MIT License.

Contact

Developer: Inês Pinheiro, MSc in Clinical Bioinformatics, University of Aveiro
Email: ines.pinheiro@ua.pt

About

Modular framework for creating a DB, ingesting, enriching, and exploring clinical variant data. Built for scalable tertiary analysis and interpretation. From raw data to searchable insights.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published