Skip to content

prishagarg/Health_Atlas

 
 

Repository files navigation

🩺 Health Atlas: Autonomous Provider Data Validation Service

An intelligent, full-stack AI system that autonomously verifies, corrects, and enriches healthcare provider data from diverse sources.

FeaturesTech StackGetting StartedArchitectureAPI Documentation


🎯 The Problem

Healthcare organizations struggle with one of the industry's most persistent challenges: inaccurate and outdated provider data. Manual validation is time-consuming, error-prone, and doesn't scale. Incorrect provider information leads to:

  • ❌ Patient care disruptions
  • ❌ Revenue loss from denied claims
  • ❌ Regulatory compliance issues
  • ❌ Poor member experience

💡 The Solution

Health Atlas leverages a multi-agent AI system that autonomously validates healthcare provider data at scale, transforming weeks of manual work into minutes of intelligent automation.


✨ Key Features

🚀 Real-Time Bulk Validation

  • Upload CSV files containing provider data and watch the system validate each record in parallel
  • Stream results back to the UI in real-time with live progress tracking
  • Process hundreds of records simultaneously using async architecture

👁️ Vision Language Model (VLM) Ready

  • Architected specifically for VLM integration to extract structured data from unstructured documents
  • Handle scanned PDFs, image-based documents, and handwritten forms
  • Process documents that traditional text parsers cannot read
  • Ready to integrate: Gemini Vision API, GPT-4 Vision, or Claude 3 Vision

🎯 Intelligent Prioritization System

  • Priority Score Algorithm: Combines data accuracy (Confidence Score) with business impact (Member Impact)
  • Automatically flag high-risk records for manual review
  • Focus your team's efforts on the most critical data quality issues first

📊 Actionable Reporting & Dashboards

  • Run Summary Dashboard: At-a-glance metrics for every validation job
    • Total records processed
    • Auto-validated vs. flagged records
    • Breakdown of common error types
    • Confidence score distribution
  • Professional PDF Reports: Export clean, shareable reports for stakeholders
  • Email Generation: Auto-generate follow-up emails for flagged providers

🤖 Multi-Agent AI Engine

A deterministic AI pipeline where specialized agents collaborate:

Agent Role Capabilities
🧠 Data Validation Agent Baseline Verification Cross-checks provider info against official NPI registry, validates physical addresses, verifies credentials
🌐 Information Enrichment Agent Data Enhancement Web scraping for missing data, contact information discovery, specialty validation
🔍 Quality Assurance Agent Integrity Checks Flags inconsistencies, detects mock/fake licenses, calculates reliability scores
🗂️ Directory Management Agent Data Synthesis Standardizes formats, resolves conflicts, generates final validated profiles

🚀 Tech Stack

Category Technologies
AI Backend Python 3.10+, FastAPI, LangGraph, Groq API
Frontend React 18, Vite, Tailwind CSS, jsPDF, React Query
AI/ML LangChain, LangGraph, Vision API Integration Layer
Data Processing Pandas, AsyncIO, PyPDF2
Web Automation Selenium WebDriver
APIs & Services Geoapify (Geocoding), NPI Registry API
Development Tools Faker (test data generation), ESLint, Prettier

🧩 Getting Started

Prerequisites

Before you begin, ensure you have the following installed:

⚙️ 1. Clone the Repository

git clone https://github.com/Rupali_2507/Health_Atlas
cd Health_Atlas

🖥️ 2. Backend Setup

# Navigate to backend directory
cd backend

# Create and activate virtual environment
python -m venv .venv

# On Windows
.\.venv\Scripts\activate

# On macOS/Linux
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Configure Environment Variables:

Create a .env file in the backend directory:

# Required API Keys
GROQ_API_KEY="your-groq-api-key-here"
GEOAPIFY_API_KEY="your-geoapify-api-key-here"

# Optional: VLM Integration (Uncomment when ready)
# GOOGLE_API_KEY="your-google-api-key"
# OPENAI_API_KEY="your-openai-api-key"

# Server Configuration
HOST="127.0.0.1"
PORT=8000

Start the Backend Server:

uvicorn main:app --reload

✅ Backend running at: http://127.0.0.1:8000
📚 API Documentation: http://127.0.0.1:8000/docs

💻 3. Frontend Setup

Open a new terminal window:

# Navigate to frontend directory
cd frontend

# Install dependencies
npm install

# Start development server
npm run dev

✅ Frontend running at: http://localhost:5173

🎉 4. Access the Application

Open your browser and navigate to:


🔬 Backend Deep Dive: Dual-Flow AI Architecture

Health Atlas operates on a sophisticated dual-flow architecture, demonstrating versatility in handling different business processes.

Flow 1: AI Validation Pipeline (Core)

CSV Upload → Parallel Processing → Multi-Agent Analysis → Real-Time Streaming → Summary Report

Key Components:

  1. High-Throughput Async Processing

    • FastAPI backend uses asyncio for concurrent record processing
    • Configurable batch sizes for optimal performance
    • Handles large datasets (10,000+ records) efficiently
  2. Live Streaming Architecture

    • Server-Sent Events (SSE) push results to frontend
    • Real-time progress tracking and log visualization
    • No polling required - true push-based updates
  3. Comprehensive Analysis Pipeline

    • NPI registry cross-validation
    • Address geocoding and verification
    • Website scraping for data enrichment
    • Confidence scoring and flagging logic
  4. Actionable Outputs

    • Downloadable PDF summary reports
    • Prioritized review queue
    • Auto-generated follow-up emails for flagged providers

Flow 2: VLM Document Processing (Future-Ready)

PDF Upload → VLM Analysis → Structured Extraction → Data Validation → Profile Creation

Currently Implemented:

  • PDF text extraction using PyPDF2
  • Structured data parsing
  • Ready-to-integrate VLM API layer

VLM Integration (Ready to Enable):

# Example: Gemini Vision Integration
def analyze_provider_document_vlm(file_path: str) -> dict:
    """
    Extract structured provider data from any document type using VLM.
    Handles: scanned PDFs, images, handwritten forms, etc.
    """
    file = genai.upload_file(path=file_path)
    
    prompt = """
    Extract the following provider information:
    - Full Name
    - NPI Number
    - Specialties
    - Address (Street, City, State, ZIP)
    - Phone and Fax
    - License Numbers
    - Accepting New Patients status
    """
    
    response = model.generate_content([file, prompt])
    return parse_structured_response(response.text)

🧰 The Agent's Toolkit

Health Atlas agents are powered by specialized tools that handle distinct validation tasks.

Function Description Technology
search_npi_registry() 🔎 Connects to official NPI database for baseline verification NPI Registry API
parse_provider_pdf() 📄 Extracts text from provider documents with broad PDF compatibility PyPDF2
parse_provider_pdf_vlm() 👁️ VLM-powered extraction from scanned/image-based documents Gemini Vision API
scrape_provider_website() 🌐 Dynamically scrapes provider websites for enrichment Selenium WebDriver
validate_address() 🗺️ Confirms address accuracy with geographic confidence scoring Geoapify API
calculate_priority_score() 📊 Computes priority based on confidence × member impact Custom Algorithm
generate_follow_up_email() ✉️ Creates professional email templates for flagged records LangChain + Groq

📊 System Architecture

┌─────────────────┐
│   React UI      │  ← User uploads CSV
│   (Frontend)    │  ← Real-time results streaming
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  FastAPI Server │  ← Async job orchestration
│   (Backend)     │  ← Multi-agent coordination
└────────┬────────┘
         │
    ┌────┴────┐
    ▼         ▼
┌─────────┐ ┌──────────┐
│ LangGraph│ │  Tools   │
│  Agents  │ │  Layer   │
└─────────┘ └──────────┘
    │            │
    └─────┬──────┘
          ▼
    ┌──────────────┐
    │ External APIs│
    │ - NPI Reg    │
    │ - Geoapify   │
    │ - Web Scraper│
    └──────────────┘

📈 Performance & Scalability

  • Processing Speed: 100+ records/minute with parallel execution
  • 📦 Batch Processing: Configurable batch sizes for memory optimization
  • 🔄 Async Architecture: Non-blocking I/O for maximum throughput
  • 📊 Scalability: Horizontal scaling ready with minimal configuration

🛣️ Roadmap

Phase 1: Core Validation (✅ Complete)

  • Multi-agent AI pipeline
  • NPI registry integration
  • Address validation
  • Real-time streaming UI
  • PDF reporting

Phase 2: VLM Integration (🚧 In Progress)

  • Gemini Vision API integration
  • Scanned document processing
  • Handwriting recognition
  • Image-based PDF parsing

Phase 3: Advanced Features (📋 Planned)

  • Historical data tracking
  • Automated re-validation scheduling
  • Machine learning-based anomaly detection
  • Multi-tenant architecture
  • API rate limiting and caching
  • Advanced analytics dashboard

Phase 4: Enterprise Ready (🔮 Future)

  • SSO/SAML authentication
  • Role-based access control
  • Audit logging
  • SOC 2 compliance
  • HIPAA compliance features
  • Microservices architecture

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.


Final KPI Summary

The system successfully processed all valid records and produced the following final summary:

Analysis of Results vs. Goals

  • KPI Goal Result Status
  • Validation Accuracy 80%+ 88.89% ✅ GOAL ACHIEVED
  • Processing Speed < 300 sec ~732 sec PARTIALLY ACHIEVED*
  • Processing Throughput 500+/hr 517 providers/hr ✅ GOAL ACHIEVED

Note on Processing Speed: The 5-minute target was missed as a deliberate engineering trade-off for the demo. To guarantee a stable run without hitting API rate limits on the free tier, the number of parallel workers was set to 1. The throughput of 517 providers/hour proves the architecture is highly efficient and would easily beat the speed target with a production-level API key.

🙏 Acknowledgments

  • LangChain & LangGraph for the agent orchestration framework
  • Groq for high-speed LLM inference
  • FastAPI for the excellent async web framework
  • React Community for the robust frontend ecosystem

🧭 Vision

Health Atlas represents a step toward self-healing data ecosystems — systems that not only detect but autonomously repair data drift in critical infrastructures like healthcare.

This foundation can scale toward enterprise-grade deployments where data reliability becomes an autonomous service, reducing operational overhead and improving patient outcomes across the healthcare industry.


About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • JavaScript 57.5%
  • Python 26.4%
  • Java 15.5%
  • Other 0.6%