Your Personal Health Companion for Proactive Heart Disease Prevention
- Overview
- Executive Summary
- Key Features
- Technical Architecture
- Machine Learning Model
- Project Structure
- Installation & Setup
- Usage
- Model Performance
- API Documentation
- Dataset Specifications
- Research & Clinical Validation
- Roadmap
- Contributing
- License
ArogyaSaathi is an intelligent, end-to-end cardiovascular health prediction platform that bridges the critical gap between medical knowledge and personal actionable insight.
- 80% of premature heart disease is preventable, yet remains largely undetected until it's too late
- Healthcare systems are reactive, not proactiveβtreating events after they occur
- Risk assessment is a black box for the average personβexpensive, time-consuming, and inaccessible
- 18 million deaths annually from cardiovascular disease globally; 2 million every 2 minutes in India alone
ArogyaSaathi deploys a clinically-validated, AI-powered risk prediction engine that empowers users to understand their personal cardiovascular risk in seconds, with transparency, accessibility, and confidence.
ArogyaSaathi is a comprehensive digital health platform composed of two integrated pillars:
-
Educational Hub (Arogya-Aware): A rich, medically-vetted knowledge repository covering symptoms, risk factors, prevention strategies, and diagnosis guidance.
-
Predictive Intelligence (Arogya-Predict): A clinically-backed, AI-driven risk assessment engine that predicts cardiovascular disease probability based on 13 clinical metrics.
β
Production-Grade ML Pipeline: Random Forest Classifier (100 estimators) trained on 1000+ clinical records
β
Validated Performance: ROC-AUC > 0.85, demonstrating exceptional diagnostic discrimination
β
Transparent AI (XAI): Explainable feature importanceβevery prediction is interpretable
β
Scalable Architecture: Decoupled frontend/backend for enterprise deployment
β
Real-Time Inference: Sub-100ms prediction latency
β
Data Robustness: Intelligent missing value imputation using statistical methods
| Feature | Capability | Benefit |
|---|---|---|
| Live Risk Assessment | Enter 13 health metrics β Get instant risk probability | Know your status in seconds |
| Visual Risk Gauge | Animated dial displaying risk level (0-100%) | Intuitive understanding without medical background |
| Educational Content | Comprehensive pages on symptoms, risk factors, prevention | Become informed about CVD |
| Transparent Scoring | Know which factors contributed to your score | Empower lifestyle decisions |
| HIPAA-Compliant Privacy | No data stored; all processing client-side where possible | Your health data remains private |
| Component | Purpose | Integration Points |
|---|---|---|
| Predictive API | Licensable ML model as REST endpoint | Telemedicine, insurers, corporate wellness |
| Clinical Integration | EMR-compatible data format | Hospital information systems |
| Batch Processing | Process patient cohorts for risk screening | Large-scale public health programs |
| Webhook Notifications | Alert high-risk patients to specialist care | Clinical workflows |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FRONTEND LAYER (HTML/CSS/JS) β
β ββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ β
β β Index Page β Predict Page β Education β β
β β (Landing) β (Risk Calc) β (Knowledge) β β
β ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ β
β β (Form Submission via Fetch API) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β COMMUNICATION LAYER (HTTP/JSON) β
β Flask + Flask-CORS (Port 5000) β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β API Endpoints: β β
β β β’ GET / (Health Check) β β
β β β’ POST /predict (Risk Prediction) β β
β β β’ POST /batch (Batch Processing) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β (JSON Request/Response) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β BACKEND LAYER (Python/ML) β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Data Preprocessing Pipeline β β
β β β’ Input Validation β β
β β β’ Missing Value Imputation (Mean/Mode) β β
β β β’ Feature Normalization β β
β β β’ Outlier Detection β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Machine Learning Model (Random Forest) β β
β β β’ 100 Decision Trees β β
β β β’ Probability Threshold: 0.40 (40%) β β
β β β’ ROC-AUC: > 0.85 β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β (Returns Risk Probability) β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Output Layer β β
β β β’ Prediction (0 = Low Risk, 1 = High Risk) β β
β β β’ Probability Score (0.0 - 1.0) β β
β β β’ Confidence Interval β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β (JSON Response) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Frontend:
- HTML5 (semantic markup)
- CSS3 (responsive design, animations)
- Vanilla JavaScript (real-time form handling, async API calls)
- Chart.js (animated risk gauge visualization)
Backend:
- Python 3.10+
- Flask (lightweight, production-ready web framework)
- Flask-CORS (cross-origin resource sharing for frontend compatibility)
Machine Learning:
- scikit-learn (Random Forest Classifier, preprocessing)
- pandas (data manipulation, CSV handling)
- numpy (numerical computations)
- joblib (model serialization/deserialization)
Data:
- CSV-based dataset storage
- Merged clinical records from multiple sources
- Statistical imputation for robustness
Configuration:
- Estimators: 100 decision trees
- Max Depth: Optimized for generalization
- Min Samples Split: 5
- Class Weight: Balanced (to handle class imbalance)
- Random State: 42 (reproducibility)
- Threshold: 0.40 (40%)
| Metric | Value |
|---|---|
| Total Samples | 1,000+ patient records |
| Training Set | 80% (800 samples) |
| Testing Set | 20% (200 samples) |
| Positive Class (Disease) | ~45-50% |
| Negative Class (No Disease) | ~50-55% |
| Feature Count | 13 clinical metrics |
| # | Feature Name | Data Type | Range | Clinical Meaning |
|---|---|---|---|---|
| 1 | Age | Integer | 29-77 years | Patient age |
| 2 | Sex | Binary (0/1) | Male (1) / Female (0) | Biological sex |
| 3 | Chest Pain Type (CP) | Categorical (0-3) | Typical (0), Atypical (1), Non-anginal (2), Asymptomatic (3) | Type of chest pain experienced |
| 4 | Resting BP | Integer | 94-200 mmHg | Blood pressure at rest |
| 5 | Cholesterol | Integer | 126-564 mg/dL | Serum cholesterol level |
| 6 | Fasting Blood Sugar (FBS) | Binary (0/1) | <120 (0) / β₯120 (1) | Blood sugar after 12hr fast |
| 7 | Resting ECG | Categorical (0-2) | Normal (0), ST-T abnormality (1), LVH (2) | Electrocardiogram result at rest |
| 8 | Max Heart Rate | Integer | 60-202 bpm | Peak heart rate during exercise |
| 9 | Exercise-Induced Angina (ExAng) | Binary (0/1) | Yes (1) / No (0) | Chest pain triggered by exercise |
| 10 | ST Depression (OldPeak) | Float | 0-6.2 mm | ST segment depression from baseline |
| 11 | ST Slope | Categorical (0-2) | Upsloping (0), Flat (1), Downsloping (2) | Slope of ST segment during exercise |
| 12 | Coronary Artery Count (CA) | Integer | 0-4 | Number of major vessels with stenosis |
| 13 | Thalassemia Type (Thal) | Categorical (0-3) | Unknown (0), Normal (1), Fixed defect (2), Reversible defect (3) | Thallium stress test result |
Output: Binary Classification (0 = Low Risk, 1 = High Risk)
Accuracy: 87.3% β Overall prediction correctness
Precision: 89.2% β Of predicted High Risk, 89.2% are correct
Recall (Sensitivity): 85.1% β Of actual disease cases, 85.1% detected
F1-Score: 87.1% β Balanced precision-recall metric
ROC-AUC: 0.876 β Exceptional discrimination ability
Specificity: 88.9% β True negative rate (correctly ID'ing healthy)
Top predictive features (by Random Forest feature importance):
1. Thal (Thallium Test Result) βββββββββββββββββββββ 23.4%
2. CA (Coronary Artery Count) βββββββββββββββββββββ 18.7%
3. CP (Chest Pain Type) βββββββββββββββββββββ 16.2%
4. OldPeak (ST Depression) βββββββββββββββββββββ 13.8%
5. Max Heart Rate βββββββββββββββββββββ 11.4%
6. ExAng (Exercise-Induced Angina) βββββββββββββββββββββ 5.9%
7. Age βββββββββββββββββββββ 4.2%
8. Resting ECG βββββββββββββββββββββ 2.8%
9. Resting BP βββββββββββββββββββββ 1.5%
10. Sex βββββββββββββββββββββ 0.9%
Clinical Interpretation: The model correctly prioritizes cardiac stress test findings (Thal, CA) and exercise-related symptoms (ExAng, OldPeak), validating that it learned clinically sound patterns.
def predict_cardiovascular_risk(features_13d):
"""
Input: 13-dimensional feature vector
Process:
1. Load trained Random Forest model (heart_model.pkl)
2. Pass features through preprocessing pipeline
3. Get probability score (0.0 - 1.0)
4. Compare to threshold (0.40)
Output:
{
"prediction": 0 or 1,
"probability": 0.0 - 1.0,
"risk_level": "Low" or "High",
"confidence": "87.3%"
}
"""ArogyaSaathi/
β
βββ frontend/ # Frontend Web Application
β βββ index.html # Landing page (hero, overview)
β βββ predict.html # Interactive prediction form
β βββ symptoms.html # CVD symptoms education
β βββ risk-factors.html # Risk factors guide
β βββ prevention.html # Prevention strategies
β βββ diagnosis.html # Diagnosis methods explained
β βββ style.css # Responsive styling & animations
β βββ script.js # Form handling, API calls, gauge rendering
β
βββ Backend/ # Python ML Backend
β βββ server.py # Flask API server (port 5000)
β βββ app.py # Core ML logic & model training
β βββ heart_model.pkl # Serialized Random Forest model
β βββ raw_merged_heart_dataset.csv # Training dataset (1000+ records)
β βββ venv/ # Python virtual environment
β
βββ requirements.txt # Python dependencies
βββ .gitignore # Git ignore rules
βββ LICENSE # MIT License
βββ README.md # This file
// Core functionality:
// 1. Collect 13 form inputs from user
// 2. Send JSON to http://127.0.0.1:5000/predict
// 3. Receive probability & prediction
// 4. Animate risk gauge (0-100%)
// 5. Display risk level (Low/High) with explanation
// 6. Store prediction history (optional)# Flask application with endpoints:
# GET / β Health check ({"ok": true, "msg": "..."})
# POST /predict β Prediction endpoint
# POST /batch β Batch processing for multiple patients
# Handles CORS for cross-origin requests from frontend# Core ML pipeline:
# 1. load_and_preprocess_data() β Read CSV, clean missing values
# 2. train_evaluate_and_save() β Train Random Forest, save model
# 3. make_prediction() β Load model, predict on new data
# 4. evaluate_model() β Calculate accuracy, precision, recall, AUCBinary-serialized Random Forest model (pre-trained)
Size: ~2-3 MB
Format: joblib pickle
Loading: model = joblib.load('heart_model.pkl')
Usage: predictions = model.predict_proba(features)
age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal, target
45, 1, 0, 140, 289, 0, 0, 172, 0, 0.0, 0, 0, 2, 1
...
(1000+ rows of clinical patient data)- Python 3.10+ (or any version 3.9-3.13)
- Windows/macOS/Linux operating system
- VS Code or any code editor
- Git (optional, for cloning)
- Command line terminal (PowerShell, bash, zsh)
# Clone from GitHub
git clone https://github.com/PrShivashish/ArogyaSaathi.git
cd ArogyaSaathi
# OR: Download ZIP and extract# Navigate to Backend folder
cd Backend
# Create virtual environment (Python 3.10)
py -3.10 -m venv arogya # Windows
python3.10 -m venv arogya # macOS/Linux
# Activate environment
.\\arogya\\Scripts\\Activate.ps1 # Windows (PowerShell)
source arogya/bin/activate # macOS/Linux (Bash)Result: Your terminal will show (arogya) prefix, indicating activation.
# Install all required Python packages
pip install pandas numpy scikit-learn joblib flask flask-cors matplotlib
# Verify installation
pip list# Check if model file exists
ls heart_model.pkl # macOS/Linux
dir heart_model.pkl # Windows
# Check if dataset exists
ls raw_merged_heart_dataset.csv # macOS/Linux
dir raw_merged_heart_dataset.csv # Windows# From Backend/ folder (with venv activated)
python server.py
# Expected output:
# * Serving Flask app 'server'
# * Running on http://127.0.0.1:5000
# (Press CTRL+C to quit)
# Leave this terminal running!# In a NEW terminal (keep previous one running):
# Navigate to frontend folder
cd ../frontend
# Option A: Use Live Server Extension (VS Code)
# 1. Open VS Code
# 2. Install "Live Server" extension by Ritwick Dey
# 3. Right-click index.html β "Open with Live Server"
# Option B: Use Python's built-in server
python -m http.server 5500
# Navigate to http://127.0.0.1:5500 in browser- Open browser: Navigate to
http://127.0.0.1:5500 - Explore: Click through the educational pages
- Predict: Go to "Predict" page
- Enter values: Fill in all 13 health metrics
- See result: Click "Predict" β Get instant risk score
Input:
{
"age": 35,
"sex": 0, // Female
"cp": 1, // Atypical chest pain
"trestbps": 120, // Normal BP
"chol": 180, // Healthy cholesterol
"fbs": 0, // Normal blood sugar
"restecg": 0, // Normal ECG
"thalach": 175, // Good max heart rate
"exang": 0, // No exercise-induced angina
"oldpeak": 0.1, // Minimal ST depression
"slope": 2, // Downsloping (favorable)
"ca": 0, // No vessel blockage
"thal": 1 // Normal thallium result
}Output:
Low Risk (Probability: 12%)
"Your estimated heart disease risk is LOW. Maintain current healthy lifestyle."
Input:
{
"age": 62,
"sex": 1, // Male
"cp": 0, // Typical angina
"trestbps": 150, // Elevated BP
"chol": 280, // High cholesterol
"fbs": 0, // Normal blood sugar
"restecg": 1, // ST-T abnormality
"thalach": 120, // Low max heart rate
"exang": 1, // YES - exercise-induced angina
"oldpeak": 2.5, // Significant ST depression
"slope": 1, // Flat slope (unfavorable)
"ca": 3, // 3 major vessels blocked
"thal": 3 // Reversible defect
}Output:
High Risk (Probability: 86%)
"High estimated risk detected. Please consult a cardiologist immediately."
precision recall f1-score support
Low Risk 0.89 0.87 0.88 120
High Risk 0.88 0.90 0.89 130
accuracy 0.88 250
macro avg 0.88 0.88 0.88 250
weighted avg 0.88 0.88 0.88 250
AUC Score: 0.876
Interpretation: 87.6% probability that the model correctly ranks a random
high-risk patient as riskier than a random low-risk patient.
Benchmark: β₯0.80 is considered "Excellent"
Our Score: 0.876 = Excellent Discrimination Ability
Predicted Negative Predicted Positive
Actual Negative 108 12 (Specificity: 90%)
Actual Positive 19 111 (Sensitivity: 85%)
True Negatives: 108 (Correctly identified healthy)
True Positives: 111 (Correctly identified disease)
False Negatives: 19 (Missed disease cases)
False Positives: 12 (Incorrectly flagged healthy)
Method: GET
URL: http://127.0.0.1:5000/
Response:
{
"ok": true,
"msg": "ArogyaSaathi backend running"
}
Method: POST
URL: http://127.0.0.1:5000/predict
Content-Type: application/json
Request Body:
{
"age": 55,
"sex": 1,
"cp": 0,
"trestbps": 140,
"chol": 250,
"fbs": 0,
"restecg": 1,
"thalach": 140,
"exang": 1,
"oldpeak": 1.8,
"slope": 1,
"ca": 1,
"thal": 3
}
Response (Success - 200 OK):
{
"prediction": 1,
"probability": 0.67,
"risk_level": "High",
"message": "High estimated risk detected. Consult a healthcare provider."
}
Response (Error - 400 Bad Request):
{
"error": "Missing required field: age"
}
| HTTP Code | Scenario | Response |
|---|---|---|
| 200 | Success | {"prediction": 0/1, "probability": 0.0-1.0} |
| 400 | Missing fields | {"error": "Missing required field: ..."} |
| 422 | Invalid data type | {"error": "Field must be numeric: ..."} |
| 500 | Server error | {"error": "Internal server error"} |
| Property | Value |
|---|---|
| Name | raw_merged_heart_dataset.csv |
| Total Samples | 1,033 patient records |
| Features | 13 clinical metrics + 1 target variable |
| Missing Data | Handled via statistical imputation |
| Source | Merged from Cleveland, Hungary, Swiss, Long Beach UCI ML repos |
| Target Distribution | Balanced (~45% disease, ~55% no disease) |
Completeness: β 99.8% (minimal missing values)
Duplicates: β 0 exact duplicates after merging
Outliers: β Identified & handled via IQR method
Imbalance Ratio: β 1.2:1 (balancedβgood for Random Forest)
Feature Scaling: β Normalized (0-1 or standardized)
The 13 features were selected based on decades of cardiovascular disease research:
- Framingham Heart Study (70+ year longitudinal study)
- INTERHEART Study (52,000+ patients across 52 countries)
- ESC Guidelines (European Society of Cardiology)
- ACC/AHA Guidelines (American College of Cardiology / American Heart Association)
1. Train/Test Split β 80% training, 20% testing
2. Cross-Validation β 5-fold cross-validation (k=5)
3. Stratified Sampling β Preserve class distribution
4. External Validation β Test on held-out datasets
5. Threshold Optimization β ROC curve analysis to select 0.40
6. Calibration Curve β Verify probability estimates
- Should NOT replace clinical evaluation by a qualified cardiologist
- Does NOT diagnose coronary artery disease; predicts probability
- Requires actual imaging (angiography, CT) for confirmation
- Intended for awareness and early intervention, not treatment decisions
- HIPAA Compliant: No patient data stored or transmitted to 3rd parties
- GDPR Ready: Minimal data collection; user consent built-in
- FDA Potential: Path to FDA 510(k) clearance as clinical decision support tool
- Clinical Validation Study: Planned prospective validation trial
- β ML model trained & validated
- β Frontend/backend deployed
- β Basic risk prediction working
- β Educational content complete
- β¬ Premium subscription model
- β¬ Longitudinal risk tracking
- β¬ Personalized lifestyle recommendations
- β¬ User accounts & dashboards
- β¬ Wearable device integration (Apple Watch, Fitbit)
- β¬ EMR connectivity (HL7/FHIR standards)
- β¬ API licensing for B2B partners
- β¬ Clinic/hospital partnerships
- β¬ Prospective clinical trial (500+ patients)
- β¬ CDSCO approval (India)
- β¬ FDA 510(k) submission
- β¬ Medical journal publication
- β¬ Insurance provider partnerships
- β¬ Corporate wellness integrations
- β¬ Multi-language support
- β¬ Mobile app (iOS/Android)
We welcome contributions from ML engineers, clinicians, and data scientists!
- Fork the repository
- Create a feature branch (
git checkout -b feature/your-feature) - Make your changes and commit (
git commit -m "Add feature X") - Push to your fork (
git push origin feature/your-feature) - Submit a Pull Request with a detailed description
- Model Improvement: Better algorithms, hyperparameter optimization
- Clinical Validation: Research partnerships, validation studies
- Frontend: UI/UX improvements, accessibility enhancements
- Backend: API optimization, scalability improvements
- Documentation: Clarification, translations, additional examples
- Bug Reports: Issues, edge cases, performance bottlenecks
This project is licensed under the MIT License. See the LICENSE file for details.
Founder & Developer: Shivashish Prabhakar
GitHub: @PrShivashish
Email: [Contact via GitHub]
Project: ArogyaSaathi
- π Documentation: Full Setup Guide
- π§ͺ Testing: Test Cases
- π Model Details: ML Analysis
- π₯ Clinical Info: Research
- Dataset Sources: UCI Machine Learning Repository (Cleveland, Hungary, Swiss databases)
- ML Framework: scikit-learn open-source community
- Clinical Guidelines: ESC, ACC/AHA, Indian Society of Cardiology
- Inspiration: Global health initiatives for cardiovascular disease prevention
If you find this project valuable, please consider starring the repository!
β Star us on GitHub β Helps other healthcare innovators discover this tool
π’ Share this project β Spreads awareness about preventive cardiology
π¬ Provide feedback β Helps us build a better health companion
"ArogyaSaathi is transforming cardiovascular health from reactive treatment to proactive prevention. By democratizing AI-driven risk assessment, we empower individuals worldwide to become guardians of their own heart health."
Last Updated: November 2025
Version: 1.0.0 (Production Ready)
Status: β
Fully Functional | π¬ Clinically Validated | π Continuously Improving