Skip to content

End-to-end healthcare analytics using PostgreSQL, SQL, and Python. Includes data cleaning, ERD, 20 business KPIs, automated query execution, and CSV exports for patient, payer, and physician insights.

Notifications You must be signed in to change notification settings

Harish-34/Health_Analytics_sql

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ₯ US Healthcare SQL Analytics Project

A complete end-to-end healthcare analytics solution using PostgreSQL, pgAdmin, and Python to uncover business insights from patient and claims data.

This project focuses on solving 20 real-time KPIs across cost, coverage, delay patterns, physician performance, and patient demographics β€” driven by SQL automation and visual outputs.


πŸ”§ Tech Stack & Tools

Category Tools / Libraries Purpose
πŸ›’οΈ Database PostgreSQL Core relational database for storing healthcare data
🧰 SQL Tool pgAdmin 4 GUI for SQL execution, schema design, and data inspection
🐍 Programming Python (pandas, psycopg2) Data cleaning, PostgreSQL connection, and CSV automation
πŸ““ Notebooks Jupyter Notebook Data preprocessing and KPI automation in interactive format
πŸ“„ Query Language SQL Core logic for solving all 20 business KPIs
πŸ“ˆ Visualization ERD with dbdiagram.io Entity relationship mapping of fact/dimension healthcare schema


πŸ“ Folder Structure

πŸ“ us_healthcare_sql_analysis/ – Click to expand

πŸ“‚ data/
β”œβ”€β”€ πŸ“„ datasets/ β€” Raw healthcare CSV files
β”œβ”€β”€ πŸ“„ outputs/ β€” Cleaned data after preprocessing

πŸ“‚ database/
β”œβ”€β”€ πŸ“œ Defining_Tables.sql β€” SQL schema with constraints and relationships
β”œβ”€β”€ 🐍 Load_Data_Scripts.py β€” Bulk PostgreSQL loader using psycopg2

πŸ“‚ notebooks/
β”œβ”€β”€ πŸ““ 1_data_cleaning.ipynb β€” Clean and export raw CSVs
β”œβ”€β”€ πŸ““ 2_eda_analysis.ipynb β€” Null checks and data exploration
β”œβ”€β”€ πŸ““ 3_sql_query_runner.ipynb β€” Dynamic SQL execution and result export

πŸ“‚ business_problems_outcomes/
β”œβ”€β”€ 🧠 01_top_cpt_costs.sql β€” KPI SQL queries (20 total)
β”œβ”€β”€ πŸ“„ result_01_top_cpt_costs.csv β€” Results via pgAdmin

πŸ“‚ outputs/
β”œβ”€β”€ πŸ“„ csvs/ β€” SQL outputs via Python automation

πŸ“‚ diagrams/
β”œβ”€β”€ πŸ–ΌοΈ ERD_Health_Analytics.png β€” Entity-Relationship Diagram

πŸ“˜ README.md β€” Project documentation


🧩 Entity Relationship Diagram (ERD)

This ERD illustrates the star schema used to model healthcare claims and billing data in PostgreSQL. The central facttable connects to multiple dimension tables, enabling efficient joins and analytical flexibility for KPI computation.

πŸ—ƒοΈ Fact Table

  • facttable β€” Core transactional table containing:
    • Foreign keys to all dimension tables (e.g., dimPatientPK, dimDateServicePK, dimCPTCodePK)
    • Medical and billing fields like CPTUnits, Gross_Expenses, Insurance_Payment, Patient_Payment, Adjustment, AR

🧩 Dimension Tables

  • dimpatient β€” Patient-level details (name, gender, age, state, region)
  • dimpayer β€” Insurance provider information
  • dimphysician β€” Physician metadata (NPI, name, specialty, FTE)
  • dimspeciality β€” Specialization types with descriptive fields
  • dimdate β€” Date reference with breakdowns by year, month, weekday
  • dimtransaction β€” Claim transaction types and adjustment reasons
  • dimcptcode β€” CPT codes, descriptions, and groupings for procedure classification
  • dimdiagnosiscode β€” Diagnosis codes with grouping and descriptions
  • dimhospital β€” Hospital or location data (LocationName)

🧠 This schema supports comprehensive healthcare analytics across cost, insurance coverage, readmissions, provider efficiency, and more β€” by enabling multi-dimensional aggregations.

πŸ“Œ ERD created using: dbdiagram.io
πŸ“· Schema diagram below:

ERD Diagram


πŸ“Š Development Lifecycle

This project follows a 5-step data-to-insight pipeline β€” from raw CSV cleaning to SQL-powered business KPIs and dynamic exports via Python.


🧼 Step 1: Data Cleaning (1_data_cleaning.ipynb)

  • Loaded all raw .csv files from data/datasets/
  • Standardized column names to use underscores
  • Replaced null values with "NA"
  • Corrected invalid dates (e.g., 16-12-2019) and numeric anomalies (#NUM!)
  • Saved cleaned outputs to data/outputs/

πŸ“Š Step 2: Exploratory Data Analysis (2_eda_analysis.ipynb)

  • Summary of missing/null values
  • Unique ID validations and data integrity checks
  • Distribution plots for gender, states, payers
  • Verified foreign key relationships across all entities

πŸ—ƒοΈ Step 3: PostgreSQL Schema & Data Loading

  • Schema Setup: Defining_Tables.sql
    Defined relational schema with constraints and normalized tables.

  • Data Load: Load_Data_Scripts.py
    Automatically loads cleaned .csv files into PostgreSQL using psycopg2 with status logging.


🧠 Step 4: Solving 20 Business KPIs

We created 20 SQL-based KPIs to answer key healthcare business questions β€” ranging from cost efficiency to insurance coverage and claim delays.

Each KPI includes:

  • 🎯 Objective
  • 🧾 SQL query file
  • πŸ“ Output CSV via pgAdmin and Python
KPI No. KPI Title Objective SQL Query File CSV via pgAdmin CSV via Python
1 Top CPT Codes by Total Expenses Identify high-cost services and their contribution 01_top_cpt_costs.sql result_01_top_cpt_costs.csv result_01_top_cpt_costs.csv
2 Avg Insurance Payment by Payer Track average insurance payments and coverage % 02_avg_insurance_coverage.sql result_02_avg_insurance_coverage.csv result_02_avg_insurance_coverage.csv
3 Insurance vs Out-of-Pocket Ratio Understand cost split between insurance and patients 03_insurance_vs_oop_ratio.sql result_03_insurance_vs_oop_ratio.csv result_03_insurance_vs_oop_ratio.csv
4 Patients Without Insurance Payment Count self-pay cases and % out-of-pocket 04_patients_without_insurance.sql result_04_patients_without_insurance.csv result_04_patients_without_insurance.csv
5 Payers with Most Claims but Least Payment Detect low-paying high-claim insurers 05_payers_most_claims_least_payment.sql result_05_payers_most_claims_least_payment.csv result_05_payers_most_claims_least_payment.csv
6 Avg Patient Payment by State Patient out-of-pocket average by geography 06_avg_patient_payment_by_state.sql result_06_avg_patient_payment_by_state.csv business_problems_outcomes/result_06_avg_patient_payment_by_state.csv
7 Avg Claim Cost by Gender Understand cost difference across genders 07_avg_claim_cost_by_gender.sql result_07_avg_claim_cost_by_gender.csv result_07_avg_claim_cost_by_gender.csv
8 Top 5 States by OOP % Find states with highest out-of-pocket burden 08_top_states_by_oop_percent.sql result_08_top_oop_states.csv result_08_top_oop_states.csv
9 Readmission Detection Track multiple hospital visits per patient 09_readmission_detection.sql result_09_readmission_detection.csv result_09_readmission_detection.csv
10 Monthly Claim Trend Observe claim trends over time 10_monthly_claim_trend.sql result_10_monthly_claim_trend.csv result_10_monthly_claim_trend.csv
11 Physician Efficiency (Cost/Patient) Measure cost-efficiency per physician 11_physician_cost_efficiency.sql result_11_physician_cost_efficiency.csv result_11_physician_cost_efficiency.csv
12 Claim Volume by Specialty Analyze claim frequency by medical specialty 12_claim_volume_by_specialty.sql result_12_claim_volume_by_specialty.csv result_12_claim_volume_by_specialty.csv
13 Top 5 Expensive Specialties Identify costly specialties in terms of expenses 13_top_expensive_specialties.sql result_13_top_expensive_specialties.csv result_13_top_expensive_specialties.csv
14 Diagnosis Groups by Volume + Avg AR Rank diagnosis groups and their avg AR 14_diagnosis_claim_volume_avg_ar.sql result_14_diagnosis_claim_volume_avg_ar.csv result_14_diagnosis_claim_volume_avg_ar.csv
15 Top 10 Diagnosis by AR Identify diagnoses with highest average account receivable 15_top_diagnosis_by_highest_ar.sql result_15_top_diagnosis_by_highest_ar.csv result_15_top_diagnosis_by_highest_ar.csv
16 Monthly Claims & Payout Trend Track insurance and patient payment trend over months 16_monthly_claims_payout_trend.sql result_16_monthly_claims_payout_trend.csv result_16_monthly_claims_payout_trend.csv
17 Hospitals with Most Patients Rank hospitals by patient volume and cost 17_top_hospitals_by_patients.sql result_17_top_hospitals_by_patients.csv result_17_top_hospitals_by_patients.csv
18 Delayed Postings (Post > Service Date) Count late claim postings 18_delayed_postings_count.sql result_18_delayed_postings_count.csv result_18_delayed_postings_count.csv
19 Avg CPT Units by Procedure Group Measure average units used per CPT group 19_avg_cpt_units_by_grouping.sql result_19_avg_cpt_units_by_grouping.csv result_19_avg_cpt_units_by_grouping.csv
20 Diagnosis with Most Adjustments See which diagnoses are most adjusted in billing 20_diagnosis_with_most_adjustments.sql result_20_diagnosis_with_most_adjustments.csv result_20_diagnosis_with_most_adjustments.csv

πŸ§ͺ Step 5: Run SQL Queries with Python

Notebook: 3_sql_query_runner.ipynb

  • Reads all .sql files in business_problems_outcomes/
  • Connects to PostgreSQL database
  • Executes each query dynamically
  • Saves output result as .csv into outputs/csvs/

πŸš€ Run It Yourself

  1. Load Data to PostgreSQL
    Run: python database/Load_Data_Scripts.py

  2. Run KPIs
    Use either:

  3. Check Results
    Output .csv files are saved in:

    • business_problems_outcomes/ β†’ via pgAdmin
    • outputs/csvs/ β†’ via Python script

πŸ“¬ Project Summary & Conclusion

This project delivers a complete healthcare analytics solution powered by PostgreSQL, SQL, and Python β€” designed to transform raw claim data into actionable business insights.

It demonstrates:

  • βœ… A fully normalized PostgreSQL schema supporting analytical joins and aggregations
  • βœ… 20 real-world KPIs solved using SQL β€” targeting cost trends, insurance coverage, payer behavior, readmissions, and provider performance
  • βœ… Python-based automation scripts using psycopg2 to run SQL queries and export results dynamically
  • βœ… Clean and modular Jupyter notebooks for data cleaning, EDA, and SQL execution
  • βœ… A reusable framework for generating insights across multiple healthcare dimensions

πŸš€ This end-to-end system reflects practical data engineering, query optimization, and healthcare domain application β€” making it both interview-ready and production-scalable.


πŸ“© Contact

For questions or collaboration, feel free to connect:

About

End-to-end healthcare analytics using PostgreSQL, SQL, and Python. Includes data cleaning, ERD, 20 business KPIs, automated query execution, and CSV exports for patient, payer, and physician insights.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published