🏥 US Healthcare SQL Analytics Project

A complete end-to-end healthcare analytics solution using PostgreSQL, pgAdmin, and Python to uncover business insights from patient and claims data.

This project focuses on solving 20 real-time KPIs across cost, coverage, delay patterns, physician performance, and patient demographics — driven by SQL automation and visual outputs.

🔧 Tech Stack & Tools

Category	Tools / Libraries	Purpose
🛢️ Database	PostgreSQL	Core relational database for storing healthcare data
🧰 SQL Tool	pgAdmin 4	GUI for SQL execution, schema design, and data inspection
🐍 Programming	Python (`pandas`, `psycopg2`)	Data cleaning, PostgreSQL connection, and CSV automation
📓 Notebooks	Jupyter Notebook	Data preprocessing and KPI automation in interactive format
📄 Query Language	SQL	Core logic for solving all 20 business KPIs
📈 Visualization	ERD with `dbdiagram.io`	Entity relationship mapping of fact/dimension healthcare schema

📁 Folder Structure

📁 us_healthcare_sql_analysis/ – Click to expand

📂 data/
├── 📄 datasets/ — Raw healthcare CSV files
├── 📄 outputs/ — Cleaned data after preprocessing

📂 database/
├── 📜 Defining_Tables.sql — SQL schema with constraints and relationships
├── 🐍 Load_Data_Scripts.py — Bulk PostgreSQL loader using psycopg2

📂 notebooks/
├── 📓 1_data_cleaning.ipynb — Clean and export raw CSVs
├── 📓 2_eda_analysis.ipynb — Null checks and data exploration
├── 📓 3_sql_query_runner.ipynb — Dynamic SQL execution and result export

📂 business_problems_outcomes/
├── 🧠 01_top_cpt_costs.sql — KPI SQL queries (20 total)
├── 📄 result_01_top_cpt_costs.csv — Results via pgAdmin

📂 outputs/
├── 📄 csvs/ — SQL outputs via Python automation

📂 diagrams/
├── 🖼️ ERD_Health_Analytics.png — Entity-Relationship Diagram

📘 README.md — Project documentation

🧩 Entity Relationship Diagram (ERD)

This ERD illustrates the star schema used to model healthcare claims and billing data in PostgreSQL. The central facttable connects to multiple dimension tables, enabling efficient joins and analytical flexibility for KPI computation.

🗃️ Fact Table

facttable — Core transactional table containing:
- Foreign keys to all dimension tables (e.g., dimPatientPK, dimDateServicePK, dimCPTCodePK)
- Medical and billing fields like CPTUnits, Gross_Expenses, Insurance_Payment, Patient_Payment, Adjustment, AR

🧩 Dimension Tables

dimpatient — Patient-level details (name, gender, age, state, region)
dimpayer — Insurance provider information
dimphysician — Physician metadata (NPI, name, specialty, FTE)
dimspeciality — Specialization types with descriptive fields
dimdate — Date reference with breakdowns by year, month, weekday
dimtransaction — Claim transaction types and adjustment reasons
dimcptcode — CPT codes, descriptions, and groupings for procedure classification
dimdiagnosiscode — Diagnosis codes with grouping and descriptions
dimhospital — Hospital or location data (LocationName)

🧠 This schema supports comprehensive healthcare analytics across cost, insurance coverage, readmissions, provider efficiency, and more — by enabling multi-dimensional aggregations.

📌 ERD created using: dbdiagram.io
📷 Schema diagram below:

📊 Development Lifecycle

This project follows a 5-step data-to-insight pipeline — from raw CSV cleaning to SQL-powered business KPIs and dynamic exports via Python.

🧼 Step 1: Data Cleaning (`1_data_cleaning.ipynb`)

Loaded all raw .csv files from data/datasets/
Standardized column names to use underscores
Replaced null values with "NA"
Corrected invalid dates (e.g., 16-12-2019) and numeric anomalies (#NUM!)
Saved cleaned outputs to data/outputs/

📊 Step 2: Exploratory Data Analysis (`2_eda_analysis.ipynb`)

Summary of missing/null values
Unique ID validations and data integrity checks
Distribution plots for gender, states, payers
Verified foreign key relationships across all entities

🗃️ Step 3: PostgreSQL Schema & Data Loading

Schema Setup: Defining_Tables.sql
Defined relational schema with constraints and normalized tables.
Data Load: Load_Data_Scripts.py
Automatically loads cleaned .csv files into PostgreSQL using psycopg2 with status logging.

🧠 Step 4: Solving 20 Business KPIs

We created 20 SQL-based KPIs to answer key healthcare business questions — ranging from cost efficiency to insurance coverage and claim delays.

Each KPI includes:

🎯 Objective
🧾 SQL query file
📁 Output CSV via pgAdmin and Python

KPI No.	KPI Title	Objective	SQL Query File	CSV via pgAdmin	CSV via Python
1	Top CPT Codes by Total Expenses	Identify high-cost services and their contribution	01_top_cpt_costs.sql	result_01_top_cpt_costs.csv	result_01_top_cpt_costs.csv
2	Avg Insurance Payment by Payer	Track average insurance payments and coverage %	02_avg_insurance_coverage.sql	result_02_avg_insurance_coverage.csv	result_02_avg_insurance_coverage.csv
3	Insurance vs Out-of-Pocket Ratio	Understand cost split between insurance and patients	03_insurance_vs_oop_ratio.sql	result_03_insurance_vs_oop_ratio.csv	result_03_insurance_vs_oop_ratio.csv
4	Patients Without Insurance Payment	Count self-pay cases and % out-of-pocket	04_patients_without_insurance.sql	result_04_patients_without_insurance.csv	result_04_patients_without_insurance.csv
5	Payers with Most Claims but Least Payment	Detect low-paying high-claim insurers	05_payers_most_claims_least_payment.sql	result_05_payers_most_claims_least_payment.csv	result_05_payers_most_claims_least_payment.csv
6	Avg Patient Payment by State	Patient out-of-pocket average by geography	06_avg_patient_payment_by_state.sql	result_06_avg_patient_payment_by_state.csv	business_problems_outcomes/result_06_avg_patient_payment_by_state.csv
7	Avg Claim Cost by Gender	Understand cost difference across genders	07_avg_claim_cost_by_gender.sql	result_07_avg_claim_cost_by_gender.csv	result_07_avg_claim_cost_by_gender.csv
8	Top 5 States by OOP %	Find states with highest out-of-pocket burden	08_top_states_by_oop_percent.sql	result_08_top_oop_states.csv	result_08_top_oop_states.csv
9	Readmission Detection	Track multiple hospital visits per patient	09_readmission_detection.sql	result_09_readmission_detection.csv	result_09_readmission_detection.csv
10	Monthly Claim Trend	Observe claim trends over time	10_monthly_claim_trend.sql	result_10_monthly_claim_trend.csv	result_10_monthly_claim_trend.csv
11	Physician Efficiency (Cost/Patient)	Measure cost-efficiency per physician	11_physician_cost_efficiency.sql	result_11_physician_cost_efficiency.csv	result_11_physician_cost_efficiency.csv
12	Claim Volume by Specialty	Analyze claim frequency by medical specialty	12_claim_volume_by_specialty.sql	result_12_claim_volume_by_specialty.csv	result_12_claim_volume_by_specialty.csv
13	Top 5 Expensive Specialties	Identify costly specialties in terms of expenses	13_top_expensive_specialties.sql	result_13_top_expensive_specialties.csv	result_13_top_expensive_specialties.csv
14	Diagnosis Groups by Volume + Avg AR	Rank diagnosis groups and their avg AR	14_diagnosis_claim_volume_avg_ar.sql	result_14_diagnosis_claim_volume_avg_ar.csv	result_14_diagnosis_claim_volume_avg_ar.csv
15	Top 10 Diagnosis by AR	Identify diagnoses with highest average account receivable	15_top_diagnosis_by_highest_ar.sql	result_15_top_diagnosis_by_highest_ar.csv	result_15_top_diagnosis_by_highest_ar.csv
16	Monthly Claims & Payout Trend	Track insurance and patient payment trend over months	16_monthly_claims_payout_trend.sql	result_16_monthly_claims_payout_trend.csv	result_16_monthly_claims_payout_trend.csv
17	Hospitals with Most Patients	Rank hospitals by patient volume and cost	17_top_hospitals_by_patients.sql	result_17_top_hospitals_by_patients.csv	result_17_top_hospitals_by_patients.csv
18	Delayed Postings (Post > Service Date)	Count late claim postings	18_delayed_postings_count.sql	result_18_delayed_postings_count.csv	result_18_delayed_postings_count.csv
19	Avg CPT Units by Procedure Group	Measure average units used per CPT group	19_avg_cpt_units_by_grouping.sql	result_19_avg_cpt_units_by_grouping.csv	result_19_avg_cpt_units_by_grouping.csv
20	Diagnosis with Most Adjustments	See which diagnoses are most adjusted in billing	20_diagnosis_with_most_adjustments.sql	result_20_diagnosis_with_most_adjustments.csv	result_20_diagnosis_with_most_adjustments.csv

🧪 Step 5: Run SQL Queries with Python

Notebook: 3_sql_query_runner.ipynb

Reads all .sql files in business_problems_outcomes/
Connects to PostgreSQL database
Executes each query dynamically
Saves output result as .csv into outputs/csvs/

🚀 Run It Yourself

Load Data to PostgreSQL
Run: python database/Load_Data_Scripts.py
Run KPIs
Use either:
- SQL in pgAdmin (manually)
- Notebook: 3_sql_query_runner.ipynb
Check Results
Output .csv files are saved in:
- business_problems_outcomes/ → via pgAdmin
- outputs/csvs/ → via Python script

📬 Project Summary & Conclusion

This project delivers a complete healthcare analytics solution powered by PostgreSQL, SQL, and Python — designed to transform raw claim data into actionable business insights.

It demonstrates:

✅ A fully normalized PostgreSQL schema supporting analytical joins and aggregations
✅ 20 real-world KPIs solved using SQL — targeting cost trends, insurance coverage, payer behavior, readmissions, and provider performance
✅ Python-based automation scripts using psycopg2 to run SQL queries and export results dynamically
✅ Clean and modular Jupyter notebooks for data cleaning, EDA, and SQL execution
✅ A reusable framework for generating insights across multiple healthcare dimensions

🚀 This end-to-end system reflects practical data engineering, query optimization, and healthcare domain application — making it both interview-ready and production-scalable.

📩 Contact

For questions or collaboration, feel free to connect:

💼 LinkedIn: Harish Chowdary
🧑‍💻 GitHub: Harish-34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🏥 US Healthcare SQL Analytics Project

🔧 Tech Stack & Tools

📁 Folder Structure

🧩 Entity Relationship Diagram (ERD)

🗃️ Fact Table

🧩 Dimension Tables

📊 Development Lifecycle

🧼 Step 1: Data Cleaning (`1_data_cleaning.ipynb`)

📊 Step 2: Exploratory Data Analysis (`2_eda_analysis.ipynb`)

🗃️ Step 3: PostgreSQL Schema & Data Loading

🧠 Step 4: Solving 20 Business KPIs

🧪 Step 5: Run SQL Queries with Python

🚀 Run It Yourself

📬 Project Summary & Conclusion

📩 Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
business_problems_outcomes		business_problems_outcomes
data		data
database		database
diagrams		diagrams
notebooks		notebooks
outputs/csvs		outputs/csvs
readme.md		readme.md

Harish-34/Health_Analytics_sql

Folders and files

Latest commit

History

Repository files navigation

🏥 US Healthcare SQL Analytics Project

🔧 Tech Stack & Tools

📁 Folder Structure

🧩 Entity Relationship Diagram (ERD)

🗃️ Fact Table

🧩 Dimension Tables

📊 Development Lifecycle

🧼 Step 1: Data Cleaning (1_data_cleaning.ipynb)

📊 Step 2: Exploratory Data Analysis (2_eda_analysis.ipynb)

🗃️ Step 3: PostgreSQL Schema & Data Loading

🧠 Step 4: Solving 20 Business KPIs

🧪 Step 5: Run SQL Queries with Python

🚀 Run It Yourself

📬 Project Summary & Conclusion

📩 Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

🧼 Step 1: Data Cleaning (`1_data_cleaning.ipynb`)

📊 Step 2: Exploratory Data Analysis (`2_eda_analysis.ipynb`)

Packages