A complete end-to-end healthcare analytics solution using PostgreSQL, pgAdmin, and Python to uncover business insights from patient and claims data.
This project focuses on solving 20 real-time KPIs across cost, coverage, delay patterns, physician performance, and patient demographics β driven by SQL automation and visual outputs.
| Category | Tools / Libraries | Purpose |
|---|---|---|
| π’οΈ Database | PostgreSQL | Core relational database for storing healthcare data |
| π§° SQL Tool | pgAdmin 4 | GUI for SQL execution, schema design, and data inspection |
| π Programming | Python (pandas, psycopg2) |
Data cleaning, PostgreSQL connection, and CSV automation |
| π Notebooks | Jupyter Notebook | Data preprocessing and KPI automation in interactive format |
| π Query Language | SQL | Core logic for solving all 20 business KPIs |
| π Visualization | ERD with dbdiagram.io |
Entity relationship mapping of fact/dimension healthcare schema |
π us_healthcare_sql_analysis/ β Click to expand
π data/
βββ π datasets/ β Raw healthcare CSV files
βββ π outputs/ β Cleaned data after preprocessing
π database/
βββ π Defining_Tables.sql β SQL schema with constraints and relationships
βββ π Load_Data_Scripts.py β Bulk PostgreSQL loader using psycopg2
π notebooks/
βββ π 1_data_cleaning.ipynb β Clean and export raw CSVs
βββ π 2_eda_analysis.ipynb β Null checks and data exploration
βββ π 3_sql_query_runner.ipynb β Dynamic SQL execution and result export
π business_problems_outcomes/
βββ π§ 01_top_cpt_costs.sql β KPI SQL queries (20 total)
βββ π result_01_top_cpt_costs.csv β Results via pgAdmin
π outputs/
βββ π csvs/ β SQL outputs via Python automation
π diagrams/
βββ πΌοΈ ERD_Health_Analytics.png β Entity-Relationship Diagram
π README.md β Project documentation
This ERD illustrates the star schema used to model healthcare claims and billing data in PostgreSQL. The central facttable connects to multiple dimension tables, enabling efficient joins and analytical flexibility for KPI computation.
facttableβ Core transactional table containing:- Foreign keys to all dimension tables (e.g.,
dimPatientPK,dimDateServicePK,dimCPTCodePK) - Medical and billing fields like
CPTUnits,Gross_Expenses,Insurance_Payment,Patient_Payment,Adjustment,AR
- Foreign keys to all dimension tables (e.g.,
dimpatientβ Patient-level details (name, gender, age, state, region)dimpayerβ Insurance provider informationdimphysicianβ Physician metadata (NPI, name, specialty, FTE)dimspecialityβ Specialization types with descriptive fieldsdimdateβ Date reference with breakdowns by year, month, weekdaydimtransactionβ Claim transaction types and adjustment reasonsdimcptcodeβ CPT codes, descriptions, and groupings for procedure classificationdimdiagnosiscodeβ Diagnosis codes with grouping and descriptionsdimhospitalβ Hospital or location data (LocationName)
π§ This schema supports comprehensive healthcare analytics across cost, insurance coverage, readmissions, provider efficiency, and more β by enabling multi-dimensional aggregations.
π ERD created using: dbdiagram.io
π· Schema diagram below:
This project follows a 5-step data-to-insight pipeline β from raw CSV cleaning to SQL-powered business KPIs and dynamic exports via Python.
- Loaded all raw
.csvfiles fromdata/datasets/ - Standardized column names to use underscores
- Replaced null values with
"NA" - Corrected invalid dates (e.g.,
16-12-2019) and numeric anomalies (#NUM!) - Saved cleaned outputs to
data/outputs/
- Summary of missing/null values
- Unique ID validations and data integrity checks
- Distribution plots for gender, states, payers
- Verified foreign key relationships across all entities
-
Schema Setup:
Defining_Tables.sql
Defined relational schema with constraints and normalized tables. -
Data Load:
Load_Data_Scripts.py
Automatically loads cleaned.csvfiles into PostgreSQL usingpsycopg2with status logging.
We created 20 SQL-based KPIs to answer key healthcare business questions β ranging from cost efficiency to insurance coverage and claim delays.
Each KPI includes:
- π― Objective
- π§Ύ SQL query file
- π Output CSV via pgAdmin and Python
Notebook: 3_sql_query_runner.ipynb
- Reads all
.sqlfiles inbusiness_problems_outcomes/ - Connects to PostgreSQL database
- Executes each query dynamically
- Saves output result as
.csvintooutputs/csvs/
-
Load Data to PostgreSQL
Run:python database/Load_Data_Scripts.py -
Run KPIs
Use either:- SQL in pgAdmin (manually)
- Notebook:
3_sql_query_runner.ipynb
-
Check Results
Output.csvfiles are saved in:business_problems_outcomes/β via pgAdminoutputs/csvs/β via Python script
This project delivers a complete healthcare analytics solution powered by PostgreSQL, SQL, and Python β designed to transform raw claim data into actionable business insights.
It demonstrates:
- β A fully normalized PostgreSQL schema supporting analytical joins and aggregations
- β 20 real-world KPIs solved using SQL β targeting cost trends, insurance coverage, payer behavior, readmissions, and provider performance
- β
Python-based automation scripts using
psycopg2to run SQL queries and export results dynamically - β Clean and modular Jupyter notebooks for data cleaning, EDA, and SQL execution
- β A reusable framework for generating insights across multiple healthcare dimensions
π This end-to-end system reflects practical data engineering, query optimization, and healthcare domain application β making it both interview-ready and production-scalable.
For questions or collaboration, feel free to connect:
- πΌ LinkedIn: Harish Chowdary
- π§βπ» GitHub: Harish-34
