A machine learning project that predicts heart disease using clinical features from the UCI Cleveland Heart Disease Dataset. The model is built with a Random Forest Classifier and optimized using Scikit-learn + GridSearchCV for hyperparameter tuning.
This model uses 13 clinical features to predict heart disease:
| Feature | Description | Range |
|---|---|---|
age |
Age of the patient in years | 20-100 |
sex |
Sex of the patient (0 = female, 1 = male) | 0-1 |
cp |
Chest pain type (1-4) | 1-4 |
trestbps |
Resting blood pressure (in mm Hg) | 80-200 |
chol |
Serum cholesterol (in mg/dl) | 100-600 |
fbs |
Fasting blood sugar > 120 mg/dl (0 = false, 1 = true) | 0-1 |
restecg |
Resting electrocardiographic results | 0-2 |
thalach |
Maximum heart rate achieved | 60-220 |
exang |
Exercise-induced angina (0 = no, 1 = yes) | 0-1 |
oldpeak |
ST depression induced by exercise relative to rest | 0-6.2 |
slope |
Slope of the peak exercise ST segment | 1-3 |
ca |
Number of major vessels colored by fluoroscopy | 0-3 |
thal |
Thalassemia (3 = normal, 6 = fixed defect, 7 = reversible defect) | 3-7 |
The model uses the UCI Cleveland Heart Disease Dataset from the UCI Machine Learning Repository.
- Source: UCI Machine Learning Repository
- Records: 303 patient records
- Target Variable: Binary classification (0 = no heart disease, 1 = heart disease)
- No heart disease (0): ~54%
- Heart disease (1): ~46%
- Random Forest Classifier: An ensemble learning method that operates by constructing a multitude of decision trees at training time.
- Method: GridSearchCV with 5-fold cross-validation
- Parameters Optimized:
n_estimators: [100, 200, 300, 400, 500]max_depth: [None, 10, 20, 30]min_samples_split: [2, 5, 10]min_samples_leaf: [1, 2, 4]bootstrap: [True, False]
- Python 3.8 or higher
- pip (Python package installer)
Install the required packages:
pip install pandas numpy matplotlib scikit-learn joblibOr install all dependencies from requirements.txt:
pip install -r requirements.txtRun the main training script to:
- Download and preprocess the dataset
- Train and optimize the Random Forest model
- Generate feature importance visualization
- Save the model, scaler, and feature names
python main.pyUse the prediction CLI tool to input patient data and get heart disease predictions:
python heart_disease_predictor.pyThe tool will prompt for each clinical feature with validation against expected ranges. Enter values or type 'exit' to quit.
Example session:
Heart Disease Prediction Tool
----------------------------
Loaded feature names: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
Number of features: 13
Enter the patient's data (or type 'exit' to quit):
Age of the patient in years (expected range: (20, 100)): 55
Sex of the patient (0 = female, 1 = male) (expected range: (0, 1)): 1
...
Heart disease prediction model/
├── main.py # Main training script
├── heart_disease_predictor.py # Prediction CLI tool
├── features.py # Feature definitions and ranges
├── readme.md # This file
├── .gitignore # Git ignore rules
├── .vscode/ # VSCode settings
├── 9000/ # Additional resources
├── data.json # Project data
├── new_data.csv # Processed dataset
├── prediction_log.csv # Prediction history log
├── feature_importance.png # Feature importance visualization
├── feature_names.pkl # Feature names pickle
├── scaler.pkl # StandardScaler pickle
├── optimized_heart_disease_model.pkl # Trained model pickle
└── heart_disease_model.pkl # Original model pickle
| File | Description |
|---|---|
optimized_heart_disease_model.pkl |
Trained Random Forest model with optimized hyperparameters |
scaler.pkl |
StandardScaler fitted on training data for feature normalization |
feature_names.pkl |
List of feature names used by the model |
feature_importance.png |
Bar chart showing feature importance rankings |
prediction_log.csv |
CSV file logging all predictions with timestamps, patient IDs, input values, predictions, and probabilities |
- Data Loading: Download dataset from UCI repository
- Missing Value Handling: Replace '?' with NaN, then fill with mode
- Type Conversion: Convert 'ca' and 'thal' columns to numeric types
- Target Binarization: Convert target values to binary (0 or 1)
- Original values > 0 become 1 (heart disease present)
- Original values = 0 become 0 (no heart disease)
- Feature Scaling: StandardScaler normalization (mean=0, std=1)
- Train-Test Split: 80% training, 20% testing (random_state=42)
UCI Dataset → Missing Value Handling → Type Conversion → Target Binarization
↓
Feature Scaling → Train-Test Split → GridSearchCV → Optimized Model
The model is evaluated on the test set using the following metrics:
| Metric | Description | Target |
|---|---|---|
| Accuracy | Proportion of correct predictions | > 80% |
| Precision | True positives / (True positives + False positives) | > 80% |
| Recall | True positives / (True positives + False negatives) | > 80% |
| ROC-AUC | Area under the ROC curve | > 85% |
Accuracy: 0.85
Precision: 0.84
Recall: 0.86
ROC-AUC: 0.92
The model identifies the most important features for heart disease prediction. Common top features include:
ca- Number of major vesselsthal- Thalassemia test resultoldpeak- ST depressionthalach- Maximum heart ratecp- Chest pain type
A feature importance visualization is saved as feature_importance.png after training.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- UCI Machine Learning Repository for the Heart Disease Dataset
- scikit-learn for the machine learning tools