This project aims to predict the risk of a heart attack using various machine learning techniques. The primary focus is on using the k-nearest neighbors (KNN) algorithm on a dataset that has been balanced using the Synthetic Minority Over-sampling Technique (SMOTE).
- Introduction
- Dataset
- Feature Selection
- Data Balancing
- Model Training
- Model Evaluation
- Usage
- Dependencies
- Results
- Contributing
- License
Heart disease is a leading cause of death globally. Early prediction and diagnosis can help in reducing the risk of severe outcomes. This project utilizes machine learning to predict the likelihood of a heart attack based on various health metrics.
The dataset used in this project contains the following features:
- Age
- Total Cholesterol (totChol)
- Systolic Blood Pressure (sysBP)
- Diastolic Blood Pressure (diaBP)
- Body Mass Index (BMI)
- Heart Rate
- Glucose
The target variable is TenYearCHD, indicating the presence of heart disease over ten years.
Feature selection was performed using the Boruta algorithm to identify the most significant features for prediction. The selected features are:
- Age
- Total Cholesterol
- Systolic Blood Pressure
- Diastolic Blood Pressure
- Body Mass Index
- Heart Rate
- Glucose
The dataset was imbalanced, so SMOTE (Synthetic Minority Over-sampling Technique) was applied to balance it. This helps in improving the performance of the machine learning models by providing more balanced training data.
The k-nearest neighbors (KNN) algorithm was used for training the model. GridSearchCV was utilized to find the best hyperparameters for the KNN model.
The model was evaluated using accuracy and a confusion matrix to understand its performance.
To use this project, follow these steps:
- Clone the repository:
git clone https://github.com/yourusername/heart-attack-risk-prediction.git
- Install the required dependencies:
pip install -r requirements.txt
- Run the Jupyter notebook to train the model and make predictions.
-
To predict the risk of a heart attack for a high-risk individual:
h_risk = [[65, 150, 180, 70, 26.97, 80, 77]] prediction_risk = knn_clf_best.predict(scaler.transform(h_risk)) print('You are safe. 😊') if prediction_risk[0] == 0 else print('Sorry, You are at risk. 👽')
-
To predict the risk of a heart attack for a low-risk individual:
h_safe = [[39, 195, 106, 70, 26.97, 80, 77]] prediction_safe = knn_clf_best.predict(scaler.transform(h_safe)) print('You are safe. 😊') if prediction_safe[0] == 0 else print('Sorry, You are at risk. 👽')
- Python 3.x
- pandas
- numpy
- seaborn
- matplotlib
- scikit-learn
- imbalanced-learn
- statsmodels
The KNN model achieved an accuracy of approximately 85.59%. The confusion matrix provides further insight into the model's performance.
Contributions are welcome! Please fork the repository and create a pull request with your changes.
This project is licensed under the MIT License. See the LICENSE file for more details.