Stroke remains one of the leading causes of death and long-term disability worldwide. Its sudden onset and severe consequences make it a major public health concern. Early detection and risk assessment are vital for enabling timely medical intervention and minimizing damage.
With the rise of digital health records and publicly available datasets, data-driven models have the potential to significantly improve preventative care and clinical outcomes.
This project leverages the (Stroke Prediction Dataset) from Kaggle, which includes health-related features such as age, gender, hypertension, heart disease, BMI, smoking status, and glucose levels.
The goal is to build a machine learning model that can accurately predict the risk of stroke in individuals based on health-related features. Such a tool can assist healthcare professionals in identifying high-risk patients for early intervention and monitoring.
- Preprocessed the data by handling missing values, encoding categorical variables, and normalizing features.
- Implemented multiple classification algorithms:
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Support Vector Machine (SVM)
- Used pipelines with SMOTE to address class imbalance for each model.
- Evaluated each model individually on performance metrics.
- Built a soft voting ensemble combining all three classifiers.
- Assessed the ensemble model on a separate test set to ensure generalizability and robustness.
- Programming Language: Python
- Libraries:
pandas,numpyβ Data manipulationscikit-learnβ Model building, preprocessingimblearnβ SMOTE for class imbalancematplotlib,seabornβ Data visualizationJupyter Notebookβ Development environment
This project demonstrates a practical application of machine learning in healthcare, showcasing a structured workflow, effective handling of imbalanced data, and thoughtful model evaluation techniques.