Heart failure is a critical condition within Cardiovascular disease (CVD), the leading cause of death globally, accounting for approximately 17.9 million deaths annually. Understanding and predicting heart failure risks based on factors like age, blood pump fraction, creatinine level, and sodium level is crucial for early intervention.
The primary research question is to assess the accuracy of a model in predicting mortality from heart failure based on key patient data.
Data for this study is sourced from a Kaggle dataset titled "Heart Failure Prediction," which includes 13 columns, encompassing both binary and quantitative patient data.
Using R libraries such as tidyverse and tidymodels, the dataset was imported, cleaned, and processed. Variables were renamed for clarity, and data was filtered to focus on patients with a history of diabetes or high blood pressure.
A K-nearest neighbors (KNN) classification model was developed. Data was split into training and testing sets, variables influencing death events were identified through visualization, and the dataset was balanced to mitigate bias.
The KNN model underwent tuning to identify the optimal number of neighbors, with 2 neighbors providing the best balance between accuracy and model simplicity. The model achieved an accuracy of approximately 72% in predicting heart failure mortality.
Various visualizations, including facetted box plots and a confusion matrix, were employed to explore the data and assess model performance.
The model predicts the likelihood of death from heart failure with around 70% accuracy, which is considered low given the life-and-death nature of the predictions. The limited accuracy might be due to weak relationships between selected predictors and the outcome or the inclusion of too many predictors.
Improving the model's accuracy could significantly enhance early warning systems for heart failure risk, potentially saving lives. This research lays the groundwork for more comprehensive models that could incorporate a wider array of variables, including family history.
Further research is needed to explore the causal relationships between key variables and mortality. Additionally, exploring the impact of reducing the number of predictors on model accuracy and employing more extensive cross-validation could refine the model's predictive capabilities.
This report outlines the development and evaluation of a machine learning model aimed at predicting heart failure mortality. Despite its current limitations, the model represents a promising step towards leveraging data science in the fight against heart disease, with significant potential for future improvement and application in medical practice.