In this final project, me and fellow INFO 1998 student Ethan Huang aim to build a few machine learning models to determine the best predictive features of a diabetes risk-factors dataset from the CDC's Behavioral Risk Factor Surveillance System (BRFSS) survey (through Kaggle).
Link to dataset: https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset/
Models used -- Decision Tree, Balanced Decision Tree (file entitled "Selection Bias"), Perceptron, SVM, Logistic Regression.
Our best models for accuracy were Perceptron and Logistic Regression. The most significant risk factors across models seem to be HighBP and GenHlth. In other words, the risk factors that appear to be most of concern for diagnosing diabetes are Blood pressure and general health -- this is important for future ML models, preventive measures, and surveys/research.
Visualizations used -- Bar graphs, Pie Charts.
Please see the attached JupyterNotebook, entitled "INFO 1998 Final Project.ipynb" for pre-run results, visualizations and some notes/comments.