This repository contains a machine learning project aimed at forecasting depression levels based on a combination of activity data, survey results, and mobile data. The dataset used includes information such as user activity, sleep data, and survey results (specifically the PHQ-4 test) to predict depression levels. The project includes various stages such as data preprocessing, exploratory data analysis (EDA), feature engineering, dimensionality reduction using PCA, and model application.
The present repository contains the solutions to the FDS Final Project for the year 2024/2025.
- Emre Yesil (1emreyesil)
- Recep Yılmaz (Rezeb)
- Nihal Yaman Yılmaz (Nihal-yaman)
main.ipynb: This is the main notebook containing the solutions to the project, along with a command to install the required packages.dataset (Folder): This folder include the dataset csv files that we used in our porject.main.rar: This rar file is the main.ipynb file's zipped file. Our ipynb file is large, if you have any problem to download, you can use this file.
The project is organized as follows:
-
Data Preprocessing
In this step, data cleaning and feature selection were performed. Features that showed the least correlation with depression scores were removed based on correlation matrix analysis and OLS (Ordinary Least Squares) reports. This helped improve the model's efficiency and focus on relevant features. -
Exploratory Data Analysis (EDA)
EDA was performed to understand the distribution of the features, identify any missing values, and understand the relationships between the features and the target variable (depression score). Various visualizations, including histograms, box plots, and correlation heatmaps, were used to explore the dataset. -
Feature Engineering
New features were derived from the existing data to better capture the patterns related to depression. Additionally, feature scaling techniques were applied to normalize the data, ensuring that the features are comparable in terms of scale. -
Principal Component Analysis (PCA)
PCA was applied to reduce the dimensionality of the dataset, making it easier to visualize and interpret the data. This step helped to identify the most important components that contribute to the variance in the data. -
Model Application
Several machine learning models were applied to predict depression scores. To handle class imbalance, SMOTE (Synthetic Minority Over-sampling Technique) was used to oversample the minority class. Hyperparameter tuning using GridSearch was also conducted to optimize model performance, with a focus on finding the best learning rate and number of leaves for tree-based models. -
Conclusion
The models were evaluated using different performance metrics, including accuracy, precision, recall, and F1-score. The results indicated that the feature engineering and model selection steps significantly impacted model performance. Further improvements could be made by enhancing the dataset with additional features and applying more advanced algorithms such as deep learning.
The PHQ-4 (Patient Health Questionnaire-4) is a four-item screening tool designed to assess the severity of depression and anxiety. The test consists of the following questions:
- Little interest or pleasure in doing things?
- Feeling down, depressed, or hopeless?
- Feeling nervous, anxious, or on edge?
- Not being able to stop or control worrying?
These questions are using for calculating PHQ-4 Score. In this project first 2 question of them were used to calculate the depression score in this project, which is a key target variable for model training and evaluation.
As you can find the csv files that we used but our main dataset was College Experience Study Dataset. Thanks for collobration.