This project analyzes factors associated with depression among students using a dataset of over 27000 observations. The objective was to explore how variables such as academic pressure, lifestyle habits, financial stress and psychological factors relate to depression outcomes and to build a composite risk score that helps identify students with a high likelihood of depression.
The analysis combines tools such as Excel, SQL, Python and Tableau for data cleaning, analysis and visualization. Results indicate strong relationships between depression and variables such as academic pressure, financial stress, sleep duration, dietary habits and other psychological factors. Students with high composite risk scores show more than double the presence of depression than those with low risk,the high-risk group represents approximately 30% of the dataset population.
Relevant links:
The project aims to answer the following questions:
- How prevalent is depression among students in the dataset?
- Does depression vary across demographic characteristics such as gender or age group?
- What academic factors are associated with higher depression levels?
- Do lifestyle habits correlate with depression outcomes?
- What role do psychological indicators such as suicidal thoughts or family mental health history play?
- Can a combined risk score be constructed to help us identify which students are at a higher risk of depression?
The dataset is composed of 27901 observations with 18 variables with the following structure:
| No | Field | Type | Description |
|---|---|---|---|
| 1 | student_id | Integer | Unique identifier assigned to each student |
| 2 | gender | String | Gender of the student (Male/Female) |
| 3 | age | Integer | Age of the student |
| 4 | city | String | City or region where the student resides |
| 5 | profession | String | Field of work or study of the student |
| 6 | academic_pressure | Integer | Measure indicating the level of pressure the student faces in academic settings |
| 7 | work_pressure | Integer | Measure of the pressure related to work or job responsibilities |
| 8 | cgpa | Float | Cumulative grade point average of the student |
| 9 | student_satisfaction | Integer | Indicator of how satisfied the students are with their studies |
| 10 | job_satisfaction | Integer | Measure of how satisfied the students are with their job or work environment |
| 11 | sleep_duration | String | Average number of hours the student sleeps per day |
| 12 | dietary_habits | String | Assessment of the student’s eating patterns and nutritional habits |
| 13 | student_degree | String | Academic degree or program that the student is pursuing |
| 14 | suicidal_thoughts | String | Binary indicator (Yes/No) that reflects whether the student has ever experienced suicidal ideation |
| 15 | study_work_hours | Integer | Average number of hours per day the student dedicates to work or study |
| 16 | financial_stress | Integer | Measure of the stress experienced due to financial concerns |
| 17 | family_mental_history | String | Indicates whether there is a family history of mental illness (Yes/No) |
| 18 | depression | Integer | Target variable that indicates whether the student is experiencing depression (Yes = 1/No = 0) |
- The dataset is observational and does not imply causal relationships.
- Psychological variables such as suicidal thoughts may be self-reported and subject to bias.
- The variable suicidal thoughts does not specify whether the student experiences these thoughts at a daily, weekly, monthly basis or just once in their life.
- There are no socioeconomic or institutional environment variables within the dataset.
- There are not variables within the dataset that express if the student has experienced depression before in their life, to what frequency have they experience it or whether they sought help.
- Some variables contain categorical groupings (e.g. sleep_duration or dietary_habits) that limit precision.
- Inconsistent values in numerical fields were replaced with the mean of the whole field.
- The composite risk score assumes equal contribution from each included factor.
- Only students with the 'profession' value of 'student' were kept into consideration. These students represent 99% of the dataset.
- Values in the 'city' field point to locations in the country of India
The analysis was conducted using the following tools:
- Excel - Data exploration and Null values handling
- PostgreSQL - Data storage
- Visual Studio Code - Main workspace for queries and coding
- SQL - Data cleaning and transformaiton
- Python (Pandas, Seaborn) - Data analysis and feature engineering
- Tableau - Interactive dashboard creation and visualization
- GitHub – Project version control and documentation
This process took place with the use of SQL in Visual Studio Code.
Key steps:
- Converted '?' values in the financial_stress column to the column mean of 3
- Checked for duplicates and null values within all fields, no duplicates nor null values were found
- Created a new table (student_survey_clean) with the following fields:
CREATE TABLE student_survey_clean AS
SELECT
student_id,
gender,
age,
academic_pressure,
cgpa,
student_satisfaction,
sleep_duration,
dietary_habits,
student_degree,
suicidal_thoughts,
study_work_hours as study_hours,
financial_stress,
family_mental_history,
depression
FROM student_depression
WHERE profession = 'Student';- The Field 'city' was not taken into consideration for the new table as it contained a good amount of inconsistent values.
- Fields related to employment (e.g., 'work_pressure', 'job_satisfaction') were excluded from the analysis.
-Created a new field with ranges for the 'cgpa' field to simplify the analysis and visualization process
ALTER TABLE student_survey_clean
ADD column cgpa_range TEXT;
UPDATE student_survey_clean
SET cgpa_range =
CASE
WHEN cgpa <1 THEN 'Below 1'
WHEN cgpa BETWEEN 1 AND 2 THEN '1-2'
WHEN cgpa BETWEEN 2 AND 3 THEN '2-3'
WHEN cgpa BETWEEN 3 AND 4 THEN '3-4'
WHEN cgpa BETWEEN 4 AND 5 THEN '4-5'
WHEN cgpa BETWEEN 5 AND 6 THEN '5-6'
WHEN cgpa BETWEEN 6 AND 7 THEN '6-7'
WHEN cgpa BETWEEN 7 AND 8 THEN '7-8'
WHEN cgpa BETWEEN 8 AND 9 THEN '8-9'
ELSE '9-10'
END;The result table consists of 27870 observations and 15 variables.
The data was analyzed using Python with the Pandas library. The new cleaned table was loaded to a jupyter notebook with a SQL to Python connection using the 'sqlalchemy' library. After this, the following main steps took place:
First, the distributions of gender, age and cgpa range were calculated, visualized and paired with their respective depression rate.

To explore the relationships between the variables, a correlation matrix was designed.
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()The correlation matrix show a positive correlation between depression and academic pressure, study hours and financial stress
New fields based on 'suicidal_thoughts' and 'family_mental_history' were created to interpret these fields into numerical values.
df['suicidal_binary'] = df['suicidal_thoughts'].map({'Yes': 1, 'No': 0})
df['family_history_binary'] = df['family_mental_history'].map({'Yes': 1, 'No': 0})Fields related to academic performance, finances, lifestyle and psychological factor also were analyzed with the overall depression rate and visualized.
To identify which students were at risk due to pressure from multiple factors, a new field was created by combining the numerical values of the dimensions previously mentioned. Following this, three new categories were created a new field which helped to segment each student.
df['risk_score'] = (
df['academic_pressure'] +
df['financial_stress'] +
df['study_hours'] +
df['suicidal_binary'] +
df['family_history_binary']
)df['risk_level'] = pd.qcut(
df['risk_score'],
q=3,
labels=['Low Risk','Medium Risk','High Risk']
)Calculations were made to count the amount of students in each category as well as finding the overall depression rate per category. Later, these values were visualized.
Academic Pressure strongly correlates with depression
More academic pressure shows a 4 times increase from low (1) academic pressure to very high (5). Almost half (43%) of the students with no academic pressure (0) seem to be affected by depression
Study Hours shows a positive correlation with depression
Depression increases as more study hours as spent per day. Students with 6 or more hours are above the overall depression rate with 10 to 12 hours being the most impactful.
Poor lifestyle habits are positively correlated with depression
The Overall depression rate increases as students engage in poor dietary habits with more than 70% of students who have unhealthy diets reporting depression. The duration of sleep follows a similar pattern but a deviation appears with students who sleep 7 to 8 hours which has a higher depression rate than students who sleep for 5 - 6 hours.
Psychological negative factors are highly prevalent in the dataset population
65.22% of the students report having experienced suicidal thoughts at least once in their life. 62.52% report having a mental illness within their family history. These factors may influence the prevalence of depression and the rate in which both of them are found is of concern.
The composite risk score effectively separates risk groups.
Students with high risk score are associated with high rates of depression. 88.78% of students within the High risk category suffer from some level of depression.
Based on the insights discovered, several practical actions can be taken that could help address and improve the students' mental health.
Institutions can implement periodic surveys which help them identify in which aspects and areas the students are experiencing academic pressure and their magnitude. These surveys can be used to help address and mitigate the relevant impact of a high academic pressure environment.
Healthy lifestyle habits and their lack of are factors which influence both the academic and personal outcomes of students. Institutions could invest in initiatives which aim to provide proper knowledge in the areas of diet and sleep to both students and their parents/guardians. The following areas can be the focus of these programs: improve sleeping conditions, napping and its benefits, balanced meals and the relevance of hydration within the different financial brackets.
Thanks to the risk score, pattern can be extracted which may help identify which students may be within the high risk level or may be heading towards it. This allows institutions and government initiatives to reach those students in need of help.
The analysis highlights several factors which are associated with the presence of depression among students, including academic pressure, financial stress, lifestyle habits and psychological factors. The constructed risk score helped identify different levels of risk and which students fall into each of these tiers, helping identify those students in dire need of help and the pattern they present.




