Team Members: Nneka Asuzu, Ruchira Malhotra
Repository: GitHub Link
“Behind every data point is a person and their daily choices. Our goal is to visualize responsibly, inclusively, and with integrity.”
— Team 7, Obesity Visualization Project
- Team 7 – Obesity Visualization Project
- Contents
- Purpose & Overview
- Business Question
- Goals & Objectives
- Dataset Overview
- Ethical, Privacy, and Inclusivity Considerations
- Limitations & Risks
- Techniques & Tools
- Project Workflow & Status
- 📂 Project Folder Structure
- Team Members & Roles
- Timeline & Milestones
- Team Agreements
- Final Deliverables
- Key Insights & Findings
- Visuals Showcase
- Next Steps & Future Work
- Video Reflection
- Reproducibility Note
- References & Links
This project explores and visualizes obesity trends based on eating habits, physical activity, and demographics.
Goal: uncover meaningful patterns contributing to different obesity levels and communicate insights through clear, ethical, and data-driven visualizations for public health intervention strategies.
What relationships exist between lifestyle habits, demographic factors, and obesity levels, and how can effective data visualization help uncover meaningful patterns for public health understanding?
- Explore distributions of numeric and categorical variables
- Visualize relationships between lifestyle factors and obesity levels
- Detect and handle duplicates, inconsistencies, and outliers
- Create a reproducible, insight-driven visualization report
- Communicate findings for public health awareness and strategy development
- Source: UCI Machine Learning Repository – Obesity Dataset
- Records: 2,111 individuals
- Attributes: 17 features (demographics, eating habits, physical activity)
- Target Variable:
NObeyesdad(7 categories: Insufficient → Obesity Type III) - Composition: ~23% real survey data (Mexico, Peru, Colombia) + ~77% synthetic via SMOTE
Note: results reflect dataset patterns, not real-world prevalence
- Synthetic and anonymized; no PII included
- Simulated individuals ensure no re-identification risk
- Neutral, respectful labeling
- Colorblind-safe palettes and accessible formatting
- Interpretations emphasize patterns, not individual judgment
- Accessible design (high-contrast visuals, clear labels)
- Fair representation of gender and age groups
- Cultural context acknowledged for Latin American origins
- Transparent disclosure of SMOTE balancing for fairness
- Synthetic Data Bias (SMOTE)
- Limited Geographic Scope (Mexico, Peru, Colombia)
- Cohort Demographics (young adults)
- Self-Reported Responses (recall bias)
- Unmeasured Health Factors (thyroid issues, genetics)
- Association, Not Causation (correlational patterns only)
- Languages: Python 3
- Environment: VS Code, Jupyter Notebook
- Libraries: pandas, numpy, matplotlib, seaborn, plotly, scipy
- Data Preprocessing: Standardized categories, removed duplicates (2,087 final records), 0 missing values
- Visualization Techniques: Countplots, Boxplots, Violinplots, Histograms, KDE, Correlation Heatmaps, FacetGrids
- Statistical Analysis: ANOVA (numeric features), Cramér's V (categorical), Pearson correlation
| Phase | Focus | Owner | Status | Key Deliverable |
|---|---|---|---|---|
| Week 1 | Setup, Cleaning, Initial EDA | Nneka | ✅ Completed | Cleaned dataset & initial distributions |
| Week 2 | Visualization, Interpretation, Insights, Conclusion | Nneka | ✅ Completed | Full analysis, insights, data story |
| Week 2 | Presentation prep, Showcase, Final Review | Ruchira | ✅ Completed | Presentation slides & showcase |
This visual map shows the layout and purpose of each directory and file in the project repository.
Obesity_project/
│
├── 📁 data/
│ ├── 📂 raw/ # Original, untouched dataset file
│ └── 📂 processed/ # Cleaned, deduplicated, and engineered data
│
├── 📁 notebooks/ # All Jupyter notebooks
│ └── 📄 obesity_analysis.ipynb # Main analysis notebook with EDA, visuals, and insights
│
├── 📁 visuals/ # Final, saved versions of all charts (PNG format)
│ ├── graph1_nobeyesdad_distribution.png
│ ├── graph2_numeric_features_distribution.png
│ ├── graph3_boxplots_numeric_by_obesity.png
│ ├── graph4_anova_numeric_strength.png
│ ├── graph5_categorical_features_distribution.png
│ ├── graph6_categorical_cramers_v.png
│ ├── graph7_correlation_heatmap.png
│ └── graph8_pairplot_optimal_obesity.png
│
├── 📁 docs/ # Documentation, Data Dictionary, Appendix
│ └── 📄 data_dictionary.md # Feature descriptions & coding
│
├── 📄 README.md # Complete documentation with workflow, insights, and ethical considerations
├── 📄 requirements.txt # List of all Python libraries needed
└── 🚫 .gitignore # Files Git should ignore
| Name | Role | Responsibilities |
|---|---|---|
| Nneka Asuzu | Project Lead & Data Science Lead | Repo setup, cleaning, EDA, visualization, insights, conclusions, README |
| Ruchira Malhotra | Presentation & Review Specialist | Slide design, showcase preparation, documentation review |
| Week | Focus | Owner | Status |
|---|---|---|---|
| Week 1 | Repo setup, data cleaning, EDA draft | Nneka | ✅ Completed |
| Week 2 | Visualization, Interpretation, Insights, Conclusion | Nneka | ✅ Completed |
| Week 2 | Presentation prep, Showcase, Final Review | Ruchira | ✅ Completed |
- Communication: Share progress on Slack after each session
- Collaboration: Minimum 2 meaningful commits per week
- Code Consistency: Follow PEP8; document all visualizations
- Reproducibility: Notebook runs from start to finish without error
- Team Support: Issues addressed within 12 hours
- Jupyter Notebook:
obesity_analysis.ipynb - Visuals Folder: Final plots (PNG format)
- Presentation Slides: Team 7 Showcase Deck
- README.md: Complete documentation with workflow, insights, and ethical considerations
- Video Reflection: Links to 3–5 minute team member videos
- Primary Behavioral Risk: FAVC – the strongest behavioral driver of severe obesity
- Primary Protective Factor: SCC – absence correlates with nearly all obesity cases
- Fixed Risk Baseline: Family history with overweight – dominant fixed factor
- Gender Disparity: Obesity Type III largely female; Obesity Type II largely male
- Inefficient Targets: Age & TUE – weak predictors, de-prioritized in interventions
Explore some of the key visualizations from our analysis:
Visualizations highlight the interplay of lifestyle, demographics, and obesity levels, providing quick insight into patterns discovered during EDA.
Based on our EDA findings, Team 7 recommends the following next steps for further analysis and actionable insights:
-
Feature Engineering & Clustering
- Create composite behavioral indices to combine related eating and activity metrics.
- Explore patterns in numeric features using K-Means or Hierarchical Clustering to see if Gender and Obesity Level groupings emerge.
-
Predictive Modeling
- Train a classification model (Random Forest or XGBoost) using key predictors: FAVC, SCC, Family History, and Gender.
- Goal: Accurately predict obesity categories (
NObeyesdad) and quantify feature importance.
-
Model Interpretation & Validation
- Use feature importance to validate EDA findings.
- Confirm whether FAVC and SCC are indeed the strongest behavioral drivers.
- Leverage model insights to refine public health interventions.
-
Targeted Visualization & Deep-Dive Analysis
- Investigate non-linear patterns in categorical features (e.g., CAEC and CALC ‘Sometimes’ category).
- Refine visualizations and messaging for more effective intervention targeting.
-
Nneka’s Reflection Video – Covers the project overview, key insights, challenges (including synthetic data limitations), cross-validating patterns using multiple visualizations, and personal contributions to team workflow.
-
Ruchira’s Reflection Video – Team member reflection on learning experience, challenges, and role in the project.
- All data cleaning, preprocessing, and EDA steps are scripted and deterministic
- Dataset loaded from fixed source; no random sampling or modeling used
- Results and visualizations are fully reproducible by any team member
- Dataset: UCI Repository
- Team Project Guidelines: UofT DSI





