MATH 3480: Machine Learning
A survey was completed to examine the mental health of teenagers and young adults. The goal of this project is to perform comprehensive data preprocessing on the young-people-survey-responses dataset to prepare it for predicting loneliness in young people.
Submission Format: Submit a written report (PDF or Word document) that includes all deliverables, code snippets, figures with captions, and documentation of your preprocessing decisions.
Your report should include:
- Code snippets: Relevant code illustrating key steps (not your entire notebook)
- Figures and visualizations: Charts, graphs, and tables at readable sizes
- Figure captions: Descriptive captions for every figure
- Explanations: Clear documentation of steps and justifications for decisions
- Results: Outcomes of each preprocessing step with statistics or summaries
-
Dataset Overview
- Report dimensions (rows × columns)
- Table showing first 5-10 rows
- Summary of data types
-
Initial Data Quality
- Identify the target variable (what we are trying to predict)
- Summary table of missing values per column
Suggested Figures: Sample data table, target distribution bar chart
Important Note: EDA and preprocessing are iterative processes done side-by-side. You need EDA to understand what preprocessing is required, and you need preprocessing to properly analyze some variables. You may find yourself moving back and forth between Parts 2 and 3.
-
Univariate Analysis
- For numerical variables: Include histograms or box plots for key variables
- For categorical variables: Include bar charts showing frequency distributions for key variables
- Caption each figure noting important observations (skewness, outliers, imbalanced categories, etc.)
- Identify which variables have distributions that may require special handling
-
Relationship to Target Variable (Bivariate Analysis)
- Create visualizations exploring the relationship between important variables and the target (loneliness)
- For numerical predictors: Use box plots, violin plots, or scatter plots grouped by target
- For categorical predictors: Use grouped bar charts or stacked bar charts
- Caption each figure explaining what relationship (if any) you observe
-
Correlation Analysis
- Create a correlation heatmap for numerical variables
- Identify highly correlated variable pairs (correlation > 0.7 or < -0.7)
- Discuss potential multicollinearity concerns
- Note which variables appear most correlated with the target
-
Key Findings Summary
- Write 1-2 paragraphs summarizing your most important findings:
- Which variables appear most predictive of loneliness?
- Which variables show little to no relationship with the target?
- Are there any surprising relationships?
- What preprocessing concerns did you identify (outliers, skewness, etc.)?
- Write 1-2 paragraphs summarizing your most important findings:
Suggested Figures:
- Histograms/box plots of key numerical variables
- Bar charts of key categorical variables
- Box plots or violin plots showing variable distributions by target class
- Correlation heatmap
- Scatter plots for interesting relationships
Note: This EDA will inform your preprocessing decisions in Part 3. For example:
- High correlation between variables may lead you to remove redundant features
- Severely skewed distributions may affect your imputation method choice
- Variables showing no relationship to the target may be candidates for removal
- Variable Removal Documentation
- Table listing removed variables with justifications
- Reference your EDA findings when explaining removals
| Variable Name | Reason for Removal |
|---|---|
| respondent_id | Unique identifier, no predictive value |
| variable_x | No correlation with target (see EDA correlation heatmap) |
- Feature Count Summary
- Starting vs. remaining feature count
-
Missing Data Analysis
- Visualization of missing value percentages per column
- Summary of missing values per row
-
Removal Documentation
- Number of columns removed (>10% missing)
- Number of rows removed (>10% missing)
- Before/after dataset dimensions
-
Imputation Strategy Table
| Variable Name | Variable Type | % Missing | Imputation Method | Justification |
|---|---|---|---|---|
| age | Numerical | 3.2% | Median | Data showed right skew in EDA |
- Imputation Results
- Before/after comparison of missing values
Suggested Figures: Missing value bar chart, before/after comparison, distribution plots of imputed values
-
Categorical Variable Identification
- List all categorical variables
- Classify each as nominal or ordinal
-
Encoding Strategy Table
| Variable Name | Variable Type | Encoding Method | Justification | Resulting Columns |
|---|---|---|---|---|
| gender | Nominal | One-Hot | No inherent order | 3 columns |
- Encoding Results
- Feature count before and after encoding
- Example of transformed variable (small sample)
Suggested Figures: Before/after encoding example, feature count comparison chart
-
Split Strategy
- State train-test ratio and random_state value
- Brief justification
-
Dataset Dimensions Table
| Dataset | Samples | Features | % of Total |
|---|---|---|---|
| Training | 850 | 45 | 70% |
| Testing | 365 | 45 | 30% |
- Target Distribution Analysis
- Side-by-side visualizations of target distribution in train vs. test sets
- Caption confirming balanced split
Suggested Figures: Side-by-side bar charts, dimension summary table
- Scaling Decision Table
| Feature Name | Data Type | Scaling Applied? | Justification |
|---|---|---|---|
| age | Continuous | Yes | Wide range observed in EDA |
| gender_male | Binary | No | Already 0/1 |
-
Scaling Methodology
- Confirm scaler fit on training data only
- Explain why (prevents data leakage)
-
Scaling Verification
- Summary statistics showing mean ≈ 0, std ≈ 1 for scaled features
-
Visual Comparison
- Distribution plots for 2-3 features before/after scaling
Suggested Figures: Before/after box plots, summary statistics table
If included: Confirm multiple linear regression model ran without errors.
- Processing Summary Table
| Preprocessing Step | Initial | Final | Change |
|---|---|---|---|
| Samples | 1,500 | 1,215 | -285 |
| Features | 150 | 45 | -105 |
| Missing Values | 8,450 | 0 | All handled |
- Key Insights
- Most important findings from EDA
- Most challenging aspect of preprocessing
- Variables with most missing data
- New features from one-hot encoding
- Any concerns about final dataset
- How EDA informed your preprocessing decisions
- Length: 5 is ideal, but can be 3-7 pages (including figures)
- Font: 11-12 point, readable font
- Margins: 1-inch on all sides
- File format: PDF or .docx
- Figures: Readable size with numbered captions, does not take up excessive space
- Code: Only relevant snippets, properly formatted
- Writing: Clear, professional, concise
- All sections included with deliverables
- EDA section includes sufficient visualizations and analysis
- EDA findings referenced in preprocessing decisions
- All figures captioned and referenced in text
- All tables clearly formatted
- Justifications for all major decisions
- Well-organized with clear headings
- Proofread for errors
- File named: LastName_FirstName_Preprocessing_Project.pdf
| Category | Criteria | Points |
|---|---|---|
| 1. Data Loading & Exploration | Dataset overview with dimensions, sample, data types | 1 |
| Initial data quality assessment with figures | 1 | |
| 2. Exploratory Data Analysis | Univariate analysis with appropriate visualizations and observations | 2 |
| Relationship to target analysis with clear visualizations | 2 | |
| Correlation analysis with heatmap and interpretation | 1.5 | |
| Key findings summary showing thoughtful analysis | 1.5 | |
| 3. Preprocessing | Discussion of Feature Selection (removed variables) | 2 |
| Missing value analysis with visualizations | 2 | |
| Discussion of imputation strategy | 2 | |
| All categorical variables identified and classified | 1 | |
| Clear justifications for encoding choices | 2 | |
| 4. Train-Test Split | Split strategy explained with appropriate ratio | 3 |
| 5. Feature Scaling | Scaling decision table with justifications | 1 |
| Correct methodology (fit on train, no leakage) | 2 | |
| 6. Report Quality | Well-organized, professionally formatted; Figures/tables with captions, properly sized; Clear writing, free of major errors; Summary with thoughtful insights connecting EDA to preprocessing |
3 |
| 7. Code | Code file is included and works without errors | 3 |
| Total: | 30 |
This assignment was created with the assistance of AI.