A comprehensive computer vision pipeline for analyzing human engagement and behavioral patterns from video data using MediaPipe and machine learning techniques.
- Overview
- Features
- Project Structure
- Installation
- Usage
- Pipeline Workflow
- Data Processing
- Results and Analysis
- Configuration
- Troubleshooting
- Contributing
This project implements an automated system for analyzing human engagement and behavioral patterns from video recordings. It processes 20 participant videos (approximately 10 seconds each) through a multi-stage pipeline that extracts facial landmarks, body pose, and behavioral metrics to classify various engagement states.
The system can detect and classify:
- Gaze Directions: Left, Right, Center, Up, Down
- Engagement States: Drowsiness, Yawning, Hand Raising, Forward Pose
- Emotional Expressions: Happy, Sad, Surprise
- ✅ Multi-modal Analysis: Combines facial landmarks, eye tracking, and body pose detection
- ✅ Real-time Processing: Efficient frame-by-frame video analysis
- ✅ Automated Threshold Optimization: Uses Optuna for hyperparameter tuning
- ✅ Comprehensive Reporting: Generates detailed performance metrics and visualizations
- ✅ Ground Truth Generation: Creates frame-level binary classification labels
- ✅ Data Enrichment: Adds hand-raise detection and pose analysis
- MediaPipe integration for robust facial and pose landmark detection
- Scipy-based geometric calculations for feature extraction
- Scikit-learn for performance evaluation
- Matplotlib/Seaborn for visualization
- Pandas for efficient data manipulation
20-person-engagment-test/
├── README.md # This file
├── CLAUDE.md # Development notes and troubleshooting
├── optimized_thresholds.json # ML-optimized detection thresholds
├── venv/ # Python virtual environment
│
├── videos/ # Input video files
│ ├── 1.mp4 # Subject 1 video
│ ├── 2.mp4 # Subject 2 video
│ └── ... (up to 20.mp4)
│
├── processed_data/ # Stage 1: Raw extracted features
│ ├── 1_data.csv
│ ├── 2_data.csv
│ └── ... (20 files)
│
├── enriched_processed_data/ # Stage 2: Enhanced with pose data
│ ├── 1_data_enriched.csv
│ ├── 2_data_enriched.csv
│ └── ... (20 files)
│
├── Results/ # Analysis outputs and visualizations
│ ├── Global/ # Aggregated performance metrics
│ ├── PerVideoFrames/ # Frame-level analysis
│ └── Subject_X/ # Individual subject results
│
├── abdalla's stuff/ # Ground truth generation utilities
│ ├── generate_ground_truth.py
│ └── ground_truth_output/ # Frame-level binary labels
│
└── Pipeline Scripts:
├── 1_process_videos.py # Extract features from videos
├── 2_optimize_thresholds.py # ML threshold optimization
├── 3_enrich_data.py # Add pose detection
├── 3.5_enrich_all_data.py # Batch enrichment
└── 4_generate_report_figures.py # Analysis and visualization
- Python 3.8+
- OpenCV
- MediaPipe
- Required Python packages (see requirements below)
-
Clone the repository:
git clone <repository-url> cd 20-person-engagment-test
-
Create and activate virtual environment:
python3 -m venv venv source venv/bin/activate # On macOS/Linux # OR venv\Scripts\activate # On Windows
-
Install dependencies:
pip install pandas numpy opencv-python mediapipe scipy scikit-learn matplotlib seaborn optuna
Run the complete pipeline in sequence:
# Activate virtual environment
source venv/bin/activate
# Stage 1: Extract features from videos
python 1_process_videos.py
# Stage 2: Optimize detection thresholds
python 2_optimize_thresholds.py
# Stage 3: Enrich data with pose information
python 3_enrich_data.py
# Stage 4: Generate analysis reports
python 4_generate_report_figures.pyProcesses video files and extracts facial landmarks, eye aspect ratios, and gaze metrics.
python 1_process_videos.pyOutputs: CSV files in processed_data/ containing:
- Eye Aspect Ratio (EAR) for drowsiness detection
- Mouth Aspect Ratio (MAR) for yawning detection
- Gaze direction ratios
- Head pose angles
- Facial expression metrics
Uses Optuna to find optimal detection thresholds for maximum accuracy.
python 2_optimize_thresholds.pyOutputs: optimized_thresholds.json with ML-tuned parameters for:
- Drowsiness detection (EAR threshold, frame count)
- Yawning detection (MAR threshold, frame count)
- Gaze direction thresholds
- Head pose boundaries
- Hand raise sensitivity
Adds hand-raise detection using pose estimation.
python 3_enrich_data.pyOutputs: Enhanced CSV files in enriched_processed_data/ with additional hand_raise_metric column.
Creates comprehensive analysis reports and visualizations.
python 4_generate_report_figures.pyOutputs:
- Performance metrics for each behavioral class
- Confusion matrices
- Per-subject and global accuracy reports
- ROC curves and statistical summaries
Generate frame-level binary classification labels:
cd "abdalla's stuff"
python generate_ground_truth.pyOutputs: Binary classification files in ground_truth_output/ with 14-dimensional vectors representing engagement states.
- Input: 20 MP4 video files (numbered 1.mp4 through 20.mp4)
- Processing:
- MediaPipe facial landmark detection
- Eye aspect ratio calculation
- Mouth aspect ratio calculation
- Gaze direction estimation
- Head pose angle extraction
- Output: Raw feature CSV files
- Input: Raw feature data + ground truth scenarios
- Processing:
- Data cleaning (outlier removal, trimming)
- Optuna-based hyperparameter optimization
- Cross-validation for robust threshold selection
- Output: Optimized threshold parameters
- Input: Raw features + original videos
- Processing:
- MediaPipe pose detection
- Hand-raise metric calculation
- Feature augmentation
- Output: Enriched feature datasets
- Input: Enriched data + optimized thresholds
- Processing:
- Binary classification application
- Performance metric calculation
- Visualization generation
- Output: Comprehensive analysis reports
The system recognizes 14 distinct behavioral scenarios:
- Gaze Left: Eyes directed left
- Gaze Right: Eyes directed right
- Gaze Center: Forward-looking gaze
- Looking Left: Head turned left
- Looking Right: Head turned right
- Looking Up: Head tilted up
- Looking Down: Head tilted down
- Drowsy: Extended eye closure (EAR < threshold for N frames)
- Yawning: Mouth opening pattern (MAR > threshold for N frames)
- Forward Pose: Attentive upright posture
- Hand Raise: Hand positioned above shoulder level
- Emotion Happy: Smile detection
- Emotion Sad: Downward facial expression
- Emotion Surprise: Wide eyes and open mouth
- Eye Aspect Ratio (EAR):
(|p2-p6| + |p3-p5|) / (2|p1-p4|) - Mouth Aspect Ratio (MAR):
|top-bottom| / |left-right| - Gaze Ratio:
(iris_x - eye_corner_left) / eye_width - Head Pose: Euler angles from facial landmark geometry
- Hand Raise: Normalized hand landmark position relative to shoulders
- Temporal Trimming: Removes first/last 10% of each scenario segment
- Outlier Removal: IQR-based filtering (3σ threshold)
- Missing Data Handling: Forward-fill and interpolation
Global_PerVideo_Accuracy.csv: Per-subject accuracy summaryGlobal_[Class]_performance_metrics.csv: Precision, recall, F1-scoreGlobal_[Class]_confusion_matrix.png: Classification matrices
- Individual performance breakdowns
- Subject-specific behavioral patterns
- Temporal analysis of engagement states
- Frame-by-frame classification outputs
- Temporal engagement trajectories
- Detailed behavioral timelines
The system evaluates performance using:
- Accuracy: Overall classification correctness
- Precision: Positive prediction accuracy
- Recall: True positive detection rate
- F1-Score: Harmonic mean of precision and recall
- ROC-AUC: Area under receiver operating characteristic curve
Based on optimized thresholds, the system achieves:
- Drowsiness Detection: ~85-90% accuracy
- Gaze Direction: ~80-85% accuracy
- Hand Raise: ~75-80% accuracy
- Emotional States: ~70-75% accuracy
Key detection thresholds (auto-optimized):
{
"ear_thresh": 0.358, // Eye closure threshold
"drowsy_frames": 52, // Consecutive frames for drowsiness
"mar_thresh": 0.400, // Mouth opening threshold
"yawn_frames": 5, // Consecutive frames for yawning
"hand_raise_thresh": 0.249, // Hand position threshold
"gaze_left_thresh": 0.445, // Left gaze boundary
"gaze_right_thresh": 0.573, // Right gaze boundary
"pose_forward_thresh": 0.174, // Forward pose angle
"smile_thresh": -0.001 // Smile detection sensitivity
}