- Overview
- Demo & Results
- System Architecture
- Action Classes & Key Mapping
- Dataset Pipeline
- Model Architecture & Training
- Real-Time Inference Pipeline
- Performance Metrics
- Project Structure
- Installation & Setup
- How to Run
- Technical Details
- Requirements
A production-grade, real-time AI system that enables touchless game control through full-body gesture recognition. The system uses a deep learning pipeline combining AI-powered person tracking, body pose estimation, and a sequential deep learning model trained on thousands of action sequences โ allowing a player to control a game character using nothing but their body movements.
No controller. No keyboard. Your body IS the controller.
| Feature | Detail |
|---|---|
| ๐ฏ Test Accuracy | 99.69% on held-out test set |
| โก Inference | Sub-100ms end-to-end real-time latency |
| ๐ฆด Pose Points | 33 full-body skeletal landmarks (MediaPipe) |
| ๐ฎ Action Classes | 5 โ Jump, Kick, Punch, MoveForward, MoveBackward |
| ๐ Dataset Size | 40,865 labeled sequences of shape (30, 132) |
| ๐ง Model | Stacked LSTM + BatchNorm + Dense layers |
| ๐ Smoothing | Majority voting over last 5 predictions |
Jump Kick MoveBackward MoveForward Punch
Jump 1482 0 0 4 0
Kick 1 2138 4 2 0
MoveBackward 0 0 1724 10 0
MoveForward 0 0 1 1940 0
Punch 0 1 2 0 864
precision recall f1-score support
Jump 1.00 1.00 1.00 1486
Kick 1.00 1.00 1.00 2145
MoveBackward 1.00 0.99 1.00 1734
MoveForward 0.99 1.00 1.00 1941
Punch 1.00 1.00 1.00 867
accuracy 1.00 8173
macro avg 1.00 1.00 1.00 8173
weighted avg 1.00 1.00 1.00 8173
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ REAL-TIME ACTION RECOGNITION PIPELINE โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฃ
โ โ
โ ๐ท Webcam Feed โ
โ โ โ
โ โผ โ
โ ๐ YOLO Person Tracking โโโบ Bounding Box + Track ID โ
โ โ โ
โ โผ โ
โ ๐ฆด MediaPipe Pose Estimation โโโบ 33 Skeletal Landmarks โ
โ โ (x, y, z, visibility ร 33 = 132) โ
โ โผ โ
โ โ๏ธ Hip Centralization โโโบ Normalize relative to hip center โ
โ โ โ
โ โผ โ
โ ๐ช Sliding Window โโโบ 30-frame temporal sequence โ
โ โ โ
โ โผ โ
โ ๐ง LSTM Deep Learning โโโบ 5-class action prediction โ
โ โ โ
โ โผ โ
โ ๐ณ๏ธ Majority Voting โโโบ Smoothed prediction (k=5 frames) โ
โ โ โ
โ โผ โ
โ โจ๏ธ Virtual Keyboard โโโบ Game Control (โ โ โ C V) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| ๐น๏ธ Body Action | โจ๏ธ Key Triggered | ๐ฏ Game Use |
|---|---|---|
| ๐ฆ Jump | โ (Up Arrow) |
Character jump |
| ๐ Punch | C |
Attack / punch move |
| ๐ฆต Kick | V |
Kick attack |
| ๐ MoveForward | โ (Right Arrow) |
Move character right |
| โฌ ๏ธ MoveBackward | โ (Left Arrow) |
Move character left |
The system includes an ActionMemory manager that:
- Tracks the previously executed action
- Mode 1 (Block Repeat): Prevents re-triggering the same action until a new one is detected โ avoids button spamming
- Mode 2 (Allow Repeat): Executes every valid prediction continuously โ suited for movement actions
NTU RGB+D Action Recognition Dataset
Academic access required โ submit request via official portal.
# Action code โ label mapping used:
action_map = {
"A024": "Kick", # Kick (NTU class 24)
"A051": "Kick", # Kick (NTU class 51)
"A050": "Punch", # Punch
"A026": "Jump", # Jump up
"A027": "Jump", # Jump down
"A059": "MoveForward", # Walking forward
"A060": "MoveBackward",# Walking backward
}- Total source videos: 2,880
- Videos extracted (selected actions): 336 โ 800+ after multi-session accumulation
For each video, the pipeline:
- Runs YOLO tracking to isolate the target person (user manually selects Track ID)
- Crops the person region per frame
- Runs MediaPipe Pose on the crop โ extracts 33 landmarks ร 4 values (x, y, z, visibility) = 132 features
- Saves extracted pose sequences to
mediapipe_pose.csvincrementally
Robust tracking features:
- IoU-based re-identification across frames
- Predicted bounding box interpolation during occlusions
- Expansion of search area for lost tracks
- Full-frame fallback detection (up to 30 frames without detection)
- Skip tracking (Space = irrelevant action, ESC = skip video)
# Per-video processing:
# 1. Extract pose joint array: shape (frames, 132)
# 2. Hip centralization โ subtract hip midpoint from all joints (x, y, z)
# 3. Sliding window with step=1 โ shape (N_windows, 30, 132)
# Final saved dataset:
sequences: shape (40865, 30, 132)
labels: shape (40865,) # action class strings
video_ids: shape (40865,) # source video filenameFinal Dataset Stats:
| Split | Samples |
|---|---|
| Train (64%) | ~26,154 |
| Validation (16%) | ~6,538 |
| Test (20%) | 8,173 |
Input Shape: (30, 132) โ 30 time steps ร 132 features
โ
โโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโ
โ LSTM(128, return_sequences) โ โ Temporal pattern capture
โ BatchNormalization โ
โ Dropout(0.3) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ LSTM(64) โ โ Sequence summary
โ BatchNormalization โ
โ Dropout(0.3) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Dense(128, ReLU) โ
โ BatchNormalization โ
โ Dropout(0.3) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Dense(64, ReLU) โ
โ BatchNormalization โ
โ Dropout(0.3) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Dense(5, Softmax) โ โ 5-class output
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| Parameter | Value |
|---|---|
| Optimizer | Adam |
| Loss Function | Sparse Categorical Crossentropy |
| Epochs | 100 (max) |
| Early Stopping Patience | 10 epochs |
| Batch Size | 32 |
| Best Epoch | 36 |
| Total Epochs Run | 46 (early stopped) |
| Epoch | Train Acc | Val Acc | Val Loss |
|---|---|---|---|
| 1 | 62.85% | 75.55% | 0.6471 |
| 4 | 89.78% | 91.60% | 0.2390 |
| 8 | 94.76% | 97.49% | 0.0858 |
| 17 | 97.56% | 98.69% | 0.0419 |
| 25 | 98.25% | 99.31% | 0.0267 |
| 36 | 99.05% | 99.53% | 0.0150 โ Best |
| 46 | 99.13% | 98.65% | 0.0453 (Early Stop) |
- Camera Init โ Opens webcam at 1280ร720
- Person Selection โ YOLO detects and tracks all people; user types Track ID + Enter
- Pose Loop โ For every frame of the selected person:
- Crop bounding box
- Extract 33 MediaPipe landmarks (132 values)
- Append to rolling deque (max=60 frames)
- Prediction โ When deque โฅ 30 frames:
- Take last 30 frames
- Apply hip centralization
- Run through LSTM model
- Apply confidence threshold (โฅ 0.5)
- Smoothing โ Majority vote over last 5 predictions
- Key Press โ Map smoothed action โ virtual keyboard press via
pynput - Memory Check โ ActionMemory decides whether to execute or block
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Prediction: Jump โ
โ Confidence: 0.97 โ
โ Previous: Kick โ
โ Actions Exec: 42 โ
โ โ
โ [Tracking box + pose skeleton]โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๐ Model Performance Summary โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฃ
โ Test Accuracy โโโโโโโโโโโโโโโโโโโโ 99.69% โ
โ Train Accuracy (best) โโโโโโโโโโโโโโโโโโโโ 99.05% โ
โ Val Accuracy (best) โโโโโโโโโโโโโโโโโโโโ 99.53% โ
โ Test Loss 0.009 (extremely low) โ
โ Inference Latency <100ms (real-time) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Per-Class Performance:
| Action | Precision | Recall | F1-Score | Test Support |
|---|---|---|---|---|
| Jump | 1.00 | 1.00 | 1.00 | 1,486 |
| Kick | 1.00 | 1.00 | 1.00 | 2,145 |
| MoveBackward | 1.00 | 0.99 | 1.00 | 1,734 |
| MoveForward | 0.99 | 1.00 | 1.00 | 1,941 |
| Punch | 1.00 | 1.00 | 1.00 | 867 |
| Overall | 1.00 | 1.00 | 1.00 | 8,173 |
Real_Time_Game_Control/
โ
โโโ ๐ code/
โ โโโ dataset_Preparation.ipynb # Full data pipeline
โ โโโ lstm_training.ipynb # Model training & evaluation
โ โโโ real_time_model.ipynb # Live inference + game control
โ
โโโ ๐ Dataset/
โ โโโ nturgb+d_rgb/ # Raw NTU RGB+D videos (not tracked)
โ โโโ video/ # Filtered action videos
โ โโโ mediapipe_pose.csv # Extracted pose data
โ โโโ yolo_pose.csv # YOLO-extracted pose data
โ โโโ skipped_videos.csv # Skipped video log
โ โโโ mediapipe_dataset_132.npz # Final training dataset
โ
โโโ ๐ Models/
โ โโโ best_lstm_model.keras # Best saved model (Keras format)
โ โโโ best_lstm_model.h5 # Best saved model (H5 format)
โ โโโ label_encoder.pkl # Sklearn LabelEncoder
โ โโโ class_info.pkl # Class names & mappings
โ โโโ yolo11n-pose.pt # YOLO pose model (nano)
โ โโโ yolo11s-pose.pt # YOLO pose model (small)
โ
โโโ README.md
- Python 3.10+
- CUDA-compatible GPU (recommended for training)
- Webcam (for real-time inference)
git clone https://github.com/uqasha524/real-time-action-recognition.git
cd real-time-action-recognitionpython -m venv venv
# Windows
venv\Scripts\activate
# Linux/macOS
source venv/bin/activatepip install tensorflow
pip install torch torchvision
pip install ultralytics # YOLO
pip install mediapipe
pip install opencv-python
pip install scikit-learn
pip install pandas numpy
pip install matplotlib seaborn
pip install pynput # Virtual keyboard control
pip install jupyterfrom ultralytics import YOLO
# Auto-downloads on first use
model = YOLO('yolo11n-pose.pt') # Nano (faster)
model = YOLO('yolo11s-pose.pt') # Small (more accurate)Request access to the NTU RGB+D Dataset at:
Submit an academic request form โ approval typically takes 1-3 days.
If you already have the trained model files in /Models/:
jupyter notebook code/real_time_model.ipynb- Run all cells
- A webcam window will open
- Type the Track ID of the person you want to track โ Press Enter
- Start performing gestures โ game keys will be triggered automatically!
Step 1: Dataset Preparation
jupyter notebook code/dataset_Preparation.ipynb- Update
source_dirto your NTU RGB+D dataset path - Run the extraction cell โ manually select person Track ID for each video
- Press Space to skip irrelevant videos, ESC to skip during tracking
- Wait for
mediapipe_pose.csvandmediapipe_dataset_132.npzto be generated
Step 2: Train the Model
jupyter notebook code/lstm_training.ipynb- Run all cells
- Training takes ~30-60 minutes on GPU (46 epochs ran, best at epoch 36)
- Model is saved automatically to
/Models/
Step 3: Real-Time Inference
jupyter notebook code/real_time_model.ipynbAll joint coordinates are normalized relative to the hip midpoint before training and inference:
hip_x = (frame[LEFT_HIP*4] + frame[RIGHT_HIP*4]) / 2
hip_y = (frame[LEFT_HIP*4+1] + frame[RIGHT_HIP*4+1]) / 2
hip_z = (frame[LEFT_HIP*4+2] + frame[RIGHT_HIP*4+2]) / 2
# Subtract hip from all joints (x, y, z only โ visibility unchanged)
frame[0::4] -= hip_x # All X coordinates
frame[1::4] -= hip_y # All Y coordinates
frame[2::4] -= hip_z # All Z coordinatesThis makes the model position-invariant โ the same action is recognized regardless of where the person stands in the frame.
# Window size: 30 frames
# Step size: 1 (maximum overlap for dense training data)
# Each window: shape (30, 132)
# For real-time:
pose_sequence = deque(maxlen=60) # Rolling buffer
# Predict on last 30 frames every framedef smooth_predictions(predictions, k=5):
recent = predictions[-k:]
labels = [p[0] for p in recent]
majority = max(set(labels), key=labels.count) # Majority vote
conf = mean([p[1] for p in recent if p[0] == majority])
return majority, confCONF_THRESHOLD = 0.5
# Only actions with softmax probability >= 0.5 are considered valid
# Others are labeled "Low Confidence" and not mapped to any key# Filename format: ActionName_SeqNumber.avi
# e.g.: Jump_701.avi โ label = "Jump"
# MoveBackward_123.avi โ label = "MoveBackward"
def extract_action_label(filename):
parts = os.path.splitext(filename)[0].split("_")
if parts[-1].isdigit():
return "_".join(parts[:-1]) # Remove trailing sequence number
return "_".join(parts)tensorflow>=2.12
torch>=2.0
ultralytics>=8.0
mediapipe>=0.10
opencv-python>=4.8
scikit-learn>=1.3
pandas>=2.0
numpy>=1.24
matplotlib>=3.7
seaborn>=0.12
pynput>=1.7
jupyter>=1.0
- Add more action classes (Duck, Roll, Sprint, Block)
- Replace LSTM with Transformer-based architecture (e.g., TimeSformer)
- Deploy as standalone executable (no Jupyter required)
- Add multi-player support (track multiple IDs simultaneously)
- Train on custom game-specific gesture sets
- Export to ONNX for faster CPU inference
This project is for academic and research purposes. The NTU RGB+D dataset is subject to its own license terms from NTU Singapore.