Skip to content

uqasha524/RealTime-GameActions-UsingLSTM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Typing SVG


๐Ÿ“‹ Table of Contents


๐ŸŽฏ Overview

A production-grade, real-time AI system that enables touchless game control through full-body gesture recognition. The system uses a deep learning pipeline combining AI-powered person tracking, body pose estimation, and a sequential deep learning model trained on thousands of action sequences โ€” allowing a player to control a game character using nothing but their body movements.

No controller. No keyboard. Your body IS the controller.

๐Ÿ”‘ Key Highlights

Feature Detail
๐ŸŽฏ Test Accuracy 99.69% on held-out test set
โšก Inference Sub-100ms end-to-end real-time latency
๐Ÿฆด Pose Points 33 full-body skeletal landmarks (MediaPipe)
๐ŸŽฎ Action Classes 5 โ€” Jump, Kick, Punch, MoveForward, MoveBackward
๐Ÿ“Š Dataset Size 40,865 labeled sequences of shape (30, 132)
๐Ÿง  Model Stacked LSTM + BatchNorm + Dense layers
๐Ÿ” Smoothing Majority voting over last 5 predictions

๐ŸŽฌ Demo & Results

Confusion Matrix (Test Set โ€” 8,173 samples)

              Jump    Kick  MoveBackward  MoveForward  Punch
Jump          1482       0             0            4      0
Kick             1    2138             4            2      0
MoveBackward     0       0          1724           10      0
MoveForward      0       0             1         1940      0
Punch            0       1             2            0    864

Classification Report

              precision    recall  f1-score   support
        Jump       1.00      1.00      1.00      1486
        Kick       1.00      1.00      1.00      2145
MoveBackward       1.00      0.99      1.00      1734
 MoveForward       0.99      1.00      1.00      1941
       Punch       1.00      1.00      1.00       867

    accuracy                           1.00      8173
   macro avg       1.00      1.00      1.00      8173
weighted avg       1.00      1.00      1.00      8173

๐Ÿ—๏ธ System Architecture

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘                REAL-TIME ACTION RECOGNITION PIPELINE                 โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘                                                                      โ•‘
โ•‘  ๐Ÿ“ท Webcam Feed                                                      โ•‘
โ•‘       โ”‚                                                              โ•‘
โ•‘       โ–ผ                                                              โ•‘
โ•‘  ๐Ÿ” YOLO Person Tracking  โ”€โ”€โ–บ  Bounding Box + Track ID              โ•‘
โ•‘       โ”‚                                                              โ•‘
โ•‘       โ–ผ                                                              โ•‘
โ•‘  ๐Ÿฆด MediaPipe Pose Estimation  โ”€โ”€โ–บ  33 Skeletal Landmarks           โ•‘
โ•‘       โ”‚                         (x, y, z, visibility ร— 33 = 132)   โ•‘
โ•‘       โ–ผ                                                              โ•‘
โ•‘  โš–๏ธ  Hip Centralization   โ”€โ”€โ–บ  Normalize relative to hip center     โ•‘
โ•‘       โ”‚                                                              โ•‘
โ•‘       โ–ผ                                                              โ•‘
โ•‘  ๐ŸชŸ Sliding Window        โ”€โ”€โ–บ  30-frame temporal sequence           โ•‘
โ•‘       โ”‚                                                              โ•‘
โ•‘       โ–ผ                                                              โ•‘
โ•‘  ๐Ÿง  LSTM Deep Learning    โ”€โ”€โ–บ  5-class action prediction            โ•‘
โ•‘       โ”‚                                                              โ•‘
โ•‘       โ–ผ                                                              โ•‘
โ•‘  ๐Ÿ—ณ๏ธ  Majority Voting      โ”€โ”€โ–บ  Smoothed prediction (k=5 frames)    โ•‘
โ•‘       โ”‚                                                              โ•‘
โ•‘       โ–ผ                                                              โ•‘
โ•‘  โŒจ๏ธ  Virtual Keyboard     โ”€โ”€โ–บ  Game Control (โ†‘ โ†’ โ† C V)            โ•‘
โ•‘                                                                      โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

๐ŸŽฎ Action Classes & Key Mapping

๐Ÿ•น๏ธ Body Action โŒจ๏ธ Key Triggered ๐ŸŽฏ Game Use
๐Ÿฆ˜ Jump โ†‘ (Up Arrow) Character jump
๐Ÿ‘Š Punch C Attack / punch move
๐Ÿฆต Kick V Kick attack
๐Ÿƒ MoveForward โ†’ (Right Arrow) Move character right
โฌ…๏ธ MoveBackward โ† (Left Arrow) Move character left

Action Memory System

The system includes an ActionMemory manager that:

  • Tracks the previously executed action
  • Mode 1 (Block Repeat): Prevents re-triggering the same action until a new one is detected โ€” avoids button spamming
  • Mode 2 (Allow Repeat): Executes every valid prediction continuously โ€” suited for movement actions

๐Ÿ“Š Dataset Pipeline

Source Dataset

NTU RGB+D Action Recognition Dataset

Academic access required โ€” submit request via official portal.

Step 1: Raw Video Extraction

# Action code โ†’ label mapping used:
action_map = {
    "A024": "Kick",        # Kick (NTU class 24)
    "A051": "Kick",        # Kick (NTU class 51)
    "A050": "Punch",       # Punch
    "A026": "Jump",        # Jump up
    "A027": "Jump",        # Jump down
    "A059": "MoveForward", # Walking forward
    "A060": "MoveBackward",# Walking backward
}
  • Total source videos: 2,880
  • Videos extracted (selected actions): 336 โ†’ 800+ after multi-session accumulation

Step 2: Pose Extraction (Dual-Model)

For each video, the pipeline:

  1. Runs YOLO tracking to isolate the target person (user manually selects Track ID)
  2. Crops the person region per frame
  3. Runs MediaPipe Pose on the crop โ†’ extracts 33 landmarks ร— 4 values (x, y, z, visibility) = 132 features
  4. Saves extracted pose sequences to mediapipe_pose.csv incrementally

Robust tracking features:

  • IoU-based re-identification across frames
  • Predicted bounding box interpolation during occlusions
  • Expansion of search area for lost tracks
  • Full-frame fallback detection (up to 30 frames without detection)
  • Skip tracking (Space = irrelevant action, ESC = skip video)

Step 3: Dataset Assembly (NPZ Format)

# Per-video processing:
# 1. Extract pose joint array: shape (frames, 132)
# 2. Hip centralization โ†’ subtract hip midpoint from all joints (x, y, z)
# 3. Sliding window with step=1 โ†’ shape (N_windows, 30, 132)

# Final saved dataset:
sequences: shape (40865, 30, 132)
labels:    shape (40865,)           # action class strings
video_ids: shape (40865,)           # source video filename

Final Dataset Stats:

Split Samples
Train (64%) ~26,154
Validation (16%) ~6,538
Test (20%) 8,173

๐Ÿง  Model Architecture & Training

LSTM Network Architecture

Input Shape: (30, 132)  โ†’  30 time steps ร— 132 features
                โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  LSTM(128, return_sequences)  โ”‚   โ† Temporal pattern capture
โ”‚  BatchNormalization           โ”‚
โ”‚  Dropout(0.3)                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  LSTM(64)                     โ”‚   โ† Sequence summary
โ”‚  BatchNormalization           โ”‚
โ”‚  Dropout(0.3)                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Dense(128, ReLU)             โ”‚
โ”‚  BatchNormalization           โ”‚
โ”‚  Dropout(0.3)                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Dense(64, ReLU)              โ”‚
โ”‚  BatchNormalization           โ”‚
โ”‚  Dropout(0.3)                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Dense(5, Softmax)            โ”‚   โ† 5-class output
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Training Configuration

Parameter Value
Optimizer Adam
Loss Function Sparse Categorical Crossentropy
Epochs 100 (max)
Early Stopping Patience 10 epochs
Batch Size 32
Best Epoch 36
Total Epochs Run 46 (early stopped)

Training Progress (Key Epochs)

Epoch Train Acc Val Acc Val Loss
1 62.85% 75.55% 0.6471
4 89.78% 91.60% 0.2390
8 94.76% 97.49% 0.0858
17 97.56% 98.69% 0.0419
25 98.25% 99.31% 0.0267
36 99.05% 99.53% 0.0150 โ† Best
46 99.13% 98.65% 0.0453 (Early Stop)

โšก Real-Time Inference Pipeline

How It Works (Live)

  1. Camera Init โ€” Opens webcam at 1280ร—720
  2. Person Selection โ€” YOLO detects and tracks all people; user types Track ID + Enter
  3. Pose Loop โ€” For every frame of the selected person:
    • Crop bounding box
    • Extract 33 MediaPipe landmarks (132 values)
    • Append to rolling deque (max=60 frames)
  4. Prediction โ€” When deque โ‰ฅ 30 frames:
    • Take last 30 frames
    • Apply hip centralization
    • Run through LSTM model
    • Apply confidence threshold (โ‰ฅ 0.5)
  5. Smoothing โ€” Majority vote over last 5 predictions
  6. Key Press โ€” Map smoothed action โ†’ virtual keyboard press via pynput
  7. Memory Check โ€” ActionMemory decides whether to execute or block

Live Display Overlay

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Prediction:    Jump            โ”‚
โ”‚ Confidence:    0.97            โ”‚
โ”‚ Previous:      Kick            โ”‚
โ”‚ Actions Exec:  42              โ”‚
โ”‚                                โ”‚
โ”‚  [Tracking box + pose skeleton]โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“ˆ Performance Metrics

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘              ๐Ÿ“Š Model Performance Summary            โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘  Test Accuracy          โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  99.69% โ•‘
โ•‘  Train Accuracy (best)  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  99.05% โ•‘
โ•‘  Val Accuracy (best)    โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  99.53% โ•‘
โ•‘  Test Loss              0.009  (extremely low)       โ•‘
โ•‘  Inference Latency      <100ms (real-time)           โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Per-Class Performance:

Action Precision Recall F1-Score Test Support
Jump 1.00 1.00 1.00 1,486
Kick 1.00 1.00 1.00 2,145
MoveBackward 1.00 0.99 1.00 1,734
MoveForward 0.99 1.00 1.00 1,941
Punch 1.00 1.00 1.00 867
Overall 1.00 1.00 1.00 8,173

๐Ÿ“ Project Structure

Real_Time_Game_Control/
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ code/
โ”‚   โ”œโ”€โ”€ dataset_Preparation.ipynb    # Full data pipeline
โ”‚   โ”œโ”€โ”€ lstm_training.ipynb          # Model training & evaluation
โ”‚   โ””โ”€โ”€ real_time_model.ipynb        # Live inference + game control
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ Dataset/
โ”‚   โ”œโ”€โ”€ nturgb+d_rgb/                # Raw NTU RGB+D videos (not tracked)
โ”‚   โ”œโ”€โ”€ video/                       # Filtered action videos
โ”‚   โ”œโ”€โ”€ mediapipe_pose.csv           # Extracted pose data
โ”‚   โ”œโ”€โ”€ yolo_pose.csv                # YOLO-extracted pose data
โ”‚   โ”œโ”€โ”€ skipped_videos.csv           # Skipped video log
โ”‚   โ””โ”€โ”€ mediapipe_dataset_132.npz    # Final training dataset
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ Models/
โ”‚   โ”œโ”€โ”€ best_lstm_model.keras        # Best saved model (Keras format)
โ”‚   โ”œโ”€โ”€ best_lstm_model.h5           # Best saved model (H5 format)
โ”‚   โ”œโ”€โ”€ label_encoder.pkl            # Sklearn LabelEncoder
โ”‚   โ”œโ”€โ”€ class_info.pkl               # Class names & mappings
โ”‚   โ”œโ”€โ”€ yolo11n-pose.pt              # YOLO pose model (nano)
โ”‚   โ””โ”€โ”€ yolo11s-pose.pt              # YOLO pose model (small)
โ”‚
โ””โ”€โ”€ README.md

๐Ÿ› ๏ธ Installation & Setup

Prerequisites

  • Python 3.10+
  • CUDA-compatible GPU (recommended for training)
  • Webcam (for real-time inference)

1. Clone the Repository

git clone https://github.com/uqasha524/real-time-action-recognition.git
cd real-time-action-recognition

2. Create Virtual Environment

python -m venv venv
# Windows
venv\Scripts\activate
# Linux/macOS
source venv/bin/activate

3. Install Dependencies

pip install tensorflow
pip install torch torchvision
pip install ultralytics        # YOLO
pip install mediapipe
pip install opencv-python
pip install scikit-learn
pip install pandas numpy
pip install matplotlib seaborn
pip install pynput             # Virtual keyboard control
pip install jupyter

4. Download YOLO Pose Models

from ultralytics import YOLO
# Auto-downloads on first use
model = YOLO('yolo11n-pose.pt')  # Nano (faster)
model = YOLO('yolo11s-pose.pt')  # Small (more accurate)

5. Download Dataset

Request access to the NTU RGB+D Dataset at:

๐Ÿ”— https://rose1.ntu.edu.sg/dataset/actionRecognition/

Submit an academic request form โ€” approval typically takes 1-3 days.


๐Ÿš€ How to Run

Option A: Run Pre-Trained Model (Real-Time Only)

If you already have the trained model files in /Models/:

jupyter notebook code/real_time_model.ipynb
  1. Run all cells
  2. A webcam window will open
  3. Type the Track ID of the person you want to track โ†’ Press Enter
  4. Start performing gestures โ€” game keys will be triggered automatically!

Option B: Full Pipeline (Train from Scratch)

Step 1: Dataset Preparation

jupyter notebook code/dataset_Preparation.ipynb
  • Update source_dir to your NTU RGB+D dataset path
  • Run the extraction cell โ€” manually select person Track ID for each video
  • Press Space to skip irrelevant videos, ESC to skip during tracking
  • Wait for mediapipe_pose.csv and mediapipe_dataset_132.npz to be generated

Step 2: Train the Model

jupyter notebook code/lstm_training.ipynb
  • Run all cells
  • Training takes ~30-60 minutes on GPU (46 epochs ran, best at epoch 36)
  • Model is saved automatically to /Models/

Step 3: Real-Time Inference

jupyter notebook code/real_time_model.ipynb

๐Ÿ”ง Technical Details

Hip Centralization

All joint coordinates are normalized relative to the hip midpoint before training and inference:

hip_x = (frame[LEFT_HIP*4] + frame[RIGHT_HIP*4]) / 2
hip_y = (frame[LEFT_HIP*4+1] + frame[RIGHT_HIP*4+1]) / 2
hip_z = (frame[LEFT_HIP*4+2] + frame[RIGHT_HIP*4+2]) / 2

# Subtract hip from all joints (x, y, z only โ€” visibility unchanged)
frame[0::4] -= hip_x   # All X coordinates
frame[1::4] -= hip_y   # All Y coordinates
frame[2::4] -= hip_z   # All Z coordinates

This makes the model position-invariant โ€” the same action is recognized regardless of where the person stands in the frame.

Sliding Window

# Window size: 30 frames
# Step size: 1 (maximum overlap for dense training data)
# Each window: shape (30, 132)

# For real-time:
pose_sequence = deque(maxlen=60)  # Rolling buffer
# Predict on last 30 frames every frame

Prediction Smoothing

def smooth_predictions(predictions, k=5):
    recent = predictions[-k:]
    labels = [p[0] for p in recent]
    majority = max(set(labels), key=labels.count)  # Majority vote
    conf = mean([p[1] for p in recent if p[0] == majority])
    return majority, conf

Confidence Filtering

CONF_THRESHOLD = 0.5
# Only actions with softmax probability >= 0.5 are considered valid
# Others are labeled "Low Confidence" and not mapped to any key

Dataset Label Extraction

# Filename format: ActionName_SeqNumber.avi
# e.g.: Jump_701.avi โ†’ label = "Jump"
#       MoveBackward_123.avi โ†’ label = "MoveBackward"

def extract_action_label(filename):
    parts = os.path.splitext(filename)[0].split("_")
    if parts[-1].isdigit():
        return "_".join(parts[:-1])  # Remove trailing sequence number
    return "_".join(parts)

๐Ÿ“ฆ Requirements

tensorflow>=2.12
torch>=2.0
ultralytics>=8.0
mediapipe>=0.10
opencv-python>=4.8
scikit-learn>=1.3
pandas>=2.0
numpy>=1.24
matplotlib>=3.7
seaborn>=0.12
pynput>=1.7
jupyter>=1.0

๐Ÿ”ฎ Future Improvements

  • Add more action classes (Duck, Roll, Sprint, Block)
  • Replace LSTM with Transformer-based architecture (e.g., TimeSformer)
  • Deploy as standalone executable (no Jupyter required)
  • Add multi-player support (track multiple IDs simultaneously)
  • Train on custom game-specific gesture sets
  • Export to ONNX for faster CPU inference

๐Ÿ“„ License

This project is for academic and research purposes. The NTU RGB+D dataset is subject to its own license terms from NTU Singapore.


Built with โค๏ธ using Python ยท TensorFlow ยท MediaPipe ยท YOLO ยท OpenCV

Python TensorFlow OpenCV Jupyter

Releases

No releases published

Packages

 
 
 

Contributors