A Practical Computer Vision–Based HCI Project
Ever found yourself wanting to pause or skip a video without reaching for the keyboard—maybe while just being lazy in the best possible way?
This project attempts to turn that idea into a working system.
The Gesture-Controlled Media Player is a real-time hand gesture–based media control interface that uses a webcam to interpret simple, intuitive hand gestures and translate them into playback commands such as play/pause and seek forward/backward.
Under the hood, it combines MediaPipe hand landmark detection, geometric reasoning, and OS-level keyboard automation. The focus is on robustness, clarity, and real-time performance, not flashy but fragile tricks.
| Gesture | Action |
|---|---|
| ✊ Closed fist (held briefly) | Play / Pause |
| ☝️ Index finger on left half of the screen | Seek backward |
| ☝️ Index finger on right half of the screen | Seek forward |
Other nice-to-haves:
- Built-in cooldowns to avoid accidental multiple triggers
- Live visual overlays so you know what the system is detecting
- Works with any browser-based video player, not just YouTube
This project explores vision-based human–computer interaction (HCI) by mapping where your hand is in space to what action the system performs over time.
Instead of relying on heavy, opaque gesture classifiers, the system uses:
- Explicit hand landmark geometry
- Per-finger state estimation
- Simple spatial reasoning
- Time-based stability checks
The result is a system that is easier to understand, debug, and extend.
git clone https://github.com/SohamB-42/gesture-controlled-media-player.git
cd gesture-controlled-media-playerpip install opencv-python mediapipe pyautoguiEnsure that HandTrackingModule.py (MediaPipe wrapper) is present in the project directory.
python main.pyPress q to exit cleanly.
The system controls media playback by simulating standard keyboard inputs using pyautogui. As a result, it works with any application that responds to common media keys, provided the application window is in focus.
- Web-based players (YouTube, Netflix, Prime Video, Coursera, etc.)
- Desktop media players (VLC, Windows Media Player)
- Presentation software (PowerPoint, Google Slides)
- Custom video players and demo applications
The system sends the following key events:
Spaceto Play / Pause←/→to Seek backward / forward
This design keeps the project platform-agnostic and application-independent, avoiding tight coupling with any specific media player or API.
- Detects and tracks up to two hands simultaneously
- Each hand is represented by 21 two-dimensional landmarks
- Frames are horizontally flipped for intuitive left–right interaction
Each finger is classified as extended or not extended using relative landmark positions:
- Index to pinky fingers use vertical (Y-axis) comparisons
- The thumb uses horizontal (X-axis) comparisons to account for mirrored camera input
This produces a compact representation:
[thumb, index, middle, ring, pinky]
A closed fist gesture is identified using multiple checks:
- Number of extended non-thumb fingers
- Palm compactness (distance between index and pinky fingertips)
- Thresholds scaled relative to frame width
The gesture must be held briefly, and a cooldown is enforced to prevent rapid toggling.
This makes play/pause intentional rather than accidental.
The camera frame is divided vertically into two regions:
| SEEK BACKWARD | SEEK FORWARD |
When the index finger is extended:
- Presence in the left region triggers a backward seek
- Presence in the right region triggers a forward seek
Each direction has its own cooldown for controlled interaction.
The system simulates standard keyboard inputs using pyautogui:
| Action | Key |
|---|---|
| Play / Pause | Space |
| Seek backward | ← |
| Seek forward | → |
Because it uses standard keys, the system is platform- and application-agnostic.
The system is designed to be user-configurable, allowing gesture sensitivity and responsiveness to be tuned based on personal preference, camera quality, and lighting conditions.
Key parameters can be adjusted directly in the source code:
CAMERA_ID = 0 # Change if using an external webcam
DRAW = True # Toggle on-screen visual overlays
SEEK_COOLDOWN = 0.45 # Delay between consecutive seek actions (seconds)
FIST_HOLD_SECONDS = 0.14 # How long a fist must be held to trigger play/pause
FIST_COOLDOWN = 0.9 # Minimum delay between play/pause toggles- Python
- OpenCV – video capture and rendering
- MediaPipe Hands – real-time hand landmark detection
- pyautogui – keyboard automation
- Basic geometry & timing logic
- Requires reasonable lighting conditions
- Extreme hand rotations can reduce accuracy
- Gesture set is intentionally minimal to prioritize reliability
Open-source and free to use, modify, and extend.