Vision Pro Style Control for macOS

A native macOS application that translates real-world hand gestures and eye gaze, captured by the MacBook's camera, into standard system inputs (mouse movement, clicks, scrolling) to emulate the intuitive Gaze-and-Pinch control scheme of Apple Vision Pro.

What This Project Does

This application brings Apple Vision Pro's revolutionary gaze-and-pinch interaction paradigm to macOS, allowing you to control your Mac using only your eyes and hands - no mouse or trackpad required. By leveraging the built-in FaceTime camera and advanced computer vision algorithms, the app tracks your eye movements to position the cursor and recognizes hand gestures for clicking, dragging, and scrolling.

Core Functionality

Eye-Controlled Cursor: Look anywhere on your screen and the cursor follows your gaze
Gesture-Based Interaction: Pinch your fingers to click, hold to drag, move your hand to scroll
Real-Time Processing: Low-latency tracking with smooth cursor movement
Accessibility Integration: Works with any application that accepts standard mouse input

Features

Gaze Tracking: Real-time eye tracking using Vision Framework to control cursor position
Hand Gesture Recognition: Pinch gestures for clicks and drags, hand movements for scrolling
System Input Synthesis: Converts gaze and gestures into actual mouse events
Multi-point Calibration: Robust calibration system for accurate gaze mapping
Visual Feedback: Real-time reticle showing gaze position and gesture states
Camera Preview: Live camera feed with overlay indicators

Requirements

macOS 15.6 or later
MacBook with built-in FaceTime camera
Camera and Accessibility permissions

Installation

Clone or download this project
Open VisionOS Simulator.xcodeproj in Xcode
Build and run the project

Setup and Usage

1. Permissions

When you first launch the app, you'll need to grant two permissions:

Camera Permission: Required to capture video from your FaceTime camera
Accessibility Permission: Required to synthesize mouse events and control the cursor

The app will guide you through granting these permissions.

2. Calibration

For accurate gaze tracking, you need to calibrate the system:

Click "Start Tracking" to begin camera capture
Click "Calibrate" to open the calibration interface
Follow the on-screen instructions to look at each calibration point
Complete all 5 calibration points for optimal accuracy

3. Using the System

Once calibrated:

Click "Enable Input" to activate gaze and gesture control
Look at different areas of your screen - the cursor will follow your gaze
Use pinch gestures to click:
- Quick pinch: Single click
- Hold pinch: Drag operation
Move your hand while not pinching to scroll

Technical Architecture

Core Components

CameraService: Manages AVFoundation camera capture and video processing
GazeTrackingService: Processes face landmarks to estimate gaze direction
HandTrackingService: Detects hand poses and gestures
InputSynthesisService: Converts tracking data into system mouse events
VisionCoordinator: Orchestrates all services and manages state

Vision Pipeline

Video Capture: AVCaptureSession captures video frames from FaceTime camera
Dual Vision Processing:
- VNDetectFaceLandmarksRequest for eye tracking
- VNHandPoseRequest for hand gesture recognition
Coordinate Mapping: Maps normalized gaze coordinates to screen coordinates
Input Synthesis: Generates CGEvent mouse events

Gesture Recognition

Pinch Detection: Monitors distance between index finger tip and thumb tip
Tap vs Drag: Duration-based classification (short pinch = tap, long pinch = drag)
Scroll Detection: Hand movement without pinch gesture

How It Works - Detailed Technical Flow

1. Video Capture Pipeline

FaceTime Camera → AVCaptureSession → Video Frames → Vision Framework Processing

The app continuously captures video from your MacBook's FaceTime camera at high resolution, processing frames through Apple's Vision Framework for real-time analysis.

2. Eye Tracking Process

Face Detection → Eye Landmark Extraction → Pupil Center Calculation → Gaze Direction Estimation → Screen Coordinate Mapping

Step-by-step breakdown:

Face Detection: Vision Framework identifies your face in the video frame
Eye Landmark Extraction: Detects detailed eye contours, eyelids, and pupil positions
Pupil Center Calculation: Calculates the precise center of each pupil
Gaze Direction: Estimates where you're looking based on pupil position relative to eye landmarks
Coordinate Mapping: Converts normalized gaze coordinates to actual screen pixel coordinates

3. Hand Gesture Recognition

Hand Detection → Key Point Extraction → Gesture Classification → Action Mapping

Detailed process:

Hand Detection: Identifies hand presence and extracts skeletal key points
Key Point Tracking: Monitors wrist, index finger tip, and thumb tip positions
Distance Calculation: Measures distance between index finger and thumb
Gesture Classification:
- Pinch: Distance < 3cm (normalized)
- Tap: Pinch duration < 0.3 seconds
- Drag: Pinch duration ≥ 0.3 seconds
- Scroll: Hand movement without pinch gesture

4. Input Synthesis

Tracking Data → Smoothing Algorithms → CGEvent Generation → System Input

Processing pipeline:

Data Smoothing: Applies weighted averaging to reduce jitter
Threshold Filtering: Removes movements below minimum thresholds
Event Generation: Creates native macOS mouse events using Core Graphics
System Integration: Posts events to the system event queue

5. Calibration Mathematics

The calibration system uses a 2D affine transformation matrix to map gaze coordinates to screen coordinates:

Screen_X = a × Gaze_X + b × Gaze_Y + c
Screen_Y = d × Gaze_X + e × Gaze_Y + f

Where the coefficients (a, b, c, d, e, f) are calculated using least squares regression from the 5-point calibration data.

Calibration System

The calibration system uses a 5-point calibration process:

Top-left corner
Top-right corner
Bottom-left corner
Bottom-right corner
Center point

This creates a mapping matrix that accounts for individual differences in eye movement patterns and head position.

User Experience & Interaction Patterns

Visual Feedback System

The reticle provides real-time feedback about system state:

Blue: Ready state (tracking active, no gesture) - Look around to move cursor
Green: Click detected (short pinch) - Quick pinch gesture registered
Red: Drag state (long pinch) - Hold pinch for dragging operations
Gray: Not tracking or low confidence - System needs better visibility

Interaction Workflow

Initialization: Start tracking → Calibrate → Enable input synthesis
Daily Use: Look at target → Pinch to click → Move hand to scroll
Advanced Operations:
- Text Selection: Look at start → Pinch and hold → Look at end → Release
- Window Management: Look at window → Pinch and hold → Look at destination
- Scrolling: Look at scrollable area → Move hand vertically

Learning Curve

Immediate: Basic cursor movement works right away
5 minutes: Comfortable with clicking and basic navigation
15 minutes: Proficient with dragging and scrolling
30 minutes: Natural interaction patterns established

Technical Challenges & Solutions

Challenge 1: Eye Tracking Accuracy

Problem: Eye tracking can be noisy and inaccurate, especially with varying lighting conditions.

Solutions:

Multi-point Calibration: 5-point calibration system accounts for individual differences
Smoothing Algorithms: Weighted averaging reduces jitter and improves stability
Confidence Filtering: Only processes high-confidence detections (>30% threshold)
Eye Aspect Ratio: Monitors eye openness to ensure reliable tracking

Challenge 2: Gesture Recognition Reliability

Problem: Hand gestures need to be distinguished from natural hand movements.

Solutions:

Distance Thresholds: Precise pinch detection using normalized distance measurements
Duration Classification: Time-based distinction between taps and drags
Movement Filtering: Minimum movement thresholds prevent accidental scrolling
State Tracking: Maintains gesture state across frames for consistency

Challenge 3: System Integration

Problem: Converting tracking data into reliable system input events.

Solutions:

Accessibility Framework: Uses AXIsProcessTrusted for proper system integration
CGEvent Generation: Creates native macOS mouse events for universal compatibility
Coordinate Transformation: Accurate mapping from camera space to screen space
Event Timing: Proper delays and timing for natural interaction feel

Challenge 4: Performance Optimization

Problem: Real-time computer vision processing can be computationally expensive.

Solutions:

Frame Skipping: Processes every 2nd frame to maintain 30fps performance
Efficient Algorithms: Optimized coordinate calculations and transformations
Background Processing: Vision processing on dedicated background queue
Memory Management: Proper cleanup and resource management

Performance Considerations

Frame processing is limited to every 2nd frame for performance
Smoothing algorithms reduce jitter in gaze and hand tracking
Confidence thresholds filter out low-quality detections
Efficient coordinate mapping minimizes computational overhead

Troubleshooting

Common Issues

Poor Tracking Accuracy
- Ensure good lighting on your face
- Recalibrate the system
- Check that your face is clearly visible in the camera preview
Gestures Not Working
- Ensure your hand is clearly visible to the camera
- Check that Accessibility permissions are granted
- Try adjusting your hand position relative to the camera
Cursor Not Moving
- Verify camera permissions are granted
- Check that gaze tracking is active (blue reticle)
- Ensure you're looking at the screen, not away from it

Performance Tips

Close other camera-intensive applications
Ensure good lighting conditions
Keep your face and hands clearly visible to the camera
Avoid rapid head movements during use

Development

Project Structure

VisionOS Simulator/
├── Models/
│   └── GazeState.swift          # State models for gaze and hand tracking
├── Services/
│   ├── CameraService.swift      # Camera capture and video processing
│   ├── GazeTrackingService.swift # Eye tracking and gaze estimation
│   ├── HandTrackingService.swift # Hand gesture recognition
│   ├── InputSynthesisService.swift # System input generation
│   └── VisionCoordinator.swift  # Main service coordinator
├── Views/
│   ├── CameraPreviewView.swift  # Camera preview component
│   ├── ReticleView.swift        # Gaze reticle overlay
│   ├── StatusView.swift         # System status display
│   └── CalibrationView.swift    # Calibration interface
├── Extensions/
│   └── NotificationNames.swift  # Custom notification names
└── ContentView.swift            # Main application interface

Key Technologies

AVFoundation: Camera capture and video processing
Vision Framework: Face and hand landmark detection
Core Graphics: System input event generation
SwiftUI: User interface
Combine: Reactive programming and state management

Project Significance

Accessibility Impact

This project demonstrates how modern computer vision can create powerful accessibility tools, potentially helping users with motor disabilities interact with their computers using only their eyes and minimal hand movements.

Technology Demonstration

The project showcases the capabilities of Apple's Vision Framework and demonstrates how to integrate computer vision with system-level input synthesis, serving as a reference implementation for similar applications.

Research Applications

The codebase provides a foundation for research in:

Human-computer interaction
Eye tracking algorithms
Gesture recognition systems
Accessibility technology development

Future Enhancements

Short-term Improvements

Support for multiple monitor setups
Customizable gesture mappings
Advanced calibration algorithms
Performance optimization for older hardware

Long-term Vision

Integration with accessibility features
Support for additional gesture types
Machine learning-based gesture recognition
Integration with other Apple frameworks (ARKit, Core ML)
Cross-platform compatibility (iOS, iPadOS)
Voice command integration
Custom gesture creation tools

License

This project is for educational and research purposes. Please ensure you comply with Apple's guidelines and terms of service when using system input synthesis features.

Contributing

This is a demonstration project showcasing the capabilities of Vision Framework and system input synthesis on macOS. Feel free to explore the code and adapt it for your own projects.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
VisionOS Simulator.xcodeproj		VisionOS Simulator.xcodeproj
VisionOS Simulator		VisionOS Simulator
VisionOS SimulatorTests		VisionOS SimulatorTests
VisionOS SimulatorUITests		VisionOS SimulatorUITests
.gitignore		.gitignore
README.md		README.md

blopezleon/Vision

Folders and files

Latest commit

History

Repository files navigation

Vision Pro Style Control for macOS

What This Project Does

Core Functionality

Features

Requirements

Installation

Setup and Usage

1. Permissions

2. Calibration

3. Using the System

Technical Architecture

Core Components

Vision Pipeline

Gesture Recognition

How It Works - Detailed Technical Flow

1. Video Capture Pipeline

2. Eye Tracking Process

3. Hand Gesture Recognition

4. Input Synthesis

5. Calibration Mathematics

Calibration System

User Experience & Interaction Patterns

Visual Feedback System

Interaction Workflow

Learning Curve

Technical Challenges & Solutions

Challenge 1: Eye Tracking Accuracy

Challenge 2: Gesture Recognition Reliability

Challenge 3: System Integration

Challenge 4: Performance Optimization

Performance Considerations

Troubleshooting

Common Issues

Performance Tips

Development

Project Structure

Key Technologies

Project Significance

Accessibility Impact

Technology Demonstration

Research Applications

Future Enhancements

Short-term Improvements

Long-term Vision

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages