Skip to content

blopezleon/Vision

Repository files navigation

Vision Pro Style Control for macOS

A native macOS application that translates real-world hand gestures and eye gaze, captured by the MacBook's camera, into standard system inputs (mouse movement, clicks, scrolling) to emulate the intuitive Gaze-and-Pinch control scheme of Apple Vision Pro.

What This Project Does

This application brings Apple Vision Pro's revolutionary gaze-and-pinch interaction paradigm to macOS, allowing you to control your Mac using only your eyes and hands - no mouse or trackpad required. By leveraging the built-in FaceTime camera and advanced computer vision algorithms, the app tracks your eye movements to position the cursor and recognizes hand gestures for clicking, dragging, and scrolling.

Core Functionality

  • Eye-Controlled Cursor: Look anywhere on your screen and the cursor follows your gaze
  • Gesture-Based Interaction: Pinch your fingers to click, hold to drag, move your hand to scroll
  • Real-Time Processing: Low-latency tracking with smooth cursor movement
  • Accessibility Integration: Works with any application that accepts standard mouse input

Features

  • Gaze Tracking: Real-time eye tracking using Vision Framework to control cursor position
  • Hand Gesture Recognition: Pinch gestures for clicks and drags, hand movements for scrolling
  • System Input Synthesis: Converts gaze and gestures into actual mouse events
  • Multi-point Calibration: Robust calibration system for accurate gaze mapping
  • Visual Feedback: Real-time reticle showing gaze position and gesture states
  • Camera Preview: Live camera feed with overlay indicators

Requirements

  • macOS 15.6 or later
  • MacBook with built-in FaceTime camera
  • Camera and Accessibility permissions

Installation

  1. Clone or download this project
  2. Open VisionOS Simulator.xcodeproj in Xcode
  3. Build and run the project

Setup and Usage

1. Permissions

When you first launch the app, you'll need to grant two permissions:

  • Camera Permission: Required to capture video from your FaceTime camera
  • Accessibility Permission: Required to synthesize mouse events and control the cursor

The app will guide you through granting these permissions.

2. Calibration

For accurate gaze tracking, you need to calibrate the system:

  1. Click "Start Tracking" to begin camera capture
  2. Click "Calibrate" to open the calibration interface
  3. Follow the on-screen instructions to look at each calibration point
  4. Complete all 5 calibration points for optimal accuracy

3. Using the System

Once calibrated:

  1. Click "Enable Input" to activate gaze and gesture control
  2. Look at different areas of your screen - the cursor will follow your gaze
  3. Use pinch gestures to click:
    • Quick pinch: Single click
    • Hold pinch: Drag operation
  4. Move your hand while not pinching to scroll

Technical Architecture

Core Components

  • CameraService: Manages AVFoundation camera capture and video processing
  • GazeTrackingService: Processes face landmarks to estimate gaze direction
  • HandTrackingService: Detects hand poses and gestures
  • InputSynthesisService: Converts tracking data into system mouse events
  • VisionCoordinator: Orchestrates all services and manages state

Vision Pipeline

  1. Video Capture: AVCaptureSession captures video frames from FaceTime camera
  2. Dual Vision Processing:
    • VNDetectFaceLandmarksRequest for eye tracking
    • VNHandPoseRequest for hand gesture recognition
  3. Coordinate Mapping: Maps normalized gaze coordinates to screen coordinates
  4. Input Synthesis: Generates CGEvent mouse events

Gesture Recognition

  • Pinch Detection: Monitors distance between index finger tip and thumb tip
  • Tap vs Drag: Duration-based classification (short pinch = tap, long pinch = drag)
  • Scroll Detection: Hand movement without pinch gesture

How It Works - Detailed Technical Flow

1. Video Capture Pipeline

FaceTime Camera → AVCaptureSession → Video Frames → Vision Framework Processing

The app continuously captures video from your MacBook's FaceTime camera at high resolution, processing frames through Apple's Vision Framework for real-time analysis.

2. Eye Tracking Process

Face Detection → Eye Landmark Extraction → Pupil Center Calculation → Gaze Direction Estimation → Screen Coordinate Mapping

Step-by-step breakdown:

  1. Face Detection: Vision Framework identifies your face in the video frame
  2. Eye Landmark Extraction: Detects detailed eye contours, eyelids, and pupil positions
  3. Pupil Center Calculation: Calculates the precise center of each pupil
  4. Gaze Direction: Estimates where you're looking based on pupil position relative to eye landmarks
  5. Coordinate Mapping: Converts normalized gaze coordinates to actual screen pixel coordinates

3. Hand Gesture Recognition

Hand Detection → Key Point Extraction → Gesture Classification → Action Mapping

Detailed process:

  1. Hand Detection: Identifies hand presence and extracts skeletal key points
  2. Key Point Tracking: Monitors wrist, index finger tip, and thumb tip positions
  3. Distance Calculation: Measures distance between index finger and thumb
  4. Gesture Classification:
    • Pinch: Distance < 3cm (normalized)
    • Tap: Pinch duration < 0.3 seconds
    • Drag: Pinch duration ≥ 0.3 seconds
    • Scroll: Hand movement without pinch gesture

4. Input Synthesis

Tracking Data → Smoothing Algorithms → CGEvent Generation → System Input

Processing pipeline:

  1. Data Smoothing: Applies weighted averaging to reduce jitter
  2. Threshold Filtering: Removes movements below minimum thresholds
  3. Event Generation: Creates native macOS mouse events using Core Graphics
  4. System Integration: Posts events to the system event queue

5. Calibration Mathematics

The calibration system uses a 2D affine transformation matrix to map gaze coordinates to screen coordinates:

Screen_X = a × Gaze_X + b × Gaze_Y + c
Screen_Y = d × Gaze_X + e × Gaze_Y + f

Where the coefficients (a, b, c, d, e, f) are calculated using least squares regression from the 5-point calibration data.

Calibration System

The calibration system uses a 5-point calibration process:

  1. Top-left corner
  2. Top-right corner
  3. Bottom-left corner
  4. Bottom-right corner
  5. Center point

This creates a mapping matrix that accounts for individual differences in eye movement patterns and head position.

User Experience & Interaction Patterns

Visual Feedback System

The reticle provides real-time feedback about system state:

  • Blue: Ready state (tracking active, no gesture) - Look around to move cursor
  • Green: Click detected (short pinch) - Quick pinch gesture registered
  • Red: Drag state (long pinch) - Hold pinch for dragging operations
  • Gray: Not tracking or low confidence - System needs better visibility

Interaction Workflow

  1. Initialization: Start tracking → Calibrate → Enable input synthesis
  2. Daily Use: Look at target → Pinch to click → Move hand to scroll
  3. Advanced Operations:
    • Text Selection: Look at start → Pinch and hold → Look at end → Release
    • Window Management: Look at window → Pinch and hold → Look at destination
    • Scrolling: Look at scrollable area → Move hand vertically

Learning Curve

  • Immediate: Basic cursor movement works right away
  • 5 minutes: Comfortable with clicking and basic navigation
  • 15 minutes: Proficient with dragging and scrolling
  • 30 minutes: Natural interaction patterns established

Technical Challenges & Solutions

Challenge 1: Eye Tracking Accuracy

Problem: Eye tracking can be noisy and inaccurate, especially with varying lighting conditions.

Solutions:

  • Multi-point Calibration: 5-point calibration system accounts for individual differences
  • Smoothing Algorithms: Weighted averaging reduces jitter and improves stability
  • Confidence Filtering: Only processes high-confidence detections (>30% threshold)
  • Eye Aspect Ratio: Monitors eye openness to ensure reliable tracking

Challenge 2: Gesture Recognition Reliability

Problem: Hand gestures need to be distinguished from natural hand movements.

Solutions:

  • Distance Thresholds: Precise pinch detection using normalized distance measurements
  • Duration Classification: Time-based distinction between taps and drags
  • Movement Filtering: Minimum movement thresholds prevent accidental scrolling
  • State Tracking: Maintains gesture state across frames for consistency

Challenge 3: System Integration

Problem: Converting tracking data into reliable system input events.

Solutions:

  • Accessibility Framework: Uses AXIsProcessTrusted for proper system integration
  • CGEvent Generation: Creates native macOS mouse events for universal compatibility
  • Coordinate Transformation: Accurate mapping from camera space to screen space
  • Event Timing: Proper delays and timing for natural interaction feel

Challenge 4: Performance Optimization

Problem: Real-time computer vision processing can be computationally expensive.

Solutions:

  • Frame Skipping: Processes every 2nd frame to maintain 30fps performance
  • Efficient Algorithms: Optimized coordinate calculations and transformations
  • Background Processing: Vision processing on dedicated background queue
  • Memory Management: Proper cleanup and resource management

Performance Considerations

  • Frame processing is limited to every 2nd frame for performance
  • Smoothing algorithms reduce jitter in gaze and hand tracking
  • Confidence thresholds filter out low-quality detections
  • Efficient coordinate mapping minimizes computational overhead

Troubleshooting

Common Issues

  1. Poor Tracking Accuracy

    • Ensure good lighting on your face
    • Recalibrate the system
    • Check that your face is clearly visible in the camera preview
  2. Gestures Not Working

    • Ensure your hand is clearly visible to the camera
    • Check that Accessibility permissions are granted
    • Try adjusting your hand position relative to the camera
  3. Cursor Not Moving

    • Verify camera permissions are granted
    • Check that gaze tracking is active (blue reticle)
    • Ensure you're looking at the screen, not away from it

Performance Tips

  • Close other camera-intensive applications
  • Ensure good lighting conditions
  • Keep your face and hands clearly visible to the camera
  • Avoid rapid head movements during use

Development

Project Structure

VisionOS Simulator/
├── Models/
│   └── GazeState.swift          # State models for gaze and hand tracking
├── Services/
│   ├── CameraService.swift      # Camera capture and video processing
│   ├── GazeTrackingService.swift # Eye tracking and gaze estimation
│   ├── HandTrackingService.swift # Hand gesture recognition
│   ├── InputSynthesisService.swift # System input generation
│   └── VisionCoordinator.swift  # Main service coordinator
├── Views/
│   ├── CameraPreviewView.swift  # Camera preview component
│   ├── ReticleView.swift        # Gaze reticle overlay
│   ├── StatusView.swift         # System status display
│   └── CalibrationView.swift    # Calibration interface
├── Extensions/
│   └── NotificationNames.swift  # Custom notification names
└── ContentView.swift            # Main application interface

Key Technologies

  • AVFoundation: Camera capture and video processing
  • Vision Framework: Face and hand landmark detection
  • Core Graphics: System input event generation
  • SwiftUI: User interface
  • Combine: Reactive programming and state management

Project Significance

Accessibility Impact

This project demonstrates how modern computer vision can create powerful accessibility tools, potentially helping users with motor disabilities interact with their computers using only their eyes and minimal hand movements.

Technology Demonstration

The project showcases the capabilities of Apple's Vision Framework and demonstrates how to integrate computer vision with system-level input synthesis, serving as a reference implementation for similar applications.

Research Applications

The codebase provides a foundation for research in:

  • Human-computer interaction
  • Eye tracking algorithms
  • Gesture recognition systems
  • Accessibility technology development

Future Enhancements

Short-term Improvements

  • Support for multiple monitor setups
  • Customizable gesture mappings
  • Advanced calibration algorithms
  • Performance optimization for older hardware

Long-term Vision

  • Integration with accessibility features
  • Support for additional gesture types
  • Machine learning-based gesture recognition
  • Integration with other Apple frameworks (ARKit, Core ML)
  • Cross-platform compatibility (iOS, iPadOS)
  • Voice command integration
  • Custom gesture creation tools

License

This project is for educational and research purposes. Please ensure you comply with Apple's guidelines and terms of service when using system input synthesis features.

Contributing

This is a demonstration project showcasing the capabilities of Vision Framework and system input synthesis on macOS. Feel free to explore the code and adapt it for your own projects.

About

VisionOS practice :)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages