A native macOS application that translates real-world hand gestures and eye gaze, captured by the MacBook's camera, into standard system inputs (mouse movement, clicks, scrolling) to emulate the intuitive Gaze-and-Pinch control scheme of Apple Vision Pro.
This application brings Apple Vision Pro's revolutionary gaze-and-pinch interaction paradigm to macOS, allowing you to control your Mac using only your eyes and hands - no mouse or trackpad required. By leveraging the built-in FaceTime camera and advanced computer vision algorithms, the app tracks your eye movements to position the cursor and recognizes hand gestures for clicking, dragging, and scrolling.
- Eye-Controlled Cursor: Look anywhere on your screen and the cursor follows your gaze
- Gesture-Based Interaction: Pinch your fingers to click, hold to drag, move your hand to scroll
- Real-Time Processing: Low-latency tracking with smooth cursor movement
- Accessibility Integration: Works with any application that accepts standard mouse input
- Gaze Tracking: Real-time eye tracking using Vision Framework to control cursor position
- Hand Gesture Recognition: Pinch gestures for clicks and drags, hand movements for scrolling
- System Input Synthesis: Converts gaze and gestures into actual mouse events
- Multi-point Calibration: Robust calibration system for accurate gaze mapping
- Visual Feedback: Real-time reticle showing gaze position and gesture states
- Camera Preview: Live camera feed with overlay indicators
- macOS 15.6 or later
- MacBook with built-in FaceTime camera
- Camera and Accessibility permissions
- Clone or download this project
- Open
VisionOS Simulator.xcodeprojin Xcode - Build and run the project
When you first launch the app, you'll need to grant two permissions:
- Camera Permission: Required to capture video from your FaceTime camera
- Accessibility Permission: Required to synthesize mouse events and control the cursor
The app will guide you through granting these permissions.
For accurate gaze tracking, you need to calibrate the system:
- Click "Start Tracking" to begin camera capture
- Click "Calibrate" to open the calibration interface
- Follow the on-screen instructions to look at each calibration point
- Complete all 5 calibration points for optimal accuracy
Once calibrated:
- Click "Enable Input" to activate gaze and gesture control
- Look at different areas of your screen - the cursor will follow your gaze
- Use pinch gestures to click:
- Quick pinch: Single click
- Hold pinch: Drag operation
- Move your hand while not pinching to scroll
- CameraService: Manages AVFoundation camera capture and video processing
- GazeTrackingService: Processes face landmarks to estimate gaze direction
- HandTrackingService: Detects hand poses and gestures
- InputSynthesisService: Converts tracking data into system mouse events
- VisionCoordinator: Orchestrates all services and manages state
- Video Capture: AVCaptureSession captures video frames from FaceTime camera
- Dual Vision Processing:
- VNDetectFaceLandmarksRequest for eye tracking
- VNHandPoseRequest for hand gesture recognition
- Coordinate Mapping: Maps normalized gaze coordinates to screen coordinates
- Input Synthesis: Generates CGEvent mouse events
- Pinch Detection: Monitors distance between index finger tip and thumb tip
- Tap vs Drag: Duration-based classification (short pinch = tap, long pinch = drag)
- Scroll Detection: Hand movement without pinch gesture
FaceTime Camera → AVCaptureSession → Video Frames → Vision Framework Processing
The app continuously captures video from your MacBook's FaceTime camera at high resolution, processing frames through Apple's Vision Framework for real-time analysis.
Face Detection → Eye Landmark Extraction → Pupil Center Calculation → Gaze Direction Estimation → Screen Coordinate Mapping
Step-by-step breakdown:
- Face Detection: Vision Framework identifies your face in the video frame
- Eye Landmark Extraction: Detects detailed eye contours, eyelids, and pupil positions
- Pupil Center Calculation: Calculates the precise center of each pupil
- Gaze Direction: Estimates where you're looking based on pupil position relative to eye landmarks
- Coordinate Mapping: Converts normalized gaze coordinates to actual screen pixel coordinates
Hand Detection → Key Point Extraction → Gesture Classification → Action Mapping
Detailed process:
- Hand Detection: Identifies hand presence and extracts skeletal key points
- Key Point Tracking: Monitors wrist, index finger tip, and thumb tip positions
- Distance Calculation: Measures distance between index finger and thumb
- Gesture Classification:
- Pinch: Distance < 3cm (normalized)
- Tap: Pinch duration < 0.3 seconds
- Drag: Pinch duration ≥ 0.3 seconds
- Scroll: Hand movement without pinch gesture
Tracking Data → Smoothing Algorithms → CGEvent Generation → System Input
Processing pipeline:
- Data Smoothing: Applies weighted averaging to reduce jitter
- Threshold Filtering: Removes movements below minimum thresholds
- Event Generation: Creates native macOS mouse events using Core Graphics
- System Integration: Posts events to the system event queue
The calibration system uses a 2D affine transformation matrix to map gaze coordinates to screen coordinates:
Screen_X = a × Gaze_X + b × Gaze_Y + c
Screen_Y = d × Gaze_X + e × Gaze_Y + f
Where the coefficients (a, b, c, d, e, f) are calculated using least squares regression from the 5-point calibration data.
The calibration system uses a 5-point calibration process:
- Top-left corner
- Top-right corner
- Bottom-left corner
- Bottom-right corner
- Center point
This creates a mapping matrix that accounts for individual differences in eye movement patterns and head position.
The reticle provides real-time feedback about system state:
- Blue: Ready state (tracking active, no gesture) - Look around to move cursor
- Green: Click detected (short pinch) - Quick pinch gesture registered
- Red: Drag state (long pinch) - Hold pinch for dragging operations
- Gray: Not tracking or low confidence - System needs better visibility
- Initialization: Start tracking → Calibrate → Enable input synthesis
- Daily Use: Look at target → Pinch to click → Move hand to scroll
- Advanced Operations:
- Text Selection: Look at start → Pinch and hold → Look at end → Release
- Window Management: Look at window → Pinch and hold → Look at destination
- Scrolling: Look at scrollable area → Move hand vertically
- Immediate: Basic cursor movement works right away
- 5 minutes: Comfortable with clicking and basic navigation
- 15 minutes: Proficient with dragging and scrolling
- 30 minutes: Natural interaction patterns established
Problem: Eye tracking can be noisy and inaccurate, especially with varying lighting conditions.
Solutions:
- Multi-point Calibration: 5-point calibration system accounts for individual differences
- Smoothing Algorithms: Weighted averaging reduces jitter and improves stability
- Confidence Filtering: Only processes high-confidence detections (>30% threshold)
- Eye Aspect Ratio: Monitors eye openness to ensure reliable tracking
Problem: Hand gestures need to be distinguished from natural hand movements.
Solutions:
- Distance Thresholds: Precise pinch detection using normalized distance measurements
- Duration Classification: Time-based distinction between taps and drags
- Movement Filtering: Minimum movement thresholds prevent accidental scrolling
- State Tracking: Maintains gesture state across frames for consistency
Problem: Converting tracking data into reliable system input events.
Solutions:
- Accessibility Framework: Uses AXIsProcessTrusted for proper system integration
- CGEvent Generation: Creates native macOS mouse events for universal compatibility
- Coordinate Transformation: Accurate mapping from camera space to screen space
- Event Timing: Proper delays and timing for natural interaction feel
Problem: Real-time computer vision processing can be computationally expensive.
Solutions:
- Frame Skipping: Processes every 2nd frame to maintain 30fps performance
- Efficient Algorithms: Optimized coordinate calculations and transformations
- Background Processing: Vision processing on dedicated background queue
- Memory Management: Proper cleanup and resource management
- Frame processing is limited to every 2nd frame for performance
- Smoothing algorithms reduce jitter in gaze and hand tracking
- Confidence thresholds filter out low-quality detections
- Efficient coordinate mapping minimizes computational overhead
-
Poor Tracking Accuracy
- Ensure good lighting on your face
- Recalibrate the system
- Check that your face is clearly visible in the camera preview
-
Gestures Not Working
- Ensure your hand is clearly visible to the camera
- Check that Accessibility permissions are granted
- Try adjusting your hand position relative to the camera
-
Cursor Not Moving
- Verify camera permissions are granted
- Check that gaze tracking is active (blue reticle)
- Ensure you're looking at the screen, not away from it
- Close other camera-intensive applications
- Ensure good lighting conditions
- Keep your face and hands clearly visible to the camera
- Avoid rapid head movements during use
VisionOS Simulator/
├── Models/
│ └── GazeState.swift # State models for gaze and hand tracking
├── Services/
│ ├── CameraService.swift # Camera capture and video processing
│ ├── GazeTrackingService.swift # Eye tracking and gaze estimation
│ ├── HandTrackingService.swift # Hand gesture recognition
│ ├── InputSynthesisService.swift # System input generation
│ └── VisionCoordinator.swift # Main service coordinator
├── Views/
│ ├── CameraPreviewView.swift # Camera preview component
│ ├── ReticleView.swift # Gaze reticle overlay
│ ├── StatusView.swift # System status display
│ └── CalibrationView.swift # Calibration interface
├── Extensions/
│ └── NotificationNames.swift # Custom notification names
└── ContentView.swift # Main application interface
- AVFoundation: Camera capture and video processing
- Vision Framework: Face and hand landmark detection
- Core Graphics: System input event generation
- SwiftUI: User interface
- Combine: Reactive programming and state management
This project demonstrates how modern computer vision can create powerful accessibility tools, potentially helping users with motor disabilities interact with their computers using only their eyes and minimal hand movements.
The project showcases the capabilities of Apple's Vision Framework and demonstrates how to integrate computer vision with system-level input synthesis, serving as a reference implementation for similar applications.
The codebase provides a foundation for research in:
- Human-computer interaction
- Eye tracking algorithms
- Gesture recognition systems
- Accessibility technology development
- Support for multiple monitor setups
- Customizable gesture mappings
- Advanced calibration algorithms
- Performance optimization for older hardware
- Integration with accessibility features
- Support for additional gesture types
- Machine learning-based gesture recognition
- Integration with other Apple frameworks (ARKit, Core ML)
- Cross-platform compatibility (iOS, iPadOS)
- Voice command integration
- Custom gesture creation tools
This project is for educational and research purposes. Please ensure you comply with Apple's guidelines and terms of service when using system input synthesis features.
This is a demonstration project showcasing the capabilities of Vision Framework and system input synthesis on macOS. Feel free to explore the code and adapt it for your own projects.