Claude/add cpu mps support p6 mzo #375

eleviidev · 2025-12-23T00:55:39Z

No description provided.

This change enables SAM3 to run on Mac M4 and other non-CUDA systems by: - Creating a device utility module (sam3/utils/device.py) for automatic device detection with priority: CUDA > MPS > CPU - Adding PyTorch-based fallbacks for Triton kernels: - sigmoid_focal_loss.py: Pure PyTorch implementation for CPU/MPS - edt.py: SciPy-based EDT implementation for CPU/MPS - Updating device detection in model_builder.py to auto-detect best available device instead of assuming CUDA - Replacing hardcoded .cuda() calls with device-agnostic .to(device) throughout the codebase: - io_utils.py: Video/image loading now respects device - sam3_tracker_base.py: Memory features use correct device - sam3_tracking_predictor.py: Image inference uses inference state device - sam3_video_predictor.py: Model initialization uses get_device() - Adding MPS-aware fallbacks in perflib: - nms.py: Falls back to CPU implementation for MPS - connected_components.py: Falls back to CPU implementation for MPS - Fixing CUDA-specific backend calls in transformer.py to only run on CUDA devices Note: Distributed training features (NCCL backend) still require CUDA as that is an inherent limitation of NCCL.

Test coverage includes: - Device utility module functions - Sigmoid focal loss on CPU/MPS - EDT (Euclidean Distance Transform) on CPU/MPS - NMS on CPU/MPS - Connected components on CPU/MPS - Transformer attention modules on CPU - Model builder device parameter handling MPS tests are automatically skipped when MPS is not available.

Adds a comprehensive example script for real-time camera segmentation using SAM3. Features include: - Auto-detection mode for automatic object segmentation - Interactive point-based prompting (left/right click) - Multi-device support (CUDA, MPS, CPU) - FPS tracking and display overlay - Frame saving and pause functionality

Move decord import inside the video loading conditional block so it's only imported when actually loading MP4 files. This prevents import errors when decord is not installed but video loading is not needed.

meta-cla · 2025-12-23T00:55:45Z

Hi @eleviidev!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

- position_encoding.py: Use get_device() for precomputed position encodings - decoder.py: Use get_device() for coordinate cache initialization - vl_combiner.py: Default device to None, use get_device_str() at runtime - sam3_image_processor.py: Default device to None, use get_device_str()

Rewrote the live camera segmentation script to use the correct SAM3 inference API via Sam3Processor instead of calling the model directly. Key changes: - Use Sam3Processor.set_image() to process frames - Use Sam3Processor.set_text_prompt() for text-based detection - Use Sam3Processor.add_geometric_prompt() for interactive box prompts - Results accessed via state dict (masks, boxes, scores)

PyTorch's grid_sample has bugs on MPS with certain tensor configurations. Added _grid_sample_mps_safe() that falls back to CPU for MPS devices.

pin_memory() is a CUDA-specific optimization that doesn't work on MPS. Added device type checks to skip pin_memory() on non-CUDA devices. Files fixed: - geometry_encoders.py - sam3_video_inference.py - sam3_tracker_base.py

torch._assert_async is not implemented for MPS devices. Use regular assert on MPS as a fallback. Files fixed: - geometry_encoders.py - sam3_image.py

Added command-line options to improve performance on MPS/CPU: - --skip-frames N: Only process every N frames (default: 1) - --resolution N: Lower model resolution (default: 1008, try 512/768) These options help achieve usable frame rates on Apple Silicon.

The model has precomputed positional encodings (freqs_cis) that are sized for 1008 resolution. Different resolutions cause shape mismatches. Use --skip-frames for performance improvement instead.

Added --half flag to convert model to float16, which can speed up inference on Apple Silicon by reducing memory bandwidth requirements.

Sam3Processor now automatically converts input images to match the model's dtype (float16 or float32), enabling half precision inference.

Match boxes dtype to img_feats dtype in roi_align call to support half precision inference.

Metal Performance Shaders fails with mixed dtype matrix multiplication. Half precision only works on CUDA, not MPS.

- Added --track flag to enable memory-based mask propagation between frames - Fixed Sam3TrackerPredictor for MPS compatibility (autocast, storage device) - When tracking is enabled, masks follow objects between full inference frames - This allows higher frame rates while maintaining visual continuity

…wnload

- Labels and confidence scores now displayed on each detected object mask - Added info panel on right side showing list of detected objects - Panel shows object label, confidence score (color-coded), and size - Confidence scores are stored and tracked between frames

Users can now specify comma-separated prompts (e.g., --prompt "person, car, dog") to detect multiple object types simultaneously. Each detection is labeled with its corresponding prompt name in both the mask overlay and the info panel.

Features: - Live video streaming with segmentation overlay - Multi-prompt detection configuration via web UI - Object count limits with show/hide toggle for each prompt type - Verbose mode showing tracking, frame count, queue size - Claude Vision API integration for detailed object analysis - Command center style dark theme interface - Real-time system log display - Confidence threshold and skip-frames controls Usage: python examples/web_command_center/app.py --prompt "person, car"

- Add setup_device_optimizations() with MPS memory management - Add mps_synchronize() for explicit GPU synchronization - Add empty_cache() for both CUDA and MPS memory cleanup - Enable device optimizations in live camera and web command center These optimizations help improve performance on Apple Silicon (M1/M2/M3/M4) by better utilizing the Metal GPU backend.

New features added to web command center: - Memory Tracking: Store mask history for object re-identification - Persistent Object IDs: Stable IDs across frames using IoU matching - Fill Holes: Morphological hole filling in masks - Smooth Edges: Edge smoothing with configurable kernel - Non-Overlapping Masks: Prevent mask overlaps (higher conf wins) - Boundary Suppression: Ignore detections near frame edges - Occlusion Suppression: Remove heavily overlapped detections - Hotstart Mode: Require N frames before confirming detection All features have UI toggles in the Features tab with configurable parameters.

- Add comprehensive SAM3-to-COCO label mapping (80+ entries) for YOLO compatibility - Integrate YOLOv12 classification on SAM3 detected regions - Add YOLOv12 pose estimation for person-like detections (17 keypoints) - Implement skeleton/keypoint overlay visualization with configurable colors - Add feature toggles: YOLO classify, pose, skeleton, keypoint labels, label spoofing - Add configurable parameters: thresholds, keypoint radius, skeleton thickness - Update UI with YOLO Integration section in Features tab

Backend: - Add voice search state tracking to CommandCenter - Add Claude-powered voice query parsing (parse_voice_query_with_claude) - Handle natural language queries like "help me find a red car" - Support multiple object detection from single voice command - Add /api/voice_search endpoint for voice query processing - Add /api/voice_feedback and /api/toggle_voice endpoints Frontend: - Add Voice Search tab with large microphone button - Add microphone button next to prompts input for quick access - Implement Web Speech API for voice recognition - Add real-time transcription feedback during listening - Add TTS (Text-to-Speech) section with voice selection - Add voice search history with click-to-reapply - Visual feedback: listening (red pulse), processing (yellow) - Speak search confirmations when TTS enabled

Backend: - Add camera detection function with platform-specific camera naming - Add camera settings to CommandCenter (current_camera_id, flip states) - Add switch_camera() and reset_detection_state() functions - Apply flip transformations in generate_frames() before processing - Add /api/cameras endpoint to list available cameras - Add /api/switch_camera endpoint to change cameras - Add /api/flip_camera and /api/set_flip endpoints for mirror control - Detect available cameras at startup Frontend: - Add Camera tab with camera selection dropdown - Add refresh button to re-scan available cameras - Add Flip Horizontal and Flip Vertical buttons with visual feedback - Show current camera info and flip status - Add camera tips section - Style camera controls with consistent theme

- Add helper functions for mask validation and bounding box extraction - Update detections during optical flow tracking frames instead of only on keyframes - Remove objects only when they actually leave the frame (mask becomes invalid) - Build new detections list atomically on keyframes to avoid race conditions - Keep tracked objects visible between SAM3 keyframe refreshes

- Support loading API key from .env file via python-dotenv - Add --api-key command line argument option - Pass api_key explicitly to Anthropic client - Add helpful error messages when API key is missing - Show warning at startup if no API key is configured

- YOLO12 doesn't exist - changed to YOLO11 (yolo11n-cls.pt, yolo11n-pose.pt) - Added fallback to YOLOv8 models if YOLO11 fails - Models auto-download on first use from ultralytics hub

- YOLO12 released Feb 2025, try it first for classification and pose - Falls back to YOLO11 then YOLOv8 if pretrained weights not available - Cleaner loop-based fallback logic with logging for each attempt

- Add current_raw_frame to store frame before overlay processing - Use raw frame for Claude object analysis instead of display frame - Fixes issue where Claude was seeing blue mask overlays in analysis

Features: - Mask-based cropping: Use SAM3 mask to crop objects cleanly for Claude analysis (white background outside mask) - Describe Scene button: Analyze entire frame with Claude - Voice commands: Support "describe scene", "describe object X", "analyze the first object", etc. - HTTPS support: Use --https flag to enable SSL for microphone access - Auto-generates self-signed certificates if cryptography is installed - Or provide --ssl-cert and --ssl-key for custom certificates UI changes: - Added "Describe Scene" button in Claude Analysis panel - Pass mask_index when analyzing objects for better cropping - Handle describe actions in voice command processing

Reference Image Search: - Upload a reference image to find similar objects - "Search by Description" mode: Claude describes image, SAM3 finds it - "Search by Visual Match" mode: CLIP compares embeddings for similarity - CLIP model loading with transformers library Draw to Search: - Draw Box: Click and drag on video to select object region - Click Point: Click on object to segment it - Canvas overlay for drawing interaction - Sends geometric prompts (box/point) to SAM3 New API endpoints: - /api/upload_reference - Upload reference image with mode selection - /api/clear_reference - Clear reference search - /api/reference_status - Get reference search status - /api/draw_prompt - Send box/point geometric prompt - /api/clear_draw_prompt - Clear pending prompts UI changes: - New "Reference Search" tab with upload and draw controls - Canvas overlay on video for draw-to-search - Visual match threshold slider

Features: - Full navigation UI overlay with directional arrows and distance indicators - Voice guidance with TTS and proximity beep sounds (frequency changes with distance) - "Navigate" button on each detected object - Location memory system - remembers where objects were found - Claude scene analysis for obstacle detection and location context - HTTPS is now the default mode for microphone/camera access - Visual distance ring that pulses when object is reachable - Success sound/announcement when object is reached - Auto-stop navigation after reaching target

SQLite Database: - Full database schema for sessions, detections, analysis, navigation, obstacles - Migrated location memory from JSON to SQLite - History APIs for detections, analysis, and navigation - Session statistics tracking Obstacle Detection During Navigation: - SAM3-based obstacle detection running in parallel during navigation - Predefined obstacle prompts (stairs, edges, furniture, doors, etc.) - Severity levels (high/medium/low) with color-coded masks - Distance estimation based on object size in frame - Visual overlays with warning triangles and labels - Audio alerts with different beep patterns per severity - TTS announcements for obstacle warnings - Cooldown system to prevent alert spam Post-Navigation Dialog: - Shows dialog when navigation ends asking user to continue or pause - TTS-enabled for accessibility - Remembers if target was reached Other improvements: - Session ID tracking for all database operations - Event logging for navigation start/stop - Obstacle history saved to database

This is a much smarter approach to obstacle detection: Before (Static List): - Used hardcoded list of "obstacle" words (stairs, chair, table, etc.) - Would incorrectly flag user's target as an obstacle - No understanding of spatial relationships - No context about what's actually in the path After (Claude AI): - Claude analyzes the scene with context about the navigation target - Understands the target is NOT an obstacle (won't flag it) - Identifies only objects that are physically in the path to target - Provides spatial context (left, right, center, floor, ahead) - Explains WHY something is an obstacle (reason field) - Suggests safe direction when obstacles are present - Understands environment type (room, hallway, outdoor, etc.) Technical changes: - Added analyze_obstacles_with_claude() for intelligent analysis - Claude returns: environment, path_clear, obstacles[], safe_direction - Rate-limited to avoid excessive API calls (3 second cache) - Updated overlay to use position-based regions instead of masks - Shows "PATH CLEAR" indicator when Claude confirms safe path - Enhanced UI alerts with position and reason context - Different visual styles for high/medium/low severity

This implements a two-layer detection system like the robot obstacle avoidance project, but enhanced with AI understanding: Layer 1 - OpenCV Real-Time (every frame): - Bilateral filtering to reduce noise while preserving edges - Canny edge detection to find object boundaries - Contour detection to identify obstacle shapes - Region-based analysis (left/center/right/floor paths) - Edge density calculation for proximity estimation - Floor clearance analysis for trip hazards Layer 2 - Claude AI (every 3 seconds): - Contextual understanding of what obstacles are - Knows the navigation target is NOT an obstacle - Explains WHY something is dangerous - Suggests safe direction to move How they work together: - OpenCV: "There's something in front of you!" (immediate, ~20ms) - Claude: "It's a glass coffee table between you and the mug, move right to avoid it" (smart, ~1-2s) Visual overlay improvements: - OpenCV detections: dashed bounding boxes with [CV] label - Claude detections: solid regions with reason text - Shows "PATH CLEAR" when safe or "Go: [direction]" for guidance - Floor analysis suggests clearest path (left/center/right) Proximity estimation: - Position in frame (lower = closer) - Edge density (higher = larger/closer object) - Floor uniformity (uniform = clear, edges = obstacles)

Implements advanced obstacle detection using only a single RGB camera: Layer 1: OpenCV Edge Detection (every frame, ~20ms) - Canny edges, contours, bilateral filtering - Immediate response for sudden obstacles Layer 2: AI Depth Estimation (MiDaS/Depth Anything) - LIDAR-like depth perception from single camera - Actual distance measurement, not just presence detection - Tries Depth Anything (2024 SOTA) first, falls back to MiDaS Layer 3: Optical Flow Collision Detection - Biomimetic technique (how insects detect collisions) - Detects approaching objects via motion expansion - Estimates time-to-collision (TTC) Layer 4: Claude AI Analysis (every 3 seconds) - Semantic understanding of obstacles - Context-aware (knows target is NOT an obstacle) - Explains WHY something is dangerous Additional features: - Ground plane segmentation for walkable area detection - Temporal obstacle tracking (detects if obstacles approaching) - Multi-position analysis (left/center/right/full-width) - Approach detection with speed estimation

Implements Apple Maps-style AR navigation with: Visual Features: - Canvas overlay for drawing animated floor path to target - Large animated chevron arrows (>>>) pointing direction - Glowing green path line from user to target position - Pulsing target marker with crosshairs - Animated path dashes that flow toward target - Searching animation when target not visible AR Info Display: - Direction indicator (arrow + text) - Distance estimation display (~2m, ~5m+, etc.) - Target name display - "Real View Navigation" status badge Smart Path Routing: - Calculates bezier curve path from bottom of screen to target - Routes around detected obstacles (curves left/right) - Perspective-adjusted to look like floor path - Updates in real-time with detection data Backend Updates: - Added target_bbox to navigation status response - Added obstacle position data for AR path routing The path automatically: - Shows when target is detected - Hides and shows searching animation when target lost - Curves around obstacles detected by 4-layer system - Updates distance/direction in real-time

Fixes: - Added 'up' and 'down' direction mappings to AR display (backend returns these, frontend was missing handling) - Added 'down' chevron rotation (180 degrees) - Added 'unknown' direction handling for searching state - Fixed duplicate stopNavigation override to call arNavPath.stop() (the override was missing AR cleanup, causing animation to continue)

claude added 4 commits December 22, 2025 15:52

Make decord import lazy to fix ModuleNotFoundError

4a4742d

Move decord import inside the video loading conditional block so it's only imported when actually loading MP4 files. This prevents import errors when decord is not installed but video loading is not needed.

claude and others added 25 commits December 23, 2025 01:00

Add MPS-safe wrapper for grid_sample to fix Apple Silicon

13e7af4

PyTorch's grid_sample has bugs on MPS with certain tensor configurations. Added _grid_sample_mps_safe() that falls back to CPU for MPS devices.

Fix pin_memory() calls for MPS compatibility

cabc154

pin_memory() is a CUDA-specific optimization that doesn't work on MPS. Added device type checks to skip pin_memory() on non-CUDA devices. Files fixed: - geometry_encoders.py - sam3_video_inference.py - sam3_tracker_base.py

Fix _assert_async for MPS compatibility

9e8bdc1

torch._assert_async is not implemented for MPS devices. Use regular assert on MPS as a fallback. Files fixed: - geometry_encoders.py - sam3_image.py

Remove resolution option - model requires fixed 1008 resolution

bee5e0a

The model has precomputed positional encodings (freqs_cis) that are sized for 1008 resolution. Different resolutions cause shape mismatches. Use --skip-frames for performance improvement instead.

Add half precision option for faster inference on MPS

91ead6f

Added --half flag to convert model to float16, which can speed up inference on Apple Silicon by reducing memory bandwidth requirements.

Fix half precision by matching input dtype to model dtype

250cb5d

Sam3Processor now automatically converts input images to match the model's dtype (float16 or float32), enabling half precision inference.

Fix roi_align dtype mismatch for half precision

d5451f8

Match boxes dtype to img_feats dtype in roi_align call to support half precision inference.

Disable half precision on MPS - Metal doesn't support mixed precision

0b07e55

Metal Performance Shaders fails with mixed dtype matrix multiplication. Half precision only works on CUDA, not MPS.

Fix tracker to use local sam3.pt checkpoint instead of HuggingFace do…

2d23549

…wnload

Add examples/ folder to checkpoint search paths for tracker

9b3fca5

Fix tracker mask addition - add frame image before adding mask

c0418e5

Fix mask dtype - convert bool to float before interpolation

f20b3ba

Keep tracker state on device to avoid MPS/CPU mismatch

7d75041

Simplify tracking for MPS compatibility - reuse masks between keyframes

7185152

Add optical flow based tracking between keyframes for MPS compatibility

4ce6f9e

Update project title to include MPS/CPU support

8ea92cc

claude added 17 commits December 24, 2025 21:32

Fix YOLO model names - use YOLO11 (latest) with v8 fallback

484ca52

- YOLO12 doesn't exist - changed to YOLO11 (yolo11n-cls.pt, yolo11n-pose.pt) - Added fallback to YOLOv8 models if YOLO11 fails - Models auto-download on first use from ultralytics hub

Try YOLO12 first with fallback chain to YOLO11 and YOLOv8

c9652f8

- YOLO12 released Feb 2025, try it first for classification and pose - Falls back to YOLO11 then YOLOv8 if pretrained weights not available - Cleaner loop-based fallback logic with logging for each attempt

Fix Claude analysis to use raw frame without overlays

650b2b6

- Add current_raw_frame to store frame before overlay processing - Use raw frame for Claude object analysis instead of display frame - Fixes issue where Claude was seeing blue mask overlays in analysis

eleviidev closed this by deleting the head repository Dec 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Claude/add cpu mps support p6 mzo #375

Claude/add cpu mps support p6 mzo #375

Uh oh!

eleviidev commented Dec 23, 2025

Uh oh!

meta-cla bot commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Claude/add cpu mps support p6 mzo #375

Claude/add cpu mps support p6 mzo #375

Uh oh!

Conversation

eleviidev commented Dec 23, 2025

Uh oh!

meta-cla bot commented Dec 23, 2025

Action Required

Process

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants