-
Notifications
You must be signed in to change notification settings - Fork 841
Claude/add cpu mps support p6 mzo #375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Claude/add cpu mps support p6 mzo #375
Conversation
This change enables SAM3 to run on Mac M4 and other non-CUDA systems by: - Creating a device utility module (sam3/utils/device.py) for automatic device detection with priority: CUDA > MPS > CPU - Adding PyTorch-based fallbacks for Triton kernels: - sigmoid_focal_loss.py: Pure PyTorch implementation for CPU/MPS - edt.py: SciPy-based EDT implementation for CPU/MPS - Updating device detection in model_builder.py to auto-detect best available device instead of assuming CUDA - Replacing hardcoded .cuda() calls with device-agnostic .to(device) throughout the codebase: - io_utils.py: Video/image loading now respects device - sam3_tracker_base.py: Memory features use correct device - sam3_tracking_predictor.py: Image inference uses inference state device - sam3_video_predictor.py: Model initialization uses get_device() - Adding MPS-aware fallbacks in perflib: - nms.py: Falls back to CPU implementation for MPS - connected_components.py: Falls back to CPU implementation for MPS - Fixing CUDA-specific backend calls in transformer.py to only run on CUDA devices Note: Distributed training features (NCCL backend) still require CUDA as that is an inherent limitation of NCCL.
Test coverage includes: - Device utility module functions - Sigmoid focal loss on CPU/MPS - EDT (Euclidean Distance Transform) on CPU/MPS - NMS on CPU/MPS - Connected components on CPU/MPS - Transformer attention modules on CPU - Model builder device parameter handling MPS tests are automatically skipped when MPS is not available.
Adds a comprehensive example script for real-time camera segmentation using SAM3. Features include: - Auto-detection mode for automatic object segmentation - Interactive point-based prompting (left/right click) - Multi-device support (CUDA, MPS, CPU) - FPS tracking and display overlay - Frame saving and pause functionality
Move decord import inside the video loading conditional block so it's only imported when actually loading MP4 files. This prevents import errors when decord is not installed but video loading is not needed.
|
Hi @eleviidev! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
- position_encoding.py: Use get_device() for precomputed position encodings - decoder.py: Use get_device() for coordinate cache initialization - vl_combiner.py: Default device to None, use get_device_str() at runtime - sam3_image_processor.py: Default device to None, use get_device_str()
Rewrote the live camera segmentation script to use the correct SAM3 inference API via Sam3Processor instead of calling the model directly. Key changes: - Use Sam3Processor.set_image() to process frames - Use Sam3Processor.set_text_prompt() for text-based detection - Use Sam3Processor.add_geometric_prompt() for interactive box prompts - Results accessed via state dict (masks, boxes, scores)
PyTorch's grid_sample has bugs on MPS with certain tensor configurations. Added _grid_sample_mps_safe() that falls back to CPU for MPS devices.
pin_memory() is a CUDA-specific optimization that doesn't work on MPS. Added device type checks to skip pin_memory() on non-CUDA devices. Files fixed: - geometry_encoders.py - sam3_video_inference.py - sam3_tracker_base.py
torch._assert_async is not implemented for MPS devices. Use regular assert on MPS as a fallback. Files fixed: - geometry_encoders.py - sam3_image.py
Added command-line options to improve performance on MPS/CPU: - --skip-frames N: Only process every N frames (default: 1) - --resolution N: Lower model resolution (default: 1008, try 512/768) These options help achieve usable frame rates on Apple Silicon.
The model has precomputed positional encodings (freqs_cis) that are sized for 1008 resolution. Different resolutions cause shape mismatches. Use --skip-frames for performance improvement instead.
Added --half flag to convert model to float16, which can speed up inference on Apple Silicon by reducing memory bandwidth requirements.
Sam3Processor now automatically converts input images to match the model's dtype (float16 or float32), enabling half precision inference.
Match boxes dtype to img_feats dtype in roi_align call to support half precision inference.
Metal Performance Shaders fails with mixed dtype matrix multiplication. Half precision only works on CUDA, not MPS.
- Added --track flag to enable memory-based mask propagation between frames - Fixed Sam3TrackerPredictor for MPS compatibility (autocast, storage device) - When tracking is enabled, masks follow objects between full inference frames - This allows higher frame rates while maintaining visual continuity
- Labels and confidence scores now displayed on each detected object mask - Added info panel on right side showing list of detected objects - Panel shows object label, confidence score (color-coded), and size - Confidence scores are stored and tracked between frames
Users can now specify comma-separated prompts (e.g., --prompt "person, car, dog") to detect multiple object types simultaneously. Each detection is labeled with its corresponding prompt name in both the mask overlay and the info panel.
Features: - Live video streaming with segmentation overlay - Multi-prompt detection configuration via web UI - Object count limits with show/hide toggle for each prompt type - Verbose mode showing tracking, frame count, queue size - Claude Vision API integration for detailed object analysis - Command center style dark theme interface - Real-time system log display - Confidence threshold and skip-frames controls Usage: python examples/web_command_center/app.py --prompt "person, car"
- Add setup_device_optimizations() with MPS memory management - Add mps_synchronize() for explicit GPU synchronization - Add empty_cache() for both CUDA and MPS memory cleanup - Enable device optimizations in live camera and web command center These optimizations help improve performance on Apple Silicon (M1/M2/M3/M4) by better utilizing the Metal GPU backend.
New features added to web command center: - Memory Tracking: Store mask history for object re-identification - Persistent Object IDs: Stable IDs across frames using IoU matching - Fill Holes: Morphological hole filling in masks - Smooth Edges: Edge smoothing with configurable kernel - Non-Overlapping Masks: Prevent mask overlaps (higher conf wins) - Boundary Suppression: Ignore detections near frame edges - Occlusion Suppression: Remove heavily overlapped detections - Hotstart Mode: Require N frames before confirming detection All features have UI toggles in the Features tab with configurable parameters.
- Add comprehensive SAM3-to-COCO label mapping (80+ entries) for YOLO compatibility - Integrate YOLOv12 classification on SAM3 detected regions - Add YOLOv12 pose estimation for person-like detections (17 keypoints) - Implement skeleton/keypoint overlay visualization with configurable colors - Add feature toggles: YOLO classify, pose, skeleton, keypoint labels, label spoofing - Add configurable parameters: thresholds, keypoint radius, skeleton thickness - Update UI with YOLO Integration section in Features tab
Backend: - Add voice search state tracking to CommandCenter - Add Claude-powered voice query parsing (parse_voice_query_with_claude) - Handle natural language queries like "help me find a red car" - Support multiple object detection from single voice command - Add /api/voice_search endpoint for voice query processing - Add /api/voice_feedback and /api/toggle_voice endpoints Frontend: - Add Voice Search tab with large microphone button - Add microphone button next to prompts input for quick access - Implement Web Speech API for voice recognition - Add real-time transcription feedback during listening - Add TTS (Text-to-Speech) section with voice selection - Add voice search history with click-to-reapply - Visual feedback: listening (red pulse), processing (yellow) - Speak search confirmations when TTS enabled
Backend: - Add camera detection function with platform-specific camera naming - Add camera settings to CommandCenter (current_camera_id, flip states) - Add switch_camera() and reset_detection_state() functions - Apply flip transformations in generate_frames() before processing - Add /api/cameras endpoint to list available cameras - Add /api/switch_camera endpoint to change cameras - Add /api/flip_camera and /api/set_flip endpoints for mirror control - Detect available cameras at startup Frontend: - Add Camera tab with camera selection dropdown - Add refresh button to re-scan available cameras - Add Flip Horizontal and Flip Vertical buttons with visual feedback - Show current camera info and flip status - Add camera tips section - Style camera controls with consistent theme
- Add helper functions for mask validation and bounding box extraction - Update detections during optical flow tracking frames instead of only on keyframes - Remove objects only when they actually leave the frame (mask becomes invalid) - Build new detections list atomically on keyframes to avoid race conditions - Keep tracked objects visible between SAM3 keyframe refreshes
- Support loading API key from .env file via python-dotenv - Add --api-key command line argument option - Pass api_key explicitly to Anthropic client - Add helpful error messages when API key is missing - Show warning at startup if no API key is configured
- YOLO12 doesn't exist - changed to YOLO11 (yolo11n-cls.pt, yolo11n-pose.pt) - Added fallback to YOLOv8 models if YOLO11 fails - Models auto-download on first use from ultralytics hub
- YOLO12 released Feb 2025, try it first for classification and pose - Falls back to YOLO11 then YOLOv8 if pretrained weights not available - Cleaner loop-based fallback logic with logging for each attempt
- Add current_raw_frame to store frame before overlay processing - Use raw frame for Claude object analysis instead of display frame - Fixes issue where Claude was seeing blue mask overlays in analysis
Features: - Mask-based cropping: Use SAM3 mask to crop objects cleanly for Claude analysis (white background outside mask) - Describe Scene button: Analyze entire frame with Claude - Voice commands: Support "describe scene", "describe object X", "analyze the first object", etc. - HTTPS support: Use --https flag to enable SSL for microphone access - Auto-generates self-signed certificates if cryptography is installed - Or provide --ssl-cert and --ssl-key for custom certificates UI changes: - Added "Describe Scene" button in Claude Analysis panel - Pass mask_index when analyzing objects for better cropping - Handle describe actions in voice command processing
Reference Image Search: - Upload a reference image to find similar objects - "Search by Description" mode: Claude describes image, SAM3 finds it - "Search by Visual Match" mode: CLIP compares embeddings for similarity - CLIP model loading with transformers library Draw to Search: - Draw Box: Click and drag on video to select object region - Click Point: Click on object to segment it - Canvas overlay for drawing interaction - Sends geometric prompts (box/point) to SAM3 New API endpoints: - /api/upload_reference - Upload reference image with mode selection - /api/clear_reference - Clear reference search - /api/reference_status - Get reference search status - /api/draw_prompt - Send box/point geometric prompt - /api/clear_draw_prompt - Clear pending prompts UI changes: - New "Reference Search" tab with upload and draw controls - Canvas overlay on video for draw-to-search - Visual match threshold slider
Features: - Full navigation UI overlay with directional arrows and distance indicators - Voice guidance with TTS and proximity beep sounds (frequency changes with distance) - "Navigate" button on each detected object - Location memory system - remembers where objects were found - Claude scene analysis for obstacle detection and location context - HTTPS is now the default mode for microphone/camera access - Visual distance ring that pulses when object is reachable - Success sound/announcement when object is reached - Auto-stop navigation after reaching target
SQLite Database: - Full database schema for sessions, detections, analysis, navigation, obstacles - Migrated location memory from JSON to SQLite - History APIs for detections, analysis, and navigation - Session statistics tracking Obstacle Detection During Navigation: - SAM3-based obstacle detection running in parallel during navigation - Predefined obstacle prompts (stairs, edges, furniture, doors, etc.) - Severity levels (high/medium/low) with color-coded masks - Distance estimation based on object size in frame - Visual overlays with warning triangles and labels - Audio alerts with different beep patterns per severity - TTS announcements for obstacle warnings - Cooldown system to prevent alert spam Post-Navigation Dialog: - Shows dialog when navigation ends asking user to continue or pause - TTS-enabled for accessibility - Remembers if target was reached Other improvements: - Session ID tracking for all database operations - Event logging for navigation start/stop - Obstacle history saved to database
This is a much smarter approach to obstacle detection: Before (Static List): - Used hardcoded list of "obstacle" words (stairs, chair, table, etc.) - Would incorrectly flag user's target as an obstacle - No understanding of spatial relationships - No context about what's actually in the path After (Claude AI): - Claude analyzes the scene with context about the navigation target - Understands the target is NOT an obstacle (won't flag it) - Identifies only objects that are physically in the path to target - Provides spatial context (left, right, center, floor, ahead) - Explains WHY something is an obstacle (reason field) - Suggests safe direction when obstacles are present - Understands environment type (room, hallway, outdoor, etc.) Technical changes: - Added analyze_obstacles_with_claude() for intelligent analysis - Claude returns: environment, path_clear, obstacles[], safe_direction - Rate-limited to avoid excessive API calls (3 second cache) - Updated overlay to use position-based regions instead of masks - Shows "PATH CLEAR" indicator when Claude confirms safe path - Enhanced UI alerts with position and reason context - Different visual styles for high/medium/low severity
This implements a two-layer detection system like the robot obstacle
avoidance project, but enhanced with AI understanding:
Layer 1 - OpenCV Real-Time (every frame):
- Bilateral filtering to reduce noise while preserving edges
- Canny edge detection to find object boundaries
- Contour detection to identify obstacle shapes
- Region-based analysis (left/center/right/floor paths)
- Edge density calculation for proximity estimation
- Floor clearance analysis for trip hazards
Layer 2 - Claude AI (every 3 seconds):
- Contextual understanding of what obstacles are
- Knows the navigation target is NOT an obstacle
- Explains WHY something is dangerous
- Suggests safe direction to move
How they work together:
- OpenCV: "There's something in front of you!" (immediate, ~20ms)
- Claude: "It's a glass coffee table between you and the mug,
move right to avoid it" (smart, ~1-2s)
Visual overlay improvements:
- OpenCV detections: dashed bounding boxes with [CV] label
- Claude detections: solid regions with reason text
- Shows "PATH CLEAR" when safe or "Go: [direction]" for guidance
- Floor analysis suggests clearest path (left/center/right)
Proximity estimation:
- Position in frame (lower = closer)
- Edge density (higher = larger/closer object)
- Floor uniformity (uniform = clear, edges = obstacles)
Implements advanced obstacle detection using only a single RGB camera: Layer 1: OpenCV Edge Detection (every frame, ~20ms) - Canny edges, contours, bilateral filtering - Immediate response for sudden obstacles Layer 2: AI Depth Estimation (MiDaS/Depth Anything) - LIDAR-like depth perception from single camera - Actual distance measurement, not just presence detection - Tries Depth Anything (2024 SOTA) first, falls back to MiDaS Layer 3: Optical Flow Collision Detection - Biomimetic technique (how insects detect collisions) - Detects approaching objects via motion expansion - Estimates time-to-collision (TTC) Layer 4: Claude AI Analysis (every 3 seconds) - Semantic understanding of obstacles - Context-aware (knows target is NOT an obstacle) - Explains WHY something is dangerous Additional features: - Ground plane segmentation for walkable area detection - Temporal obstacle tracking (detects if obstacles approaching) - Multi-position analysis (left/center/right/full-width) - Approach detection with speed estimation
Implements Apple Maps-style AR navigation with: Visual Features: - Canvas overlay for drawing animated floor path to target - Large animated chevron arrows (>>>) pointing direction - Glowing green path line from user to target position - Pulsing target marker with crosshairs - Animated path dashes that flow toward target - Searching animation when target not visible AR Info Display: - Direction indicator (arrow + text) - Distance estimation display (~2m, ~5m+, etc.) - Target name display - "Real View Navigation" status badge Smart Path Routing: - Calculates bezier curve path from bottom of screen to target - Routes around detected obstacles (curves left/right) - Perspective-adjusted to look like floor path - Updates in real-time with detection data Backend Updates: - Added target_bbox to navigation status response - Added obstacle position data for AR path routing The path automatically: - Shows when target is detected - Hides and shows searching animation when target lost - Curves around obstacles detected by 4-layer system - Updates distance/direction in real-time
Fixes: - Added 'up' and 'down' direction mappings to AR display (backend returns these, frontend was missing handling) - Added 'down' chevron rotation (180 degrees) - Added 'unknown' direction handling for searching state - Fixed duplicate stopNavigation override to call arNavPath.stop() (the override was missing AR cleanup, causing animation to continue)
No description provided.