Skip to content

tylergibbs1/iris

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IRIS - Iterative Reasoning with Image Segmentation

A modern web application powered by Meta's SAM 3 (Segment Anything Model 3) and Claude for grounded visual reasoning. IRIS enables LLMs to iteratively query segmentation to verify visual facts rather than hallucinating, with an intuitive web interface for interactive analysis.

Why IRIS vs Standard Vision Models?

The Problem: When you send an image directly to Claude Vision or other multimodal LLMs, they can hallucinate counts, positions, and relationships.

The Solution: IRIS forces Claude to "show its work" using actual computer vision tools.

Key Differences

Scenario Standard Vision LLM IRIS (Grounded)
Counting "I see approximately 7 people" ❌ Could be wrong Calls segment_concept("person"){"count": 5} ✅ Exact
Spatial Reasoning "The car appears to be in the parking space" ❌ Approximation Calculates IoU between masks → "89% overlap" ✅ Precise
Verification "Most workers appear to be wearing hard hats" ❌ Uncertain Segments 8 people, 7 hard hats → "Worker #4 missing hard hat" ✅ Specific
Video Tracking "Crowd density increases" ❌ Vague Frame 0: 3 people, Frame 5: 12 people, Frame 9: 5 people ✅ Precise temporal data
Proof Text description only Visual masks + bounding boxes + confidence scores ✅ Verifiable

Real-World Example

Question: "Are all workers wearing proper PPE?"

Standard Vision LLM Response:

"I can see several workers. Most appear to be wearing hard hats,
though one in the back may not be."

IRIS Response:

1. segment_concept("person") → 8 workers detected
2. segment_concept("hard hat") → 7 hard hats detected
3. analysis_spatial("person", "hard hat") → 7 overlapping pairs
4. Result: Worker #4 at position [245, 180] has no hard hat

"7 out of 8 workers are wearing hard hats. Worker #4 is not compliant."
+ Visual overlay showing worker #4 circled without hard hat detection

When to Use IRIS

Choose IRIS when you need:

  • Exact counts (not "several" or "many")
  • Spatial measurements (distance, overlap %, containment)
  • Verification with proof (show me where X is detected)
  • Video object tracking (how does count change frame-by-frame?)
  • ML dataset export (COCO, YOLO annotations for training)
  • Audit trails (visual evidence of detections)

Standard vision LLMs are fine for:

  • General scene descriptions
  • OCR / text reading
  • Creative/artistic analysis
  • When approximate answers are acceptable

Features

1. Grounded Describing Agent

LLM receives an image and calls segmentation tools to verify facts:

segment_concept("red traffic light")  # -> 1 instance at (x1, y1, x2, y2)
segment_concept("pedestrian")         # -> 3 instances
segment_concept("crosswalk")          # -> 1 instance

# Claude can accurately answer: "Is this car running a red light?"

2. Video Analysis Agent

Claude analyzes video frames with SAM3 segmentation tools:

# Claude extracts frames and tracks objects across time
segment_concept_in_frame(0, "person")      # -> 5 people in frame 0
segment_concept_all_frames("person")       # -> Track count changes over time

# Claude answers: "How does crowd density change throughout the video?"

3. Mask Visualization

Save segmentation results with visual overlays:

  • Semi-transparent colored masks
  • Corner bracket-style bounding boxes
  • Indexed labels [01], [02], etc.

Installation

cd sam3_vision_tools

# Create virtual environment
python -m venv venv
venv\Scripts\activate  # Windows
# source venv/bin/activate  # Linux/Mac

# Install dependencies
pip install -r requirements.txt

# For RTX 50 series GPUs (Blackwell architecture):
pip install torch==2.9.0 torchvision==0.24.0 --index-url https://download.pytorch.org/whl/cu130

Environment Variables

Copy .env.example to .env and fill in your actual values:

cp .env.example .env

Then edit .env with your credentials:

ANTHROPIC_API_KEY=sk-ant-...    # Required for Claude API
HF_TOKEN=hf_...                  # Required for SAM 3 gated model

Requirements

  • Python 3.9+
  • Node.js 18+ (for web frontend)
  • PyTorch 2.9.0+ (for RTX 50 series) or PyTorch 2.0+ (older GPUs)
  • CUDA 13.0 (RTX 50 series) or CUDA 12.x (older GPUs)
  • ~2GB disk space for SAM 3 model weights
  • ffmpeg (for video processing and downsampling)

Quick Start

Web Interface (Recommended)

  1. Start the backend server:
# Using Python directly (development)
python server.py  # Runs on http://localhost:8000

# Or using uvicorn (production-ready)
uvicorn server:app --reload --port 8000
  1. Start the frontend (in a new terminal):
cd web
npm install  # First time only
npm run dev  # Runs on http://localhost:3000
  1. Open http://localhost:3000 in your browser

CLI Examples (Alternative)

# Run the grounded describer demo
python examples.py --demo grounded

# Direct tool usage demo
python examples.py --demo tools

Python API

from src.agents.grounded_describer import GroundedDescriberAgent

agent = GroundedDescriberAgent(model="claude-sonnet-4-5-20250929")

result = agent.analyze(
    image_path="traffic_scene.jpg",
    question="Is this car running a red light?",
    candidate_concepts=["red traffic light", "green traffic light", "car"]
)

print(result["answer"])

Web Interface

IRIS includes a modern web UI built with Next.js 16 and React 19 for interactive visual analysis with real-time feedback.

Web Interface Demo

Features

  • Real-time streaming chat with Claude using Server-Sent Events (SSE)
  • Drag-and-drop image/video upload with instant preview
  • Live visualization overlay with TensorPoint design system styling (dark theme with orange accents)
  • Tools execution panel showing active segmentations and results
  • Lightbox view for detailed inspection (click image to expand, ESC to close)
  • Video timeline with formatted timestamps (MM:SS.mmm) and progress indicators
  • Performance settings for video processing (frame skip, resolution, processing mode)
  • Responsive design with dark theme optimized for visual analysis

API Endpoints

The FastAPI backend (server.py) exposes the following endpoints:

Endpoint Method Purpose
/api/health GET Server status and model loaded state
/api/upload/image POST Upload image, returns dimensions and storage URL
/api/upload/video POST Upload video with configurable processing mode (frame_extraction or whole_video), frame sampling, and resolution settings
/api/preload POST Preload SAM3 model (warm start) with SSE progress updates
/api/chat POST Streaming chat with Claude via SSE (returns events: tool_call, tool_result, visualization, text, done, error)
/api/media/current GET Retrieve current media as base64
/visualizations/{file} GET Static file serving of generated mask visualizations

SSE Event Types

When streaming chat responses, the following event types are emitted:

  • status - Connection status updates
  • tool_call - Claude invoked a tool (includes tool name and input parameters)
  • tool_result - Tool execution completed (includes result data)
  • visualization - Mask visualization image generated (includes URL to fetch)
  • frame_visualization - Per-frame video visualization (for video analysis)
  • text - Claude response text chunk (streaming)
  • done - Chat turn completed
  • error - Error occurred during processing

Architecture

Architecture Diagram

sam3_vision_tools/
├── server.py              # FastAPI backend with SSE streaming (uvicorn)
├── examples.py            # CLI demos (grounded describer, tool usage)
├── src/
│   ├── __init__.py
│   ├── sam3_engine.py     # Core SAM 3 wrapper (image & video models)
│   ├── claude_tools.py    # Tool definitions for Claude
│   ├── viz.py             # Mask visualization system (TensorPoint design)
│   ├── video_utils.py     # Video trimming and metadata utilities
│   └── agents/
│       └── grounded_describer.py  # Grounded visual Q&A agent
├── web/                   # Next.js 16 frontend (React 19)
│   ├── src/
│   │   ├── app/           # Next.js app router pages
│   │   │   ├── layout.tsx
│   │   │   └── page.tsx
│   │   ├── components/    # React components
│   │   │   ├── chat-panel.tsx
│   │   │   ├── preview-panel.tsx
│   │   │   ├── tools-panel.tsx
│   │   │   ├── settings-modal.tsx  # Performance settings for video
│   │   │   ├── media-upload.tsx
│   │   │   └── ui/        # shadcn/ui components
│   │   ├── contexts/      # React contexts
│   │   │   └── settings-context.tsx
│   │   └── lib/           # API client and utilities
│   │       └── api.ts
│   ├── package.json
│   └── tailwind.config.ts
└── requirements.txt

Available Tools

Tool Description
segment_concept Segment all instances of a text-described concept
segment_multiple_concepts Segment multiple concepts in one call
segment_with_box Segment using bounding box constraint
segment_with_point Segment object at a specific point
compute_mask_overlap Compare two segmentation results (IoU)
get_image_dimensions Get image width/height

Video Tools

Tool Description
segment_concept_in_frame Segment concept in a specific frame with timestamp
segment_concept_all_frames Track concept across all frames with temporal summary. Supports both frame extraction mode (sampled frames) and whole_video mode (native SAM3VideoModel tracking with temporal consistency)
get_video_info Get frame count, timestamps, dimensions, duration

Analysis Tools

Tool Description
analysis_summarize Generate comprehensive segmentation summary with statistics (confidence breakdown, size distribution, spatial clustering)
analysis_spatial Analyze spatial relationships between concepts (overlapping, nearby, far pairs with IoU and distance metrics)
analysis_compare_concepts Compare multiple concepts by count, total_area, or avg_confidence with ranking
export_dataset Export segmentation annotations in COCO JSON, YOLO txt, or Pascal VOC XML formats for ML training

Video Analysis Tools

Tool Description
video_track_changes Temporal change detection: compare specific frames, analyze count timelines, track object entry/exit events

Running Examples

# Grounded describer demo (interactive Q&A with segmentation)
python examples.py --demo grounded

# Direct tool usage demo (programmatic API)
python examples.py --demo tools

For advanced usage, see the examples.py file which demonstrates:

  • Using the GroundedDescriberAgent for visual Q&A
  • Direct SAM3Engine API calls
  • Tool integration with Claude

GPU Support

Tested on:

  • NVIDIA RTX 5070 (Blackwell, sm_120) - PyTorch 2.9.0 + CUDA 13.0
  • NVIDIA RTX 30/40 series - PyTorch 2.0+ with CUDA 12.x

Performance Tips

  1. GPU Acceleration: SAM 3 runs much faster on GPU (CUDA recommended)
  2. Batch Concepts: Use segment_multiple_concepts for efficiency
  3. Caching: Segmentation results are cached per session to avoid redundant computation
  4. Video Processing Modes:
    • Frame Extraction Mode: Samples N frames evenly (default: 15/min, configurable)
    • Whole Video Mode: Uses SAM3VideoModel for temporal tracking with frame_skip (default: 2x speedup)
  5. Video Resolution: Downsample to 720p or 480p for faster processing (configurable in settings)
  6. Context Management: Message history is automatically truncated to last 20 messages to prevent unbounded context growth

License

  • SAM 3 model (Meta AI license)
  • Claude API (Anthropic terms of service)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors