Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ Built-in Tools (zoom, crop) + Provider Tools (detect, classify, etc.)
- **Error modes**: retry (default), skip, fail
- **Observability**: Full tracing and metrics integration

**Testing:** 336 comprehensive tests in `test_tool_schema.py`, `test_tool_registry.py`, `test_agent_loop.py`, `test_tool_prompts.py`, `test_tool_call_parser.py`, etc. All passing with zero regressions.
**Testing:** 342 comprehensive tests in `test_tool_schema.py`, `test_tool_registry.py`, `test_agent_loop.py`, `test_tool_prompts.py`, `test_tool_call_parser.py`, etc. All passing with zero regressions.

**Documentation:** See `docs/VLM_TOOL_CALLING_SUMMARY.md` for complete architecture details, design decisions, limitations, and future roadmap.

Expand Down Expand Up @@ -152,7 +152,7 @@ pytest tests/test_track_node.py -v # Track graph node (39 tests)

# VLM tool-calling test suites (v1.7.0)
pytest tests/test_tool_schema.py -v # Tool schema (33 tests)
pytest tests/test_tool_registry.py -v # Tool registry (44 tests)
pytest tests/test_tool_registry.py -v # Tool registry (49 tests)
pytest tests/test_agent_loop.py -v # Agent loop (51 tests)
pytest tests/test_tool_prompts.py -v # Tool prompts (18 tests)
pytest tests/test_tool_call_parser.py -v # Tool call parser (51 tests)
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,8 @@ runs/
# Dev Test Artifacts
examples/inference/outputs/
bm_test/
example_test.sh
example_test.ps1

# Auto Claude data directory
.auto-claude/
Expand Down
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,26 @@ Versions follow [Semantic Versioning](https://semver.org/).

---

## [1.9.1] - 2026-03-08

### Changed

- Refactored graph flow notation from `→` to `>` in all examples, scripts, and documentation for consistency with the DSL operator syntax
- Updated expected output structure descriptions in examples and docs to match the new `>` notation

### Added

- `ToolRegistry` now requires `text_prompts` for zero-shot providers (GroundingDINO, OWL-ViT, CLIP) and raises `ValueError` when they are missing
- Improved tool schema generation: zero-shot providers automatically include a `text_prompts` parameter in their generated `ToolSchema`
- Tests for zero-shot provider detection and `text_prompts` schema requirement in `test_tool_registry.py`

### Fixed

- SAM adapter: minor issue where prompt-less calls could silently produce empty masks instead of raising a clear error
- Video tracking examples: corrected frame iteration and output path handling in `examples/track/`

---

## [1.9.0] - 2026-03-02

### Added
Expand Down
70 changes: 3 additions & 67 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1171,72 +1171,7 @@ export MATA_CONFIG=/path/to/config.json

## 🛣️ Roadmap

### ✅ Completed (v1.9.0 - Current)

#### **OCR / Text Extraction** — Five backends, graph nodes, evaluation pipeline

- ✅ **Five OCR backends**: EasyOCR (80+ languages), PaddleOCR (multilingual), Tesseract (classic), GOT-OCR2 (HuggingFace end-to-end), TrOCR (HuggingFace line-level)
- ✅ **`mata.run("ocr", ...)` API**: Unified entry point — `model=` selects backend or HuggingFace ID
- ✅ **`mata.load("ocr", ...)` API**: Returns persistent adapter for repeated inference
- ✅ **`OCRResult` type**: `.full_text`, `.regions` (list of `TextRegion` with bbox + score + text)
- ✅ **Multi-format export**: `.save("out.txt")`, `.save("out.csv")`, `.save("out.json")`, `.save("overlay.png")`
- ✅ **`filter_by_score()`**: Confidence-threshold filtering on OCR results
- ✅ **`OCRText` graph artifact**: Strongly-typed artifact for graph pipelines
- ✅ **`OCR` graph node**: Accepts `Image` or `ROIs` input, aggregates per-crop results with `instance_id` correlation
- ✅ **`ExtractROIs` graph node**: Crops detection regions for downstream OCR
- ✅ **`OCRWrapper`**: Protocol-based capability wrapper enabling OCR as a graph provider
- ✅ **VLM tool integration**: `"ocr"` registered in `ToolRegistry` and `TASK_SCHEMA_DEFAULTS` for agent mode
- ✅ **UniversalLoader routing**: Bare engine names (`"easyocr"`, `"paddleocr"`, `"tesseract"`) routed via `_EXTERNAL_OCR_ENGINES`; HuggingFace OCR IDs routed through `_load_from_huggingface()`
- ✅ **Optional dependencies**: EasyOCR, PaddleOCR, Tesseract declared as optional extras in `pyproject.toml`
- ✅ **71 evaluation tests**: `test_eval_ocr.py` — all passing, zero regressions against 4307+ total
- ⏳ **`mata.val("ocr", ...)` evaluation**: `OCRMetrics` (word accuracy, character accuracy, precision, recall, F1) with COCO-Text JSON dataset loader | PENDING for v1.9.1 release due to dataset licensing review.

### ✅ Completed (v1.8)

#### **Object Tracking** — ByteTrack + BotSort

- ✅ **Vendored ByteTrack**: Zero-dependency implementation in `src/mata/trackers/` (no yolox/ultralytics)
- ✅ **Vendored BotSort**: IoU + Global Motion Compensation (GMC via sparse optical flow)
- ✅ **`mata.track()` API**: One-liner video/stream/webcam/image-dir tracking
- ✅ **`mata.load("track", ...)` API**: Returns `TrackingAdapter` for persistent per-frame tracking
- ✅ **Multiple source types**: Video files, RTSP streams, webcams, image directories, single images
- ✅ **Track ID rendering**: `show_track_ids=True` with deterministic per-track colors
- ✅ **Trajectory trails**: `show_trails=True` — PIL-native polyline history rendering
- ✅ **CSV/JSON export**: MOT-compatible CSV export, multi-frame JSON with metadata
- ✅ **Graph node upgrade**: `Track` node uses vendored trackers, `BotSortWrapper` added
- ✅ **Graph presets**: BotSort variants added to surveillance/driving presets
- ✅ **YAML config**: Tracker settings in `~/.mata/models.yaml` under `track:` task
- ✅ **687 tests**: All passing, zero regressions against 4047+ total

### ✅ Completed (v1.6)

#### **Graph System Architecture** - Multi-task workflows with parallel execution

- ✅ **Artifact Type System**: Strongly-typed vision primitives (Image, Detections, Masks, Keypoints, Tracks, ROIs)
- ✅ **Task Graph Builder**: Fluent API for composing multi-task pipelines (Detect → Segment → Pose)
- ✅ **Parallel Execution**: Automatic parallelization of independent tasks (1.5-3x speedup, 41x in benchmarks)
- ✅ **Conditional Branching**: Result-driven workflow control with If/else, HasLabel, CountAbove, ScoreAbove
- ✅ **Temporal Processing**: Video inference with BYTETrack/IoU tracking and frame policies
- ✅ **Capability Providers**: Protocol-based model registry with lazy loading
- ✅ **VLM Graph Nodes**: VLMDescribe, VLMDetect, VLMQuery, PromoteEntities for Entity→Instance workflows
- ✅ **Visualization Nodes**: Native Annotate and NMS nodes reusing existing PIL/matplotlib backends
- ✅ **Pre-built Presets**: 8 graph presets (detection+segmentation, scene analysis, VLM workflows, tracking)
- ✅ **Observability**: Metrics collection, execution tracing, and provenance tracking
- ✅ **`mata.infer()` API**: New public API for graph execution with flat provider dicts
- ✅ **Backward Compatibility**: 100% compatible with existing `mata.load()`/`mata.run()` APIs
- ✅ **Comprehensive Testing**: 2185 tests, >80% coverage

### ✅ Completed (v1.5.3)

- ✅ **Multi-Task Support**: Detection, classification, segmentation, depth estimation, vision-language models (VLM)
- ✅ **Zero-Shot Capabilities**: CLIP (classify), GroundingDINO/OWL-ViT (detect), SAM/SAM3 (segment)
- ✅ **Vision-Language Models**: Image captioning, VQA, visual understanding with Qwen3-VL - February 2026
- ✅ **Universal Loader**: llama.cpp-style loading with 5-strategy auto-detection
- ✅ **Multi-Format Runtime**: PyTorch, ONNX Runtime, TorchScript, Torchvision support
- ✅ **Torchvision CNN Detection**: Apache 2.0 licensed models (RetinaNet, Faster R-CNN, FCOS, SSD) - February 2026
- ✅ **Export & Visualization**: JSON/CSV/image/crops with dual backends (PIL/matplotlib)
- ✅ **Plugin Removal**: Simplified architecture, -1,268 lines of legacy code
- ✅ **Comprehensive Testing**: 405 tests (exceeded 202+ target), 60-85% coverage
> **For a full history of completed features, see [CHANGELOG.md](CHANGELOG.md).**

### 🔄 In Progress

Expand All @@ -1245,7 +1180,7 @@ export MATA_CONFIG=/path/to/config.json
- ⏳ **ReID model integration**: Feature embeddings via HuggingFace ReID models
- ⏳ **Cross-camera tracking**: Match track IDs across camera feeds
- ⏳ **BotSort ReID mode**: Enable `with_reid=true` in botsort config
- **Status**: Planned for v1.9
- **Status**: Planned for v1.9.x

#### **2. KACA Integration** - MIT-licensed CNN detection with PyTorch and ONNX support

Expand All @@ -1261,6 +1196,7 @@ export MATA_CONFIG=/path/to/config.json
- 🔄 **Model Recommendations**: Suggest best models based on task and hardware constraints
- 🔄 **Batch Model Download**: Pre-download common models for air-gapped environments
- 🔄 **Enhanced Search**: Filter by task, license, performance metrics
- **Status**: Planned for v2.x

### ⏳ Planned (v2.0 - Q2 2026)

Expand Down
35 changes: 33 additions & 2 deletions docs/VLM_TOOL_CALLING_SUMMARY.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
# VLM Tool-Calling Agent System — Architecture Summary

**Version**: 1.7.0
**Version**: 1.7.1
**Implementation Date**: February 16, 2026
**Last Updated**: March 8, 2026
**Status**: ✅ Production Ready
**Test Coverage**: 336 comprehensive tests, all passing
**Test Coverage**: 342 comprehensive tests, all passing

---

Expand Down Expand Up @@ -270,6 +271,31 @@ mata.infer(

---

### 8. **Provider-Aware Schema Generation for Zero-Shot Models** _(v1.7.1)_

**Decision**: `ToolRegistry` introspects the actual provider at registration time and upgrades `text_prompts` to `required=True` for zero-shot adapters.

**Rationale**:

- **VLM must know `text_prompts` is required** — The default `TASK_SCHEMA_DEFAULTS["detect"]` marks `text_prompts` as optional (correct for supervised detectors like RT-DETR, YOLO). But zero-shot models (GroundingDINO, OWL-ViT) **cannot run without class names**. If the schema shows the parameter as optional, the VLM's system prompt will say _"optional"_ and the agent will omit it, causing a `TypeError` or `InvalidInputError` at execution time.
- **Zero-shot contract is enforced at the adapter level** — `HuggingFaceZeroShotDetectAdapter.predict()` keeps `text_prompts` as a required positional argument. The fix is upstream: make the _schema_ match the adapter's actual contract.
- **Clean detection via class name** — All MATA zero-shot adapters have `"ZeroShot"` in their class name. `_is_zero_shot_provider()` unwraps one layer of wrapper (e.g., `DetectorWrapper.adapter`) and checks the underlying class name — no new class attributes or protocol changes needed.
- **`TASK_SCHEMA_DEFAULTS` stays generic** — The shared default schema is not modified; customization happens per-provider at `ToolRegistry` construction time.

**Agentic chain this enables**:

```
VLM: "I see an unknown object. Let me classify it."
→ classifier(region=[80,120,220,300]) → "cat (0.92)"
VLM: "It's a cat. Let me find all cats using the detector."
→ detector(text_prompts="cat") → 2 cats detected
VLM: "Found 2 cats at [80,120,220,300] and [300,130,440,280]. Summary..."
```

**Implementation**: `_is_zero_shot_provider()` + upgraded `_schema_for_capability(capability, tool_name, provider)` in `src/mata/core/tool_registry.py` (v1.7.1).

---

### 7. **Multi-Format Tool Call Parsing**

**Decision**: Support fenced blocks (` ```tool_call `), XML (`<tool_call>`), and raw JSON.
Expand Down Expand Up @@ -560,6 +586,11 @@ result = AgentResult(
- **Impact**: VLMs may output `"0.5"` instead of `0.5` for floats
- **Solution**: Comprehensive type coercion in `validate_tool_call()`

#### ~~4. Zero-Shot Detector Omits `text_prompts`~~ _(Fixed — v1.7.1)_

- **Was**: `TASK_SCHEMA_DEFAULTS["detect"]` marked `text_prompts` as optional, causing the VLM to omit it. Zero-shot adapters require it, so the call failed with `TypeError`.
- **Fix**: `ToolRegistry._schema_for_capability()` now introspects the actual provider via `_is_zero_shot_provider()` and upgrades `text_prompts` to `required=True` for zero-shot adapters. The VLM's system prompt now correctly says the parameter is required, so the agent always populates it from its own reasoning.

---

## Future Directions
Expand Down
4 changes: 2 additions & 2 deletions examples/classify/basic_classification.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""Basic Classification Examples — MATA Framework

Progressive examples: one-shot load/reuse model comparison filtering.
Progressive examples: one-shot > load/reuse > model comparison > filtering.
Run: python examples/classify/basic_classification.py
"""

Expand Down Expand Up @@ -31,7 +31,7 @@ def load_and_reuse():
for _ in range(2):
result = classifier.predict(get_image())
top1 = result.get_top1()
print(f" {top1.label_name}: {top1.score * 100:.2f}%")
print(f" to {top1.label_name}: {top1.score * 100:.2f}%")


# === Section 3: Access Results (.get_top1, top-5 predictions) ===
Expand Down
2 changes: 1 addition & 1 deletion examples/classify/clip_zeroshot.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ def example4_batch_classification():
for name, img in images:
result = classifier.predict(img, text_prompts=text_prompts, top_k=2)
top2 = [(p.label_name, f"{p.score:.4f}") for p in result.predictions]
print(f" {name:15s} {top2}")
print(f" {name:15s} to {top2}")


def main():
Expand Down
6 changes: 4 additions & 2 deletions examples/depth/basic_depth.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def example_depth_v1():
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
result.save(OUTPUT_DIR / "depth_v1.png", colormap="magma")
result.save(OUTPUT_DIR / "depth_v1.json")
print(f"Saved {OUTPUT_DIR}/depth_v1.png and depth_v1.json")
print(f"Saved to {OUTPUT_DIR}/depth_v1.png and depth_v1.json")


# === Section 2: One-Shot Depth (Depth Anything V2) ===
Expand All @@ -61,7 +61,7 @@ def example_depth_v2():
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
result.save(OUTPUT_DIR / "depth_v2.png", colormap="magma")
result.save(OUTPUT_DIR / "depth_v2.json")
print(f"Saved {OUTPUT_DIR}/depth_v2.png and depth_v2.json")
print(f"Saved to {OUTPUT_DIR}/depth_v2.png and depth_v2.json")


# === Section 3: Load Once, Predict Many ===
Expand Down Expand Up @@ -105,6 +105,8 @@ def main():
except Exception as exc:
print(f" [error] load-once: {exc}")

print("\nDone.")


if __name__ == "__main__":
main()
6 changes: 3 additions & 3 deletions examples/detect/basic_detection.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,12 +87,12 @@ def section_export(output_dir: Path):
# Save to .json file
json_path = output_dir / "detections.json"
result.save(str(json_path))
print(f"[export] Saved JSON {json_path}")
print(f"[export] Saved JSON to {json_path}")

# Save annotated image (overlay bboxes on the source image)
img_path = output_dir / "detections_overlay.jpg"
result.save(str(img_path))
print(f"[export] Saved image {img_path}")
print(f"[export] Saved image to {img_path}")


# === Section 6: Config Aliases ===
Expand All @@ -104,7 +104,7 @@ def section_config_aliases():
mata.register_model("detect", "my-rtdetr", "PekingU/rtdetr_r50vd", threshold=0.6)

detector = mata.load("detect", "my-detr")
print(f"[alias] Loaded 'my-detr' {detector.__class__.__name__}")
print(f"[alias] Loaded 'my-detr' to {detector.__class__.__name__}")

# Config-file aliases work the same way — set them in .mata/models.yaml
# and load by name without calling register_model() in code.
Expand Down
19 changes: 10 additions & 9 deletions examples/detect/zeroshot_detection.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ def example_grounding_dino():
output_image = draw_detections(image.copy(), result, text_prompts)
output_path = "examples/images/output_grounding_dino.jpg"
output_image.save(output_path)
print(f"\n Saved visualization to: {output_path}")
print(f"\n Saved visualization to: {output_path}")

return result

Expand Down Expand Up @@ -119,7 +119,7 @@ def example_owlvit_v2():
output_image = draw_detections(image.copy(), result, text_prompts)
output_path = "examples/images/output_owlvit_v2.jpg"
output_image.save(output_path)
print(f"\n Saved visualization to: {output_path}")
print(f"\n Saved visualization to: {output_path}")

return result

Expand Down Expand Up @@ -153,7 +153,7 @@ def example_batch_processing():
for instance in result.instances:
print(f" - {instance.label_name}: {instance.score:.3f}")

print(f"\n Processed {len(images)} images in batch")
print(f"\n Processed {len(images)} images in batch")

return results

Expand Down Expand Up @@ -214,9 +214,9 @@ def example_model_comparison():
print(f" OWL-ViT v2: {len(result_owlv2.instances)} objects")

print("\n[Results] Model comparison:")
print(f" ├─ GroundingDINO: {len(result_gdino.instances)} detections")
print(f" ├─ OWL-ViT v1: {len(result_owlv1.instances)} detections")
print(f" └─ OWL-ViT v2: {len(result_owlv2.instances)} detections")
print(f" - GroundingDINO: {len(result_gdino.instances)} detections")
print(f" - OWL-ViT v1: {len(result_owlv1.instances)} detections")
print(f" - OWL-ViT v2: {len(result_owlv2.instances)} detections")

return result_gdino, result_owlv1, result_owlv2

Expand All @@ -238,17 +238,18 @@ def main():
example_model_comparison()

print("\n" + "=" * 70)
print(" All examples completed successfully!")
print(" All examples completed successfully!")
print("=" * 70)
print("\nNext steps:")
print(" 1. Check the output images in examples/images/")
print(" 2. Try with your own images")
print(" 3. Experiment with different text prompts")
print(" 4. Explore the GroundingDINOSAM pipeline: examples/segment/grounding_sam_pipeline.py")
print(" 4. Explore the GroundingDINO then SAM pipeline: examples/segment/grounding_sam_pipeline.py")
print()
print("Done")

except Exception as e:
print(f"\n Error: {e}", file=sys.stderr)
print(f"\n Error: {e}", file=sys.stderr)
import traceback
traceback.print_exc()
sys.exit(1)
Expand Down
2 changes: 1 addition & 1 deletion examples/graph/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ These examples demonstrate the fundamental graph system capabilities:

| Example | Description | Key Features |
| ------------------------------------------- | -------------------------------------------- | ------------------------------------------------ |
| ✅ [simple_pipeline.py](simple_pipeline.py) | Detection Filter Segmentation Fuse | `mata.infer()`, `Graph.then()`, basic pipeline |
| ✅ [simple_pipeline.py](simple_pipeline.py) | Detection > Filter > Segmentation > Fuse | `mata.infer()`, `Graph.then()`, basic pipeline |
| [parallel_tasks.py](parallel_tasks.py) | Parallel detection + classification + depth | `Graph.parallel()`, `ParallelScheduler`, speedup |
| [video_tracking.py](video_tracking.py) | Video processing with object tracking | `VideoProcessor`, `Track`, frame policies |
| [vlm_workflows.py](vlm_workflows.py) | VLM grounded detection & scene understanding | `VLMDetect`, `PromoteEntities`, VLM presets |
Expand Down
Loading