Skip to content

Latest commit

 

History

History
504 lines (381 loc) · 14 KB

File metadata and controls

504 lines (381 loc) · 14 KB

Visual AI API

High-performance visual analysis API with NVIDIA Triton Inference Server.

Object detection, face recognition, visual search, OCR, and embeddings - all through a unified REST API with TensorRT acceleration.


Quick Start

Copy & Run (One Line)

git clone https://github.com/davidamacey/OpenProcessor.git && cd OpenProcessor && ./scripts/setup.sh

That's it! The setup script automatically:

  • Pulls pre-built Docker images from Docker Hub (~15GB)
  • Detects your GPU and selects the optimal profile
  • Downloads required models (~500MB, ~16 seconds)
  • Exports models to TensorRT (~30-60 min first time, one-time only)
  • Starts all services and runs smoke tests

First-time setup takes ~30-60 minutes (mostly TensorRT compilation). Subsequent starts take seconds.

Non-Interactive Setup

# Clone and setup with defaults (no prompts)
git clone https://github.com/davidamacey/OpenProcessor.git && cd OpenProcessor && ./scripts/setup.sh --yes

# Or specify a profile explicitly
./scripts/setup.sh --profile=standard --gpu=0 --yes

Verify Installation

curl http://localhost:4603/health
# {"status":"healthy","version":"0.1.0",...}

# Quick test with an image
curl -X POST http://localhost:4603/detect -F "image=@your-image.jpg"

Management Commands

./scripts/openprocessor.sh status    # Check service health
./scripts/openprocessor.sh logs -f   # View live logs
./scripts/openprocessor.sh restart   # Restart all services
./scripts/openprocessor.sh help      # See all commands

See INSTALLATION.md for manual installation, troubleshooting, and advanced options.

Docker Hub

Pre-built images are available on Docker Hub (pulled automatically by setup):

docker pull davidamacey/openprocessor:latest        # FastAPI service (~14GB)
docker pull davidamacey/openprocessor-triton:latest  # Triton server (~18GB)

GPU Compatibility

Profile VRAM GPUs Throughput
minimal 6-8GB RTX 3060, RTX 4060 ~5 RPS
standard 12-24GB RTX 3080, RTX 4090 ~15 RPS
full 48GB+ A6000, A100 ~50 RPS

Switch profiles: ./scripts/openprocessor.sh profile <name>


Setup Timing (RTX A6000, 48GB)

First-time setup is dominated by one-time TensorRT model compilation. Subsequent starts take only seconds since compiled engines are cached.

Step Time Notes
Model downloads ~16s 5 models, ~500MB from HuggingFace/GitHub
Docker image pull ~30s Pre-built from Docker Hub (~32GB total)
TensorRT exports ~31 min One-time only, cached after first run
Service startup ~33s Triton loads cached TRT engines
Total first run ~32 min
Subsequent starts ~30s Just docker compose up -d

TensorRT Export Breakdown:

Model Export Time Engine Size
YOLO11 detection ~8 min ~45 MB
SCRFD face detection ~2 min ~20 MB
ArcFace embeddings ~2 min ~86 MB
MobileCLIP image encoder ~4 min ~35 MB
MobileCLIP text encoder ~1 min ~125 MB
PaddleOCR (det + rec) ~15 min ~15 MB

Times vary by GPU. Faster GPUs with more CUDA cores export faster. Lower VRAM GPUs (8-12GB) may take 45-60 minutes total.


API Endpoints

All endpoints available on port 4603.

Object Detection

Endpoint Method Description
/detect POST YOLO object detection (single image)
/detect/batch POST Batch detection (up to 64 images)

Face Recognition

Endpoint Method Description
/faces/detect POST Face detection with landmarks (SCRFD)
/faces/recognize POST Detection + ArcFace 512-dim embeddings
/faces/verify POST 1:1 face comparison (two images)
/faces/search POST Find similar faces in index
/faces/identify POST 1:N face identification

Embeddings

Endpoint Method Description
/embed/image POST MobileCLIP image embedding (512-dim)
/embed/text POST MobileCLIP text embedding (512-dim)
/embed/batch POST Batch image embeddings
/embed/boxes POST Per-box crop embeddings

Visual Search

Endpoint Method Description
/search/image POST Image-to-image similarity search
/search/text POST Text-to-image search
/search/face POST Face similarity search
/search/ocr POST Search images by text content
/search/object POST Object-level search (vehicles, people)

Data Ingestion

Endpoint Method Description
/ingest POST Ingest image (auto-indexes faces, OCR, objects)
/ingest/batch POST Batch ingest (up to 64 images)
/ingest/directory POST Bulk ingest from server directory

OCR (Text Extraction)

Endpoint Method Description
/ocr/predict POST Extract text from image (PP-OCRv5)
/ocr/batch POST Batch OCR processing

Combined Analysis

Endpoint Method Description
/analyze POST All models on single image (YOLO + faces + CLIP + OCR)
/analyze/batch POST Batch combined analysis

Clustering & Albums

Endpoint Method Description
/clusters/train/{index} POST Train FAISS clustering for an index
/clusters/stats/{index} GET Get cluster statistics
/clusters/{index}/{id} GET Get cluster members
/clusters/albums GET List auto-generated albums

Data Retrieval

Endpoint Method Description
/query/image/{id} GET Get stored image data/metadata
/query/stats GET Index statistics for all indexes
/query/duplicates GET List duplicate groups

Health & Monitoring

Endpoint Method Description
/health GET Service health check
/health/models GET Triton model status

Usage Examples

Python

import requests

# Object Detection
with open('image.jpg', 'rb') as f:
    resp = requests.post('http://localhost:4603/detect', files={'image': f})
result = resp.json()
# {"detections": [{"x1": 0.1, "y1": 0.2, "x2": 0.3, "y2": 0.4, "confidence": 0.95, "class_id": 0, "class_name": "person"}], ...}

# Face Recognition
with open('photo.jpg', 'rb') as f:
    resp = requests.post('http://localhost:4603/faces/recognize', files={'image': f})
print(resp.json())
# {"num_faces": 2, "faces": [...], "embeddings": [[...512 floats...], ...]}

# Image Embedding
with open('image.jpg', 'rb') as f:
    resp = requests.post('http://localhost:4603/embed/image', files={'image': f})
embedding = resp.json()['embedding']  # 512-dim vector

# Text-to-Image Search
resp = requests.post('http://localhost:4603/search/text',
                    json={'query': 'a red sports car', 'top_k': 10})
results = resp.json()['results']

# Image Ingestion (auto-indexes everything)
with open('photo.jpg', 'rb') as f:
    resp = requests.post('http://localhost:4603/ingest',
                        files={'image': f},
                        data={'image_id': 'photo_001'})
print(resp.json())
# {"status": "indexed", "image_id": "photo_001", "indexed": {"global": true, "faces": 2, "vehicles": 1}}

# OCR
with open('document.jpg', 'rb') as f:
    resp = requests.post('http://localhost:4603/ocr/predict', files={'image': f})
print(resp.json())
# {"num_texts": 5, "texts": ["Invoice", "Total: $100"], ...}

# Combined Analysis (everything in one call)
with open('scene.jpg', 'rb') as f:
    resp = requests.post('http://localhost:4603/analyze', files={'image': f})
result = resp.json()
# {"detections": [...], "faces": [...], "global_embedding": [...], "ocr": {...}}

cURL

# Detection
curl -X POST http://localhost:4603/detect -F "image=@photo.jpg"

# Face Recognition
curl -X POST http://localhost:4603/faces/recognize -F "image=@face.jpg"

# Text Search
curl -X POST http://localhost:4603/search/text \
    -H "Content-Type: application/json" \
    -d '{"query": "sunset beach", "top_k": 10}'

# Ingestion
curl -X POST http://localhost:4603/ingest \
    -F "image=@photo.jpg" \
    -F "image_id=my_photo_001"

Response Formats

Detection Response

{
  "detections": [
    {
      "x1": 0.094, "y1": 0.278, "x2": 0.870, "y2": 0.989,
      "confidence": 0.918,
      "class_id": 0,
      "class_name": "person"
    }
  ],
  "image": {"width": 1920, "height": 1080},
  "inference_time_ms": 12.5
}

Note: Coordinates are normalized (0.0-1.0). Multiply by image width/height for pixels.

Face Recognition Response

{
  "num_faces": 2,
  "faces": [
    {
      "box": {"x1": 0.30, "y1": 0.10, "x2": 0.50, "y2": 0.40},
      "confidence": 0.98,
      "landmarks": [[0.35, 0.20], [0.45, 0.20], [0.40, 0.28], [0.36, 0.35], [0.44, 0.35]]
    }
  ],
  "embeddings": [[...512 floats...]],
  "inference_time_ms": 25.3
}

Search Response

{
  "status": "success",
  "results": [
    {"image_id": "img_001", "score": 0.95, "image_path": "/path/to/image.jpg"}
  ],
  "total_results": 10,
  "search_time_ms": 15.2
}

Ingest Response

{
  "status": "success",
  "image_id": "photo_001",
  "num_detections": 5,
  "num_faces": 2,
  "embedding_norm": 1.0,
  "indexed": {
    "global": true,
    "vehicles": 1,
    "people": 2,
    "faces": 2,
    "ocr": true
  },
  "ocr": {
    "num_texts": 3,
    "full_text": "Invoice Total: $100",
    "indexed": true
  },
  "total_time_ms": 850.4
}

Architecture

Client (Port 4603)
       |
       v
  +----------+
  | yolo-api |  FastAPI service (all endpoints)
  +----------+
       |
       v
  +--------------+     +------------+
  | triton-server|     | opensearch |
  | (GPU)        |     | (k-NN)     |
  +--------------+     +------------+

Services:

  • yolo-api (port 4603): FastAPI service handling all requests
  • triton-server (ports 4600-4602): NVIDIA Triton Inference Server with TensorRT models
  • opensearch (port 4607): Vector database for similarity search
  • prometheus/grafana (ports 4604/4605): Monitoring stack

Models

Model Purpose Backend
YOLO11 Object detection TensorRT End2End
SCRFD-10G Face detection + landmarks TensorRT
ArcFace Face embeddings (512-dim) TensorRT
MobileCLIP Image/text embeddings (512-dim) TensorRT
PP-OCRv5 Text detection + recognition TensorRT

All models use FP16 precision with dynamic batching for optimal throughput.


Performance

Measured Latency (single request):

Operation Time Throughput
Object Detection 140-170ms ~6-7 RPS
Face Detection 100-150ms ~7-10 RPS
Face Recognition 105-130ms ~8-9 RPS
Image Embedding (CLIP) 6-8ms ~120 RPS
Text Embedding (CLIP) 5-17ms ~60-200 RPS
OCR Prediction 170-350ms ~3-6 RPS
Full Analyze 280-430ms ~2-3 RPS
Single Image Ingest 750-950ms ~1-1.3 RPS
Batch Ingest (50 images) 7.3s total ~6.8 images/sec

Batch processing provides ~2-3x throughput improvement over sequential single-image processing.


System Requirements

Minimum:

  • NVIDIA GPU with 8GB+ VRAM (Ampere or newer)
  • 16GB RAM, 16 CPU cores
  • Docker with NVIDIA Container Toolkit

Recommended:

  • NVIDIA A100/A6000/RTX 4090 (16GB+)
  • 64GB RAM, 48+ CPU cores
  • NVMe SSD for image storage

Configuration

Worker Count

# docker-compose.yml
command: --workers=64  # Production
command: --workers=2   # Development

GPU Selection

# docker-compose.yml
device_ids: ['0', '2']  # Use GPUs 0 and 2

Testing

Run comprehensive test suite to verify all functionality:

# Full system test (32 tests covering all endpoints)
source .venv/bin/activate
python tests/test_full_system.py 2>&1 | tee test_results/test_results.txt

# Visual validation (draws bounding boxes on test images)
python tests/validate_visual_results.py 2>&1 | tee test_results/visual_validation.txt

# View annotated test images
ls test_results/*.jpg

Test Coverage:

  • ✅ All ML model endpoints (detection, faces, CLIP, OCR)
  • ✅ Single and batch processing
  • ✅ Directory ingest pipeline (50+ images)
  • ✅ OpenSearch indexing and search
  • ✅ Visual validation with bounding boxes

Benchmarking

cd benchmarks
./build.sh
./triton_bench --mode quick    # 30-second test
./triton_bench --mode full     # Full benchmark

See benchmarks/README.md for detailed benchmarking guide.


Documentation


Attribution

This project uses:

See ATTRIBUTION.md for complete licensing information.


Built for maximum throughput - Process 100K+ images in minutes, visual search in milliseconds.