MediaSearch

A media transcription, indexing, and search platform built exclusively for VAST Data infrastructure. MediaSearch ingests audio and video files, transcribes them using pluggable ASR engines, generates vector embeddings for semantic search, and provides keyword, semantic, and hybrid search capabilities with sub-second latency.

Status: ✅ Feature Complete - All 50 adapter methods for production VAST deployment are implemented and tested.

Problem Statement

Organizations with large media libraries need to make spoken content searchable. Traditional approaches require manual transcription or proprietary SaaS solutions that don't integrate with on-premises storage. MediaSearch provides an open, self-hosted solution that:

Automatically transcribes media files as they're uploaded
Supports multiple ASR engines (NVIDIA NIMs, Whisper, or bring-your-own)
Enables precise search with timestamp-accurate results
Scales horizontally for enterprise workloads
Runs on VAST Data infrastructure in production, with local development support

Implementation Status

VAST Adapters (Production)

Component	Methods	Status	Features
VAST DataBase	30/30	✅ Complete	Asset management, versioning, transcript storage, 384-dim vector search, DLQ operations
VAST DataEngine Queue	9/9	✅ Complete	Job enqueuing (immediate & delayed), consumption, acknowledgment, retry with backoff, DLQ integration
VAST DataEngine S3	11/11	✅ Complete	Object operations, bucket management, presigned URLs, event notifications
TOTAL	50/50	✅ 100% COMPLETE	Ready for production deployment

Search Capabilities

✅ Keyword Search: LIKE-based full-text search on transcript segments
✅ Semantic Search: Vector similarity using cosine distance (384-dimensional embeddings)
✅ Hybrid Search: Combined keyword + semantic with configurable weights
✅ Visibility Filtering: All queries automatically filter ACTIVE results (prevents partial/staging data exposure)

Architecture Features

✅ Transaction Support: Atomic operations for asset versioning and state transitions
✅ Error Handling: Dead-letter queue with automatic retry logic and triage classification
✅ Python Sidecar: Full SQL query support with complex filtering (WHERE, ORDER BY, LIMIT, array_cosine_distance)
✅ Comprehensive Testing: 100+ unit tests with mocks (no VAST cluster required)
✅ Production Logging: Detailed console logging for debugging and monitoring

Architecture

MediaSearch is designed as a VAST-native platform that leverages VAST Data's unified infrastructure:

┌─────────────────┐     ┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  VAST S3 Bucket │────▶│   Ingest    │────▶│ Orchestrator │────▶│ Search API  │
│  (Media Upload) │     │   Service   │     │              │     │             │
└─────────────────┘     └─────────────┘     └──────────────┘     └─────────────┘
                               │                    │
                               ▼                    ▼
                    ┌────────────────┐      ┌───────────┐
                    │ VAST DataBase  │      │ ASR/Embed │
                    │ (Index+Vectors)│      │  Engines  │
                    └────────────────┘      └───────────┘

Production Stack (VAST Data)

VAST Data provides a unified infrastructure that combines storage, compute, and database in a single system:

Component	VAST Service	Implementation	Purpose
Object Storage	VAST S3-compatible buckets	`vast-dataengine/S3 adapter` (11 methods)	Media file storage with event notifications
Compute	VAST DataEngine	`vast-dataengine/Queue adapter` (9 methods)	Serverless job execution for transcription, embedding generation
Database	VAST DataBase	`vast-database adapter` (30 methods)	Relational tables + 384-dim vector embeddings with semantic search
Python Bridge	HTTP RPC Sidecar	`services/vast-db-sidecar/app.py`	Flask server bridging Node.js to VAST Python SDK for database operations

Local Development Stack

For development and testing, swappable adapters provide equivalent functionality:

Production	Local Equivalent	Notes
VAST S3	MinIO	S3-compatible object storage
VAST DataEngine	Direct Node.js	In-process execution
VAST DataBase	PostgreSQL + pgvector	Relational + vector search

Key Design Principles:

VAST-first architecture: Production uses VAST DataBase + DataEngine exclusively
Adapter pattern: All infrastructure behind port interfaces with swappable implementations
Lifecycle management: Versioned assets with atomic visibility transitions (STAGING → ACTIVE → ARCHIVED)
Idempotent operations: Safe retries with exponential backoff and dead-letter queue

See docs/ARCHITECTURE.md for detailed design documentation.

Non-Goals

This project intentionally avoids certain architectural patterns:

No PostgreSQL in production: VAST DataBase provides all relational and vector storage needs
No Redis/message queues: VAST DataEngine handles job orchestration natively
No external object stores: Media lives in VAST S3-compatible storage only
No cloud-specific services: No AWS SQS, Azure Service Bus, or GCP Pub/Sub dependencies

The local development adapters (PostgreSQL, Redis, MinIO) exist solely for offline development and are never deployed to production. This constraint is enforced by CI guardrails.

Prerequisites

Node.js 20+
pnpm 9+
Docker & Docker Compose

Quick Start

# Clone repository
git clone https://github.com/ssotoa70/mediasearch.git
cd mediasearch

# Install dependencies
pnpm install

# Start local infrastructure (PostgreSQL, Redis, MinIO)
docker compose up -d

# Build all packages
pnpm build

# Run end-to-end demo
pnpm demo

The demo will:

Start all services
Upload a sample file
Process it through the pipeline
Execute search queries

Manual Testing

# Upload a file to MinIO (localhost:9001, user: minioadmin)
# Or trigger ingestion manually:
curl -X POST http://localhost:3000/events/s3 \
  -H "Content-Type: application/json" \
  -d '{"event_type": "ObjectCreated", "bucket": "media", "object_key": "test.mp3"}'

# Search (after processing completes)
curl "http://localhost:3001/search?q=hello&type=keyword"
curl "http://localhost:3001/search?q=greeting&type=semantic"

Configuration

Copy .env.example to .env and configure:

Variable	Description	Default
`BACKEND`	Infrastructure backend (`vast` or `local`)	`local`
`ASR_ENGINE`	ASR engine (`NVIDIA_NIMS`, `WHISPER`, `BYO`, `STUB`)	`STUB`
`EMBEDDING_MODEL`	Embedding model name	`stub`
`EMBEDDING_DIMENSION`	Vector dimension	`384`
`MEDIA_BUCKET`	S3 bucket for media files	`media`
`JOB_CONCURRENCY`	Parallel job processing	`4`
`MAX_RETRY_ATTEMPTS`	Retries before DLQ	`5`

See .env.example for complete configuration options.

Services

Service	Port	Description
Ingest	3000	S3 event handler, job creation
Search API	3001	Keyword, semantic, hybrid search
Triage	3002	DLQ management, quarantine handling
Orchestrator	3003	Job processing, ASR, embeddings

Project Structure

mediasearch/
├── packages/
│   ├── domain/              # Business logic, types, port interfaces
│   └── adapters/
│       ├── local-postgres/  # PostgreSQL adapter (dev)
│       ├── local-queue/     # Redis/BullMQ adapter (dev)
│       ├── local-s3/        # MinIO adapter (dev)
│       ├── vast-database/   # VAST DataBase adapter (prod)
│       └── vast-dataengine/ # VAST DataEngine adapter (prod)
├── services/
│   ├── ingest/              # S3 event ingestion
│   ├── orchestrator/        # Job processing pipeline
│   ├── search-api/          # Search endpoints
│   └── triage/              # DLQ management
├── db/migrations/           # Database schema
├── docker/                  # Dockerfiles
├── docs/                    # Documentation
└── scripts/                 # Utility scripts

Documentation

Roadmap

Near-term (API Stabilization)

Finalize search API contract and response schema
Add OpenAPI/Swagger documentation
Comprehensive error codes and messages

Mid-term (Relevance & Quality)

Search relevance tuning (BM25 weights, vector similarity thresholds)
Multi-language ASR support with auto-detection
Speaker diarization improvements
Confidence score calibration

Long-term (Enterprise Features)

Multi-tenant support with namespace isolation
Webhook notifications for processing events
Prometheus metrics and Grafana dashboards
Transcript editing and correction UI

Contributing

See CONTRIBUTING.md for guidelines.

License

Apache License 2.0 - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
db		db
docker		docker
docs		docs
packages		packages
scripts		scripts
services		services
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
VAST_PROJECT_COMPLETION_TRACKER.md		VAST_PROJECT_COMPLETION_TRACKER.md
docker-compose.yml		docker-compose.yml
package.json		package.json
pnpm-workspace.yaml		pnpm-workspace.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MediaSearch

Problem Statement

Implementation Status

VAST Adapters (Production)

Search Capabilities

Architecture Features

Architecture

Production Stack (VAST Data)

Local Development Stack

Non-Goals

Prerequisites

Quick Start

Manual Testing

Configuration

Services

Project Structure

Documentation

Roadmap

Near-term (API Stabilization)

Mid-term (Relevance & Quality)

Long-term (Enterprise Features)

Contributing

License

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

License

ssotoa70/mediasearch

Folders and files

Latest commit

History

Repository files navigation

MediaSearch

Problem Statement

Implementation Status

VAST Adapters (Production)

Search Capabilities

Architecture Features

Architecture

Production Stack (VAST Data)

Local Development Stack

Non-Goals

Prerequisites

Quick Start

Manual Testing

Configuration

Services

Project Structure

Documentation

Roadmap

Near-term (API Stabilization)

Mid-term (Relevance & Quality)

Long-term (Enterprise Features)

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages