diff --git a/MEDIA_API.md b/MEDIA_API.md new file mode 100644 index 0000000..3467114 --- /dev/null +++ b/MEDIA_API.md @@ -0,0 +1,954 @@ +# Media Generation API Specification + +OpenAI-compatible API specification for generating images, music, video and more using various AI providers. + +## Table of Contents + +- [Overview](#overview) +- [Features](#features) +- [API Endpoints](#api-endpoints) + - [Image Generation](#image-generation) + - [Image Editing](#image-editing) + - [Image Variations](#image-variations) + - [Text-to-Speech](#text-to-speech) + - [Speech-to-Text](#speech-to-text) + - [Audio Translation](#audio-translation) + - [Music Generation](#music-generation) + - [Video Generation](#video-generation) +- [Provider Support](#provider-support) +- [Authentication](#authentication) +- [Usage Examples](#usage-examples) +- [Implementation Guide](#implementation-guide) +- [Error Handling](#error-handling) + +## Overview + +This API specification provides a unified, OpenAI-compatible interface for various media generation services. It enables developers to: + +- Generate images from text prompts +- Edit and create variations of existing images +- Convert text to speech +- Transcribe audio to text with speaker diarization +- Translate audio to English +- Generate music from descriptions +- Create videos from text or images + +The API follows OpenAI's standards, making it easy to integrate with existing applications and switch between different providers seamlessly. + +## Features + +### Image Generation +- Multiple AI models: DALL-E 2, DALL-E 3, GPT-Image-1, Midjourney, Flux, Stable Diffusion +- Flexible sizing and quality options +- Streaming support for progressive image generation +- Transparent backgrounds (PNG/WebP) +- Style control (vivid vs. natural) + +### Audio Services +- **Text-to-Speech**: Multiple voices, streaming support, various audio formats +- **Speech-to-Text**: High-accuracy transcription with speaker diarization +- **Translation**: Automatic translation to English from any language + +### Music Generation +- Text-to-music generation +- Style and genre control +- Instrumental or vocal options +- Async task-based processing + +### Video Generation (Planned) +- Text-to-video generation +- Image-to-video animation +- Multiple resolution and FPS options + +## API Endpoints + +### Image Generation + +#### POST `/v1/images/generations` + +Generate images from text prompts. + +**Request Body:** +```json +{ + "prompt": "A cute baby sea otter", + "model": "gpt-image-1", + "n": 1, + "size": "1024x1024", + "quality": "high", + "response_format": "b64_json" +} +``` + +**Response:** +```json +{ + "created": 1713833628, + "data": [ + { + "b64_json": "iVBORw0KGgoAAAANSUhEUgA...", + "revised_prompt": "A cute baby sea otter floating on its back..." + } + ], + "usage": { + "total_tokens": 100, + "input_tokens": 50, + "output_tokens": 50 + } +} +``` + +**Supported Models:** +- `dall-e-2`: Fast, economical, basic quality +- `dall-e-3`: High quality, single image only +- `gpt-image-1`: Latest model, supports streaming and advanced features +- `midjourney`: Artistic style generation +- `flux`: Fast generation +- `stable-diffusion`: Open-source alternative + +### Image Editing + +#### POST `/v1/images/edits` + +Edit or extend images with AI assistance. + +**Request Body (multipart/form-data):** +``` +image[]: +image[]: (optional, multiple images) +mask: (optional) +prompt: "Add a festive red bow" +model: "gpt-image-1" +stream: true +``` + +**Streaming Response:** +``` +event: image_edit.partial_image +data: {"type":"image_edit.partial_image","b64_json":"...","partial_image_index":0} + +event: image_edit.completed +data: {"type":"image_edit.completed","b64_json":"...","usage":{"total_tokens":100}} +``` + +### Image Variations + +#### POST `/v1/images/variations` + +Create variations of an existing image. + +**Request Body (multipart/form-data):** +``` +image: +model: "dall-e-2" +n: 2 +size: "1024x1024" +``` + +**Response:** +```json +{ + "created": 1589478378, + "data": [ + { + "url": "https://..." + }, + { + "url": "https://..." + } + ] +} +``` + +### Text-to-Speech + +#### POST `/v1/audio/speech` + +Generate audio from text. + +**Request Body:** +```json +{ + "model": "gpt-4o-mini-tts", + "input": "The quick brown fox jumped over the lazy dog.", + "voice": "alloy", + "response_format": "mp3", + "speed": 1.0 +} +``` + +**Response:** +Binary audio data (application/octet-stream) + +**Available Voices:** +- `alloy`, `ash`, `ballad`, `coral`, `echo`, `fable`, `onyx`, `nova`, `sage`, `shimmer`, `verse` + +**Supported Formats:** +- `mp3` (default), `opus`, `aac`, `flac`, `wav`, `pcm` + +### Speech-to-Text + +#### POST `/v1/audio/transcriptions` + +Transcribe audio to text with optional speaker identification. + +**Request Body (multipart/form-data):** +``` +file: