diff --git a/MEDIA_API.md b/MEDIA_API.md
new file mode 100644
index 0000000..3467114
--- /dev/null
+++ b/MEDIA_API.md
@@ -0,0 +1,954 @@
+# Media Generation API Specification
+
+OpenAI-compatible API specification for generating images, music, video and more using various AI providers.
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Features](#features)
+- [API Endpoints](#api-endpoints)
+  - [Image Generation](#image-generation)
+  - [Image Editing](#image-editing)
+  - [Image Variations](#image-variations)
+  - [Text-to-Speech](#text-to-speech)
+  - [Speech-to-Text](#speech-to-text)
+  - [Audio Translation](#audio-translation)
+  - [Music Generation](#music-generation)
+  - [Video Generation](#video-generation)
+- [Provider Support](#provider-support)
+- [Authentication](#authentication)
+- [Usage Examples](#usage-examples)
+- [Implementation Guide](#implementation-guide)
+- [Error Handling](#error-handling)
+
+## Overview
+
+This API specification provides a unified, OpenAI-compatible interface for various media generation services. It enables developers to:
+
+- Generate images from text prompts
+- Edit and create variations of existing images
+- Convert text to speech
+- Transcribe audio to text with speaker diarization
+- Translate audio to English
+- Generate music from descriptions
+- Create videos from text or images
+
+The API follows OpenAI's standards, making it easy to integrate with existing applications and switch between different providers seamlessly.
+
+## Features
+
+### Image Generation
+- Multiple AI models: DALL-E 2, DALL-E 3, GPT-Image-1, Midjourney, Flux, Stable Diffusion
+- Flexible sizing and quality options
+- Streaming support for progressive image generation
+- Transparent backgrounds (PNG/WebP)
+- Style control (vivid vs. natural)
+
+### Audio Services
+- **Text-to-Speech**: Multiple voices, streaming support, various audio formats
+- **Speech-to-Text**: High-accuracy transcription with speaker diarization
+- **Translation**: Automatic translation to English from any language
+
+### Music Generation
+- Text-to-music generation
+- Style and genre control
+- Instrumental or vocal options
+- Async task-based processing
+
+### Video Generation (Planned)
+- Text-to-video generation
+- Image-to-video animation
+- Multiple resolution and FPS options
+
+## API Endpoints
+
+### Image Generation
+
+#### POST `/v1/images/generations`
+
+Generate images from text prompts.
+
+**Request Body:**
+```json
+{
+  "prompt": "A cute baby sea otter",
+  "model": "gpt-image-1",
+  "n": 1,
+  "size": "1024x1024",
+  "quality": "high",
+  "response_format": "b64_json"
+}
+```
+
+**Response:**
+```json
+{
+  "created": 1713833628,
+  "data": [
+    {
+      "b64_json": "iVBORw0KGgoAAAANSUhEUgA...",
+      "revised_prompt": "A cute baby sea otter floating on its back..."
+    }
+  ],
+  "usage": {
+    "total_tokens": 100,
+    "input_tokens": 50,
+    "output_tokens": 50
+  }
+}
+```
+
+**Supported Models:**
+- `dall-e-2`: Fast, economical, basic quality
+- `dall-e-3`: High quality, single image only
+- `gpt-image-1`: Latest model, supports streaming and advanced features
+- `midjourney`: Artistic style generation
+- `flux`: Fast generation
+- `stable-diffusion`: Open-source alternative
+
+### Image Editing
+
+#### POST `/v1/images/edits`
+
+Edit or extend images with AI assistance.
+
+**Request Body (multipart/form-data):**
+```
+image[]: <binary file data>
+image[]: <binary file data> (optional, multiple images)
+mask: <binary file data> (optional)
+prompt: "Add a festive red bow"
+model: "gpt-image-1"
+stream: true
+```
+
+**Streaming Response:**
+```
+event: image_edit.partial_image
+data: {"type":"image_edit.partial_image","b64_json":"...","partial_image_index":0}
+
+event: image_edit.completed
+data: {"type":"image_edit.completed","b64_json":"...","usage":{"total_tokens":100}}
+```
+
+### Image Variations
+
+#### POST `/v1/images/variations`
+
+Create variations of an existing image.
+
+**Request Body (multipart/form-data):**
+```
+image: <binary PNG file>
+model: "dall-e-2"
+n: 2
+size: "1024x1024"
+```
+
+**Response:**
+```json
+{
+  "created": 1589478378,
+  "data": [
+    {
+      "url": "https://..."
+    },
+    {
+      "url": "https://..."
+    }
+  ]
+}
+```
+
+### Text-to-Speech
+
+#### POST `/v1/audio/speech`
+
+Generate audio from text.
+
+**Request Body:**
+```json
+{
+  "model": "gpt-4o-mini-tts",
+  "input": "The quick brown fox jumped over the lazy dog.",
+  "voice": "alloy",
+  "response_format": "mp3",
+  "speed": 1.0
+}
+```
+
+**Response:**
+Binary audio data (application/octet-stream)
+
+**Available Voices:**
+- `alloy`, `ash`, `ballad`, `coral`, `echo`, `fable`, `onyx`, `nova`, `sage`, `shimmer`, `verse`
+
+**Supported Formats:**
+- `mp3` (default), `opus`, `aac`, `flac`, `wav`, `pcm`
+
+### Speech-to-Text
+
+#### POST `/v1/audio/transcriptions`
+
+Transcribe audio to text with optional speaker identification.
+
+**Request Body (multipart/form-data):**
+```
+file: <audio file>
+model: "gpt-4o-transcribe"
+language: "en"
+response_format: "verbose_json"
+timestamp_granularities[]: "word"
+timestamp_granularities[]: "segment"
+```
+
+**Response (verbose_json):**
+```json
+{
+  "task": "transcribe",
+  "language": "en",
+  "duration": 27.4,
+  "text": "Imagine the wildest idea...",
+  "words": [
+    {
+      "word": "Imagine",
+      "start": 0.0,
+      "end": 0.5
+    }
+  ],
+  "segments": [
+    {
+      "id": 0,
+      "start": 0.0,
+      "end": 5.0,
+      "text": "Imagine the wildest idea..."
+    }
+  ],
+  "usage": {
+    "type": "tokens",
+    "input_tokens": 14,
+    "output_tokens": 45,
+    "total_tokens": 59
+  }
+}
+```
+
+**Speaker Diarization:**
+
+For multi-speaker transcription, use `gpt-4o-transcribe-diarize` model:
+
+```
+file: <audio file>
+model: "gpt-4o-transcribe-diarize"
+response_format: "diarized_json"
+chunking_strategy: "auto"
+known_speaker_names[]: "agent"
+known_speaker_references[]: "data:audio/wav;base64,AAA..."
+```
+
+**Diarized Response:**
+```json
+{
+  "task": "transcribe",
+  "duration": 27.4,
+  "text": "Agent: Thanks for calling...\nA: Hi, I'm trying...",
+  "segments": [
+    {
+      "type": "transcript.text.segment",
+      "id": "seg_001",
+      "start": 0.0,
+      "end": 4.7,
+      "text": "Thanks for calling OpenAI support.",
+      "speaker": "agent"
+    },
+    {
+      "type": "transcript.text.segment",
+      "id": "seg_002",
+      "start": 4.7,
+      "end": 11.8,
+      "text": "Hi, I'm trying to enable diarization.",
+      "speaker": "A"
+    }
+  ]
+}
+```
+
+### Audio Translation
+
+#### POST `/v1/audio/translations`
+
+Translate audio from any language to English.
+
+**Request Body (multipart/form-data):**
+```
+file: <audio file>
+model: "whisper-1"
+response_format: "json"
+```
+
+**Response:**
+```json
+{
+  "text": "Hello, how are you?"
+}
+```
+
+### Music Generation
+
+#### POST `/v1/music/generations`
+
+Generate music from text descriptions (async).
+
+**Request Body:**
+```json
+{
+  "prompt": "An upbeat electronic dance track with heavy bass",
+  "model": "suno-v3.5",
+  "duration": 120,
+  "style": "electronic",
+  "instrumental": false
+}
+```
+
+**Response (202 Accepted):**
+```json
+{
+  "task_id": "music_abc123",
+  "status": "pending",
+  "estimated_completion_time": 30
+}
+```
+
+#### GET `/v1/music/generations/{task_id}`
+
+Check music generation status.
+
+**Response:**
+```json
+{
+  "task_id": "music_abc123",
+  "status": "completed",
+  "progress": 100,
+  "result": {
+    "url": "https://storage.example.com/music_abc123.mp3",
+    "duration": 120.5,
+    "format": "mp3"
+  }
+}
+```
+
+### Video Generation
+
+#### POST `/v1/video/generations`
+
+Generate video from text or images (async, planned feature).
+
+**Request Body:**
+```json
+{
+  "prompt": "A serene sunset over a mountain landscape",
+  "model": "runway-gen3",
+  "duration": 5,
+  "resolution": "1280x720",
+  "fps": 30
+}
+```
+
+**Response (202 Accepted):**
+```json
+{
+  "task_id": "video_xyz789",
+  "status": "pending",
+  "estimated_completion_time": 120
+}
+```
+
+## Provider Support
+
+The API is designed to work with multiple providers through a unified interface:
+
+### Image Providers
+
+| Provider | Models | Features |
+|----------|--------|----------|
+| OpenAI | DALL-E 2, DALL-E 3, GPT-Image-1 | Generation, editing, variations |
+| Midjourney | midjourney | High-quality artistic generation |
+| Flux | flux | Fast generation |
+| Stable Diffusion | stable-diffusion | Open-source, customizable |
+
+### Audio Providers
+
+| Provider | Models | Features |
+|----------|--------|----------|
+| OpenAI | TTS-1, TTS-1-HD, GPT-4o-TTS | Text-to-speech |
+| OpenAI | Whisper-1, GPT-4o-transcribe | Speech-to-text, diarization |
+
+### Music Providers
+
+| Provider | Models | Features |
+|----------|--------|----------|
+| Suno AI | suno-v3, suno-v3.5 | Music generation |
+
+### Video Providers (Planned)
+
+| Provider | Models | Features |
+|----------|--------|----------|
+| RunwayML | runway-gen3 | Text-to-video |
+| Stability AI | stability-video | Video generation |
+
+## Authentication
+
+The API supports two authentication methods:
+
+### API Key Authentication
+
+Include your API key in the request header:
+
+```bash
+curl -X POST https://api.example.com/v1/images/generations \
+  -H "X-API-Key: your-api-key" \
+  -H "Content-Type: application/json" \
+  -d '{"prompt": "A sunset"}'
+```
+
+### Bearer Token Authentication
+
+Use a bearer token for JWT-based authentication:
+
+```bash
+curl -X POST https://api.example.com/v1/images/generations \
+  -H "Authorization: Bearer your-token" \
+  -H "Content-Type: application/json" \
+  -d '{"prompt": "A sunset"}'
+```
+
+## Usage Examples
+
+### Python
+
+```python
+from openai import OpenAI
+
+# Initialize client
+client = OpenAI(
+    api_key="your-api-key",
+    base_url="https://api.example.com/v1"
+)
+
+# Generate an image
+response = client.images.generate(
+    prompt="A futuristic cityscape at sunset",
+    model="gpt-image-1",
+    size="1536x1024",
+    quality="high",
+    n=1
+)
+
+print(response.data[0].url)
+
+# Transcribe audio
+with open("audio.mp3", "rb") as audio_file:
+    transcription = client.audio.transcriptions.create(
+        file=audio_file,
+        model="gpt-4o-transcribe",
+        response_format="verbose_json",
+        timestamp_granularities=["word", "segment"]
+    )
+    print(transcription.text)
+
+# Generate speech
+response = client.audio.speech.create(
+    model="gpt-4o-mini-tts",
+    voice="alloy",
+    input="Hello, this is a test."
+)
+
+response.stream_to_file("output.mp3")
+```
+
+### Node.js
+
+```javascript
+import OpenAI from 'openai';
+import fs from 'fs';
+
+const client = new OpenAI({
+  apiKey: 'your-api-key',
+  baseURL: 'https://api.example.com/v1'
+});
+
+// Generate an image
+const imageResponse = await client.images.generate({
+  prompt: 'A futuristic cityscape at sunset',
+  model: 'gpt-image-1',
+  size: '1536x1024',
+  quality: 'high',
+  n: 1
+});
+
+console.log(imageResponse.data[0].url);
+
+// Transcribe audio
+const transcription = await client.audio.transcriptions.create({
+  file: fs.createReadStream('audio.mp3'),
+  model: 'gpt-4o-transcribe',
+  response_format: 'verbose_json',
+  timestamp_granularities: ['word', 'segment']
+});
+
+console.log(transcription.text);
+
+// Generate speech
+const mp3 = await client.audio.speech.create({
+  model: 'gpt-4o-mini-tts',
+  voice: 'alloy',
+  input: 'Hello, this is a test.'
+});
+
+const buffer = Buffer.from(await mp3.arrayBuffer());
+await fs.promises.writeFile('output.mp3', buffer);
+```
+
+### cURL
+
+```bash
+# Generate an image
+curl https://api.example.com/v1/images/generations \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer $API_KEY" \
+  -d '{
+    "model": "gpt-image-1",
+    "prompt": "A cute baby sea otter",
+    "n": 1,
+    "size": "1024x1024"
+  }'
+
+# Transcribe audio
+curl https://api.example.com/v1/audio/transcriptions \
+  -H "Authorization: Bearer $API_KEY" \
+  -H "Content-Type: multipart/form-data" \
+  -F file="@audio.mp3" \
+  -F model="gpt-4o-transcribe"
+
+# Generate speech
+curl https://api.example.com/v1/audio/speech \
+  -H "Authorization: Bearer $API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "gpt-4o-mini-tts",
+    "input": "The quick brown fox jumped over the lazy dog.",
+    "voice": "alloy"
+  }' \
+  --output speech.mp3
+```
+
+## Implementation Guide
+
+### For API Gateway Integration
+
+To implement this specification in your API Gateway:
+
+#### 1. Provider Abstraction Layer
+
+Create a provider abstraction that maps unified requests to provider-specific APIs:
+
+```javascript
+// providers/imageProvider.js
+class ImageProvider {
+  async generateImage(params) {
+    switch (params.model) {
+      case 'dall-e-2':
+      case 'dall-e-3':
+      case 'gpt-image-1':
+        return this.openaiProvider.generate(params);
+      case 'midjourney':
+        return this.midjourneyProvider.generate(params);
+      case 'flux':
+        return this.fluxProvider.generate(params);
+      case 'stable-diffusion':
+        return this.stabilityProvider.generate(params);
+      default:
+        throw new Error(`Unsupported model: ${params.model}`);
+    }
+  }
+
+  // Normalize provider-specific responses to unified format
+  normalizeResponse(providerResponse, provider) {
+    // Convert provider response to OpenAI-compatible format
+    return {
+      created: Math.floor(Date.now() / 1000),
+      data: this.normalizeImages(providerResponse, provider),
+      usage: this.calculateUsage(providerResponse)
+    };
+  }
+}
+```
+
+#### 2. Request Routing
+
+Route requests based on the model parameter:
+
+```javascript
+// routes/images.js
+router.post('/v1/images/generations', async (req, res) => {
+  try {
+    const params = req.body;
+
+    // Validate request
+    validateImageRequest(params);
+
+    // Route to appropriate provider
+    const provider = getProviderForModel(params.model);
+    const result = await provider.generateImage(params);
+
+    // Return normalized response
+    res.json(result);
+  } catch (error) {
+    handleError(error, res);
+  }
+});
+```
+
+#### 3. Streaming Support
+
+Implement streaming for progressive image generation:
+
+```javascript
+async function streamImageGeneration(params, res) {
+  res.setHeader('Content-Type', 'text/event-stream');
+  res.setHeader('Cache-Control', 'no-cache');
+  res.setHeader('Connection', 'keep-alive');
+
+  const provider = getProviderForModel(params.model);
+
+  for await (const chunk of provider.generateImageStream(params)) {
+    const event = {
+      type: 'image_generation.partial_image',
+      b64_json: chunk.partialImage,
+      partial_image_index: chunk.index
+    };
+
+    res.write(`event: image_generation.partial_image\n`);
+    res.write(`data: ${JSON.stringify(event)}\n\n`);
+  }
+
+  res.write(`event: image_generation.completed\n`);
+  res.write(`data: ${JSON.stringify(finalEvent)}\n\n`);
+  res.end();
+}
+```
+
+#### 4. Async Task Management
+
+For music and video generation, implement task-based processing:
+
+```javascript
+// services/taskManager.js
+class TaskManager {
+  constructor() {
+    this.tasks = new Map();
+  }
+
+  async createTask(type, params) {
+    const taskId = generateTaskId();
+
+    const task = {
+      id: taskId,
+      type: type,
+      status: 'pending',
+      progress: 0,
+      params: params,
+      createdAt: new Date()
+    };
+
+    this.tasks.set(taskId, task);
+
+    // Start async processing
+    this.processTask(taskId);
+
+    return {
+      task_id: taskId,
+      status: 'pending',
+      estimated_completion_time: this.estimateTime(type, params)
+    };
+  }
+
+  async processTask(taskId) {
+    const task = this.tasks.get(taskId);
+    task.status = 'processing';
+
+    try {
+      const provider = this.getProviderForTask(task);
+
+      // Process with progress updates
+      const result = await provider.generate(task.params, (progress) => {
+        task.progress = progress;
+      });
+
+      task.status = 'completed';
+      task.result = result;
+    } catch (error) {
+      task.status = 'failed';
+      task.error = error.message;
+    }
+  }
+
+  getTaskStatus(taskId) {
+    const task = this.tasks.get(taskId);
+    if (!task) {
+      throw new Error('Task not found');
+    }
+
+    return {
+      task_id: task.id,
+      status: task.status,
+      progress: task.progress,
+      result: task.result || null,
+      error: task.error || null
+    };
+  }
+}
+```
+
+#### 5. Failover Strategy
+
+Implement automatic failover between providers:
+
+```javascript
+class ProviderFailover {
+  constructor(providers) {
+    this.providers = providers; // Ordered list of providers
+  }
+
+  async executeWithFailover(operation, params) {
+    let lastError;
+
+    for (const provider of this.providers) {
+      try {
+        console.log(`Trying provider: ${provider.name}`);
+        const result = await provider[operation](params);
+        return result;
+      } catch (error) {
+        console.error(`Provider ${provider.name} failed:`, error);
+        lastError = error;
+
+        // Continue to next provider
+        continue;
+      }
+    }
+
+    // All providers failed
+    throw new Error(`All providers failed. Last error: ${lastError.message}`);
+  }
+}
+
+// Usage
+const imageProviders = [
+  new OpenAIProvider(),
+  new MidjourneyProvider(),
+  new FluxProvider()
+];
+
+const failover = new ProviderFailover(imageProviders);
+const result = await failover.executeWithFailover('generateImage', params);
+```
+
+### Database Schema
+
+For tracking tasks and usage:
+
+```sql
+CREATE TABLE media_tasks (
+  id VARCHAR(255) PRIMARY KEY,
+  user_id VARCHAR(255) NOT NULL,
+  type VARCHAR(50) NOT NULL, -- 'image', 'audio', 'music', 'video'
+  operation VARCHAR(50) NOT NULL, -- 'generate', 'edit', 'transcribe', etc.
+  status VARCHAR(50) NOT NULL, -- 'pending', 'processing', 'completed', 'failed'
+  progress INTEGER DEFAULT 0,
+  params JSONB NOT NULL,
+  result JSONB,
+  error TEXT,
+  provider VARCHAR(100),
+  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+  completed_at TIMESTAMP
+);
+
+CREATE INDEX idx_tasks_user_id ON media_tasks(user_id);
+CREATE INDEX idx_tasks_status ON media_tasks(status);
+CREATE INDEX idx_tasks_created_at ON media_tasks(created_at);
+
+CREATE TABLE media_usage (
+  id SERIAL PRIMARY KEY,
+  user_id VARCHAR(255) NOT NULL,
+  task_id VARCHAR(255),
+  endpoint VARCHAR(255) NOT NULL,
+  model VARCHAR(100),
+  provider VARCHAR(100),
+  tokens_used INTEGER,
+  duration_seconds NUMERIC,
+  cost NUMERIC(10, 6),
+  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
+);
+
+CREATE INDEX idx_usage_user_id ON media_usage(user_id);
+CREATE INDEX idx_usage_created_at ON media_usage(created_at);
+```
+
+## Error Handling
+
+The API uses standard HTTP status codes and returns errors in a consistent format:
+
+### Error Response Format
+
+```json
+{
+  "error": {
+    "message": "Invalid parameter: prompt is required",
+    "type": "invalid_request_error",
+    "code": "invalid_parameter",
+    "param": "prompt"
+  }
+}
+```
+
+### Common Error Codes
+
+| Status Code | Error Type | Description |
+|-------------|------------|-------------|
+| 400 | invalid_request_error | Invalid parameters or malformed request |
+| 401 | authentication_error | Invalid or missing API key |
+| 403 | permission_error | Insufficient permissions |
+| 404 | not_found_error | Resource not found |
+| 429 | rate_limit_error | Too many requests |
+| 500 | server_error | Internal server error |
+| 503 | service_unavailable | Service temporarily unavailable |
+
+### Error Handling Best Practices
+
+1. **Implement Retry Logic**: For transient errors (500, 503), implement exponential backoff
+2. **Validate Input**: Check parameters before sending requests
+3. **Handle Rate Limits**: Implement request queuing and throttling
+4. **Log Errors**: Track errors for debugging and monitoring
+5. **Provide Fallbacks**: Switch to alternative providers when primary fails
+
+### Example Error Handling
+
+```javascript
+async function generateImageWithRetry(params, maxRetries = 3) {
+  for (let attempt = 1; attempt <= maxRetries; attempt++) {
+    try {
+      const response = await client.images.generate(params);
+      return response;
+    } catch (error) {
+      if (error.status === 429) {
+        // Rate limit - wait and retry
+        const waitTime = Math.pow(2, attempt) * 1000;
+        await new Promise(resolve => setTimeout(resolve, waitTime));
+        continue;
+      } else if (error.status >= 500 && attempt < maxRetries) {
+        // Server error - retry
+        console.log(`Attempt ${attempt} failed, retrying...`);
+        continue;
+      } else {
+        // Other error - don't retry
+        throw error;
+      }
+    }
+  }
+
+  throw new Error('Max retries exceeded');
+}
+```
+
+## Rate Limiting
+
+Implement rate limiting to prevent abuse:
+
+```javascript
+const rateLimit = require('express-rate-limit');
+
+const limiter = rateLimit({
+  windowMs: 15 * 60 * 1000, // 15 minutes
+  max: 100, // Limit each IP to 100 requests per windowMs
+  message: {
+    error: {
+      message: 'Too many requests, please try again later.',
+      type: 'rate_limit_error',
+      code: 'rate_limit_exceeded'
+    }
+  },
+  standardHeaders: true,
+  legacyHeaders: false,
+});
+
+// Apply to all media endpoints
+app.use('/v1/images/', limiter);
+app.use('/v1/audio/', limiter);
+app.use('/v1/music/', limiter);
+app.use('/v1/video/', limiter);
+```
+
+## Monitoring and Observability
+
+Track key metrics for your media API:
+
+### Key Metrics
+
+1. **Request Metrics**
+   - Total requests per endpoint
+   - Request success/failure rate
+   - Average response time
+   - P95/P99 latency
+
+2. **Provider Metrics**
+   - Provider success rate
+   - Provider failover frequency
+   - Provider response time
+
+3. **Resource Metrics**
+   - Token usage
+   - Cost per request
+   - Storage usage (for generated media)
+
+4. **Business Metrics**
+   - Active users
+   - Popular models
+   - Feature adoption
+
+### Logging Example
+
+```javascript
+function logMediaRequest(req, result, duration) {
+  logger.info('Media API Request', {
+    endpoint: req.path,
+    method: req.method,
+    model: req.body.model,
+    provider: result.provider,
+    duration_ms: duration,
+    status: 'success',
+    user_id: req.user.id,
+    tokens_used: result.usage?.total_tokens
+  });
+}
+```
+
+## Contributing
+
+Contributions to this specification are welcome! Please:
+
+1. Review the OpenAI API standards
+2. Ensure backwards compatibility
+3. Add tests for new features
+4. Update documentation
+
+## License
+
+This specification is released under the MIT License. See [LICENSE](LICENSE) for details.
+
+## Related Links
+
+- [OpenAI API Reference](https://platform.openai.com/docs/api-reference)
+- [OpenAPI Specification](https://www.openapis.org/)
+- [API Gateway Architecture](https://github.com/deep-assistant/api-gateway/blob/main/ARCHITECTURE.md)
+- [Telegram Bot Integration](https://github.com/deep-assistant/telegram-bot/blob/main/ARCHITECTURE.md)
diff --git a/README_MEDIA_API.md b/README_MEDIA_API.md
new file mode 100644
index 0000000..d86afdc
--- /dev/null
+++ b/README_MEDIA_API.md
@@ -0,0 +1,402 @@
+# Media Generation API for Deep Assistant
+
+**OpenAI-compatible API specification for images, music, video and more**
+
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+[![OpenAPI 3.1](https://img.shields.io/badge/OpenAPI-3.1-green.svg)](https://www.openapis.org/)
+
+## Quick Start
+
+This repository contains a comprehensive OpenAI-compatible API specification for media generation services, including images, audio (speech and transcription), music, and video.
+
+### Files
+
+- **[media-api-spec.yaml](media-api-spec.yaml)** - Complete OpenAPI 3.1 specification
+- **[MEDIA_API.md](MEDIA_API.md)** - Full documentation with examples and implementation guide
+
+## What's Inside
+
+### Supported Media Types
+
+#### 🖼️ Images
+- **Generation**: Create images from text prompts
+- **Editing**: Modify existing images with AI
+- **Variations**: Generate similar versions of images
+
+**Supported Providers**: DALL-E 2/3, GPT-Image-1, Midjourney, Flux, Stable Diffusion
+
+#### 🎵 Audio
+- **Text-to-Speech**: Convert text to natural-sounding audio
+- **Speech-to-Text**: Transcribe audio with speaker identification
+- **Translation**: Translate audio to English
+
+**Supported Providers**: OpenAI (TTS-1, Whisper, GPT-4o models)
+
+#### 🎼 Music
+- **Music Generation**: Create original music from text descriptions
+- **Style Control**: Specify genre, mood, and instrumentation
+
+**Supported Providers**: Suno AI
+
+#### 🎬 Video (Planned)
+- **Text-to-Video**: Generate videos from descriptions
+- **Image-to-Video**: Animate still images
+
+**Planned Providers**: RunwayML, Stability AI
+
+## API Overview
+
+### Base URL
+```
+https://api.example.com/v1
+```
+
+### Authentication
+```bash
+# API Key
+curl -H "X-API-Key: your-api-key" ...
+
+# Bearer Token
+curl -H "Authorization: Bearer your-token" ...
+```
+
+### Quick Examples
+
+#### Generate an Image
+```bash
+curl https://api.example.com/v1/images/generations \
+  -H "Authorization: Bearer $API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "prompt": "A serene mountain landscape at sunset",
+    "model": "gpt-image-1",
+    "size": "1536x1024",
+    "quality": "high"
+  }'
+```
+
+#### Transcribe Audio
+```bash
+curl https://api.example.com/v1/audio/transcriptions \
+  -H "Authorization: Bearer $API_KEY" \
+  -F file="@meeting.mp3" \
+  -F model="gpt-4o-transcribe" \
+  -F response_format="diarized_json"
+```
+
+#### Generate Speech
+```bash
+curl https://api.example.com/v1/audio/speech \
+  -H "Authorization: Bearer $API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "gpt-4o-mini-tts",
+    "input": "Welcome to our service!",
+    "voice": "alloy"
+  }' \
+  --output welcome.mp3
+```
+
+#### Create Music
+```bash
+curl https://api.example.com/v1/music/generations \
+  -H "Authorization: Bearer $API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "prompt": "Upbeat electronic dance music with heavy bass",
+    "duration": 120,
+    "style": "electronic"
+  }'
+```
+
+## Key Features
+
+### 🔄 Provider Abstraction
+Switch between different AI providers seamlessly using a unified interface. The API automatically handles provider-specific quirks and normalizes responses.
+
+### 📡 Streaming Support
+Progressive generation for images and audio, allowing you to show users partial results as they're created.
+
+### 🎯 Async Task Management
+Long-running operations (music, video) use task-based processing with status polling.
+
+### ⚡ Auto Failover
+Automatic fallback to alternative providers when the primary provider fails.
+
+### 📊 Usage Tracking
+Built-in token and cost tracking for all operations.
+
+## Integration
+
+### For Developers
+
+Use this specification to:
+1. **Build client SDKs** compatible with OpenAI libraries
+2. **Integrate into existing applications** that use OpenAI APIs
+3. **Switch providers** without changing application code
+
+### For API Providers
+
+Implement this specification to:
+1. **Offer OpenAI-compatible endpoints** for easy adoption
+2. **Support multiple AI providers** with unified routing
+3. **Enable seamless migration** for OpenAI users
+
+### For the Deep Assistant Ecosystem
+
+This specification is designed to integrate with:
+- **[API Gateway](https://github.com/deep-assistant/api-gateway)** - Central routing and failover
+- **[Telegram Bot](https://github.com/deep-assistant/telegram-bot)** - User-facing media generation
+- **[GPTutor](https://github.com/deep-assistant/GPTutor)** - Educational platform with image generation
+- **[Web Capture](https://github.com/deep-assistant/web-capture)** - Content capture with media processing
+
+## Documentation
+
+### Full Documentation
+See **[MEDIA_API.md](MEDIA_API.md)** for complete documentation including:
+- Detailed endpoint specifications
+- Request/response examples in multiple languages
+- Implementation guide for API Gateway
+- Error handling and best practices
+- Provider integration patterns
+- Database schemas
+- Monitoring and observability
+
+### OpenAPI Specification
+See **[media-api-spec.yaml](media-api-spec.yaml)** for the machine-readable API specification.
+
+You can:
+- Generate client SDKs using [OpenAPI Generator](https://openapi-generator.tech/)
+- Import into API testing tools like [Postman](https://www.postman.com/) or [Insomnia](https://insomnia.rest/)
+- Generate documentation with [Swagger UI](https://swagger.io/tools/swagger-ui/)
+- Validate requests/responses automatically
+
+## Usage with OpenAI SDKs
+
+This API is compatible with official OpenAI SDKs. Just change the base URL:
+
+### Python
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    api_key="your-api-key",
+    base_url="https://api.example.com/v1"
+)
+
+# Use as normal
+response = client.images.generate(
+    prompt="A beautiful sunset",
+    model="gpt-image-1"
+)
+```
+
+### Node.js
+```javascript
+import OpenAI from 'openai';
+
+const client = new OpenAI({
+  apiKey: 'your-api-key',
+  baseURL: 'https://api.example.com/v1'
+});
+
+// Use as normal
+const response = await client.images.generate({
+  prompt: 'A beautiful sunset',
+  model: 'gpt-image-1'
+});
+```
+
+## API Endpoints
+
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/v1/images/generations` | POST | Generate images from text |
+| `/v1/images/edits` | POST | Edit/extend existing images |
+| `/v1/images/variations` | POST | Create image variations |
+| `/v1/audio/speech` | POST | Text-to-speech generation |
+| `/v1/audio/transcriptions` | POST | Speech-to-text transcription |
+| `/v1/audio/translations` | POST | Translate audio to English |
+| `/v1/music/generations` | POST | Generate music (async) |
+| `/v1/music/generations/{id}` | GET | Get music generation status |
+| `/v1/video/generations` | POST | Generate video (async, planned) |
+
+## Provider Comparison
+
+### Image Generation
+
+| Feature | DALL-E 2 | DALL-E 3 | GPT-Image-1 | Midjourney | Flux |
+|---------|----------|----------|-------------|------------|------|
+| Max Resolution | 1024x1024 | 1792x1024 | 4096x4096 | 2048x2048 | 1024x1024 |
+| Multiple Images | ✅ (1-10) | ❌ (1 only) | ✅ (1-10) | ✅ | ✅ |
+| Streaming | ❌ | ❌ | ✅ | ❌ | ✅ |
+| Transparent BG | ❌ | ❌ | ✅ | ❌ | ❌ |
+| Editing | ✅ | ❌ | ✅ | ✅ | ❌ |
+| Speed | Fast | Medium | Medium | Slow | Very Fast |
+| Cost | $ | $$ | $$$ | $$ | $ |
+
+### Audio Models
+
+| Feature | TTS-1 | TTS-1-HD | GPT-4o-TTS | Whisper-1 | GPT-4o-Transcribe |
+|---------|-------|----------|------------|-----------|-------------------|
+| Quality | Standard | High | Very High | Good | Excellent |
+| Voices | 6 | 6 | 11 | N/A | N/A |
+| Streaming | ✅ | ✅ | ✅ | N/A | ✅ |
+| Diarization | N/A | N/A | N/A | ❌ | ✅ |
+| Speed | Very Fast | Fast | Medium | Fast | Medium |
+| Cost | $ | $$ | $$$ | $ | $$$ |
+
+## Implementation Status
+
+| Feature | Status | Notes |
+|---------|--------|-------|
+| Image Generation | ✅ Ready | Full OpenAI compatibility |
+| Image Editing | ✅ Ready | Supports GPT-Image-1, DALL-E 2 |
+| Image Variations | ✅ Ready | DALL-E 2 only |
+| Text-to-Speech | ✅ Ready | Multiple voices and formats |
+| Speech-to-Text | ✅ Ready | With diarization support |
+| Audio Translation | ✅ Ready | Whisper-based |
+| Music Generation | ✅ Ready | Async task-based |
+| Video Generation | 🚧 Planned | Specification ready |
+
+## Roadmap
+
+### Phase 1: Core Media APIs ✅
+- [x] OpenAPI 3.1 specification
+- [x] Images (generation, editing, variations)
+- [x] Audio (speech, transcription, translation)
+- [x] Music generation
+- [x] Video generation specification
+
+### Phase 2: API Gateway Integration 🚧
+- [ ] Implement provider abstraction layer
+- [ ] Add failover logic
+- [ ] Implement streaming endpoints
+- [ ] Add task management for async operations
+- [ ] Set up monitoring and logging
+
+### Phase 3: Provider Integrations 🚧
+- [ ] OpenAI (DALL-E, Whisper, TTS)
+- [ ] Midjourney API integration
+- [ ] Flux API integration
+- [ ] Suno AI integration
+- [ ] RunwayML integration
+
+### Phase 4: Advanced Features 📋
+- [ ] Batch processing
+- [ ] Webhook notifications for async tasks
+- [ ] Advanced caching
+- [ ] CDN integration for media delivery
+- [ ] Cost optimization strategies
+
+### Phase 5: Client Libraries 📋
+- [ ] Python SDK
+- [ ] Node.js SDK
+- [ ] Go SDK
+- [ ] Ruby SDK
+
+## Architecture Integration
+
+This media API specification fits into the Deep Assistant architecture:
+
+```
+┌─────────────────┐
+│  Client Apps    │
+│ (Bot, Web, App) │
+└────────┬────────┘
+         │
+         ↓
+┌─────────────────┐
+│  API Gateway    │◄─── This Specification
+│  (Media Router) │
+└────────┬────────┘
+         │
+    ┌────┴─────┬─────────┬──────────┐
+    ↓          ↓         ↓          ↓
+┌────────┐ ┌────────┐ ┌───────┐ ┌────────┐
+│ OpenAI │ │Midjourney│ │ Suno │ │Runway │
+│Provider│ │Provider │ │ AI   │ │  ML   │
+└────────┘ └────────┘ └───────┘ └────────┘
+```
+
+## Testing
+
+### Validate OpenAPI Specification
+
+```bash
+# Using openapi-generator-cli
+npx @openapitools/openapi-generator-cli validate -i media-api-spec.yaml
+
+# Using swagger-cli
+npx swagger-cli validate media-api-spec.yaml
+```
+
+### Generate Mock Server
+
+```bash
+# Using Prism
+npx @stoplight/prism-cli mock media-api-spec.yaml
+```
+
+### Generate Documentation
+
+```bash
+# Using Redoc
+npx @redocly/cli build-docs media-api-spec.yaml
+
+# Using Swagger UI
+docker run -p 8080:8080 -e SWAGGER_JSON=/spec/media-api-spec.yaml \
+  -v $(pwd):/spec swaggerapi/swagger-ui
+```
+
+## Contributing
+
+We welcome contributions! Please:
+
+1. **Check existing issues** or create a new one
+2. **Follow OpenAPI standards** and OpenAI conventions
+3. **Add examples** for new features
+4. **Update documentation** in both YAML and Markdown
+5. **Test your changes** with validation tools
+
+See the main [CONTRIBUTING](https://github.com/deep-assistant/master-plan/blob/main/CONTRIBUTING.md) guide for more details.
+
+## License
+
+This specification is released under the MIT License. See [LICENSE](LICENSE) for details.
+
+## Support
+
+- **Issues**: [GitHub Issues](https://github.com/deep-assistant/master-plan/issues)
+- **Discussions**: [GitHub Discussions](https://github.com/deep-assistant/master-plan/discussions)
+- **Documentation**: [MEDIA_API.md](MEDIA_API.md)
+
+## Related Projects
+
+- **[master-plan](https://github.com/deep-assistant/master-plan)** - Main repository and roadmap
+- **[api-gateway](https://github.com/deep-assistant/api-gateway)** - OpenAI-compatible API gateway
+- **[telegram-bot](https://github.com/deep-assistant/telegram-bot)** - Telegram bot with media features
+- **[GPTutor](https://github.com/deep-assistant/GPTutor)** - Educational AI with image generation
+- **[web-capture](https://github.com/deep-assistant/web-capture)** - Web page capture service
+
+## Acknowledgments
+
+This specification is based on:
+- [OpenAI API Reference](https://platform.openai.com/docs/api-reference)
+- [OpenAPI 3.1 Specification](https://spec.openapis.org/oas/v3.1.0)
+- Community feedback and best practices
+
+## See Also
+
+- [OpenAI Images API](https://platform.openai.com/docs/api-reference/images)
+- [OpenAI Audio API](https://platform.openai.com/docs/api-reference/audio)
+- [OpenAI API Best Practices](https://platform.openai.com/docs/guides/production-best-practices)
+- [Suno AI](https://www.suno.ai/)
+- [Midjourney](https://www.midjourney.com/)
+- [Stability AI](https://stability.ai/)
+
+---
+
+**Built with ❤️ by the Deep Assistant team**
+
+For questions or support, please open an issue or discussion on GitHub.
diff --git a/media-api-spec.json b/media-api-spec.json
new file mode 100644
index 0000000..1c7dfba
--- /dev/null
+++ b/media-api-spec.json
@@ -0,0 +1,1680 @@
+{
+  "openapi": "3.1.0",
+  "info": {
+    "title": "Media Generation API",
+    "description": "OpenAI-compatible API specification for media generation including images, music, video and more.\n\nThis specification provides a standardized interface for various media generation providers,\nallowing seamless integration with different AI models for creating images, audio, and video content.\n\n## Features\n- Image generation, editing, and variations\n- Text-to-speech audio generation\n- Speech-to-text transcription\n- Audio translation\n- Video generation (future support)\n- Music generation (future support)\n\n## Provider Support\nThis API specification is designed to work with multiple providers:\n- **Images**: DALL-E 2, DALL-E 3, GPT-Image-1, Midjourney, Flux, Stable Diffusion\n- **Audio (Speech)**: GPT-4o-mini-TTS, TTS-1, TTS-1-HD\n- **Audio (Transcription)**: GPT-4o-transcribe, Whisper\n- **Music**: Suno AI\n- **Video**: RunwayML, Stability AI Video (planned)\n",
+    "version": "1.0.0",
+    "contact": {
+      "name": "Deep Assistant",
+      "url": "https://github.com/deep-assistant/master-plan"
+    },
+    "license": {
+      "name": "MIT",
+      "url": "https://github.com/deep-assistant/master-plan/blob/main/LICENSE"
+    }
+  },
+  "servers": [
+    {
+      "url": "https://api.example.com/v1",
+      "description": "Production server"
+    },
+    {
+      "url": "http://localhost:3000/v1",
+      "description": "Development server"
+    }
+  ],
+  "security": [
+    {
+      "ApiKeyAuth": []
+    },
+    {
+      "BearerAuth": []
+    }
+  ],
+  "tags": [
+    {
+      "name": "Images",
+      "description": "Generate, edit, and create variations of images"
+    },
+    {
+      "name": "Audio",
+      "description": "Text-to-speech, speech-to-text, and audio translation"
+    },
+    {
+      "name": "Music",
+      "description": "Music generation from text prompts"
+    },
+    {
+      "name": "Video",
+      "description": "Video generation and editing"
+    }
+  ],
+  "paths": {
+    "/images/generations": {
+      "post": {
+        "operationId": "createImage",
+        "tags": [
+          "Images"
+        ],
+        "summary": "Create image",
+        "description": "Creates an image given a text prompt. Supports multiple AI models including DALL-E, GPT-Image-1,\nMidjourney, Flux, and Stable Diffusion.\n\nLearn more at the [OpenAI image generation guide](https://platform.openai.com/docs/guides/images).\n",
+        "requestBody": {
+          "required": true,
+          "content": {
+            "application/json": {
+              "schema": {
+                "$ref": "#/components/schemas/CreateImageRequest"
+              },
+              "examples": {
+                "basic": {
+                  "summary": "Basic image generation",
+                  "value": {
+                    "prompt": "A cute baby sea otter",
+                    "model": "gpt-image-1",
+                    "n": 1,
+                    "size": "1024x1024"
+                  }
+                },
+                "streaming": {
+                  "summary": "Streaming image generation",
+                  "value": {
+                    "prompt": "A futuristic cityscape at sunset",
+                    "model": "gpt-image-1",
+                    "stream": true,
+                    "size": "1536x1024",
+                    "quality": "high"
+                  }
+                }
+              }
+            }
+          }
+        },
+        "responses": {
+          "200": {
+            "description": "Successfully generated image(s)",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "$ref": "#/components/schemas/ImagesResponse"
+                }
+              },
+              "text/event-stream": {
+                "schema": {
+                  "$ref": "#/components/schemas/ImageGenerationStreamEvent"
+                }
+              }
+            }
+          },
+          "400": {
+            "$ref": "#/components/responses/BadRequest"
+          },
+          "401": {
+            "$ref": "#/components/responses/Unauthorized"
+          },
+          "429": {
+            "$ref": "#/components/responses/RateLimitExceeded"
+          },
+          "500": {
+            "$ref": "#/components/responses/InternalServerError"
+          }
+        }
+      }
+    },
+    "/images/edits": {
+      "post": {
+        "operationId": "createImageEdit",
+        "tags": [
+          "Images"
+        ],
+        "summary": "Create image edit",
+        "description": "Creates an edited or extended image given one or more source images and a text prompt.\nThis endpoint supports `gpt-image-1` and `dall-e-2`.\n",
+        "requestBody": {
+          "required": true,
+          "content": {
+            "multipart/form-data": {
+              "schema": {
+                "$ref": "#/components/schemas/CreateImageEditRequest"
+              }
+            }
+          }
+        },
+        "responses": {
+          "200": {
+            "description": "Successfully edited image(s)",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "$ref": "#/components/schemas/ImagesResponse"
+                }
+              },
+              "text/event-stream": {
+                "schema": {
+                  "$ref": "#/components/schemas/ImageEditStreamEvent"
+                }
+              }
+            }
+          },
+          "400": {
+            "$ref": "#/components/responses/BadRequest"
+          },
+          "401": {
+            "$ref": "#/components/responses/Unauthorized"
+          },
+          "429": {
+            "$ref": "#/components/responses/RateLimitExceeded"
+          }
+        }
+      }
+    },
+    "/images/variations": {
+      "post": {
+        "operationId": "createImageVariation",
+        "tags": [
+          "Images"
+        ],
+        "summary": "Create image variation",
+        "description": "Creates variations of a given image. Currently only supports `dall-e-2`.\n",
+        "requestBody": {
+          "required": true,
+          "content": {
+            "multipart/form-data": {
+              "schema": {
+                "$ref": "#/components/schemas/CreateImageVariationRequest"
+              }
+            }
+          }
+        },
+        "responses": {
+          "200": {
+            "description": "Successfully created image variation(s)",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "$ref": "#/components/schemas/ImagesResponse"
+                }
+              }
+            }
+          },
+          "400": {
+            "$ref": "#/components/responses/BadRequest"
+          },
+          "401": {
+            "$ref": "#/components/responses/Unauthorized"
+          }
+        }
+      }
+    },
+    "/audio/speech": {
+      "post": {
+        "operationId": "createSpeech",
+        "tags": [
+          "Audio"
+        ],
+        "summary": "Create speech",
+        "description": "Generates audio from the input text using text-to-speech models.\n",
+        "requestBody": {
+          "required": true,
+          "content": {
+            "application/json": {
+              "schema": {
+                "$ref": "#/components/schemas/CreateSpeechRequest"
+              },
+              "examples": {
+                "basic": {
+                  "summary": "Basic speech generation",
+                  "value": {
+                    "model": "gpt-4o-mini-tts",
+                    "input": "The quick brown fox jumped over the lazy dog.",
+                    "voice": "alloy"
+                  }
+                },
+                "streaming": {
+                  "summary": "Streaming speech with SSE",
+                  "value": {
+                    "model": "gpt-4o-mini-tts",
+                    "input": "Hello, world!",
+                    "voice": "shimmer",
+                    "stream_format": "sse"
+                  }
+                }
+              }
+            }
+          }
+        },
+        "responses": {
+          "200": {
+            "description": "Successfully generated speech audio",
+            "headers": {
+              "Transfer-Encoding": {
+                "schema": {
+                  "type": "string"
+                },
+                "description": "chunked"
+              }
+            },
+            "content": {
+              "application/octet-stream": {
+                "schema": {
+                  "type": "string",
+                  "format": "binary"
+                }
+              },
+              "text/event-stream": {
+                "schema": {
+                  "$ref": "#/components/schemas/SpeechStreamEvent"
+                }
+              }
+            }
+          },
+          "400": {
+            "$ref": "#/components/responses/BadRequest"
+          },
+          "401": {
+            "$ref": "#/components/responses/Unauthorized"
+          }
+        }
+      }
+    },
+    "/audio/transcriptions": {
+      "post": {
+        "operationId": "createTranscription",
+        "tags": [
+          "Audio"
+        ],
+        "summary": "Create transcription",
+        "description": "Transcribes audio into text. Supports various response formats including JSON, verbose JSON,\ndiarized JSON (speaker identification), and streaming.\n",
+        "requestBody": {
+          "required": true,
+          "content": {
+            "multipart/form-data": {
+              "schema": {
+                "$ref": "#/components/schemas/CreateTranscriptionRequest"
+              }
+            }
+          }
+        },
+        "responses": {
+          "200": {
+            "description": "Successfully transcribed audio",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "oneOf": [
+                    {
+                      "$ref": "#/components/schemas/TranscriptionResponseJson"
+                    },
+                    {
+                      "$ref": "#/components/schemas/TranscriptionResponseVerboseJson"
+                    },
+                    {
+                      "$ref": "#/components/schemas/TranscriptionResponseDiarizedJson"
+                    }
+                  ]
+                }
+              },
+              "text/event-stream": {
+                "schema": {
+                  "$ref": "#/components/schemas/TranscriptionStreamEvent"
+                }
+              }
+            }
+          },
+          "400": {
+            "$ref": "#/components/responses/BadRequest"
+          },
+          "401": {
+            "$ref": "#/components/responses/Unauthorized"
+          }
+        }
+      }
+    },
+    "/audio/translations": {
+      "post": {
+        "operationId": "createTranslation",
+        "tags": [
+          "Audio"
+        ],
+        "summary": "Create translation",
+        "description": "Translates audio from any supported language into English text.\n",
+        "requestBody": {
+          "required": true,
+          "content": {
+            "multipart/form-data": {
+              "schema": {
+                "$ref": "#/components/schemas/CreateTranslationRequest"
+              }
+            }
+          }
+        },
+        "responses": {
+          "200": {
+            "description": "Successfully translated audio",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "oneOf": [
+                    {
+                      "$ref": "#/components/schemas/TranslationResponseJson"
+                    },
+                    {
+                      "$ref": "#/components/schemas/TranslationResponseVerboseJson"
+                    }
+                  ]
+                }
+              }
+            }
+          },
+          "400": {
+            "$ref": "#/components/responses/BadRequest"
+          },
+          "401": {
+            "$ref": "#/components/responses/Unauthorized"
+          }
+        }
+      }
+    },
+    "/music/generations": {
+      "post": {
+        "operationId": "createMusic",
+        "tags": [
+          "Music"
+        ],
+        "summary": "Create music",
+        "description": "Generates music from a text prompt. Supports Suno AI and other music generation models.\n\n**Note**: This endpoint returns a task ID for async processing. Use the status endpoint\nto check generation progress and retrieve the final audio file.\n",
+        "requestBody": {
+          "required": true,
+          "content": {
+            "application/json": {
+              "schema": {
+                "$ref": "#/components/schemas/CreateMusicRequest"
+              }
+            }
+          }
+        },
+        "responses": {
+          "202": {
+            "description": "Music generation task accepted",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "$ref": "#/components/schemas/MusicTaskResponse"
+                }
+              }
+            }
+          },
+          "400": {
+            "$ref": "#/components/responses/BadRequest"
+          },
+          "401": {
+            "$ref": "#/components/responses/Unauthorized"
+          }
+        }
+      }
+    },
+    "/music/generations/{task_id}": {
+      "get": {
+        "operationId": "getMusicStatus",
+        "tags": [
+          "Music"
+        ],
+        "summary": "Get music generation status",
+        "description": "Retrieves the status and result of a music generation task.\n",
+        "parameters": [
+          {
+            "name": "task_id",
+            "in": "path",
+            "required": true,
+            "schema": {
+              "type": "string"
+            },
+            "description": "The task ID returned from the music generation request"
+          }
+        ],
+        "responses": {
+          "200": {
+            "description": "Task status retrieved successfully",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "$ref": "#/components/schemas/MusicStatusResponse"
+                }
+              }
+            }
+          },
+          "404": {
+            "$ref": "#/components/responses/NotFound"
+          }
+        }
+      }
+    },
+    "/video/generations": {
+      "post": {
+        "operationId": "createVideo",
+        "tags": [
+          "Video"
+        ],
+        "summary": "Create video",
+        "description": "Generates video from text prompt or images. Supports various video generation models.\n\n**Note**: This is a planned feature. Implementation details may change.\n",
+        "requestBody": {
+          "required": true,
+          "content": {
+            "application/json": {
+              "schema": {
+                "$ref": "#/components/schemas/CreateVideoRequest"
+              }
+            }
+          }
+        },
+        "responses": {
+          "202": {
+            "description": "Video generation task accepted",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "$ref": "#/components/schemas/VideoTaskResponse"
+                }
+              }
+            }
+          },
+          "400": {
+            "$ref": "#/components/responses/BadRequest"
+          },
+          "401": {
+            "$ref": "#/components/responses/Unauthorized"
+          },
+          "501": {
+            "description": "Not implemented yet"
+          }
+        }
+      }
+    }
+  },
+  "components": {
+    "securitySchemes": {
+      "ApiKeyAuth": {
+        "type": "apiKey",
+        "in": "header",
+        "name": "X-API-Key",
+        "description": "API key for authentication"
+      },
+      "BearerAuth": {
+        "type": "http",
+        "scheme": "bearer",
+        "bearerFormat": "JWT",
+        "description": "Bearer token authentication"
+      }
+    },
+    "schemas": {
+      "CreateImageRequest": {
+        "type": "object",
+        "required": [
+          "prompt"
+        ],
+        "properties": {
+          "prompt": {
+            "type": "string",
+            "description": "A text description of the desired image(s). Maximum length:\n- 32000 characters for `gpt-image-1`\n- 4000 characters for `dall-e-3`\n- 1000 characters for `dall-e-2`\n",
+            "example": "A cute baby sea otter",
+            "maxLength": 32000
+          },
+          "model": {
+            "type": "string",
+            "enum": [
+              "dall-e-2",
+              "dall-e-3",
+              "gpt-image-1",
+              "gpt-image-1-mini",
+              "midjourney",
+              "flux",
+              "stable-diffusion"
+            ],
+            "default": "dall-e-2",
+            "description": "The model to use for image generation",
+            "nullable": true
+          },
+          "n": {
+            "type": "integer",
+            "minimum": 1,
+            "maximum": 10,
+            "default": 1,
+            "description": "The number of images to generate. Must be between 1 and 10.\nFor `dall-e-3`, only `n=1` is supported.\n",
+            "nullable": true
+          },
+          "size": {
+            "type": "string",
+            "enum": [
+              "auto",
+              "256x256",
+              "512x512",
+              "1024x1024",
+              "1536x1024",
+              "1024x1536",
+              "1792x1024",
+              "1024x1792"
+            ],
+            "default": "auto",
+            "description": "The size of the generated images:\n- `gpt-image-1`: `1024x1024`, `1536x1024` (landscape), `1024x1536` (portrait), or `auto` (default)\n- `dall-e-2`: `256x256`, `512x512`, or `1024x1024`\n- `dall-e-3`: `1024x1024`, `1792x1024`, or `1024x1792`\n",
+            "nullable": true
+          },
+          "quality": {
+            "type": "string",
+            "enum": [
+              "standard",
+              "hd",
+              "low",
+              "medium",
+              "high",
+              "auto"
+            ],
+            "default": "auto",
+            "description": "The quality of the image that will be generated:\n- `auto` (default): automatically select the best quality\n- `high`, `medium`, `low`: supported for `gpt-image-1`\n- `hd`, `standard`: supported for `dall-e-3`\n- `standard`: only option for `dall-e-2`\n",
+            "nullable": true
+          },
+          "response_format": {
+            "type": "string",
+            "enum": [
+              "url",
+              "b64_json"
+            ],
+            "default": "url",
+            "description": "The format in which generated images are returned. Must be one of `url` or `b64_json`.\nURLs are only valid for 60 minutes after generation.\nNote: `gpt-image-1` always returns base64-encoded images.\n",
+            "nullable": true
+          },
+          "output_format": {
+            "type": "string",
+            "enum": [
+              "png",
+              "jpeg",
+              "webp"
+            ],
+            "default": "png",
+            "description": "The format in which the generated images are returned.\nOnly supported for `gpt-image-1`. Must be one of `png`, `jpeg`, or `webp`.\n",
+            "nullable": true
+          },
+          "output_compression": {
+            "type": "integer",
+            "minimum": 0,
+            "maximum": 100,
+            "default": 100,
+            "description": "The compression level (0-100%) for generated images.\nOnly supported for `gpt-image-1` with `webp` or `jpeg` output formats.\n",
+            "nullable": true
+          },
+          "stream": {
+            "type": "boolean",
+            "default": false,
+            "description": "Generate the image in streaming mode. Only supported for `gpt-image-1`.\n",
+            "nullable": true
+          },
+          "background": {
+            "type": "string",
+            "enum": [
+              "transparent",
+              "opaque",
+              "auto"
+            ],
+            "default": "auto",
+            "description": "Sets transparency for the background. Only supported for `gpt-image-1`.\n- `transparent`: requires PNG or WebP output format\n- `opaque`: solid background\n- `auto` (default): model determines best background\n",
+            "nullable": true
+          },
+          "style": {
+            "type": "string",
+            "enum": [
+              "vivid",
+              "natural"
+            ],
+            "default": "vivid",
+            "description": "The style of generated images. Only supported for `dall-e-3`.\n- `vivid`: hyper-real and dramatic images\n- `natural`: more natural, less hyper-real images\n",
+            "nullable": true
+          },
+          "moderation": {
+            "type": "string",
+            "enum": [
+              "low",
+              "auto"
+            ],
+            "default": "auto",
+            "description": "Content moderation level for `gpt-image-1`:\n- `low`: less restrictive filtering\n- `auto` (default): standard filtering\n",
+            "nullable": true
+          },
+          "user": {
+            "type": "string",
+            "description": "A unique identifier representing your end-user, for monitoring and abuse detection.\n",
+            "example": "user-1234"
+          }
+        }
+      },
+      "CreateImageEditRequest": {
+        "type": "object",
+        "required": [
+          "image",
+          "prompt"
+        ],
+        "properties": {
+          "image": {
+            "type": "array",
+            "items": {
+              "type": "string",
+              "format": "binary"
+            },
+            "minItems": 1,
+            "description": "One or more source images to edit. Must be valid PNG files, less than 4MB each.\n"
+          },
+          "mask": {
+            "type": "string",
+            "format": "binary",
+            "description": "An image indicating which areas of `image` to edit. Must be a valid PNG file,\nless than 4MB, and have the same dimensions as `image`.\n",
+            "nullable": true
+          },
+          "prompt": {
+            "type": "string",
+            "description": "A text description of the desired edited image(s)",
+            "example": "A cute baby sea otter wearing a beret",
+            "maxLength": 32000
+          },
+          "model": {
+            "type": "string",
+            "enum": [
+              "dall-e-2",
+              "gpt-image-1"
+            ],
+            "default": "dall-e-2",
+            "description": "The model to use for image editing",
+            "nullable": true
+          },
+          "n": {
+            "type": "integer",
+            "minimum": 1,
+            "maximum": 10,
+            "default": 1,
+            "description": "The number of images to generate",
+            "nullable": true
+          },
+          "size": {
+            "type": "string",
+            "enum": [
+              "256x256",
+              "512x512",
+              "1024x1024"
+            ],
+            "default": "1024x1024",
+            "nullable": true
+          },
+          "response_format": {
+            "type": "string",
+            "enum": [
+              "url",
+              "b64_json"
+            ],
+            "default": "url",
+            "nullable": true
+          },
+          "stream": {
+            "type": "boolean",
+            "default": false,
+            "description": "Generate the edit in streaming mode (gpt-image-1 only)",
+            "nullable": true
+          },
+          "user": {
+            "type": "string",
+            "nullable": true
+          }
+        }
+      },
+      "CreateImageVariationRequest": {
+        "type": "object",
+        "required": [
+          "image"
+        ],
+        "properties": {
+          "image": {
+            "type": "string",
+            "format": "binary",
+            "description": "The image to use as the basis for variations. Must be a valid PNG file,\nless than 4MB, and square.\n"
+          },
+          "model": {
+            "type": "string",
+            "enum": [
+              "dall-e-2"
+            ],
+            "default": "dall-e-2",
+            "description": "Only `dall-e-2` is supported",
+            "nullable": true
+          },
+          "n": {
+            "type": "integer",
+            "minimum": 1,
+            "maximum": 10,
+            "default": 1,
+            "description": "The number of variations to generate",
+            "nullable": true
+          },
+          "size": {
+            "type": "string",
+            "enum": [
+              "256x256",
+              "512x512",
+              "1024x1024"
+            ],
+            "default": "1024x1024",
+            "nullable": true
+          },
+          "response_format": {
+            "type": "string",
+            "enum": [
+              "url",
+              "b64_json"
+            ],
+            "default": "url",
+            "nullable": true
+          },
+          "user": {
+            "type": "string",
+            "nullable": true
+          }
+        }
+      },
+      "ImagesResponse": {
+        "type": "object",
+        "properties": {
+          "created": {
+            "type": "integer",
+            "description": "Unix timestamp of when the image(s) were created"
+          },
+          "data": {
+            "type": "array",
+            "items": {
+              "$ref": "#/components/schemas/ImageObject"
+            }
+          },
+          "usage": {
+            "$ref": "#/components/schemas/UsageInfo",
+            "nullable": true
+          }
+        }
+      },
+      "ImageObject": {
+        "type": "object",
+        "properties": {
+          "url": {
+            "type": "string",
+            "format": "uri",
+            "description": "The URL of the generated image (valid for 60 minutes)",
+            "nullable": true
+          },
+          "b64_json": {
+            "type": "string",
+            "description": "The base64-encoded JSON of the generated image",
+            "nullable": true
+          },
+          "revised_prompt": {
+            "type": "string",
+            "description": "The prompt that was used to generate the image (if modified)",
+            "nullable": true
+          }
+        }
+      },
+      "ImageGenerationStreamEvent": {
+        "type": "object",
+        "properties": {
+          "type": {
+            "type": "string",
+            "enum": [
+              "image_generation.partial_image",
+              "image_generation.completed"
+            ]
+          },
+          "b64_json": {
+            "type": "string",
+            "description": "Base64-encoded partial or complete image"
+          },
+          "partial_image_index": {
+            "type": "integer",
+            "description": "Index of the partial image in streaming sequence",
+            "nullable": true
+          },
+          "usage": {
+            "$ref": "#/components/schemas/UsageInfo",
+            "nullable": true
+          }
+        }
+      },
+      "ImageEditStreamEvent": {
+        "type": "object",
+        "properties": {
+          "type": {
+            "type": "string",
+            "enum": [
+              "image_edit.partial_image",
+              "image_edit.completed"
+            ]
+          },
+          "b64_json": {
+            "type": "string"
+          },
+          "partial_image_index": {
+            "type": "integer",
+            "nullable": true
+          },
+          "usage": {
+            "$ref": "#/components/schemas/UsageInfo",
+            "nullable": true
+          }
+        }
+      },
+      "CreateSpeechRequest": {
+        "type": "object",
+        "required": [
+          "model",
+          "input",
+          "voice"
+        ],
+        "properties": {
+          "model": {
+            "type": "string",
+            "enum": [
+              "tts-1",
+              "tts-1-hd",
+              "gpt-4o-mini-tts",
+              "gpt-4o-tts"
+            ],
+            "description": "The TTS model to use"
+          },
+          "input": {
+            "type": "string",
+            "maxLength": 4096,
+            "description": "The text to generate audio for"
+          },
+          "voice": {
+            "type": "string",
+            "enum": [
+              "alloy",
+              "ash",
+              "ballad",
+              "coral",
+              "echo",
+              "fable",
+              "onyx",
+              "nova",
+              "sage",
+              "shimmer",
+              "verse"
+            ],
+            "description": "The voice to use for generation"
+          },
+          "response_format": {
+            "type": "string",
+            "enum": [
+              "mp3",
+              "opus",
+              "aac",
+              "flac",
+              "wav",
+              "pcm"
+            ],
+            "default": "mp3",
+            "description": "The format to return the audio in",
+            "nullable": true
+          },
+          "speed": {
+            "type": "number",
+            "minimum": 0.25,
+            "maximum": 4.0,
+            "default": 1.0,
+            "description": "The speed of the generated audio (0.25 to 4.0)",
+            "nullable": true
+          },
+          "stream_format": {
+            "type": "string",
+            "enum": [
+              "raw",
+              "sse"
+            ],
+            "description": "The streaming format:\n- `raw`: standard binary streaming\n- `sse`: server-sent events format\n",
+            "nullable": true
+          }
+        }
+      },
+      "CreateTranscriptionRequest": {
+        "type": "object",
+        "required": [
+          "file",
+          "model"
+        ],
+        "properties": {
+          "file": {
+            "type": "string",
+            "format": "binary",
+            "description": "The audio file to transcribe. Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm.\nMaximum file size: 25MB.\n"
+          },
+          "model": {
+            "type": "string",
+            "enum": [
+              "whisper-1",
+              "gpt-4o-transcribe",
+              "gpt-4o-mini-transcribe",
+              "gpt-4o-transcribe-diarize"
+            ],
+            "description": "The transcription model to use"
+          },
+          "language": {
+            "type": "string",
+            "description": "The language of the input audio in ISO-639-1 format (e.g., 'en', 'es').\nSupplying the input language improves accuracy and latency.\n",
+            "nullable": true
+          },
+          "prompt": {
+            "type": "string",
+            "description": "Optional text to guide the model's style or continue a previous audio segment.\n",
+            "nullable": true
+          },
+          "response_format": {
+            "type": "string",
+            "enum": [
+              "json",
+              "text",
+              "srt",
+              "verbose_json",
+              "vtt",
+              "diarized_json"
+            ],
+            "default": "json",
+            "description": "The format of the transcript output",
+            "nullable": true
+          },
+          "temperature": {
+            "type": "number",
+            "minimum": 0,
+            "maximum": 1,
+            "default": 0,
+            "description": "The sampling temperature (0 to 1)",
+            "nullable": true
+          },
+          "timestamp_granularities": {
+            "type": "array",
+            "items": {
+              "type": "string",
+              "enum": [
+                "word",
+                "segment"
+              ]
+            },
+            "description": "The timestamp granularities to include in the response.\nOnly applicable for `verbose_json` format.\n",
+            "nullable": true
+          },
+          "stream": {
+            "type": "boolean",
+            "default": false,
+            "description": "Stream the transcription as it's processed",
+            "nullable": true
+          },
+          "chunking_strategy": {
+            "type": "string",
+            "enum": [
+              "auto",
+              "fixed"
+            ],
+            "description": "Strategy for chunking long audio files (diarization only):\n- `auto`: automatically determine chunk size\n- `fixed`: use fixed chunk size\n",
+            "nullable": true
+          },
+          "known_speaker_names": {
+            "type": "array",
+            "items": {
+              "type": "string"
+            },
+            "description": "Array of known speaker names for diarization",
+            "nullable": true
+          },
+          "known_speaker_references": {
+            "type": "array",
+            "items": {
+              "type": "string"
+            },
+            "description": "Array of speaker voice reference audio (base64 encoded)",
+            "nullable": true
+          }
+        }
+      },
+      "CreateTranslationRequest": {
+        "type": "object",
+        "required": [
+          "file",
+          "model"
+        ],
+        "properties": {
+          "file": {
+            "type": "string",
+            "format": "binary",
+            "description": "The audio file to translate (translates to English)"
+          },
+          "model": {
+            "type": "string",
+            "enum": [
+              "whisper-1"
+            ],
+            "description": "The translation model to use"
+          },
+          "prompt": {
+            "type": "string",
+            "nullable": true
+          },
+          "response_format": {
+            "type": "string",
+            "enum": [
+              "json",
+              "text",
+              "srt",
+              "verbose_json",
+              "vtt"
+            ],
+            "default": "json",
+            "nullable": true
+          },
+          "temperature": {
+            "type": "number",
+            "minimum": 0,
+            "maximum": 1,
+            "default": 0,
+            "nullable": true
+          }
+        }
+      },
+      "TranscriptionResponseJson": {
+        "type": "object",
+        "properties": {
+          "text": {
+            "type": "string",
+            "description": "The transcribed text"
+          },
+          "usage": {
+            "$ref": "#/components/schemas/AudioUsageInfo",
+            "nullable": true
+          }
+        }
+      },
+      "TranscriptionResponseVerboseJson": {
+        "type": "object",
+        "properties": {
+          "task": {
+            "type": "string",
+            "enum": [
+              "transcribe"
+            ]
+          },
+          "language": {
+            "type": "string",
+            "description": "The detected language"
+          },
+          "duration": {
+            "type": "number",
+            "description": "Duration of the audio in seconds"
+          },
+          "text": {
+            "type": "string",
+            "description": "The transcribed text"
+          },
+          "words": {
+            "type": "array",
+            "items": {
+              "$ref": "#/components/schemas/TranscriptionWord"
+            },
+            "nullable": true
+          },
+          "segments": {
+            "type": "array",
+            "items": {
+              "$ref": "#/components/schemas/TranscriptionSegment"
+            },
+            "nullable": true
+          },
+          "usage": {
+            "$ref": "#/components/schemas/AudioUsageInfo",
+            "nullable": true
+          }
+        }
+      },
+      "TranscriptionResponseDiarizedJson": {
+        "type": "object",
+        "properties": {
+          "task": {
+            "type": "string",
+            "enum": [
+              "transcribe"
+            ]
+          },
+          "duration": {
+            "type": "number",
+            "description": "Duration of the audio in seconds"
+          },
+          "text": {
+            "type": "string",
+            "description": "The full transcribed text with speaker labels"
+          },
+          "segments": {
+            "type": "array",
+            "items": {
+              "$ref": "#/components/schemas/DiarizedSegment"
+            }
+          },
+          "usage": {
+            "$ref": "#/components/schemas/AudioUsageInfo",
+            "nullable": true
+          }
+        }
+      },
+      "TranscriptionWord": {
+        "type": "object",
+        "properties": {
+          "word": {
+            "type": "string"
+          },
+          "start": {
+            "type": "number"
+          },
+          "end": {
+            "type": "number"
+          }
+        }
+      },
+      "TranscriptionSegment": {
+        "type": "object",
+        "properties": {
+          "id": {
+            "type": "integer"
+          },
+          "seek": {
+            "type": "integer"
+          },
+          "start": {
+            "type": "number"
+          },
+          "end": {
+            "type": "number"
+          },
+          "text": {
+            "type": "string"
+          },
+          "tokens": {
+            "type": "array",
+            "items": {
+              "type": "integer"
+            }
+          },
+          "temperature": {
+            "type": "number"
+          },
+          "avg_logprob": {
+            "type": "number"
+          },
+          "compression_ratio": {
+            "type": "number"
+          },
+          "no_speech_prob": {
+            "type": "number"
+          }
+        }
+      },
+      "DiarizedSegment": {
+        "type": "object",
+        "properties": {
+          "type": {
+            "type": "string",
+            "enum": [
+              "transcript.text.segment"
+            ]
+          },
+          "id": {
+            "type": "string"
+          },
+          "start": {
+            "type": "number"
+          },
+          "end": {
+            "type": "number"
+          },
+          "text": {
+            "type": "string"
+          },
+          "speaker": {
+            "type": "string",
+            "description": "The identified speaker name or ID"
+          }
+        }
+      },
+      "TranslationResponseJson": {
+        "type": "object",
+        "properties": {
+          "text": {
+            "type": "string",
+            "description": "The translated text (in English)"
+          }
+        }
+      },
+      "TranslationResponseVerboseJson": {
+        "type": "object",
+        "properties": {
+          "task": {
+            "type": "string",
+            "enum": [
+              "translate"
+            ]
+          },
+          "language": {
+            "type": "string",
+            "description": "The source language detected"
+          },
+          "duration": {
+            "type": "number"
+          },
+          "text": {
+            "type": "string"
+          }
+        }
+      },
+      "TranscriptionStreamEvent": {
+        "type": "object",
+        "properties": {
+          "type": {
+            "type": "string",
+            "enum": [
+              "transcript.text.delta",
+              "transcript.completed"
+            ]
+          },
+          "delta": {
+            "type": "string",
+            "description": "Incremental transcription text",
+            "nullable": true
+          },
+          "text": {
+            "type": "string",
+            "description": "Full transcription text (on completed)",
+            "nullable": true
+          }
+        }
+      },
+      "SpeechStreamEvent": {
+        "type": "object",
+        "properties": {
+          "type": {
+            "type": "string",
+            "enum": [
+              "speech.audio_delta",
+              "speech.completed"
+            ]
+          },
+          "delta": {
+            "type": "string",
+            "format": "byte",
+            "description": "Base64-encoded audio chunk",
+            "nullable": true
+          }
+        }
+      },
+      "CreateMusicRequest": {
+        "type": "object",
+        "required": [
+          "prompt"
+        ],
+        "properties": {
+          "prompt": {
+            "type": "string",
+            "description": "Text description of the music to generate",
+            "example": "An upbeat electronic dance track with heavy bass"
+          },
+          "model": {
+            "type": "string",
+            "enum": [
+              "suno-v3",
+              "suno-v3.5"
+            ],
+            "default": "suno-v3.5",
+            "nullable": true
+          },
+          "duration": {
+            "type": "integer",
+            "minimum": 10,
+            "maximum": 300,
+            "description": "Duration of the music in seconds (10-300)",
+            "nullable": true
+          },
+          "style": {
+            "type": "string",
+            "description": "Musical style or genre",
+            "nullable": true
+          },
+          "instrumental": {
+            "type": "boolean",
+            "default": false,
+            "description": "Whether to generate instrumental music (no vocals)",
+            "nullable": true
+          }
+        }
+      },
+      "MusicTaskResponse": {
+        "type": "object",
+        "properties": {
+          "task_id": {
+            "type": "string",
+            "description": "Unique identifier for the music generation task"
+          },
+          "status": {
+            "type": "string",
+            "enum": [
+              "pending",
+              "processing"
+            ],
+            "description": "Current status of the task"
+          },
+          "estimated_completion_time": {
+            "type": "integer",
+            "description": "Estimated time to completion in seconds",
+            "nullable": true
+          }
+        }
+      },
+      "MusicStatusResponse": {
+        "type": "object",
+        "properties": {
+          "task_id": {
+            "type": "string"
+          },
+          "status": {
+            "type": "string",
+            "enum": [
+              "pending",
+              "processing",
+              "completed",
+              "failed"
+            ]
+          },
+          "progress": {
+            "type": "integer",
+            "minimum": 0,
+            "maximum": 100,
+            "description": "Progress percentage (0-100)"
+          },
+          "result": {
+            "type": "object",
+            "nullable": true,
+            "properties": {
+              "url": {
+                "type": "string",
+                "format": "uri",
+                "description": "URL to download the generated music"
+              },
+              "duration": {
+                "type": "number",
+                "description": "Actual duration of the generated music"
+              },
+              "format": {
+                "type": "string",
+                "description": "Audio format (e.g., mp3, wav)"
+              }
+            }
+          },
+          "error": {
+            "type": "string",
+            "nullable": true,
+            "description": "Error message if the task failed"
+          }
+        }
+      },
+      "CreateVideoRequest": {
+        "type": "object",
+        "required": [
+          "prompt"
+        ],
+        "properties": {
+          "prompt": {
+            "type": "string",
+            "description": "Text description of the video to generate"
+          },
+          "model": {
+            "type": "string",
+            "enum": [
+              "runway-gen3",
+              "stability-video"
+            ],
+            "nullable": true
+          },
+          "duration": {
+            "type": "integer",
+            "minimum": 2,
+            "maximum": 10,
+            "description": "Duration of the video in seconds",
+            "nullable": true
+          },
+          "resolution": {
+            "type": "string",
+            "enum": [
+              "512x512",
+              "768x768",
+              "1024x1024",
+              "1280x720",
+              "1920x1080"
+            ],
+            "default": "1280x720",
+            "nullable": true
+          },
+          "fps": {
+            "type": "integer",
+            "enum": [
+              24,
+              30,
+              60
+            ],
+            "default": 30,
+            "description": "Frames per second",
+            "nullable": true
+          },
+          "source_image": {
+            "type": "string",
+            "format": "binary",
+            "description": "Optional source image for image-to-video generation",
+            "nullable": true
+          }
+        }
+      },
+      "VideoTaskResponse": {
+        "type": "object",
+        "properties": {
+          "task_id": {
+            "type": "string"
+          },
+          "status": {
+            "type": "string",
+            "enum": [
+              "pending",
+              "processing"
+            ]
+          },
+          "estimated_completion_time": {
+            "type": "integer",
+            "nullable": true
+          }
+        }
+      },
+      "UsageInfo": {
+        "type": "object",
+        "properties": {
+          "total_tokens": {
+            "type": "integer"
+          },
+          "input_tokens": {
+            "type": "integer"
+          },
+          "output_tokens": {
+            "type": "integer"
+          },
+          "input_tokens_details": {
+            "type": "object",
+            "properties": {
+              "text_tokens": {
+                "type": "integer"
+              },
+              "image_tokens": {
+                "type": "integer"
+              }
+            }
+          }
+        }
+      },
+      "AudioUsageInfo": {
+        "type": "object",
+        "properties": {
+          "type": {
+            "type": "string",
+            "enum": [
+              "tokens",
+              "duration"
+            ]
+          },
+          "input_tokens": {
+            "type": "integer",
+            "nullable": true
+          },
+          "input_token_details": {
+            "type": "object",
+            "nullable": true,
+            "properties": {
+              "text_tokens": {
+                "type": "integer"
+              },
+              "audio_tokens": {
+                "type": "integer"
+              }
+            }
+          },
+          "output_tokens": {
+            "type": "integer",
+            "nullable": true
+          },
+          "total_tokens": {
+            "type": "integer",
+            "nullable": true
+          },
+          "seconds": {
+            "type": "number",
+            "nullable": true,
+            "description": "Duration in seconds (for duration-based usage)"
+          }
+        }
+      },
+      "Error": {
+        "type": "object",
+        "properties": {
+          "error": {
+            "type": "object",
+            "properties": {
+              "message": {
+                "type": "string",
+                "description": "A human-readable error message"
+              },
+              "type": {
+                "type": "string",
+                "description": "Error type identifier"
+              },
+              "code": {
+                "type": "string",
+                "description": "Error code",
+                "nullable": true
+              },
+              "param": {
+                "type": "string",
+                "description": "The parameter that caused the error",
+                "nullable": true
+              }
+            }
+          }
+        }
+      }
+    },
+    "responses": {
+      "BadRequest": {
+        "description": "Bad request - Invalid parameters",
+        "content": {
+          "application/json": {
+            "schema": {
+              "$ref": "#/components/schemas/Error"
+            },
+            "example": {
+              "error": {
+                "message": "Invalid parameter: prompt is required",
+                "type": "invalid_request_error",
+                "code": "invalid_parameter",
+                "param": "prompt"
+              }
+            }
+          }
+        }
+      },
+      "Unauthorized": {
+        "description": "Unauthorized - Invalid or missing API key",
+        "content": {
+          "application/json": {
+            "schema": {
+              "$ref": "#/components/schemas/Error"
+            },
+            "example": {
+              "error": {
+                "message": "Invalid API key provided",
+                "type": "invalid_request_error",
+                "code": "invalid_api_key"
+              }
+            }
+          }
+        }
+      },
+      "NotFound": {
+        "description": "Resource not found",
+        "content": {
+          "application/json": {
+            "schema": {
+              "$ref": "#/components/schemas/Error"
+            },
+            "example": {
+              "error": {
+                "message": "The requested resource was not found",
+                "type": "invalid_request_error",
+                "code": "not_found"
+              }
+            }
+          }
+        }
+      },
+      "RateLimitExceeded": {
+        "description": "Rate limit exceeded",
+        "content": {
+          "application/json": {
+            "schema": {
+              "$ref": "#/components/schemas/Error"
+            },
+            "example": {
+              "error": {
+                "message": "Rate limit exceeded. Please try again later.",
+                "type": "rate_limit_error",
+                "code": "rate_limit_exceeded"
+              }
+            }
+          }
+        }
+      },
+      "InternalServerError": {
+        "description": "Internal server error",
+        "content": {
+          "application/json": {
+            "schema": {
+              "$ref": "#/components/schemas/Error"
+            },
+            "example": {
+              "error": {
+                "message": "An internal server error occurred",
+                "type": "server_error",
+                "code": "internal_error"
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+}
\ No newline at end of file
diff --git a/media-api-spec.yaml b/media-api-spec.yaml
new file mode 100644
index 0000000..d54fee4
--- /dev/null
+++ b/media-api-spec.yaml
@@ -0,0 +1,1332 @@
+openapi: 3.1.0
+info:
+  title: Media Generation API
+  description: |
+    OpenAI-compatible API specification for media generation including images, music, video and more.
+
+    This specification provides a standardized interface for various media generation providers,
+    allowing seamless integration with different AI models for creating images, audio, and video content.
+
+    ## Features
+    - Image generation, editing, and variations
+    - Text-to-speech audio generation
+    - Speech-to-text transcription
+    - Audio translation
+    - Video generation (future support)
+    - Music generation (future support)
+
+    ## Provider Support
+    This API specification is designed to work with multiple providers:
+    - **Images**: DALL-E 2, DALL-E 3, GPT-Image-1, Midjourney, Flux, Stable Diffusion
+    - **Audio (Speech)**: GPT-4o-mini-TTS, TTS-1, TTS-1-HD
+    - **Audio (Transcription)**: GPT-4o-transcribe, Whisper
+    - **Music**: Suno AI
+    - **Video**: RunwayML, Stability AI Video (planned)
+
+  version: 1.0.0
+  contact:
+    name: Deep Assistant
+    url: https://github.com/deep-assistant/master-plan
+  license:
+    name: MIT
+    url: https://github.com/deep-assistant/master-plan/blob/main/LICENSE
+
+servers:
+  - url: https://api.example.com/v1
+    description: Production server
+  - url: http://localhost:3000/v1
+    description: Development server
+
+security:
+  - ApiKeyAuth: []
+  - BearerAuth: []
+
+tags:
+  - name: Images
+    description: Generate, edit, and create variations of images
+  - name: Audio
+    description: Text-to-speech, speech-to-text, and audio translation
+  - name: Music
+    description: Music generation from text prompts
+  - name: Video
+    description: Video generation and editing
+
+paths:
+  /images/generations:
+    post:
+      operationId: createImage
+      tags:
+        - Images
+      summary: Create image
+      description: |
+        Creates an image given a text prompt. Supports multiple AI models including DALL-E, GPT-Image-1,
+        Midjourney, Flux, and Stable Diffusion.
+
+        Learn more at the [OpenAI image generation guide](https://platform.openai.com/docs/guides/images).
+      requestBody:
+        required: true
+        content:
+          application/json:
+            schema:
+              $ref: '#/components/schemas/CreateImageRequest'
+            examples:
+              basic:
+                summary: Basic image generation
+                value:
+                  prompt: "A cute baby sea otter"
+                  model: "gpt-image-1"
+                  n: 1
+                  size: "1024x1024"
+              streaming:
+                summary: Streaming image generation
+                value:
+                  prompt: "A futuristic cityscape at sunset"
+                  model: "gpt-image-1"
+                  stream: true
+                  size: "1536x1024"
+                  quality: "high"
+      responses:
+        '200':
+          description: Successfully generated image(s)
+          content:
+            application/json:
+              schema:
+                $ref: '#/components/schemas/ImagesResponse'
+            text/event-stream:
+              schema:
+                $ref: '#/components/schemas/ImageGenerationStreamEvent'
+        '400':
+          $ref: '#/components/responses/BadRequest'
+        '401':
+          $ref: '#/components/responses/Unauthorized'
+        '429':
+          $ref: '#/components/responses/RateLimitExceeded'
+        '500':
+          $ref: '#/components/responses/InternalServerError'
+
+  /images/edits:
+    post:
+      operationId: createImageEdit
+      tags:
+        - Images
+      summary: Create image edit
+      description: |
+        Creates an edited or extended image given one or more source images and a text prompt.
+        This endpoint supports `gpt-image-1` and `dall-e-2`.
+      requestBody:
+        required: true
+        content:
+          multipart/form-data:
+            schema:
+              $ref: '#/components/schemas/CreateImageEditRequest'
+      responses:
+        '200':
+          description: Successfully edited image(s)
+          content:
+            application/json:
+              schema:
+                $ref: '#/components/schemas/ImagesResponse'
+            text/event-stream:
+              schema:
+                $ref: '#/components/schemas/ImageEditStreamEvent'
+        '400':
+          $ref: '#/components/responses/BadRequest'
+        '401':
+          $ref: '#/components/responses/Unauthorized'
+        '429':
+          $ref: '#/components/responses/RateLimitExceeded'
+
+  /images/variations:
+    post:
+      operationId: createImageVariation
+      tags:
+        - Images
+      summary: Create image variation
+      description: |
+        Creates variations of a given image. Currently only supports `dall-e-2`.
+      requestBody:
+        required: true
+        content:
+          multipart/form-data:
+            schema:
+              $ref: '#/components/schemas/CreateImageVariationRequest'
+      responses:
+        '200':
+          description: Successfully created image variation(s)
+          content:
+            application/json:
+              schema:
+                $ref: '#/components/schemas/ImagesResponse'
+        '400':
+          $ref: '#/components/responses/BadRequest'
+        '401':
+          $ref: '#/components/responses/Unauthorized'
+
+  /audio/speech:
+    post:
+      operationId: createSpeech
+      tags:
+        - Audio
+      summary: Create speech
+      description: |
+        Generates audio from the input text using text-to-speech models.
+      requestBody:
+        required: true
+        content:
+          application/json:
+            schema:
+              $ref: '#/components/schemas/CreateSpeechRequest'
+            examples:
+              basic:
+                summary: Basic speech generation
+                value:
+                  model: "gpt-4o-mini-tts"
+                  input: "The quick brown fox jumped over the lazy dog."
+                  voice: "alloy"
+              streaming:
+                summary: Streaming speech with SSE
+                value:
+                  model: "gpt-4o-mini-tts"
+                  input: "Hello, world!"
+                  voice: "shimmer"
+                  stream_format: "sse"
+      responses:
+        '200':
+          description: Successfully generated speech audio
+          headers:
+            Transfer-Encoding:
+              schema:
+                type: string
+              description: chunked
+          content:
+            application/octet-stream:
+              schema:
+                type: string
+                format: binary
+            text/event-stream:
+              schema:
+                $ref: '#/components/schemas/SpeechStreamEvent'
+        '400':
+          $ref: '#/components/responses/BadRequest'
+        '401':
+          $ref: '#/components/responses/Unauthorized'
+
+  /audio/transcriptions:
+    post:
+      operationId: createTranscription
+      tags:
+        - Audio
+      summary: Create transcription
+      description: |
+        Transcribes audio into text. Supports various response formats including JSON, verbose JSON,
+        diarized JSON (speaker identification), and streaming.
+      requestBody:
+        required: true
+        content:
+          multipart/form-data:
+            schema:
+              $ref: '#/components/schemas/CreateTranscriptionRequest'
+      responses:
+        '200':
+          description: Successfully transcribed audio
+          content:
+            application/json:
+              schema:
+                oneOf:
+                  - $ref: '#/components/schemas/TranscriptionResponseJson'
+                  - $ref: '#/components/schemas/TranscriptionResponseVerboseJson'
+                  - $ref: '#/components/schemas/TranscriptionResponseDiarizedJson'
+            text/event-stream:
+              schema:
+                $ref: '#/components/schemas/TranscriptionStreamEvent'
+        '400':
+          $ref: '#/components/responses/BadRequest'
+        '401':
+          $ref: '#/components/responses/Unauthorized'
+
+  /audio/translations:
+    post:
+      operationId: createTranslation
+      tags:
+        - Audio
+      summary: Create translation
+      description: |
+        Translates audio from any supported language into English text.
+      requestBody:
+        required: true
+        content:
+          multipart/form-data:
+            schema:
+              $ref: '#/components/schemas/CreateTranslationRequest'
+      responses:
+        '200':
+          description: Successfully translated audio
+          content:
+            application/json:
+              schema:
+                oneOf:
+                  - $ref: '#/components/schemas/TranslationResponseJson'
+                  - $ref: '#/components/schemas/TranslationResponseVerboseJson'
+        '400':
+          $ref: '#/components/responses/BadRequest'
+        '401':
+          $ref: '#/components/responses/Unauthorized'
+
+  /music/generations:
+    post:
+      operationId: createMusic
+      tags:
+        - Music
+      summary: Create music
+      description: |
+        Generates music from a text prompt. Supports Suno AI and other music generation models.
+
+        **Note**: This endpoint returns a task ID for async processing. Use the status endpoint
+        to check generation progress and retrieve the final audio file.
+      requestBody:
+        required: true
+        content:
+          application/json:
+            schema:
+              $ref: '#/components/schemas/CreateMusicRequest'
+      responses:
+        '202':
+          description: Music generation task accepted
+          content:
+            application/json:
+              schema:
+                $ref: '#/components/schemas/MusicTaskResponse'
+        '400':
+          $ref: '#/components/responses/BadRequest'
+        '401':
+          $ref: '#/components/responses/Unauthorized'
+
+  /music/generations/{task_id}:
+    get:
+      operationId: getMusicStatus
+      tags:
+        - Music
+      summary: Get music generation status
+      description: |
+        Retrieves the status and result of a music generation task.
+      parameters:
+        - name: task_id
+          in: path
+          required: true
+          schema:
+            type: string
+          description: The task ID returned from the music generation request
+      responses:
+        '200':
+          description: Task status retrieved successfully
+          content:
+            application/json:
+              schema:
+                $ref: '#/components/schemas/MusicStatusResponse'
+        '404':
+          $ref: '#/components/responses/NotFound'
+
+  /video/generations:
+    post:
+      operationId: createVideo
+      tags:
+        - Video
+      summary: Create video
+      description: |
+        Generates video from text prompt or images. Supports various video generation models.
+
+        **Note**: This is a planned feature. Implementation details may change.
+      requestBody:
+        required: true
+        content:
+          application/json:
+            schema:
+              $ref: '#/components/schemas/CreateVideoRequest'
+      responses:
+        '202':
+          description: Video generation task accepted
+          content:
+            application/json:
+              schema:
+                $ref: '#/components/schemas/VideoTaskResponse'
+        '400':
+          $ref: '#/components/responses/BadRequest'
+        '401':
+          $ref: '#/components/responses/Unauthorized'
+        '501':
+          description: Not implemented yet
+
+components:
+  securitySchemes:
+    ApiKeyAuth:
+      type: apiKey
+      in: header
+      name: X-API-Key
+      description: API key for authentication
+    BearerAuth:
+      type: http
+      scheme: bearer
+      bearerFormat: JWT
+      description: Bearer token authentication
+
+  schemas:
+    # Image Generation Schemas
+    CreateImageRequest:
+      type: object
+      required:
+        - prompt
+      properties:
+        prompt:
+          type: string
+          description: |
+            A text description of the desired image(s). Maximum length:
+            - 32000 characters for `gpt-image-1`
+            - 4000 characters for `dall-e-3`
+            - 1000 characters for `dall-e-2`
+          example: "A cute baby sea otter"
+          maxLength: 32000
+        model:
+          type: string
+          enum:
+            - dall-e-2
+            - dall-e-3
+            - gpt-image-1
+            - gpt-image-1-mini
+            - midjourney
+            - flux
+            - stable-diffusion
+          default: dall-e-2
+          description: The model to use for image generation
+          nullable: true
+        n:
+          type: integer
+          minimum: 1
+          maximum: 10
+          default: 1
+          description: |
+            The number of images to generate. Must be between 1 and 10.
+            For `dall-e-3`, only `n=1` is supported.
+          nullable: true
+        size:
+          type: string
+          enum:
+            - auto
+            - "256x256"
+            - "512x512"
+            - "1024x1024"
+            - "1536x1024"
+            - "1024x1536"
+            - "1792x1024"
+            - "1024x1792"
+          default: auto
+          description: |
+            The size of the generated images:
+            - `gpt-image-1`: `1024x1024`, `1536x1024` (landscape), `1024x1536` (portrait), or `auto` (default)
+            - `dall-e-2`: `256x256`, `512x512`, or `1024x1024`
+            - `dall-e-3`: `1024x1024`, `1792x1024`, or `1024x1792`
+          nullable: true
+        quality:
+          type: string
+          enum:
+            - standard
+            - hd
+            - low
+            - medium
+            - high
+            - auto
+          default: auto
+          description: |
+            The quality of the image that will be generated:
+            - `auto` (default): automatically select the best quality
+            - `high`, `medium`, `low`: supported for `gpt-image-1`
+            - `hd`, `standard`: supported for `dall-e-3`
+            - `standard`: only option for `dall-e-2`
+          nullable: true
+        response_format:
+          type: string
+          enum:
+            - url
+            - b64_json
+          default: url
+          description: |
+            The format in which generated images are returned. Must be one of `url` or `b64_json`.
+            URLs are only valid for 60 minutes after generation.
+            Note: `gpt-image-1` always returns base64-encoded images.
+          nullable: true
+        output_format:
+          type: string
+          enum:
+            - png
+            - jpeg
+            - webp
+          default: png
+          description: |
+            The format in which the generated images are returned.
+            Only supported for `gpt-image-1`. Must be one of `png`, `jpeg`, or `webp`.
+          nullable: true
+        output_compression:
+          type: integer
+          minimum: 0
+          maximum: 100
+          default: 100
+          description: |
+            The compression level (0-100%) for generated images.
+            Only supported for `gpt-image-1` with `webp` or `jpeg` output formats.
+          nullable: true
+        stream:
+          type: boolean
+          default: false
+          description: |
+            Generate the image in streaming mode. Only supported for `gpt-image-1`.
+          nullable: true
+        background:
+          type: string
+          enum:
+            - transparent
+            - opaque
+            - auto
+          default: auto
+          description: |
+            Sets transparency for the background. Only supported for `gpt-image-1`.
+            - `transparent`: requires PNG or WebP output format
+            - `opaque`: solid background
+            - `auto` (default): model determines best background
+          nullable: true
+        style:
+          type: string
+          enum:
+            - vivid
+            - natural
+          default: vivid
+          description: |
+            The style of generated images. Only supported for `dall-e-3`.
+            - `vivid`: hyper-real and dramatic images
+            - `natural`: more natural, less hyper-real images
+          nullable: true
+        moderation:
+          type: string
+          enum:
+            - low
+            - auto
+          default: auto
+          description: |
+            Content moderation level for `gpt-image-1`:
+            - `low`: less restrictive filtering
+            - `auto` (default): standard filtering
+          nullable: true
+        user:
+          type: string
+          description: |
+            A unique identifier representing your end-user, for monitoring and abuse detection.
+          example: "user-1234"
+
+    CreateImageEditRequest:
+      type: object
+      required:
+        - image
+        - prompt
+      properties:
+        image:
+          type: array
+          items:
+            type: string
+            format: binary
+          minItems: 1
+          description: |
+            One or more source images to edit. Must be valid PNG files, less than 4MB each.
+        mask:
+          type: string
+          format: binary
+          description: |
+            An image indicating which areas of `image` to edit. Must be a valid PNG file,
+            less than 4MB, and have the same dimensions as `image`.
+          nullable: true
+        prompt:
+          type: string
+          description: A text description of the desired edited image(s)
+          example: "A cute baby sea otter wearing a beret"
+          maxLength: 32000
+        model:
+          type: string
+          enum:
+            - dall-e-2
+            - gpt-image-1
+          default: dall-e-2
+          description: The model to use for image editing
+          nullable: true
+        n:
+          type: integer
+          minimum: 1
+          maximum: 10
+          default: 1
+          description: The number of images to generate
+          nullable: true
+        size:
+          type: string
+          enum:
+            - "256x256"
+            - "512x512"
+            - "1024x1024"
+          default: "1024x1024"
+          nullable: true
+        response_format:
+          type: string
+          enum:
+            - url
+            - b64_json
+          default: url
+          nullable: true
+        stream:
+          type: boolean
+          default: false
+          description: Generate the edit in streaming mode (gpt-image-1 only)
+          nullable: true
+        user:
+          type: string
+          nullable: true
+
+    CreateImageVariationRequest:
+      type: object
+      required:
+        - image
+      properties:
+        image:
+          type: string
+          format: binary
+          description: |
+            The image to use as the basis for variations. Must be a valid PNG file,
+            less than 4MB, and square.
+        model:
+          type: string
+          enum:
+            - dall-e-2
+          default: dall-e-2
+          description: Only `dall-e-2` is supported
+          nullable: true
+        n:
+          type: integer
+          minimum: 1
+          maximum: 10
+          default: 1
+          description: The number of variations to generate
+          nullable: true
+        size:
+          type: string
+          enum:
+            - "256x256"
+            - "512x512"
+            - "1024x1024"
+          default: "1024x1024"
+          nullable: true
+        response_format:
+          type: string
+          enum:
+            - url
+            - b64_json
+          default: url
+          nullable: true
+        user:
+          type: string
+          nullable: true
+
+    ImagesResponse:
+      type: object
+      properties:
+        created:
+          type: integer
+          description: Unix timestamp of when the image(s) were created
+        data:
+          type: array
+          items:
+            $ref: '#/components/schemas/ImageObject'
+        usage:
+          $ref: '#/components/schemas/UsageInfo'
+          nullable: true
+
+    ImageObject:
+      type: object
+      properties:
+        url:
+          type: string
+          format: uri
+          description: The URL of the generated image (valid for 60 minutes)
+          nullable: true
+        b64_json:
+          type: string
+          description: The base64-encoded JSON of the generated image
+          nullable: true
+        revised_prompt:
+          type: string
+          description: The prompt that was used to generate the image (if modified)
+          nullable: true
+
+    ImageGenerationStreamEvent:
+      type: object
+      properties:
+        type:
+          type: string
+          enum:
+            - image_generation.partial_image
+            - image_generation.completed
+        b64_json:
+          type: string
+          description: Base64-encoded partial or complete image
+        partial_image_index:
+          type: integer
+          description: Index of the partial image in streaming sequence
+          nullable: true
+        usage:
+          $ref: '#/components/schemas/UsageInfo'
+          nullable: true
+
+    ImageEditStreamEvent:
+      type: object
+      properties:
+        type:
+          type: string
+          enum:
+            - image_edit.partial_image
+            - image_edit.completed
+        b64_json:
+          type: string
+        partial_image_index:
+          type: integer
+          nullable: true
+        usage:
+          $ref: '#/components/schemas/UsageInfo'
+          nullable: true
+
+    # Audio Schemas
+    CreateSpeechRequest:
+      type: object
+      required:
+        - model
+        - input
+        - voice
+      properties:
+        model:
+          type: string
+          enum:
+            - tts-1
+            - tts-1-hd
+            - gpt-4o-mini-tts
+            - gpt-4o-tts
+          description: The TTS model to use
+        input:
+          type: string
+          maxLength: 4096
+          description: The text to generate audio for
+        voice:
+          type: string
+          enum:
+            - alloy
+            - ash
+            - ballad
+            - coral
+            - echo
+            - fable
+            - onyx
+            - nova
+            - sage
+            - shimmer
+            - verse
+          description: The voice to use for generation
+        response_format:
+          type: string
+          enum:
+            - mp3
+            - opus
+            - aac
+            - flac
+            - wav
+            - pcm
+          default: mp3
+          description: The format to return the audio in
+          nullable: true
+        speed:
+          type: number
+          minimum: 0.25
+          maximum: 4.0
+          default: 1.0
+          description: The speed of the generated audio (0.25 to 4.0)
+          nullable: true
+        stream_format:
+          type: string
+          enum:
+            - raw
+            - sse
+          description: |
+            The streaming format:
+            - `raw`: standard binary streaming
+            - `sse`: server-sent events format
+          nullable: true
+
+    CreateTranscriptionRequest:
+      type: object
+      required:
+        - file
+        - model
+      properties:
+        file:
+          type: string
+          format: binary
+          description: |
+            The audio file to transcribe. Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm.
+            Maximum file size: 25MB.
+        model:
+          type: string
+          enum:
+            - whisper-1
+            - gpt-4o-transcribe
+            - gpt-4o-mini-transcribe
+            - gpt-4o-transcribe-diarize
+          description: The transcription model to use
+        language:
+          type: string
+          description: |
+            The language of the input audio in ISO-639-1 format (e.g., 'en', 'es').
+            Supplying the input language improves accuracy and latency.
+          nullable: true
+        prompt:
+          type: string
+          description: |
+            Optional text to guide the model's style or continue a previous audio segment.
+          nullable: true
+        response_format:
+          type: string
+          enum:
+            - json
+            - text
+            - srt
+            - verbose_json
+            - vtt
+            - diarized_json
+          default: json
+          description: The format of the transcript output
+          nullable: true
+        temperature:
+          type: number
+          minimum: 0
+          maximum: 1
+          default: 0
+          description: The sampling temperature (0 to 1)
+          nullable: true
+        timestamp_granularities:
+          type: array
+          items:
+            type: string
+            enum:
+              - word
+              - segment
+          description: |
+            The timestamp granularities to include in the response.
+            Only applicable for `verbose_json` format.
+          nullable: true
+        stream:
+          type: boolean
+          default: false
+          description: Stream the transcription as it's processed
+          nullable: true
+        chunking_strategy:
+          type: string
+          enum:
+            - auto
+            - fixed
+          description: |
+            Strategy for chunking long audio files (diarization only):
+            - `auto`: automatically determine chunk size
+            - `fixed`: use fixed chunk size
+          nullable: true
+        known_speaker_names:
+          type: array
+          items:
+            type: string
+          description: Array of known speaker names for diarization
+          nullable: true
+        known_speaker_references:
+          type: array
+          items:
+            type: string
+          description: Array of speaker voice reference audio (base64 encoded)
+          nullable: true
+
+    CreateTranslationRequest:
+      type: object
+      required:
+        - file
+        - model
+      properties:
+        file:
+          type: string
+          format: binary
+          description: The audio file to translate (translates to English)
+        model:
+          type: string
+          enum:
+            - whisper-1
+          description: The translation model to use
+        prompt:
+          type: string
+          nullable: true
+        response_format:
+          type: string
+          enum:
+            - json
+            - text
+            - srt
+            - verbose_json
+            - vtt
+          default: json
+          nullable: true
+        temperature:
+          type: number
+          minimum: 0
+          maximum: 1
+          default: 0
+          nullable: true
+
+    TranscriptionResponseJson:
+      type: object
+      properties:
+        text:
+          type: string
+          description: The transcribed text
+        usage:
+          $ref: '#/components/schemas/AudioUsageInfo'
+          nullable: true
+
+    TranscriptionResponseVerboseJson:
+      type: object
+      properties:
+        task:
+          type: string
+          enum:
+            - transcribe
+        language:
+          type: string
+          description: The detected language
+        duration:
+          type: number
+          description: Duration of the audio in seconds
+        text:
+          type: string
+          description: The transcribed text
+        words:
+          type: array
+          items:
+            $ref: '#/components/schemas/TranscriptionWord'
+          nullable: true
+        segments:
+          type: array
+          items:
+            $ref: '#/components/schemas/TranscriptionSegment'
+          nullable: true
+        usage:
+          $ref: '#/components/schemas/AudioUsageInfo'
+          nullable: true
+
+    TranscriptionResponseDiarizedJson:
+      type: object
+      properties:
+        task:
+          type: string
+          enum:
+            - transcribe
+        duration:
+          type: number
+          description: Duration of the audio in seconds
+        text:
+          type: string
+          description: The full transcribed text with speaker labels
+        segments:
+          type: array
+          items:
+            $ref: '#/components/schemas/DiarizedSegment'
+        usage:
+          $ref: '#/components/schemas/AudioUsageInfo'
+          nullable: true
+
+    TranscriptionWord:
+      type: object
+      properties:
+        word:
+          type: string
+        start:
+          type: number
+        end:
+          type: number
+
+    TranscriptionSegment:
+      type: object
+      properties:
+        id:
+          type: integer
+        seek:
+          type: integer
+        start:
+          type: number
+        end:
+          type: number
+        text:
+          type: string
+        tokens:
+          type: array
+          items:
+            type: integer
+        temperature:
+          type: number
+        avg_logprob:
+          type: number
+        compression_ratio:
+          type: number
+        no_speech_prob:
+          type: number
+
+    DiarizedSegment:
+      type: object
+      properties:
+        type:
+          type: string
+          enum:
+            - transcript.text.segment
+        id:
+          type: string
+        start:
+          type: number
+        end:
+          type: number
+        text:
+          type: string
+        speaker:
+          type: string
+          description: The identified speaker name or ID
+
+    TranslationResponseJson:
+      type: object
+      properties:
+        text:
+          type: string
+          description: The translated text (in English)
+
+    TranslationResponseVerboseJson:
+      type: object
+      properties:
+        task:
+          type: string
+          enum:
+            - translate
+        language:
+          type: string
+          description: The source language detected
+        duration:
+          type: number
+        text:
+          type: string
+
+    TranscriptionStreamEvent:
+      type: object
+      properties:
+        type:
+          type: string
+          enum:
+            - transcript.text.delta
+            - transcript.completed
+        delta:
+          type: string
+          description: Incremental transcription text
+          nullable: true
+        text:
+          type: string
+          description: Full transcription text (on completed)
+          nullable: true
+
+    SpeechStreamEvent:
+      type: object
+      properties:
+        type:
+          type: string
+          enum:
+            - speech.audio_delta
+            - speech.completed
+        delta:
+          type: string
+          format: byte
+          description: Base64-encoded audio chunk
+          nullable: true
+
+    # Music Generation Schemas
+    CreateMusicRequest:
+      type: object
+      required:
+        - prompt
+      properties:
+        prompt:
+          type: string
+          description: Text description of the music to generate
+          example: "An upbeat electronic dance track with heavy bass"
+        model:
+          type: string
+          enum:
+            - suno-v3
+            - suno-v3.5
+          default: suno-v3.5
+          nullable: true
+        duration:
+          type: integer
+          minimum: 10
+          maximum: 300
+          description: Duration of the music in seconds (10-300)
+          nullable: true
+        style:
+          type: string
+          description: Musical style or genre
+          nullable: true
+        instrumental:
+          type: boolean
+          default: false
+          description: Whether to generate instrumental music (no vocals)
+          nullable: true
+
+    MusicTaskResponse:
+      type: object
+      properties:
+        task_id:
+          type: string
+          description: Unique identifier for the music generation task
+        status:
+          type: string
+          enum:
+            - pending
+            - processing
+          description: Current status of the task
+        estimated_completion_time:
+          type: integer
+          description: Estimated time to completion in seconds
+          nullable: true
+
+    MusicStatusResponse:
+      type: object
+      properties:
+        task_id:
+          type: string
+        status:
+          type: string
+          enum:
+            - pending
+            - processing
+            - completed
+            - failed
+        progress:
+          type: integer
+          minimum: 0
+          maximum: 100
+          description: Progress percentage (0-100)
+        result:
+          type: object
+          nullable: true
+          properties:
+            url:
+              type: string
+              format: uri
+              description: URL to download the generated music
+            duration:
+              type: number
+              description: Actual duration of the generated music
+            format:
+              type: string
+              description: Audio format (e.g., mp3, wav)
+        error:
+          type: string
+          nullable: true
+          description: Error message if the task failed
+
+    # Video Generation Schemas
+    CreateVideoRequest:
+      type: object
+      required:
+        - prompt
+      properties:
+        prompt:
+          type: string
+          description: Text description of the video to generate
+        model:
+          type: string
+          enum:
+            - runway-gen3
+            - stability-video
+          nullable: true
+        duration:
+          type: integer
+          minimum: 2
+          maximum: 10
+          description: Duration of the video in seconds
+          nullable: true
+        resolution:
+          type: string
+          enum:
+            - "512x512"
+            - "768x768"
+            - "1024x1024"
+            - "1280x720"
+            - "1920x1080"
+          default: "1280x720"
+          nullable: true
+        fps:
+          type: integer
+          enum:
+            - 24
+            - 30
+            - 60
+          default: 30
+          description: Frames per second
+          nullable: true
+        source_image:
+          type: string
+          format: binary
+          description: Optional source image for image-to-video generation
+          nullable: true
+
+    VideoTaskResponse:
+      type: object
+      properties:
+        task_id:
+          type: string
+        status:
+          type: string
+          enum:
+            - pending
+            - processing
+        estimated_completion_time:
+          type: integer
+          nullable: true
+
+    # Common Schemas
+    UsageInfo:
+      type: object
+      properties:
+        total_tokens:
+          type: integer
+        input_tokens:
+          type: integer
+        output_tokens:
+          type: integer
+        input_tokens_details:
+          type: object
+          properties:
+            text_tokens:
+              type: integer
+            image_tokens:
+              type: integer
+
+    AudioUsageInfo:
+      type: object
+      properties:
+        type:
+          type: string
+          enum:
+            - tokens
+            - duration
+        input_tokens:
+          type: integer
+          nullable: true
+        input_token_details:
+          type: object
+          nullable: true
+          properties:
+            text_tokens:
+              type: integer
+            audio_tokens:
+              type: integer
+        output_tokens:
+          type: integer
+          nullable: true
+        total_tokens:
+          type: integer
+          nullable: true
+        seconds:
+          type: number
+          nullable: true
+          description: Duration in seconds (for duration-based usage)
+
+    Error:
+      type: object
+      properties:
+        error:
+          type: object
+          properties:
+            message:
+              type: string
+              description: A human-readable error message
+            type:
+              type: string
+              description: Error type identifier
+            code:
+              type: string
+              description: Error code
+              nullable: true
+            param:
+              type: string
+              description: The parameter that caused the error
+              nullable: true
+
+  responses:
+    BadRequest:
+      description: Bad request - Invalid parameters
+      content:
+        application/json:
+          schema:
+            $ref: '#/components/schemas/Error'
+          example:
+            error:
+              message: "Invalid parameter: prompt is required"
+              type: "invalid_request_error"
+              code: "invalid_parameter"
+              param: "prompt"
+
+    Unauthorized:
+      description: Unauthorized - Invalid or missing API key
+      content:
+        application/json:
+          schema:
+            $ref: '#/components/schemas/Error'
+          example:
+            error:
+              message: "Invalid API key provided"
+              type: "invalid_request_error"
+              code: "invalid_api_key"
+
+    NotFound:
+      description: Resource not found
+      content:
+        application/json:
+          schema:
+            $ref: '#/components/schemas/Error'
+          example:
+            error:
+              message: "The requested resource was not found"
+              type: "invalid_request_error"
+              code: "not_found"
+
+    RateLimitExceeded:
+      description: Rate limit exceeded
+      content:
+        application/json:
+          schema:
+            $ref: '#/components/schemas/Error'
+          example:
+            error:
+              message: "Rate limit exceeded. Please try again later."
+              type: "rate_limit_error"
+              code: "rate_limit_exceeded"
+
+    InternalServerError:
+      description: Internal server error
+      content:
+        application/json:
+          schema:
+            $ref: '#/components/schemas/Error'
+          example:
+            error:
+              message: "An internal server error occurred"
+              type: "server_error"
+              code: "internal_error"