From 59703b8fe9afdc8f3e391293185aafe07f16fbc9 Mon Sep 17 00:00:00 2001 From: konard Date: Thu, 30 Oct 2025 05:16:46 +0100 Subject: [PATCH 1/3] Initial commit with task details for issue #13 Adding CLAUDE.md with task information for AI processing. This file will be removed when the task is complete. Issue: undefined --- CLAUDE.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 CLAUDE.md diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..16bf872 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,5 @@ +Issue to solve: undefined +Your prepared branch: issue-13-3d80f6f5 +Your prepared working directory: /tmp/gh-issue-solver-1761797804749 + +Proceed. \ No newline at end of file From db18f6fa004e829ec1afc20df1091a94cd17aa84 Mon Sep 17 00:00:00 2001 From: konard Date: Thu, 30 Oct 2025 05:25:57 +0100 Subject: [PATCH 2/3] Add comprehensive media generation API specification MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This commit implements issue #13 by providing a complete OpenAI-compatible API specification for media generation including images, audio, music, and video. Changes: - Add media-api-spec.yaml: Complete OpenAPI 3.1 specification with 9 endpoints - Add media-api-spec.json: JSON version of the specification - Add MEDIA_API.md: Comprehensive documentation with examples and implementation guide - Add README_MEDIA_API.md: Quick start guide and overview Features: - Image generation, editing, and variations (DALL-E, GPT-Image-1, Midjourney, Flux, Stable Diffusion) - Text-to-speech with 11 voices (TTS-1, GPT-4o-TTS) - Speech-to-text with diarization (Whisper, GPT-4o-transcribe) - Audio translation to English - Music generation with Suno AI (async task-based) - Video generation specification (planned) Implementation details: - Provider abstraction layer for seamless switching - Streaming support for progressive generation - Async task management for long-running operations - Auto failover between providers - OpenAI SDK compatibility - Comprehensive error handling - Usage tracking and monitoring Documentation includes: - Complete API reference with request/response examples - Provider comparison tables - Implementation guide for API Gateway integration - Database schemas for task and usage tracking - Error handling best practices - Rate limiting strategies - Code examples in Python, Node.js, and cURL This specification is ready for implementation in the api-gateway service and can be integrated with telegram-bot and other Deep Assistant services. πŸ€– Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- MEDIA_API.md | 954 ++++++++++++++++++++++++ README_MEDIA_API.md | 402 +++++++++++ media-api-spec.json | 1680 +++++++++++++++++++++++++++++++++++++++++++ media-api-spec.yaml | 1332 ++++++++++++++++++++++++++++++++++ 4 files changed, 4368 insertions(+) create mode 100644 MEDIA_API.md create mode 100644 README_MEDIA_API.md create mode 100644 media-api-spec.json create mode 100644 media-api-spec.yaml diff --git a/MEDIA_API.md b/MEDIA_API.md new file mode 100644 index 0000000..3467114 --- /dev/null +++ b/MEDIA_API.md @@ -0,0 +1,954 @@ +# Media Generation API Specification + +OpenAI-compatible API specification for generating images, music, video and more using various AI providers. + +## Table of Contents + +- [Overview](#overview) +- [Features](#features) +- [API Endpoints](#api-endpoints) + - [Image Generation](#image-generation) + - [Image Editing](#image-editing) + - [Image Variations](#image-variations) + - [Text-to-Speech](#text-to-speech) + - [Speech-to-Text](#speech-to-text) + - [Audio Translation](#audio-translation) + - [Music Generation](#music-generation) + - [Video Generation](#video-generation) +- [Provider Support](#provider-support) +- [Authentication](#authentication) +- [Usage Examples](#usage-examples) +- [Implementation Guide](#implementation-guide) +- [Error Handling](#error-handling) + +## Overview + +This API specification provides a unified, OpenAI-compatible interface for various media generation services. It enables developers to: + +- Generate images from text prompts +- Edit and create variations of existing images +- Convert text to speech +- Transcribe audio to text with speaker diarization +- Translate audio to English +- Generate music from descriptions +- Create videos from text or images + +The API follows OpenAI's standards, making it easy to integrate with existing applications and switch between different providers seamlessly. + +## Features + +### Image Generation +- Multiple AI models: DALL-E 2, DALL-E 3, GPT-Image-1, Midjourney, Flux, Stable Diffusion +- Flexible sizing and quality options +- Streaming support for progressive image generation +- Transparent backgrounds (PNG/WebP) +- Style control (vivid vs. natural) + +### Audio Services +- **Text-to-Speech**: Multiple voices, streaming support, various audio formats +- **Speech-to-Text**: High-accuracy transcription with speaker diarization +- **Translation**: Automatic translation to English from any language + +### Music Generation +- Text-to-music generation +- Style and genre control +- Instrumental or vocal options +- Async task-based processing + +### Video Generation (Planned) +- Text-to-video generation +- Image-to-video animation +- Multiple resolution and FPS options + +## API Endpoints + +### Image Generation + +#### POST `/v1/images/generations` + +Generate images from text prompts. + +**Request Body:** +```json +{ + "prompt": "A cute baby sea otter", + "model": "gpt-image-1", + "n": 1, + "size": "1024x1024", + "quality": "high", + "response_format": "b64_json" +} +``` + +**Response:** +```json +{ + "created": 1713833628, + "data": [ + { + "b64_json": "iVBORw0KGgoAAAANSUhEUgA...", + "revised_prompt": "A cute baby sea otter floating on its back..." + } + ], + "usage": { + "total_tokens": 100, + "input_tokens": 50, + "output_tokens": 50 + } +} +``` + +**Supported Models:** +- `dall-e-2`: Fast, economical, basic quality +- `dall-e-3`: High quality, single image only +- `gpt-image-1`: Latest model, supports streaming and advanced features +- `midjourney`: Artistic style generation +- `flux`: Fast generation +- `stable-diffusion`: Open-source alternative + +### Image Editing + +#### POST `/v1/images/edits` + +Edit or extend images with AI assistance. + +**Request Body (multipart/form-data):** +``` +image[]: +image[]: (optional, multiple images) +mask: (optional) +prompt: "Add a festive red bow" +model: "gpt-image-1" +stream: true +``` + +**Streaming Response:** +``` +event: image_edit.partial_image +data: {"type":"image_edit.partial_image","b64_json":"...","partial_image_index":0} + +event: image_edit.completed +data: {"type":"image_edit.completed","b64_json":"...","usage":{"total_tokens":100}} +``` + +### Image Variations + +#### POST `/v1/images/variations` + +Create variations of an existing image. + +**Request Body (multipart/form-data):** +``` +image: +model: "dall-e-2" +n: 2 +size: "1024x1024" +``` + +**Response:** +```json +{ + "created": 1589478378, + "data": [ + { + "url": "https://..." + }, + { + "url": "https://..." + } + ] +} +``` + +### Text-to-Speech + +#### POST `/v1/audio/speech` + +Generate audio from text. + +**Request Body:** +```json +{ + "model": "gpt-4o-mini-tts", + "input": "The quick brown fox jumped over the lazy dog.", + "voice": "alloy", + "response_format": "mp3", + "speed": 1.0 +} +``` + +**Response:** +Binary audio data (application/octet-stream) + +**Available Voices:** +- `alloy`, `ash`, `ballad`, `coral`, `echo`, `fable`, `onyx`, `nova`, `sage`, `shimmer`, `verse` + +**Supported Formats:** +- `mp3` (default), `opus`, `aac`, `flac`, `wav`, `pcm` + +### Speech-to-Text + +#### POST `/v1/audio/transcriptions` + +Transcribe audio to text with optional speaker identification. + +**Request Body (multipart/form-data):** +``` +file: