Skip to content

Latest commit

 

History

History
903 lines (766 loc) · 16.9 KB

File metadata and controls

903 lines (766 loc) · 16.9 KB

Inferno API Documentation

Table of Contents

Overview

Inferno provides multiple API interfaces for AI/ML model inference:

  • REST API: Standard HTTP endpoints for synchronous inference
  • WebSocket API: Real-time bidirectional streaming
  • OpenAI-Compatible API: Drop-in replacement for OpenAI API
  • Metrics API: Prometheus-compatible metrics endpoint

Base URL

http://localhost:8080

Content Types

  • Request: application/json
  • Response: application/json
  • Streaming: text/event-stream (SSE) or WebSocket

Authentication

Inferno supports multiple authentication methods:

API Key Authentication

Include your API key in the Authorization header:

Authorization: Bearer YOUR_API_KEY

JWT Token Authentication

For session-based authentication:

Authorization: Bearer YOUR_JWT_TOKEN

Obtaining Credentials

Generate API Key

inferno security api-key create --user USER_ID --name "My API Key"

Login for JWT Token

curl -X POST http://localhost:8080/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username": "user", "password": "pass"}'

REST API Endpoints

Health Check

Check service health status.

GET /health

Response:

{
  "status": "healthy",
  "version": "0.1.0",
  "uptime_seconds": 3600,
  "models_loaded": 2
}

List Models

Get available models.

GET /models

Response:

{
  "models": [
    {
      "id": "llama-2-7b",
      "name": "Llama 2 7B",
      "type": "gguf",
      "size_bytes": 7516192768,
      "loaded": true,
      "context_size": 4096,
      "capabilities": ["text-generation", "embeddings"]
    }
  ]
}

Load Model

Load a model into memory.

POST /models/{model_id}/load

Request:

{
  "gpu_layers": 32,
  "context_size": 2048,
  "batch_size": 512
}

Response:

{
  "status": "loaded",
  "model_id": "llama-2-7b",
  "memory_usage_bytes": 8589934592,
  "load_time_ms": 5432
}

Unload Model

Unload a model from memory.

POST /models/{model_id}/unload

Response:

{
  "status": "unloaded",
  "model_id": "llama-2-7b"
}

Inference

Run inference on a loaded model.

POST /inference

Request:

{
  "model": "llama-2-7b",
  "prompt": "What is the capital of France?",
  "max_tokens": 100,
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": 40,
  "repeat_penalty": 1.1,
  "stop": ["\n", "###"],
  "stream": false
}

Response:

{
  "id": "inf_123456",
  "model": "llama-2-7b",
  "choices": [
    {
      "text": "The capital of France is Paris.",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 7,
    "total_tokens": 15
  },
  "created": 1704067200,
  "processing_time_ms": 234
}

Streaming Inference

Stream inference results using Server-Sent Events.

POST /inference/stream

Request: Same as regular inference with "stream": true

Response (SSE):

data: {"token": "The", "index": 0}
data: {"token": " capital", "index": 1}
data: {"token": " of", "index": 2}
data: {"token": " France", "index": 3}
data: {"token": " is", "index": 4}
data: {"token": " Paris", "index": 5}
data: {"token": ".", "index": 6}
data: {"done": true, "finish_reason": "stop"}

Embeddings

Generate text embeddings.

POST /embeddings

Request:

{
  "model": "llama-2-7b",
  "input": ["Hello world", "How are you?"],
  "encoding_format": "float"
}

Response:

{
  "model": "llama-2-7b",
  "data": [
    {
      "embedding": [0.023, -0.445, 0.192, ...],
      "index": 0
    },
    {
      "embedding": [0.011, -0.234, 0.567, ...],
      "index": 1
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}

Batch Processing

Submit batch inference jobs.

POST /batch

Request:

{
  "model": "llama-2-7b",
  "requests": [
    {"id": "req1", "prompt": "What is AI?"},
    {"id": "req2", "prompt": "Explain quantum computing"}
  ],
  "max_tokens": 100,
  "webhook_url": "https://example.com/webhook"
}

Response:

{
  "batch_id": "batch_789",
  "status": "processing",
  "total_requests": 2,
  "created": 1704067200
}

Get Batch Status

Check batch job status.

GET /batch/{batch_id}

Response:

{
  "batch_id": "batch_789",
  "status": "completed",
  "completed": 2,
  "failed": 0,
  "total": 2,
  "results_url": "/batch/batch_789/results"
}

WebSocket API

Connect to the WebSocket endpoint for real-time streaming:

ws://localhost:8080/ws

Connection

const ws = new WebSocket('ws://localhost:8080/ws');
ws.onopen = () => {
  ws.send(JSON.stringify({
    type: 'auth',
    token: 'YOUR_API_KEY'
  }));
};

Request Format

{
  "type": "inference",
  "id": "req_123",
  "model": "llama-2-7b",
  "prompt": "Tell me a story",
  "max_tokens": 200,
  "stream": true
}

Response Format

{
  "type": "token",
  "id": "req_123",
  "token": "Once",
  "index": 0
}

Message Types

  • auth: Authentication
  • inference: Inference request
  • cancel: Cancel ongoing inference
  • ping/pong: Keep-alive
  • error: Error message
  • token: Streaming token
  • complete: Inference complete

OpenAI-Compatible API

Inferno provides OpenAI API compatibility for easy migration.

Chat Completions

POST /v1/chat/completions

Request:

{
  "model": "llama-2-7b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the weather like?"}
  ],
  "temperature": 0.7,
  "max_tokens": 100,
  "stream": false
}

Response:

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1704067200,
  "model": "llama-2-7b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I don't have access to real-time weather data..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 15,
    "total_tokens": 35
  }
}

Completions (Legacy)

POST /v1/completions

Request:

{
  "model": "llama-2-7b",
  "prompt": "Once upon a time",
  "max_tokens": 50,
  "temperature": 0.8
}

Models List

GET /v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "llama-2-7b",
      "object": "model",
      "created": 1704067200,
      "owned_by": "local"
    }
  ]
}

Metrics & Monitoring

Prometheus Metrics

GET /metrics

Response (Prometheus format):

# HELP inferno_inference_requests_total Total inference requests
# TYPE inferno_inference_requests_total counter
inferno_inference_requests_total{model="llama-2-7b"} 1234

# HELP inferno_inference_duration_seconds Inference duration
# TYPE inferno_inference_duration_seconds histogram
inferno_inference_duration_seconds_bucket{le="0.1"} 100
inferno_inference_duration_seconds_bucket{le="0.5"} 450
inferno_inference_duration_seconds_bucket{le="1.0"} 890

OpenTelemetry Traces

GET /traces

Response:

{
  "traces": [
    {
      "trace_id": "abc123",
      "span_id": "def456",
      "operation_name": "inference.llama-2-7b",
      "start_time": "2024-01-01T12:00:00Z",
      "duration_ms": 234,
      "status": "ok"
    }
  ]
}

Custom Metrics

POST /metrics/custom

Request:

{
  "name": "custom_metric",
  "value": 42.5,
  "type": "gauge",
  "labels": {
    "environment": "production"
  }
}

Error Handling

All API errors follow a consistent format:

{
  "error": {
    "code": "MODEL_NOT_FOUND",
    "message": "Model 'gpt-5' not found",
    "details": {
      "available_models": ["llama-2-7b", "mistral-7b"]
    }
  },
  "request_id": "req_abc123",
  "timestamp": "2024-01-01T12:00:00Z"
}

Error Codes

  • INVALID_REQUEST: Malformed request
  • AUTHENTICATION_FAILED: Invalid credentials
  • AUTHORIZATION_FAILED: Insufficient permissions
  • MODEL_NOT_FOUND: Model doesn't exist
  • MODEL_NOT_LOADED: Model not in memory
  • RATE_LIMIT_EXCEEDED: Too many requests
  • CONTEXT_LENGTH_EXCEEDED: Input too long
  • INFERENCE_FAILED: Processing error
  • TIMEOUT: Request timeout
  • INTERNAL_ERROR: Server error

HTTP Status Codes

  • 200 OK: Success
  • 400 Bad Request: Invalid request
  • 401 Unauthorized: Authentication required
  • 403 Forbidden: Access denied
  • 404 Not Found: Resource not found
  • 429 Too Many Requests: Rate limit exceeded
  • 500 Internal Server Error: Server error
  • 503 Service Unavailable: Service overloaded

Rate Limiting

Rate limits are enforced per API key or IP address:

Default Limits

  • Requests per minute: 60
  • Requests per hour: 1000
  • Tokens per minute: 10000
  • Concurrent requests: 10

Rate Limit Headers

X-RateLimit-Limit: 60
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1704067260
X-RateLimit-Reset-After: 30

Rate Limit Response

{
  "error": {
    "code": "RATE_LIMIT_EXCEEDED",
    "message": "Rate limit exceeded. Please retry after 30 seconds.",
    "retry_after": 30
  }
}

Examples

Python Example

import requests
import json

# Configuration
API_KEY = "your_api_key"
BASE_URL = "http://localhost:8080"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

# Simple inference
response = requests.post(
    f"{BASE_URL}/inference",
    headers=headers,
    json={
        "model": "llama-2-7b",
        "prompt": "What is machine learning?",
        "max_tokens": 100,
        "temperature": 0.7
    }
)

result = response.json()
print(result["choices"][0]["text"])

# Streaming inference with SSE
import sseclient

response = requests.post(
    f"{BASE_URL}/inference/stream",
    headers=headers,
    json={
        "model": "llama-2-7b",
        "prompt": "Explain quantum physics",
        "max_tokens": 200,
        "stream": True
    },
    stream=True
)

client = sseclient.SSEClient(response)
for event in client.events():
    data = json.loads(event.data)
    if "token" in data:
        print(data["token"], end="", flush=True)
    elif "done" in data:
        break

JavaScript/TypeScript Example

// Configuration
const API_KEY = 'your_api_key';
const BASE_URL = 'http://localhost:8080';

// Simple inference
async function runInference(prompt: string): Promise<string> {
  const response = await fetch(`${BASE_URL}/inference`, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      model: 'llama-2-7b',
      prompt: prompt,
      max_tokens: 100,
      temperature: 0.7
    })
  });

  const result = await response.json();
  return result.choices[0].text;
}

// WebSocket streaming
function streamInference(prompt: string) {
  const ws = new WebSocket(`ws://localhost:8080/ws`);

  ws.onopen = () => {
    // Authenticate
    ws.send(JSON.stringify({
      type: 'auth',
      token: API_KEY
    }));

    // Send inference request
    ws.send(JSON.stringify({
      type: 'inference',
      id: 'req_' + Date.now(),
      model: 'llama-2-7b',
      prompt: prompt,
      max_tokens: 200,
      stream: true
    }));
  };

  ws.onmessage = (event) => {
    const data = JSON.parse(event.data);

    if (data.type === 'token') {
      process.stdout.write(data.token);
    } else if (data.type === 'complete') {
      console.log('\nDone!');
      ws.close();
    } else if (data.type === 'error') {
      console.error('Error:', data.message);
      ws.close();
    }
  };
}

cURL Examples

# Health check
curl http://localhost:8080/health

# List models
curl -H "Authorization: Bearer $API_KEY" \
  http://localhost:8080/models

# Run inference
curl -X POST \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2-7b",
    "prompt": "Hello, how are you?",
    "max_tokens": 50
  }' \
  http://localhost:8080/inference

# Stream inference
curl -X POST \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "llama-2-7b",
    "prompt": "Tell me a joke",
    "max_tokens": 100,
    "stream": true
  }' \
  http://localhost:8080/inference/stream

# OpenAI-compatible chat
curl -X POST \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2-7b",
    "messages": [
      {"role": "user", "content": "What is 2+2?"}
    ]
  }' \
  http://localhost:8080/v1/chat/completions

Go Example

package main

import (
    "bytes"
    "encoding/json"
    "fmt"
    "net/http"
)

const (
    API_KEY  = "your_api_key"
    BASE_URL = "http://localhost:8080"
)

type InferenceRequest struct {
    Model      string   `json:"model"`
    Prompt     string   `json:"prompt"`
    MaxTokens  int      `json:"max_tokens"`
    Temperature float64 `json:"temperature"`
}

type InferenceResponse struct {
    Choices []struct {
        Text string `json:"text"`
    } `json:"choices"`
}

func runInference(prompt string) (string, error) {
    reqBody := InferenceRequest{
        Model:       "llama-2-7b",
        Prompt:      prompt,
        MaxTokens:   100,
        Temperature: 0.7,
    }

    jsonData, _ := json.Marshal(reqBody)

    req, err := http.NewRequest("POST", BASE_URL+"/inference",
        bytes.NewBuffer(jsonData))
    if err != nil {
        return "", err
    }

    req.Header.Set("Authorization", "Bearer "+API_KEY)
    req.Header.Set("Content-Type", "application/json")

    client := &http.Client{}
    resp, err := client.Do(req)
    if err != nil {
        return "", err
    }
    defer resp.Body.Close()

    var result InferenceResponse
    json.NewDecoder(resp.Body).Decode(&result)

    if len(result.Choices) > 0 {
        return result.Choices[0].Text, nil
    }

    return "", fmt.Errorf("no response")
}

Rust Example

use reqwest;
use serde::{Deserialize, Serialize};

const API_KEY: &str = "your_api_key";
const BASE_URL: &str = "http://localhost:8080";

#[derive(Serialize)]
struct InferenceRequest {
    model: String,
    prompt: String,
    max_tokens: u32,
    temperature: f32,
}

#[derive(Deserialize)]
struct InferenceResponse {
    choices: Vec<Choice>,
}

#[derive(Deserialize)]
struct Choice {
    text: String,
}

async fn run_inference(prompt: &str) -> Result<String, Box<dyn std::error::Error>> {
    let client = reqwest::Client::new();

    let request = InferenceRequest {
        model: "llama-2-7b".to_string(),
        prompt: prompt.to_string(),
        max_tokens: 100,
        temperature: 0.7,
    };

    let response = client
        .post(format!("{}/inference", BASE_URL))
        .header("Authorization", format!("Bearer {}", API_KEY))
        .json(&request)
        .send()
        .await?
        .json::<InferenceResponse>()
        .await?;

    Ok(response.choices[0].text.clone())
}

SDK Support

Official SDKs are planned for:

  • Python (inferno-python)
  • JavaScript/TypeScript (@inferno/client)
  • Go (github.com/inferno-ai/go-client)
  • Rust (inferno-client)
  • Java (io.inferno:client)
  • C# (Inferno.Client)

Webhooks

Configure webhooks for async events:

{
  "webhook_url": "https://example.com/webhook",
  "events": ["inference.complete", "batch.complete", "model.loaded"],
  "secret": "webhook_secret_key"
}

Webhook Payload

{
  "event": "inference.complete",
  "timestamp": "2024-01-01T12:00:00Z",
  "data": {
    "request_id": "req_123",
    "model": "llama-2-7b",
    "tokens_generated": 50,
    "duration_ms": 234
  },
  "signature": "sha256=abcdef123456..."
}

API Versioning

The API follows semantic versioning:

  • Current version: v1
  • Version in URL: /v1/endpoint
  • Header: API-Version: 1.0

Deprecation Policy

  • Deprecated endpoints marked with Deprecation header
  • Minimum 6 months notice before removal
  • Migration guides provided

Security Best Practices

  1. Always use HTTPS in production
  2. Rotate API keys regularly
  3. Implement request signing for webhooks
  4. Use rate limiting to prevent abuse
  5. Enable audit logging
  6. Validate and sanitize all inputs
  7. Implement timeout for long-running requests
  8. Use authentication for all endpoints

Support