Inferno API Documentation

Overview
Authentication
REST API Endpoints
WebSocket API
OpenAI-Compatible API
Metrics & Monitoring
Error Handling
Rate Limiting
Examples

Overview

Inferno provides multiple API interfaces for AI/ML model inference:

REST API: Standard HTTP endpoints for synchronous inference
WebSocket API: Real-time bidirectional streaming
OpenAI-Compatible API: Drop-in replacement for OpenAI API
Metrics API: Prometheus-compatible metrics endpoint

Base URL

http://localhost:8080

Content Types

Request: application/json
Response: application/json
Streaming: text/event-stream (SSE) or WebSocket

Authentication

Inferno supports multiple authentication methods:

API Key Authentication

Include your API key in the Authorization header:

Authorization: Bearer YOUR_API_KEY

JWT Token Authentication

For session-based authentication:

Authorization: Bearer YOUR_JWT_TOKEN

Obtaining Credentials

Generate API Key

inferno security api-key create --user USER_ID --name "My API Key"

Login for JWT Token

curl -X POST http://localhost:8080/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username": "user", "password": "pass"}'

REST API Endpoints

Health Check

Check service health status.

GET /health

Response:

{
  "status": "healthy",
  "version": "0.1.0",
  "uptime_seconds": 3600,
  "models_loaded": 2
}

List Models

Get available models.

GET /models

Response:

{
  "models": [
    {
      "id": "llama-2-7b",
      "name": "Llama 2 7B",
      "type": "gguf",
      "size_bytes": 7516192768,
      "loaded": true,
      "context_size": 4096,
      "capabilities": ["text-generation", "embeddings"]
    }
  ]
}

Load Model

Load a model into memory.

POST /models/{model_id}/load

Request:

{
  "gpu_layers": 32,
  "context_size": 2048,
  "batch_size": 512
}

Response:

{
  "status": "loaded",
  "model_id": "llama-2-7b",
  "memory_usage_bytes": 8589934592,
  "load_time_ms": 5432
}

Unload Model

Unload a model from memory.

POST /models/{model_id}/unload

Response:

{
  "status": "unloaded",
  "model_id": "llama-2-7b"
}

Inference

Run inference on a loaded model.

POST /inference

Request:

{
  "model": "llama-2-7b",
  "prompt": "What is the capital of France?",
  "max_tokens": 100,
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": 40,
  "repeat_penalty": 1.1,
  "stop": ["\n", "###"],
  "stream": false
}

Response:

{
  "id": "inf_123456",
  "model": "llama-2-7b",
  "choices": [
    {
      "text": "The capital of France is Paris.",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 7,
    "total_tokens": 15
  },
  "created": 1704067200,
  "processing_time_ms": 234
}

Streaming Inference

Stream inference results using Server-Sent Events.

POST /inference/stream

Request: Same as regular inference with "stream": true

Response (SSE):

data: {"token": "The", "index": 0}
data: {"token": " capital", "index": 1}
data: {"token": " of", "index": 2}
data: {"token": " France", "index": 3}
data: {"token": " is", "index": 4}
data: {"token": " Paris", "index": 5}
data: {"token": ".", "index": 6}
data: {"done": true, "finish_reason": "stop"}

Embeddings

Generate text embeddings.

POST /embeddings

Request:

{
  "model": "llama-2-7b",
  "input": ["Hello world", "How are you?"],
  "encoding_format": "float"
}

Response:

{
  "model": "llama-2-7b",
  "data": [
    {
      "embedding": [0.023, -0.445, 0.192, ...],
      "index": 0
    },
    {
      "embedding": [0.011, -0.234, 0.567, ...],
      "index": 1
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}

Batch Processing

Submit batch inference jobs.

POST /batch

Request:

{
  "model": "llama-2-7b",
  "requests": [
    {"id": "req1", "prompt": "What is AI?"},
    {"id": "req2", "prompt": "Explain quantum computing"}
  ],
  "max_tokens": 100,
  "webhook_url": "https://example.com/webhook"
}

Response:

{
  "batch_id": "batch_789",
  "status": "processing",
  "total_requests": 2,
  "created": 1704067200
}

Get Batch Status

Check batch job status.

GET /batch/{batch_id}

Response:

{
  "batch_id": "batch_789",
  "status": "completed",
  "completed": 2,
  "failed": 0,
  "total": 2,
  "results_url": "/batch/batch_789/results"
}

WebSocket API

Connect to the WebSocket endpoint for real-time streaming:

ws://localhost:8080/ws

Connection

const ws = new WebSocket('ws://localhost:8080/ws');
ws.onopen = () => {
  ws.send(JSON.stringify({
    type: 'auth',
    token: 'YOUR_API_KEY'
  }));
};

Request Format

{
  "type": "inference",
  "id": "req_123",
  "model": "llama-2-7b",
  "prompt": "Tell me a story",
  "max_tokens": 200,
  "stream": true
}

Response Format

{
  "type": "token",
  "id": "req_123",
  "token": "Once",
  "index": 0
}

Message Types

auth: Authentication
inference: Inference request
cancel: Cancel ongoing inference
ping/pong: Keep-alive
error: Error message
token: Streaming token
complete: Inference complete

OpenAI-Compatible API

Inferno provides OpenAI API compatibility for easy migration.

Chat Completions

POST /v1/chat/completions

Request:

{
  "model": "llama-2-7b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the weather like?"}
  ],
  "temperature": 0.7,
  "max_tokens": 100,
  "stream": false
}

Response:

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1704067200,
  "model": "llama-2-7b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I don't have access to real-time weather data..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 15,
    "total_tokens": 35
  }
}

Completions (Legacy)

POST /v1/completions

Request:

{
  "model": "llama-2-7b",
  "prompt": "Once upon a time",
  "max_tokens": 50,
  "temperature": 0.8
}

Models List

GET /v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "llama-2-7b",
      "object": "model",
      "created": 1704067200,
      "owned_by": "local"
    }
  ]
}

Metrics & Monitoring

Prometheus Metrics

GET /metrics

Response (Prometheus format):

# HELP inferno_inference_requests_total Total inference requests
# TYPE inferno_inference_requests_total counter
inferno_inference_requests_total{model="llama-2-7b"} 1234

# HELP inferno_inference_duration_seconds Inference duration
# TYPE inferno_inference_duration_seconds histogram
inferno_inference_duration_seconds_bucket{le="0.1"} 100
inferno_inference_duration_seconds_bucket{le="0.5"} 450
inferno_inference_duration_seconds_bucket{le="1.0"} 890

OpenTelemetry Traces

GET /traces

Response:

{
  "traces": [
    {
      "trace_id": "abc123",
      "span_id": "def456",
      "operation_name": "inference.llama-2-7b",
      "start_time": "2024-01-01T12:00:00Z",
      "duration_ms": 234,
      "status": "ok"
    }
  ]
}

Custom Metrics

POST /metrics/custom

Request:

{
  "name": "custom_metric",
  "value": 42.5,
  "type": "gauge",
  "labels": {
    "environment": "production"
  }
}

Error Handling

All API errors follow a consistent format:

{
  "error": {
    "code": "MODEL_NOT_FOUND",
    "message": "Model 'gpt-5' not found",
    "details": {
      "available_models": ["llama-2-7b", "mistral-7b"]
    }
  },
  "request_id": "req_abc123",
  "timestamp": "2024-01-01T12:00:00Z"
}

Error Codes

INVALID_REQUEST: Malformed request
AUTHENTICATION_FAILED: Invalid credentials
AUTHORIZATION_FAILED: Insufficient permissions
MODEL_NOT_FOUND: Model doesn't exist
MODEL_NOT_LOADED: Model not in memory
RATE_LIMIT_EXCEEDED: Too many requests
CONTEXT_LENGTH_EXCEEDED: Input too long
INFERENCE_FAILED: Processing error
TIMEOUT: Request timeout
INTERNAL_ERROR: Server error

HTTP Status Codes

200 OK: Success
400 Bad Request: Invalid request
401 Unauthorized: Authentication required
403 Forbidden: Access denied
404 Not Found: Resource not found
429 Too Many Requests: Rate limit exceeded
500 Internal Server Error: Server error
503 Service Unavailable: Service overloaded

Rate Limiting

Rate limits are enforced per API key or IP address:

Default Limits

Requests per minute: 60
Requests per hour: 1000
Tokens per minute: 10000
Concurrent requests: 10

Rate Limit Headers

X-RateLimit-Limit: 60
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1704067260
X-RateLimit-Reset-After: 30

Rate Limit Response

{
  "error": {
    "code": "RATE_LIMIT_EXCEEDED",
    "message": "Rate limit exceeded. Please retry after 30 seconds.",
    "retry_after": 30
  }
}

Examples

Python Example

import requests
import json

# Configuration
API_KEY = "your_api_key"
BASE_URL = "http://localhost:8080"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

# Simple inference
response = requests.post(
    f"{BASE_URL}/inference",
    headers=headers,
    json={
        "model": "llama-2-7b",
        "prompt": "What is machine learning?",
        "max_tokens": 100,
        "temperature": 0.7
    }
)

result = response.json()
print(result["choices"][0]["text"])

# Streaming inference with SSE
import sseclient

response = requests.post(
    f"{BASE_URL}/inference/stream",
    headers=headers,
    json={
        "model": "llama-2-7b",
        "prompt": "Explain quantum physics",
        "max_tokens": 200,
        "stream": True
    },
    stream=True
)

client = sseclient.SSEClient(response)
for event in client.events():
    data = json.loads(event.data)
    if "token" in data:
        print(data["token"], end="", flush=True)
    elif "done" in data:
        break

JavaScript/TypeScript Example

// Configuration
const API_KEY = 'your_api_key';
const BASE_URL = 'http://localhost:8080';

// Simple inference
async function runInference(prompt: string): Promise<string> {
  const response = await fetch(`${BASE_URL}/inference`, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      model: 'llama-2-7b',
      prompt: prompt,
      max_tokens: 100,
      temperature: 0.7
    })
  });

  const result = await response.json();
  return result.choices[0].text;
}

// WebSocket streaming
function streamInference(prompt: string) {
  const ws = new WebSocket(`ws://localhost:8080/ws`);

  ws.onopen = () => {
    // Authenticate
    ws.send(JSON.stringify({
      type: 'auth',
      token: API_KEY
    }));

    // Send inference request
    ws.send(JSON.stringify({
      type: 'inference',
      id: 'req_' + Date.now(),
      model: 'llama-2-7b',
      prompt: prompt,
      max_tokens: 200,
      stream: true
    }));
  };

  ws.onmessage = (event) => {
    const data = JSON.parse(event.data);

    if (data.type === 'token') {
      process.stdout.write(data.token);
    } else if (data.type === 'complete') {
      console.log('\nDone!');
      ws.close();
    } else if (data.type === 'error') {
      console.error('Error:', data.message);
      ws.close();
    }
  };
}

cURL Examples

# Health check
curl http://localhost:8080/health

# List models
curl -H "Authorization: Bearer $API_KEY" \
  http://localhost:8080/models

# Run inference
curl -X POST \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2-7b",
    "prompt": "Hello, how are you?",
    "max_tokens": 50
  }' \
  http://localhost:8080/inference

# Stream inference
curl -X POST \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "llama-2-7b",
    "prompt": "Tell me a joke",
    "max_tokens": 100,
    "stream": true
  }' \
  http://localhost:8080/inference/stream

# OpenAI-compatible chat
curl -X POST \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2-7b",
    "messages": [
      {"role": "user", "content": "What is 2+2?"}
    ]
  }' \
  http://localhost:8080/v1/chat/completions

Go Example

package main

import (
    "bytes"
    "encoding/json"
    "fmt"
    "net/http"
)

const (
    API_KEY  = "your_api_key"
    BASE_URL = "http://localhost:8080"
)

type InferenceRequest struct {
    Model      string   `json:"model"`
    Prompt     string   `json:"prompt"`
    MaxTokens  int      `json:"max_tokens"`
    Temperature float64 `json:"temperature"`
}

type InferenceResponse struct {
    Choices []struct {
        Text string `json:"text"`
    } `json:"choices"`
}

func runInference(prompt string) (string, error) {
    reqBody := InferenceRequest{
        Model:       "llama-2-7b",
        Prompt:      prompt,
        MaxTokens:   100,
        Temperature: 0.7,
    }

    jsonData, _ := json.Marshal(reqBody)

    req, err := http.NewRequest("POST", BASE_URL+"/inference",
        bytes.NewBuffer(jsonData))
    if err != nil {
        return "", err
    }

    req.Header.Set("Authorization", "Bearer "+API_KEY)
    req.Header.Set("Content-Type", "application/json")

    client := &http.Client{}
    resp, err := client.Do(req)
    if err != nil {
        return "", err
    }
    defer resp.Body.Close()

    var result InferenceResponse
    json.NewDecoder(resp.Body).Decode(&result)

    if len(result.Choices) > 0 {
        return result.Choices[0].Text, nil
    }

    return "", fmt.Errorf("no response")
}

Rust Example

use reqwest;
use serde::{Deserialize, Serialize};

const API_KEY: &str = "your_api_key";
const BASE_URL: &str = "http://localhost:8080";

#[derive(Serialize)]
struct InferenceRequest {
    model: String,
    prompt: String,
    max_tokens: u32,
    temperature: f32,
}

#[derive(Deserialize)]
struct InferenceResponse {
    choices: Vec<Choice>,
}

#[derive(Deserialize)]
struct Choice {
    text: String,
}

async fn run_inference(prompt: &str) -> Result<String, Box<dyn std::error::Error>> {
    let client = reqwest::Client::new();

    let request = InferenceRequest {
        model: "llama-2-7b".to_string(),
        prompt: prompt.to_string(),
        max_tokens: 100,
        temperature: 0.7,
    };

    let response = client
        .post(format!("{}/inference", BASE_URL))
        .header("Authorization", format!("Bearer {}", API_KEY))
        .json(&request)
        .send()
        .await?
        .json::<InferenceResponse>()
        .await?;

    Ok(response.choices[0].text.clone())
}

SDK Support

Official SDKs are planned for:

Python (inferno-python)
JavaScript/TypeScript (@inferno/client)
Go (github.com/inferno-ai/go-client)
Rust (inferno-client)
Java (io.inferno:client)
C# (Inferno.Client)

Webhooks

Configure webhooks for async events:

{
  "webhook_url": "https://example.com/webhook",
  "events": ["inference.complete", "batch.complete", "model.loaded"],
  "secret": "webhook_secret_key"
}

Webhook Payload

{
  "event": "inference.complete",
  "timestamp": "2024-01-01T12:00:00Z",
  "data": {
    "request_id": "req_123",
    "model": "llama-2-7b",
    "tokens_generated": 50,
    "duration_ms": 234
  },
  "signature": "sha256=abcdef123456..."
}

API Versioning

The API follows semantic versioning:

Current version: v1
Version in URL: /v1/endpoint
Header: API-Version: 1.0

Deprecation Policy

Deprecated endpoints marked with Deprecation header
Minimum 6 months notice before removal
Migration guides provided

Security Best Practices

Always use HTTPS in production
Rotate API keys regularly
Implement request signing for webhooks
Use rate limiting to prevent abuse
Enable audit logging
Validate and sanitize all inputs
Implement timeout for long-running requests
Use authentication for all endpoints

Support

Documentation: https://github.com/ringo380/inferno/wiki
GitHub Issues: https://github.com/ringo380/inferno/issues
GitHub Discussions: https://github.com/ringo380/inferno/discussions
Enterprise: Contact maintainer for specialized installation assistance (information and pricing available)

FilesExpand file tree

API.md

Latest commit

History

API.md

File metadata and controls

Inferno API Documentation

Table of Contents

Overview

Base URL

Content Types

Authentication

API Key Authentication

JWT Token Authentication

Obtaining Credentials

Generate API Key

Login for JWT Token

REST API Endpoints

Health Check

List Models

Load Model

Unload Model

Inference

Streaming Inference

Embeddings

Batch Processing

Get Batch Status

WebSocket API

Connection

Request Format

Response Format

Message Types

OpenAI-Compatible API

Chat Completions

Completions (Legacy)

Models List

Metrics & Monitoring

Prometheus Metrics

OpenTelemetry Traces

Custom Metrics

Error Handling

Error Codes

HTTP Status Codes

Rate Limiting

Default Limits

Rate Limit Headers

Rate Limit Response

Examples

Python Example

JavaScript/TypeScript Example

cURL Examples

Go Example

Rust Example

SDK Support

Webhooks

Webhook Payload

API Versioning

Deprecation Policy

Security Best Practices

Support