- Overview
- Authentication
- REST API Endpoints
- WebSocket API
- OpenAI-Compatible API
- Metrics & Monitoring
- Error Handling
- Rate Limiting
- Examples
Inferno provides multiple API interfaces for AI/ML model inference:
- REST API: Standard HTTP endpoints for synchronous inference
- WebSocket API: Real-time bidirectional streaming
- OpenAI-Compatible API: Drop-in replacement for OpenAI API
- Metrics API: Prometheus-compatible metrics endpoint
http://localhost:8080
- Request:
application/json - Response:
application/json - Streaming:
text/event-stream(SSE) or WebSocket
Inferno supports multiple authentication methods:
Include your API key in the Authorization header:
Authorization: Bearer YOUR_API_KEYFor session-based authentication:
Authorization: Bearer YOUR_JWT_TOKENinferno security api-key create --user USER_ID --name "My API Key"curl -X POST http://localhost:8080/auth/login \
-H "Content-Type: application/json" \
-d '{"username": "user", "password": "pass"}'Check service health status.
GET /healthResponse:
{
"status": "healthy",
"version": "0.1.0",
"uptime_seconds": 3600,
"models_loaded": 2
}Get available models.
GET /modelsResponse:
{
"models": [
{
"id": "llama-2-7b",
"name": "Llama 2 7B",
"type": "gguf",
"size_bytes": 7516192768,
"loaded": true,
"context_size": 4096,
"capabilities": ["text-generation", "embeddings"]
}
]
}Load a model into memory.
POST /models/{model_id}/loadRequest:
{
"gpu_layers": 32,
"context_size": 2048,
"batch_size": 512
}Response:
{
"status": "loaded",
"model_id": "llama-2-7b",
"memory_usage_bytes": 8589934592,
"load_time_ms": 5432
}Unload a model from memory.
POST /models/{model_id}/unloadResponse:
{
"status": "unloaded",
"model_id": "llama-2-7b"
}Run inference on a loaded model.
POST /inferenceRequest:
{
"model": "llama-2-7b",
"prompt": "What is the capital of France?",
"max_tokens": 100,
"temperature": 0.7,
"top_p": 0.9,
"top_k": 40,
"repeat_penalty": 1.1,
"stop": ["\n", "###"],
"stream": false
}Response:
{
"id": "inf_123456",
"model": "llama-2-7b",
"choices": [
{
"text": "The capital of France is Paris.",
"index": 0,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 7,
"total_tokens": 15
},
"created": 1704067200,
"processing_time_ms": 234
}Stream inference results using Server-Sent Events.
POST /inference/streamRequest: Same as regular inference with "stream": true
Response (SSE):
data: {"token": "The", "index": 0}
data: {"token": " capital", "index": 1}
data: {"token": " of", "index": 2}
data: {"token": " France", "index": 3}
data: {"token": " is", "index": 4}
data: {"token": " Paris", "index": 5}
data: {"token": ".", "index": 6}
data: {"done": true, "finish_reason": "stop"}
Generate text embeddings.
POST /embeddingsRequest:
{
"model": "llama-2-7b",
"input": ["Hello world", "How are you?"],
"encoding_format": "float"
}Response:
{
"model": "llama-2-7b",
"data": [
{
"embedding": [0.023, -0.445, 0.192, ...],
"index": 0
},
{
"embedding": [0.011, -0.234, 0.567, ...],
"index": 1
}
],
"usage": {
"prompt_tokens": 5,
"total_tokens": 5
}
}Submit batch inference jobs.
POST /batchRequest:
{
"model": "llama-2-7b",
"requests": [
{"id": "req1", "prompt": "What is AI?"},
{"id": "req2", "prompt": "Explain quantum computing"}
],
"max_tokens": 100,
"webhook_url": "https://example.com/webhook"
}Response:
{
"batch_id": "batch_789",
"status": "processing",
"total_requests": 2,
"created": 1704067200
}Check batch job status.
GET /batch/{batch_id}Response:
{
"batch_id": "batch_789",
"status": "completed",
"completed": 2,
"failed": 0,
"total": 2,
"results_url": "/batch/batch_789/results"
}Connect to the WebSocket endpoint for real-time streaming:
ws://localhost:8080/ws
const ws = new WebSocket('ws://localhost:8080/ws');
ws.onopen = () => {
ws.send(JSON.stringify({
type: 'auth',
token: 'YOUR_API_KEY'
}));
};{
"type": "inference",
"id": "req_123",
"model": "llama-2-7b",
"prompt": "Tell me a story",
"max_tokens": 200,
"stream": true
}{
"type": "token",
"id": "req_123",
"token": "Once",
"index": 0
}auth: Authenticationinference: Inference requestcancel: Cancel ongoing inferenceping/pong: Keep-aliveerror: Error messagetoken: Streaming tokencomplete: Inference complete
Inferno provides OpenAI API compatibility for easy migration.
POST /v1/chat/completionsRequest:
{
"model": "llama-2-7b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the weather like?"}
],
"temperature": 0.7,
"max_tokens": 100,
"stream": false
}Response:
{
"id": "chatcmpl-123",
"object": "chat.completion",
"created": 1704067200,
"model": "llama-2-7b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "I don't have access to real-time weather data..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 20,
"completion_tokens": 15,
"total_tokens": 35
}
}POST /v1/completionsRequest:
{
"model": "llama-2-7b",
"prompt": "Once upon a time",
"max_tokens": 50,
"temperature": 0.8
}GET /v1/modelsResponse:
{
"object": "list",
"data": [
{
"id": "llama-2-7b",
"object": "model",
"created": 1704067200,
"owned_by": "local"
}
]
}GET /metricsResponse (Prometheus format):
# HELP inferno_inference_requests_total Total inference requests
# TYPE inferno_inference_requests_total counter
inferno_inference_requests_total{model="llama-2-7b"} 1234
# HELP inferno_inference_duration_seconds Inference duration
# TYPE inferno_inference_duration_seconds histogram
inferno_inference_duration_seconds_bucket{le="0.1"} 100
inferno_inference_duration_seconds_bucket{le="0.5"} 450
inferno_inference_duration_seconds_bucket{le="1.0"} 890
GET /tracesResponse:
{
"traces": [
{
"trace_id": "abc123",
"span_id": "def456",
"operation_name": "inference.llama-2-7b",
"start_time": "2024-01-01T12:00:00Z",
"duration_ms": 234,
"status": "ok"
}
]
}POST /metrics/customRequest:
{
"name": "custom_metric",
"value": 42.5,
"type": "gauge",
"labels": {
"environment": "production"
}
}All API errors follow a consistent format:
{
"error": {
"code": "MODEL_NOT_FOUND",
"message": "Model 'gpt-5' not found",
"details": {
"available_models": ["llama-2-7b", "mistral-7b"]
}
},
"request_id": "req_abc123",
"timestamp": "2024-01-01T12:00:00Z"
}INVALID_REQUEST: Malformed requestAUTHENTICATION_FAILED: Invalid credentialsAUTHORIZATION_FAILED: Insufficient permissionsMODEL_NOT_FOUND: Model doesn't existMODEL_NOT_LOADED: Model not in memoryRATE_LIMIT_EXCEEDED: Too many requestsCONTEXT_LENGTH_EXCEEDED: Input too longINFERENCE_FAILED: Processing errorTIMEOUT: Request timeoutINTERNAL_ERROR: Server error
200 OK: Success400 Bad Request: Invalid request401 Unauthorized: Authentication required403 Forbidden: Access denied404 Not Found: Resource not found429 Too Many Requests: Rate limit exceeded500 Internal Server Error: Server error503 Service Unavailable: Service overloaded
Rate limits are enforced per API key or IP address:
- Requests per minute: 60
- Requests per hour: 1000
- Tokens per minute: 10000
- Concurrent requests: 10
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1704067260
X-RateLimit-Reset-After: 30{
"error": {
"code": "RATE_LIMIT_EXCEEDED",
"message": "Rate limit exceeded. Please retry after 30 seconds.",
"retry_after": 30
}
}import requests
import json
# Configuration
API_KEY = "your_api_key"
BASE_URL = "http://localhost:8080"
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
# Simple inference
response = requests.post(
f"{BASE_URL}/inference",
headers=headers,
json={
"model": "llama-2-7b",
"prompt": "What is machine learning?",
"max_tokens": 100,
"temperature": 0.7
}
)
result = response.json()
print(result["choices"][0]["text"])
# Streaming inference with SSE
import sseclient
response = requests.post(
f"{BASE_URL}/inference/stream",
headers=headers,
json={
"model": "llama-2-7b",
"prompt": "Explain quantum physics",
"max_tokens": 200,
"stream": True
},
stream=True
)
client = sseclient.SSEClient(response)
for event in client.events():
data = json.loads(event.data)
if "token" in data:
print(data["token"], end="", flush=True)
elif "done" in data:
break// Configuration
const API_KEY = 'your_api_key';
const BASE_URL = 'http://localhost:8080';
// Simple inference
async function runInference(prompt: string): Promise<string> {
const response = await fetch(`${BASE_URL}/inference`, {
method: 'POST',
headers: {
'Authorization': `Bearer ${API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'llama-2-7b',
prompt: prompt,
max_tokens: 100,
temperature: 0.7
})
});
const result = await response.json();
return result.choices[0].text;
}
// WebSocket streaming
function streamInference(prompt: string) {
const ws = new WebSocket(`ws://localhost:8080/ws`);
ws.onopen = () => {
// Authenticate
ws.send(JSON.stringify({
type: 'auth',
token: API_KEY
}));
// Send inference request
ws.send(JSON.stringify({
type: 'inference',
id: 'req_' + Date.now(),
model: 'llama-2-7b',
prompt: prompt,
max_tokens: 200,
stream: true
}));
};
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === 'token') {
process.stdout.write(data.token);
} else if (data.type === 'complete') {
console.log('\nDone!');
ws.close();
} else if (data.type === 'error') {
console.error('Error:', data.message);
ws.close();
}
};
}# Health check
curl http://localhost:8080/health
# List models
curl -H "Authorization: Bearer $API_KEY" \
http://localhost:8080/models
# Run inference
curl -X POST \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-2-7b",
"prompt": "Hello, how are you?",
"max_tokens": 50
}' \
http://localhost:8080/inference
# Stream inference
curl -X POST \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "llama-2-7b",
"prompt": "Tell me a joke",
"max_tokens": 100,
"stream": true
}' \
http://localhost:8080/inference/stream
# OpenAI-compatible chat
curl -X POST \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-2-7b",
"messages": [
{"role": "user", "content": "What is 2+2?"}
]
}' \
http://localhost:8080/v1/chat/completionspackage main
import (
"bytes"
"encoding/json"
"fmt"
"net/http"
)
const (
API_KEY = "your_api_key"
BASE_URL = "http://localhost:8080"
)
type InferenceRequest struct {
Model string `json:"model"`
Prompt string `json:"prompt"`
MaxTokens int `json:"max_tokens"`
Temperature float64 `json:"temperature"`
}
type InferenceResponse struct {
Choices []struct {
Text string `json:"text"`
} `json:"choices"`
}
func runInference(prompt string) (string, error) {
reqBody := InferenceRequest{
Model: "llama-2-7b",
Prompt: prompt,
MaxTokens: 100,
Temperature: 0.7,
}
jsonData, _ := json.Marshal(reqBody)
req, err := http.NewRequest("POST", BASE_URL+"/inference",
bytes.NewBuffer(jsonData))
if err != nil {
return "", err
}
req.Header.Set("Authorization", "Bearer "+API_KEY)
req.Header.Set("Content-Type", "application/json")
client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
return "", err
}
defer resp.Body.Close()
var result InferenceResponse
json.NewDecoder(resp.Body).Decode(&result)
if len(result.Choices) > 0 {
return result.Choices[0].Text, nil
}
return "", fmt.Errorf("no response")
}use reqwest;
use serde::{Deserialize, Serialize};
const API_KEY: &str = "your_api_key";
const BASE_URL: &str = "http://localhost:8080";
#[derive(Serialize)]
struct InferenceRequest {
model: String,
prompt: String,
max_tokens: u32,
temperature: f32,
}
#[derive(Deserialize)]
struct InferenceResponse {
choices: Vec<Choice>,
}
#[derive(Deserialize)]
struct Choice {
text: String,
}
async fn run_inference(prompt: &str) -> Result<String, Box<dyn std::error::Error>> {
let client = reqwest::Client::new();
let request = InferenceRequest {
model: "llama-2-7b".to_string(),
prompt: prompt.to_string(),
max_tokens: 100,
temperature: 0.7,
};
let response = client
.post(format!("{}/inference", BASE_URL))
.header("Authorization", format!("Bearer {}", API_KEY))
.json(&request)
.send()
.await?
.json::<InferenceResponse>()
.await?;
Ok(response.choices[0].text.clone())
}Official SDKs are planned for:
- Python (
inferno-python) - JavaScript/TypeScript (
@inferno/client) - Go (
github.com/inferno-ai/go-client) - Rust (
inferno-client) - Java (
io.inferno:client) - C# (
Inferno.Client)
Configure webhooks for async events:
{
"webhook_url": "https://example.com/webhook",
"events": ["inference.complete", "batch.complete", "model.loaded"],
"secret": "webhook_secret_key"
}{
"event": "inference.complete",
"timestamp": "2024-01-01T12:00:00Z",
"data": {
"request_id": "req_123",
"model": "llama-2-7b",
"tokens_generated": 50,
"duration_ms": 234
},
"signature": "sha256=abcdef123456..."
}The API follows semantic versioning:
- Current version:
v1 - Version in URL:
/v1/endpoint - Header:
API-Version: 1.0
- Deprecated endpoints marked with
Deprecationheader - Minimum 6 months notice before removal
- Migration guides provided
- Always use HTTPS in production
- Rotate API keys regularly
- Implement request signing for webhooks
- Use rate limiting to prevent abuse
- Enable audit logging
- Validate and sanitize all inputs
- Implement timeout for long-running requests
- Use authentication for all endpoints
- Documentation: https://github.com/ringo380/inferno/wiki
- GitHub Issues: https://github.com/ringo380/inferno/issues
- GitHub Discussions: https://github.com/ringo380/inferno/discussions
- Enterprise: Contact maintainer for specialized installation assistance (information and pricing available)