An OpenAI-compatible proxy server for inference backends, starting with SGLang support and designed for extensibility.
This project provides a production-ready inference server that:
- Exposes OpenAI-compatible API endpoints (
/v1/completions,/v1/chat/completions) - Proxies requests to backend inference frameworks (currently SGLang)
- Supports structured logging with configurable outputs
- Provides graceful shutdown and health monitoring
- Designed for extensibility to support multiple backends (vLLM, TGI, etc.)
- ✅ OpenAI API Compatibility: Drop-in replacement for OpenAI API endpoints
- ✅ SGLang Backend: Full support for SGLang inference framework
- ✅ GPU Environment Tracking: Automatic GPU info collection (model, memory, driver, CUDA version) in every response
- ✅ Extensible Architecture: Abstract backend interface for easy additions
- ✅ Structured Logging: Configurable logging to console and/or file
- ✅ Configuration Flexibility: CLI args, environment variables, or defaults
- ✅ Health Monitoring: Built-in health check endpoint
- ✅ Graceful Shutdown: Proper cleanup on SIGTERM/SIGINT
- ✅ Integration Tests: Comprehensive tests using OpenAI SDK
# Clone the repository
git clone https://github.com/apuslabs/deterministic-inference.git
cd deterministic-inference
# Install with uv
uv sync --prerelease=allow
# Or install in development mode
uv pip install -e .
# Install with test dependencies
uv pip install -e ".[test]"Using the installed command:
deterministic-inference-server --model-path qwen/qwen2.5-0.5b-instructOr using Python module:
python -m deterministic_inference --model-path qwen/qwen2.5-0.5b-instructUsing the OpenAI Python SDK:
from openai import OpenAI
# Point to your local server
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="dummy-key" # Not validated, but required by SDK
)
# Text completion
response = client.completions.create(
model="your-model",
prompt="Once upon a time",
max_tokens=50
)
# Chat completion
response = client.chat.completions.create(
model="your-model",
messages=[
{"role": "user", "content": "Hello!"}
]
)
# The response includes an 'environment' field with GPU information
# Note: Access via response's raw JSON if needed, as OpenAI SDK may not expose custom fieldsAll completion responses include an environment field with GPU information:
{
"id": "cmpl-xxx",
"object": "chat.completion",
"created": 1234567890,
"model": "your-model",
"choices": [...],
"usage": {...},
"environment": "{\"gpu_count\": 2, \"gpus\": [{\"index\": 0, \"name\": \"NVIDIA A100\", \"memory_total\": 40960, \"memory_unit\": \"MiB\"}, ...], \"driver_version\": \"535.104.05\", \"cuda_version\": \"12.2+\"}"
}The environment field contains a JSON string with:
- gpu_count: Number of GPUs available
- gpus: Array of GPU details (index, name, memory)
- driver_version: NVIDIA driver version
- cuda_version: CUDA version
Using curl:
# Health check
curl http://localhost:8080/health
# Chat completion
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'deterministic-inference-server \
--model-path /path/to/model \ # Required
--backend sglang \ # Backend type (default: sglang)
--backend-host 127.0.0.1 \ # Backend server host
--backend-port 30000 \ # Backend server port
--proxy-host 0.0.0.0 \ # Proxy server host
--proxy-port 8080 \ # Proxy server port
--log-level INFO \ # Log level
--log-file /var/log/inference.log # Optional log fileexport INFERENCE_MODEL_PATH=/path/to/model
export INFERENCE_BACKEND=sglang
export INFERENCE_BACKEND_HOST=127.0.0.1
export INFERENCE_BACKEND_PORT=30000
export INFERENCE_PROXY_HOST=0.0.0.0
export INFERENCE_PROXY_PORT=8080
export INFERENCE_LOG_LEVEL=INFO
export INFERENCE_LOG_FILE=/var/log/inference.log
deterministic-inference-serverConfiguration is loaded in the following order (later sources override earlier ones):
- Defaults - Sensible defaults built into the application
- Environment Variables - Prefixed with
INFERENCE_ - CLI Arguments - Highest priority, override everything else
# Start with minimal configuration
deterministic-inference-server --model-path /models/llama-2-7b# Start with production settings
deterministic-inference-server \
--model-path /models/llama-2-7b \
--proxy-host 0.0.0.0 \
--proxy-port 8080 \
--log-level INFO \
--log-file /var/log/inference.log# Enable debug logging
deterministic-inference-server \
--model-path /models/llama-2-7b \
--log-level DEBUGThis server is designed to be called from Erlang applications:
% Start server process
ServerPort = open_port(
{spawn, "deterministic-inference-server --model-path /path/to/model --proxy-port 8080"},
[stream, exit_status, {line, 1024}]
),
% Wait for server to be ready
timer:sleep(5000),
% Check health
{ok, {{_, 200, _}, _, Body}} = httpc:request("http://localhost:8080/health"),
% Use the server...
% Stop server gracefully
port_close(ServerPort).GET /health
Returns server and backend status:
{
"status": "healthy",
"backend": {
"status": "ready",
"healthy": true,
"url": "http://127.0.0.1:30000"
}
}POST /v1/completions
Compatible with OpenAI's /v1/completions endpoint.
POST /v1/chat/completions
Compatible with OpenAI's /v1/chat/completions endpoint.
Run integration tests (requires server to be running):
# Install test dependencies
uv pip install -e ".[test]"
# Start the server in one terminal
deterministic-inference-server --model-path /path/to/model
# Run tests in another terminal
pytest tests/ -v
# Run with coverage
pytest tests/ -v --cov=deterministic_inferencedeterministic-inference/
├── src/
│ └── deterministic_inference/
│ ├── __init__.py # Package initialization
│ ├── __main__.py # Module entry point
│ ├── cli.py # CLI argument parsing
│ ├── config.py # Configuration management
│ ├── environment.py # GPU environment collection
│ ├── logging_config.py # Logging setup
│ ├── server.py # Main server orchestration
│ ├── backends/
│ │ ├── base.py # Abstract backend interface
│ │ └── sglang.py # SGLang implementation
│ └── proxy/
│ └── handler.py # OpenAI API proxy handler
├── tests/
│ └── test_integration.py # Integration tests
├── docs/
│ └── REQUIREMENTS.md # Detailed requirements
├── pyproject.toml # Package configuration
├── test_gpu_env.py # GPU environment test script
└── README.md # This file
The server consists of four main components:
- Environment Collector (
environment.py): Collects GPU information at startup using gpustat - Backend Manager (
backends/): Manages the lifecycle of inference backend processes (SGLang, etc.) - Proxy Server (
proxy/): HTTP server that implements OpenAI-compatible API and injects environment info - Server Orchestrator (
server.py): Coordinates all components lifecycle
To add a new backend:
- Create a new class inheriting from
Backendinbackends/ - Implement required methods:
start_server(),stop_server(),health_check(),is_running() - Update
server.pyto recognize the new backend type - Add configuration options in
config.pyandcli.py
Logs include structured information:
- Timestamp
- Log level
- Module name
- Message
Example log output:
2025-10-23 10:30:45 | INFO | deterministic_inference.server | Starting Deterministic Inference Server
2025-10-23 10:30:45 | INFO | deterministic_inference.backends.sglang | Starting SGLang server with model: /models/llama-2-7b
2025-10-23 10:30:50 | INFO | deterministic_inference.backends.sglang | SGLang server is ready
- Check if model path exists and is accessible
- Verify ports are not already in use
- Check logs for detailed error messages
- Ensure SGLang is installed:
pip install sglang
- Check backend logs in server output
- Verify backend health:
curl http://localhost:8080/health - Increase startup timeout:
--backend-startup-timeout 600
- Verify firewall settings
- Check if proxy host/port are correct
- Ensure backend is running: check health endpoint
Contributions are welcome! Please ensure:
- Code follows existing style
- New backends implement the
Backendinterface - Integration tests pass
- Documentation is updated
MIT License - see LICENSE file for details
Jax - jax@apus.network