Quick Start | Documentation | OpenAI API Compatibility | Discord
Open-source agentic API server for building AI applications. OpenAI-compatible. Any model, any infrastructure.
Llama Stack is a drop-in replacement for the OpenAI API that you can run anywhere — your laptop, your datacenter, or the cloud. Use any OpenAI-compatible client or agentic framework. Swap between Llama, GPT, Gemini, Mistral, or any model without changing your application code.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Hello"}],
)- Chat Completions & Embeddings — standard
/v1/chat/completions,/v1/completions, and/v1/embeddingsendpoints, compatible with any OpenAI client - Responses API — server-side agentic orchestration with tool calling, MCP server integration, and built-in file search (RAG) in a single API call (learn more)
- Vector Stores & Files —
/v1/vector_storesand/v1/filesfor managed document storage and search - Batches —
/v1/batchesfor offline batch processing - Open Responses conformant — the Responses API implementation passes the Open Responses conformance test suite
Llama Stack has a pluggable provider architecture. Develop locally with Ollama, deploy to production with vLLM, or connect to a managed service — the API stays the same.
┌─────────────────────────────────────────────────────────────────────────┐
│ Llama Stack Server │
│ (same API, same code, any environment) │
│ │
│ /v1/chat/completions /v1/responses /v1/vector_stores /v1/files │
│ /v1/embeddings /v1/batches /v1/models /v1/connectors │
├───────────────────┬──────────────────┬──────────────────────────────────┤
│ Inference │ Vector stores │ Tools & connectors │
│ Ollama │ FAISS │ MCP servers │
│ vLLM, TGI │ Milvus │ Brave, Tavily (web search) │
│ AWS Bedrock │ Qdrant │ File search (built-in RAG) │
│ Azure OpenAI │ PGVector │ │
│ Fireworks │ ChromaDB │ File storage & processing │
│ Together │ Weaviate │ Local filesystem, S3 │
│ ...15+ more │ Elasticsearch │ PDF, HTML (file processors) │
│ │ SQLite-vec │ │
└───────────────────┴──────────────────┴──────────────────────────────────┘
See the provider documentation for the full list.
Install and run a Llama Stack server:
# One-line install
curl -LsSf https://github.com/llamastack/llama-stack/raw/main/scripts/install.sh | bash
# Or install via uv
uv pip install llama-stack
# Start the server (uses the starter distribution with Ollama)
llama stack runThen connect with any OpenAI client — Python, TypeScript, curl, or any framework that speaks the OpenAI API.
See the Quick Start guide for detailed setup.
- Documentation — full reference
- OpenAI API Compatibility — endpoint coverage and provider matrix
- Getting Started Notebook — text and vision inference walkthrough
- Contributing — how to contribute
Client SDKs:
| Language | SDK | Package |
|---|---|---|
| Python | llama-stack-client-python | |
| TypeScript | llama-stack-client-typescript |
We hold regular community calls every Thursday at 09:00 AM PST — see the Community Event on Discord for details.
Thanks to all our amazing contributors!