Skip to content

extrasmall0/llama-stack

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3,701 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Llama Stack

PyPI version PyPI - Downloads Docker Hub - Pulls License Discord Unit Tests Integration Tests

Quick Start | Documentation | OpenAI API Compatibility | Discord

Open-source agentic API server for building AI applications. OpenAI-compatible. Any model, any infrastructure.

Llama Stack is a drop-in replacement for the OpenAI API that you can run anywhere — your laptop, your datacenter, or the cloud. Use any OpenAI-compatible client or agentic framework. Swap between Llama, GPT, Gemini, Mistral, or any model without changing your application code.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Hello"}],
)

What you get

  • Chat Completions & Embeddings — standard /v1/chat/completions, /v1/completions, and /v1/embeddings endpoints, compatible with any OpenAI client
  • Responses API — server-side agentic orchestration with tool calling, MCP server integration, and built-in file search (RAG) in a single API call (learn more)
  • Vector Stores & Files/v1/vector_stores and /v1/files for managed document storage and search
  • Batches/v1/batches for offline batch processing
  • Open Responses conformant — the Responses API implementation passes the Open Responses conformance test suite

Use any model, use any infrastructure

Llama Stack has a pluggable provider architecture. Develop locally with Ollama, deploy to production with vLLM, or connect to a managed service — the API stays the same.

┌─────────────────────────────────────────────────────────────────────────┐
│                          Llama Stack Server                             │
│               (same API, same code, any environment)                    │
│                                                                         │
│  /v1/chat/completions  /v1/responses  /v1/vector_stores  /v1/files      │
│  /v1/embeddings        /v1/batches    /v1/models         /v1/connectors │
├───────────────────┬──────────────────┬──────────────────────────────────┤
│  Inference        │  Vector stores   │  Tools & connectors              │
│    Ollama         │    FAISS         │    MCP servers                   │
│    vLLM, TGI      │    Milvus        │    Brave, Tavily (web search)    │
│    AWS Bedrock    │    Qdrant        │    File search (built-in RAG)    │
│    Azure OpenAI   │    PGVector      │                                  │
│    Fireworks      │    ChromaDB      │  File storage & processing       │
│    Together       │    Weaviate      │    Local filesystem, S3          │
│    ...15+ more    │    Elasticsearch │    PDF, HTML (file processors)   │
│                   │    SQLite-vec    │                                  │
└───────────────────┴──────────────────┴──────────────────────────────────┘

See the provider documentation for the full list.

Get started

Install and run a Llama Stack server:

# One-line install
curl -LsSf https://github.com/llamastack/llama-stack/raw/main/scripts/install.sh | bash

# Or install via uv
uv pip install llama-stack

# Start the server (uses the starter distribution with Ollama)
llama stack run

Then connect with any OpenAI client — Python, TypeScript, curl, or any framework that speaks the OpenAI API.

See the Quick Start guide for detailed setup.

Resources

Client SDKs:

Language SDK Package
Python llama-stack-client-python PyPI version
TypeScript llama-stack-client-typescript NPM version

Community

We hold regular community calls every Thursday at 09:00 AM PST — see the Community Event on Discord for details.

Star History Chart

Thanks to all our amazing contributors!

Llama Stack contributors

About

Composable building blocks to build LLM Apps

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 83.1%
  • TypeScript 10.9%
  • Mustache 3.8%
  • Shell 1.6%
  • Swift 0.2%
  • Dockerfile 0.1%
  • Other 0.3%