Dynamic multi-instance manager for HuggingFace Text Embeddings Inference (TEI). Run multiple embedding models simultaneously with intelligent resource management, health monitoring, and automatic recovery.
TEI Manager is designed for teams running multiple embedding models on a single GPU host who want:
- Unified API - One gRPC endpoint to route requests to any model
- Simple operations - REST API for instance lifecycle, no orchestrator required
- Production basics - Health checks, auto-restart, metrics, state persistence
Not a fit if you need: Multi-node clustering, request queuing, per-tenant quotas, or Kubernetes-native autoscaling. For those, consider Ray Serve, vLLM, or KServe.
flowchart LR
subgraph Clients
C1[REST Client]
C2[gRPC Client]
end
subgraph TEI Manager
API[REST API<br/>:9000]
MUX[gRPC Multiplexer<br/>:9001]
HM[Health Monitor]
ST[State Persistence]
end
subgraph TEI Instances
T1[bge-small<br/>GPU 0 · :8080]
T2[bge-large<br/>GPU 1 · :8081]
T3[splade<br/>GPU 0 · :8082]
end
C1 --> API
C2 --> MUX
API --> T1 & T2 & T3
MUX --> T1 & T2 & T3
HM --> T1 & T2 & T3
ST -.-> API
Request flow:
- Clients send embedding requests to the gRPC Multiplexer (port 9001)
- Multiplexer routes to the target instance based on
instance_name - TEI instance processes the request on its assigned GPU
- Response returns through the multiplexer to the client
Management flow:
- REST API (port 9000) handles instance lifecycle (create/start/stop/delete)
- Health Monitor checks each instance periodically, auto-restarts on failure
- State Persistence saves instance configs to disk for crash recovery
- Dynamic Instance Management - Create, start, stop, restart, and delete TEI instances via REST API
- Model Registry - Track, download, and verify HuggingFace models before deployment
- Multi-GPU Support - Pin instances to specific GPUs or share across all available GPUs
- gRPC Multiplexer - Unified streaming gRPC endpoint for routing requests to multiple instances
- Arrow Batch Embeddings - High-throughput batch embedding via Arrow IPC with LZ4 compression
- Rust Benchmark Client - Built-in gRPC client for benchmarking and integration examples
- State Persistence - Automatic state saving with atomic writes and crash recovery
- Health Monitoring - Continuous health checks with configurable auto-restart on failure
- Prometheus Metrics - Built-in metrics export for monitoring instance lifecycle and operations
- mTLS Authentication - Optional mutual TLS for secure gRPC connections
TEI Manager images are built on the TEI gRPC base images, which provide GPU-optimized kernels for embedding inference.
Tag format: {manager_version}-tei-{tei_version}[-{variant}]
See latest releases for current image tags.
| Variant | Tag suffix | Base Image | Target |
|---|---|---|---|
| Multi-arch | (none) | text-embeddings-inference:{tei}-grpc |
Auto-detect GPU |
| CPU | -cpu |
text-embeddings-inference:cpu-{tei}-grpc |
No GPU required |
| Ada | -ada |
text-embeddings-inference:89-{tei}-grpc |
RTX 40xx, L4, L40, L40S |
| Hopper | -hopper |
text-embeddings-inference:hopper-{tei}-grpc |
H100, H200 |
# Pull the image for your GPU architecture (replace <version> from latest release)
docker pull ghcr.io/nazq/tei-manager:<version> # Multi-arch (auto-detect)
docker pull ghcr.io/nazq/tei-manager:<version>-cpu # CPU-only (no GPU)
docker pull ghcr.io/nazq/tei-manager:<version>-ada # Ada (RTX 40xx, L4, L40, L40S)
docker pull ghcr.io/nazq/tei-manager:<version>-hopper # Hopper (H100, H200)
# Run with GPU support
docker run -d --gpus all \
--name tei-manager \
-p 9000:9000 \
-p 9001:9001 \
-p 8080-8089:8080-8089 \
ghcr.io/nazq/tei-manager:<version># Create an embedding instance
curl -X POST http://localhost:9000/instances \
-H "Content-Type: application/json" \
-d '{"name": "bge-small", "model_id": "BAAI/bge-small-en-v1.5"}'
# Wait for instance to be ready (~30s for model download)
curl http://localhost:9000/instances/bge-small
# Generate embeddings via REST (direct to TEI)
curl -X POST http://localhost:8080/embed \
-H "Content-Type: application/json" \
-d '{"inputs": "Hello world"}'# Generate embeddings via gRPC multiplexer
grpcurl -plaintext -d '{
"target": {"instance_name": "bge-small"},
"request": {"inputs": "Hello world", "truncate": true, "normalize": true}
}' localhost:9001 tei_multiplexer.v1.TeiMultiplexer/Embed
# Get instance info
grpcurl -plaintext -d '{
"target": {"instance_name": "bge-small"}
}' localhost:9001 tei_multiplexer.v1.TeiMultiplexer/Info
# List available services
grpcurl -plaintext localhost:9001 listThe gRPC multiplexer provides a unified endpoint for routing embedding requests to any managed instance.
| Method | Description |
|---|---|
Embed |
Generate dense embeddings for a single text |
EmbedStream |
Streaming dense embeddings |
EmbedSparse |
Generate sparse embeddings (SPLADE) |
EmbedArrow |
High-throughput batch dense embedding via Arrow IPC |
EmbedSparseArrow |
High-throughput batch sparse embedding via Arrow IPC |
Rerank |
Rerank documents by relevance |
Tokenize |
Tokenize text |
Info |
Get model information |
The EmbedArrow and EmbedSparseArrow endpoints enable high-throughput batch processing using Apache Arrow IPC format with LZ4 compression:
# Dense embeddings via Arrow IPC
grpcurl -plaintext -d '{
"target": {"instance_name": "bge-small"},
"arrow_ipc": "<base64-encoded-arrow-ipc>",
"truncate": true,
"normalize": true
}' localhost:9001 tei_multiplexer.v1.TeiMultiplexer/EmbedArrow
# Sparse embeddings via Arrow IPC (SPLADE models)
grpcurl -plaintext -d '{
"target": {"instance_name": "splade"},
"arrow_ipc": "<base64-encoded-arrow-ipc>",
"truncate": true
}' localhost:9001 tei_multiplexer.v1.TeiMultiplexer/EmbedSparseArrowBenefits:
- Process thousands of texts in a single request
- LZ4 compression reduces network overhead
- Efficient memory layout for batch processing
- Dense: Returns embeddings as Arrow
FixedSizeList<Float32>for zero-copy access - Sparse: Returns embeddings as Arrow
List<Struct<index:u32, value:f32>>for variable-length sparse vectors
TEI Manager includes a built-in Rust benchmark client for testing throughput and latency. This also serves as a complete example for integrating with the gRPC API from Rust.
# Build from source
cargo build --release --bin bench-client
# Or run directly
cargo run --release --bin bench-client -- --help# Standard mode: concurrent single-text requests
bench-client -e http://localhost:9001 -i bge-small \
--mode standard --num-texts 10000 --batch-size 100
# Arrow mode: batched Arrow IPC requests (recommended for throughput)
bench-client -e http://localhost:9001 -i bge-small \
--mode arrow --num-texts 100000 --batch-size 1000
# With mTLS
bench-client -e https://localhost:9001 -i bge-small \
--cert client.pem --key client-key.pem --ca ca.pem \
--mode arrow --num-texts 100000 --batch-size 1000{
"mode": "arrow",
"instance_name": "bge-small",
"num_texts": 100000,
"batch_size": 1000,
"num_requests": 100,
"total_duration_secs": 12.34,
"throughput_per_sec": 8103.72,
"successful": 100000,
"failed": 0
}The bench-client source (src/bin/bench-client.rs) demonstrates:
- Connecting to the gRPC multiplexer with/without TLS
- Creating Arrow IPC batches with LZ4 compression
- Sending
EmbedArrowrequests and parsing responses - Concurrent request handling with Tokio
| Method | Endpoint | Description | Success | Error Codes |
|---|---|---|---|---|
GET |
/health |
Health check | 200 | - |
GET |
/metrics |
Prometheus metrics | 200 | - |
GET |
/instances |
List all instances | 200 | - |
GET |
/instances/{name} |
Get instance details | 200 | 404 INSTANCE_NOT_FOUND |
POST |
/instances |
Create new instance | 201 | 409 INSTANCE_EXISTS, 422 PORT_CONFLICT |
DELETE |
/instances/{name} |
Delete instance | 200 | 404 INSTANCE_NOT_FOUND |
POST |
/instances/{name}/start |
Start instance | 200 | 404, 409 ALREADY_RUNNING |
POST |
/instances/{name}/stop |
Stop instance | 200 | 404, 409 NOT_RUNNING |
POST |
/instances/{name}/restart |
Restart instance | 200 | 404 |
GET |
/instances/{name}/logs |
Get instance logs | 200 | 404 |
GET |
/models |
List all known models | 200 | - |
POST |
/models |
Register a model | 201 | - |
GET |
/models/{id} |
Get model details | 200 | 404 MODEL_NOT_FOUND |
POST |
/models/{id}/download |
Download model to cache | 200 | 409 MODEL_BUSY, 500 |
POST |
/models/{id}/load |
Smoke test model loading | 200 | 409 MODEL_BUSY, 500 |
Error responses include a machine-readable code field:
{"error": "Instance not found", "code": "INSTANCE_NOT_FOUND", "timestamp": "..."}curl -X POST http://localhost:9000/instances \
-H "Content-Type: application/json" \
-d '{
"name": "my-model",
"model_id": "BAAI/bge-small-en-v1.5",
"gpu_id": 0,
"max_batch_tokens": 16384,
"max_concurrent_requests": 512
}'Required Fields:
name- Unique instance namemodel_id- HuggingFace model ID
Optional Fields:
port- HTTP port (auto-assigned if omitted)gpu_id- GPU to pin instance to (omit to use all GPUs)max_batch_tokens- Max tokens per batch (default: 16384)max_concurrent_requests- Max concurrent requests (default: 512)pooling- Pooling method (e.g., "splade" for sparse models)
The model registry tracks HuggingFace models and their status. Models are auto-discovered from the HF cache on startup.
# List all known models
curl http://localhost:9000/models
# Register a model (checks if already cached)
curl -X POST http://localhost:9000/models \
-H "Content-Type: application/json" \
-d '{"model_id": "BAAI/bge-small-en-v1.5"}'
# Download a model to cache
curl -X POST "http://localhost:9000/models/BAAI%2Fbge-small-en-v1.5/download"
# Smoke test model loading (loads on GPU 0, verifies, unloads)
curl -X POST "http://localhost:9000/models/BAAI%2Fbge-small-en-v1.5/load"
# Get model details (cache path, size, metadata, verification status)
curl "http://localhost:9000/models/BAAI%2Fbge-small-en-v1.5"Model Status Flow:
available- Model is registered but not downloadeddownloading- Download in progressdownloaded- Model is in HF cacheloading- Smoke test in progressverified- Smoke test passed, ready to usefailed- Smoke test failed (checkverification_error)
Note: Model IDs contain
/which must be URL-encoded as%2Fin paths.
TEI_MANAGER_API_PORT=9000 # REST API port
TEI_MANAGER_GRPC_PORT=9001 # gRPC multiplexer port
TEI_MANAGER_STATE_FILE=/data/state.toml
TEI_BINARY_PATH=/usr/local/bin/text-embeddings-routerapi_port = 9000
grpc_port = 9001
state_file = "/data/state.toml"
health_check_interval_secs = 30
max_instances = 10
# Pre-register models (checked against HF cache on startup)
models = [
"BAAI/bge-small-en-v1.5",
"sentence-transformers/all-MiniLM-L6-v2"
]
# Seed instances (auto-started on boot)
[[instances]]
name = "bge-small"
model_id = "BAAI/bge-small-en-v1.5"
gpu_id = 0# GPU 0: Small model for low-latency
curl -X POST http://localhost:9000/instances \
-H "Content-Type: application/json" \
-d '{"name": "fast", "model_id": "BAAI/bge-small-en-v1.5", "gpu_id": 0}'
# GPU 1: Large model for quality
curl -X POST http://localhost:9000/instances \
-H "Content-Type: application/json" \
-d '{"name": "quality", "model_id": "BAAI/bge-large-en-v1.5", "gpu_id": 1}'
# Route requests to either via gRPC
grpcurl -plaintext -d '{"target": {"instance_name": "fast"}, "request": {"inputs": "Quick query"}}' \
localhost:9001 tei_multiplexer.v1.TeiMultiplexer/Embed
grpcurl -plaintext -d '{"target": {"instance_name": "quality"}, "request": {"inputs": "Important document"}}' \
localhost:9001 tei_multiplexer.v1.TeiMultiplexer/Embedcurl -X POST http://localhost:9000/instances \
-H "Content-Type: application/json" \
-d '{
"name": "splade",
"model_id": "naver/splade-cocondenser-ensembledistil",
"pooling": "splade"
}'
# Generate sparse embeddings
grpcurl -plaintext -d '{
"target": {"instance_name": "splade"},
"request": {"inputs": "Information retrieval"}
}' localhost:9001 tei_multiplexer.v1.TeiMultiplexer/EmbedSparse# Install just: cargo install just
just --list # Show all available commands
# Common workflows
just test # Run unit tests
just check # Format check + clippy + all tests
just coverage # Generate HTML coverage report
just docker-build # Build Docker image
just pre-commit # Run before committing- DESIGN.md - Architecture and design decisions
- docs/GRPC_MULTIPLEXER.md - Full gRPC API reference
- docs/DEPLOYMENT.md - Production deployment guide
- docs/MTLS.md - mTLS configuration
- Single host only - No clustering or multi-node coordination
- No request queuing - Requests exceeding TEI's
max_concurrent_requestsreturn errors immediately - No per-tenant auth - mTLS authenticates connections, not individual requests
- Port range required - Each instance needs an HTTP port; plan your port range accordingly
- Model-based routing (route by
model_idinstead ofinstance_name) - HTTP embedding endpoint on manager (avoid direct TEI access)
- Metrics-based instance recommendations
TEI Manager follows Semantic Versioning:
- MAJOR - Breaking changes to REST/gRPC APIs or config format
- MINOR - New features, backward-compatible
- PATCH - Bug fixes only
Docker tag format: {manager_version}-tei-{tei_version}[-{arch}]
The manager version tracks our API stability. The TEI version tracks the embedded TEI binary. We test against TEI's gRPC interface and will bump MINOR if TEI changes require manager updates.
Current stability:
- REST API: Stable since v0.4.0
- gRPC API: Stable since v0.3.0
- Config format: Stable since v0.1.0
Apache License 2.0 - see LICENSE for details.