TEI Manager

Dynamic multi-instance manager for HuggingFace Text Embeddings Inference (TEI). Run multiple embedding models simultaneously with intelligent resource management, health monitoring, and automatic recovery.

Who Is This For?

TEI Manager is designed for teams running multiple embedding models on a single GPU host who want:

Unified API - One gRPC endpoint to route requests to any model
Simple operations - REST API for instance lifecycle, no orchestrator required
Production basics - Health checks, auto-restart, metrics, state persistence

Not a fit if you need: Multi-node clustering, request queuing, per-tenant quotas, or Kubernetes-native autoscaling. For those, consider Ray Serve, vLLM, or KServe.

Architecture

flowchart LR
    subgraph Clients
        C1[REST Client]
        C2[gRPC Client]
    end

    subgraph TEI Manager
        API[REST API<br/>:9000]
        MUX[gRPC Multiplexer<br/>:9001]
        HM[Health Monitor]
        ST[State Persistence]
    end

    subgraph TEI Instances
        T1[bge-small<br/>GPU 0 · :8080]
        T2[bge-large<br/>GPU 1 · :8081]
        T3[splade<br/>GPU 0 · :8082]
    end

    C1 --> API
    C2 --> MUX
    API --> T1 & T2 & T3
    MUX --> T1 & T2 & T3
    HM --> T1 & T2 & T3
    ST -.-> API

Request flow:

Clients send embedding requests to the gRPC Multiplexer (port 9001)
Multiplexer routes to the target instance based on instance_name
TEI instance processes the request on its assigned GPU
Response returns through the multiplexer to the client

Management flow:

REST API (port 9000) handles instance lifecycle (create/start/stop/delete)
Health Monitor checks each instance periodically, auto-restarts on failure
State Persistence saves instance configs to disk for crash recovery

Features

Dynamic Instance Management - Create, start, stop, restart, and delete TEI instances via REST API
Model Registry - Track, download, and verify HuggingFace models before deployment
Multi-GPU Support - Pin instances to specific GPUs or share across all available GPUs
gRPC Multiplexer - Unified streaming gRPC endpoint for routing requests to multiple instances
Arrow Batch Embeddings - High-throughput batch embedding via Arrow IPC with LZ4 compression
Rust Benchmark Client - Built-in gRPC client for benchmarking and integration examples
State Persistence - Automatic state saving with atomic writes and crash recovery
Health Monitoring - Continuous health checks with configurable auto-restart on failure
Prometheus Metrics - Built-in metrics export for monitoring instance lifecycle and operations
mTLS Authentication - Optional mutual TLS for secure gRPC connections

Docker Images

TEI Manager images are built on the TEI gRPC base images, which provide GPU-optimized kernels for embedding inference.

Tag format: {manager_version}-tei-{tei_version}[-{variant}]

See latest releases for current image tags.

Variant	Tag suffix	Base Image	Target
Multi-arch	(none)	`text-embeddings-inference:{tei}-grpc`	Auto-detect GPU
CPU	`-cpu`	`text-embeddings-inference:cpu-{tei}-grpc`	No GPU required
Ada	`-ada`	`text-embeddings-inference:89-{tei}-grpc`	RTX 40xx, L4, L40, L40S
Hopper	`-hopper`	`text-embeddings-inference:hopper-{tei}-grpc`	H100, H200

Quick Start

Using Docker

# Pull the image for your GPU architecture (replace <version> from latest release)
docker pull ghcr.io/nazq/tei-manager:<version>        # Multi-arch (auto-detect)
docker pull ghcr.io/nazq/tei-manager:<version>-cpu     # CPU-only (no GPU)
docker pull ghcr.io/nazq/tei-manager:<version>-ada     # Ada (RTX 40xx, L4, L40, L40S)
docker pull ghcr.io/nazq/tei-manager:<version>-hopper  # Hopper (H100, H200)

# Run with GPU support
docker run -d --gpus all \
  --name tei-manager \
  -p 9000:9000 \
  -p 9001:9001 \
  -p 8080-8089:8080-8089 \
  ghcr.io/nazq/tei-manager:<version>

# Create an embedding instance
curl -X POST http://localhost:9000/instances \
  -H "Content-Type: application/json" \
  -d '{"name": "bge-small", "model_id": "BAAI/bge-small-en-v1.5"}'

# Wait for instance to be ready (~30s for model download)
curl http://localhost:9000/instances/bge-small

# Generate embeddings via REST (direct to TEI)
curl -X POST http://localhost:8080/embed \
  -H "Content-Type: application/json" \
  -d '{"inputs": "Hello world"}'

Using gRPC with grpcurl

# Generate embeddings via gRPC multiplexer
grpcurl -plaintext -d '{
  "target": {"instance_name": "bge-small"},
  "request": {"inputs": "Hello world", "truncate": true, "normalize": true}
}' localhost:9001 tei_multiplexer.v1.TeiMultiplexer/Embed

# Get instance info
grpcurl -plaintext -d '{
  "target": {"instance_name": "bge-small"}
}' localhost:9001 tei_multiplexer.v1.TeiMultiplexer/Info

# List available services
grpcurl -plaintext localhost:9001 list

gRPC API

The gRPC multiplexer provides a unified endpoint for routing embedding requests to any managed instance.

Available Methods

Method	Description
`Embed`	Generate dense embeddings for a single text
`EmbedStream`	Streaming dense embeddings
`EmbedSparse`	Generate sparse embeddings (SPLADE)
`EmbedArrow`	High-throughput batch dense embedding via Arrow IPC
`EmbedSparseArrow`	High-throughput batch sparse embedding via Arrow IPC
`Rerank`	Rerank documents by relevance
`Tokenize`	Tokenize text
`Info`	Get model information

Arrow Batch Embeddings

The EmbedArrow and EmbedSparseArrow endpoints enable high-throughput batch processing using Apache Arrow IPC format with LZ4 compression:

# Dense embeddings via Arrow IPC
grpcurl -plaintext -d '{
  "target": {"instance_name": "bge-small"},
  "arrow_ipc": "<base64-encoded-arrow-ipc>",
  "truncate": true,
  "normalize": true
}' localhost:9001 tei_multiplexer.v1.TeiMultiplexer/EmbedArrow

# Sparse embeddings via Arrow IPC (SPLADE models)
grpcurl -plaintext -d '{
  "target": {"instance_name": "splade"},
  "arrow_ipc": "<base64-encoded-arrow-ipc>",
  "truncate": true
}' localhost:9001 tei_multiplexer.v1.TeiMultiplexer/EmbedSparseArrow

Benefits:

Process thousands of texts in a single request
LZ4 compression reduces network overhead
Efficient memory layout for batch processing
Dense: Returns embeddings as Arrow FixedSizeList<Float32> for zero-copy access
Sparse: Returns embeddings as Arrow List<Struct<index:u32, value:f32>> for variable-length sparse vectors

Rust Benchmark Client

TEI Manager includes a built-in Rust benchmark client for testing throughput and latency. This also serves as a complete example for integrating with the gRPC API from Rust.

Installation

# Build from source
cargo build --release --bin bench-client

# Or run directly
cargo run --release --bin bench-client -- --help

Usage

# Standard mode: concurrent single-text requests
bench-client -e http://localhost:9001 -i bge-small \
  --mode standard --num-texts 10000 --batch-size 100

# Arrow mode: batched Arrow IPC requests (recommended for throughput)
bench-client -e http://localhost:9001 -i bge-small \
  --mode arrow --num-texts 100000 --batch-size 1000

# With mTLS
bench-client -e https://localhost:9001 -i bge-small \
  --cert client.pem --key client-key.pem --ca ca.pem \
  --mode arrow --num-texts 100000 --batch-size 1000

Example Output

{
  "mode": "arrow",
  "instance_name": "bge-small",
  "num_texts": 100000,
  "batch_size": 1000,
  "num_requests": 100,
  "total_duration_secs": 12.34,
  "throughput_per_sec": 8103.72,
  "successful": 100000,
  "failed": 0
}

Using as a Rust Library Example

The bench-client source (src/bin/bench-client.rs) demonstrates:

Connecting to the gRPC multiplexer with/without TLS
Creating Arrow IPC batches with LZ4 compression
Sending EmbedArrow requests and parsing responses
Concurrent request handling with Tokio

REST API

Endpoints

Method	Endpoint	Description	Success	Error Codes
`GET`	`/health`	Health check	200	-
`GET`	`/metrics`	Prometheus metrics	200	-
`GET`	`/instances`	List all instances	200	-
`GET`	`/instances/{name}`	Get instance details	200	404 `INSTANCE_NOT_FOUND`
`POST`	`/instances`	Create new instance	201	409 `INSTANCE_EXISTS`, 422 `PORT_CONFLICT`
`DELETE`	`/instances/{name}`	Delete instance	200	404 `INSTANCE_NOT_FOUND`
`POST`	`/instances/{name}/start`	Start instance	200	404, 409 `ALREADY_RUNNING`
`POST`	`/instances/{name}/stop`	Stop instance	200	404, 409 `NOT_RUNNING`
`POST`	`/instances/{name}/restart`	Restart instance	200	404
`GET`	`/instances/{name}/logs`	Get instance logs	200	404
`GET`	`/models`	List all known models	200	-
`POST`	`/models`	Register a model	201	-
`GET`	`/models/{id}`	Get model details	200	404 `MODEL_NOT_FOUND`
`POST`	`/models/{id}/download`	Download model to cache	200	409 `MODEL_BUSY`, 500
`POST`	`/models/{id}/load`	Smoke test model loading	200	409 `MODEL_BUSY`, 500

Error responses include a machine-readable code field:

{"error": "Instance not found", "code": "INSTANCE_NOT_FOUND", "timestamp": "..."}

Create Instance

curl -X POST http://localhost:9000/instances \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-model",
    "model_id": "BAAI/bge-small-en-v1.5",
    "gpu_id": 0,
    "max_batch_tokens": 16384,
    "max_concurrent_requests": 512
  }'

Required Fields:

name - Unique instance name
model_id - HuggingFace model ID

Optional Fields:

port - HTTP port (auto-assigned if omitted)
gpu_id - GPU to pin instance to (omit to use all GPUs)
max_batch_tokens - Max tokens per batch (default: 16384)
max_concurrent_requests - Max concurrent requests (default: 512)
pooling - Pooling method (e.g., "splade" for sparse models)

Model Registry

The model registry tracks HuggingFace models and their status. Models are auto-discovered from the HF cache on startup.

# List all known models
curl http://localhost:9000/models

# Register a model (checks if already cached)
curl -X POST http://localhost:9000/models \
  -H "Content-Type: application/json" \
  -d '{"model_id": "BAAI/bge-small-en-v1.5"}'

# Download a model to cache
curl -X POST "http://localhost:9000/models/BAAI%2Fbge-small-en-v1.5/download"

# Smoke test model loading (loads on GPU 0, verifies, unloads)
curl -X POST "http://localhost:9000/models/BAAI%2Fbge-small-en-v1.5/load"

# Get model details (cache path, size, metadata, verification status)
curl "http://localhost:9000/models/BAAI%2Fbge-small-en-v1.5"

Model Status Flow:

available - Model is registered but not downloaded
downloading - Download in progress
downloaded - Model is in HF cache
loading - Smoke test in progress
verified - Smoke test passed, ready to use
failed - Smoke test failed (check verification_error)

Note: Model IDs contain / which must be URL-encoded as %2F in paths.

Configuration

Environment Variables

TEI_MANAGER_API_PORT=9000           # REST API port
TEI_MANAGER_GRPC_PORT=9001          # gRPC multiplexer port
TEI_MANAGER_STATE_FILE=/data/state.toml
TEI_BINARY_PATH=/usr/local/bin/text-embeddings-router

Config File

api_port = 9000
grpc_port = 9001
state_file = "/data/state.toml"
health_check_interval_secs = 30
max_instances = 10

# Pre-register models (checked against HF cache on startup)
models = [
  "BAAI/bge-small-en-v1.5",
  "sentence-transformers/all-MiniLM-L6-v2"
]

# Seed instances (auto-started on boot)
[[instances]]
name = "bge-small"
model_id = "BAAI/bge-small-en-v1.5"
gpu_id = 0

Examples

Multi-GPU Setup

# GPU 0: Small model for low-latency
curl -X POST http://localhost:9000/instances \
  -H "Content-Type: application/json" \
  -d '{"name": "fast", "model_id": "BAAI/bge-small-en-v1.5", "gpu_id": 0}'

# GPU 1: Large model for quality
curl -X POST http://localhost:9000/instances \
  -H "Content-Type: application/json" \
  -d '{"name": "quality", "model_id": "BAAI/bge-large-en-v1.5", "gpu_id": 1}'

# Route requests to either via gRPC
grpcurl -plaintext -d '{"target": {"instance_name": "fast"}, "request": {"inputs": "Quick query"}}' \
  localhost:9001 tei_multiplexer.v1.TeiMultiplexer/Embed

grpcurl -plaintext -d '{"target": {"instance_name": "quality"}, "request": {"inputs": "Important document"}}' \
  localhost:9001 tei_multiplexer.v1.TeiMultiplexer/Embed

Sparse Embeddings (SPLADE)

curl -X POST http://localhost:9000/instances \
  -H "Content-Type: application/json" \
  -d '{
    "name": "splade",
    "model_id": "naver/splade-cocondenser-ensembledistil",
    "pooling": "splade"
  }'

# Generate sparse embeddings
grpcurl -plaintext -d '{
  "target": {"instance_name": "splade"},
  "request": {"inputs": "Information retrieval"}
}' localhost:9001 tei_multiplexer.v1.TeiMultiplexer/EmbedSparse

Development

# Install just: cargo install just
just --list              # Show all available commands

# Common workflows
just test                # Run unit tests
just check               # Format check + clippy + all tests
just coverage            # Generate HTML coverage report
just docker-build        # Build Docker image
just pre-commit          # Run before committing

Documentation

DESIGN.md - Architecture and design decisions
docs/GRPC_MULTIPLEXER.md - Full gRPC API reference
docs/DEPLOYMENT.md - Production deployment guide
docs/MTLS.md - mTLS configuration

Known Limitations

Single host only - No clustering or multi-node coordination
No request queuing - Requests exceeding TEI's max_concurrent_requests return errors immediately
No per-tenant auth - mTLS authenticates connections, not individual requests
Port range required - Each instance needs an HTTP port; plan your port range accordingly

Future Directions

Model-based routing (route by model_id instead of instance_name)
HTTP embedding endpoint on manager (avoid direct TEI access)
Metrics-based instance recommendations

Versioning

TEI Manager follows Semantic Versioning:

MAJOR - Breaking changes to REST/gRPC APIs or config format
MINOR - New features, backward-compatible
PATCH - Bug fixes only

Docker tag format: {manager_version}-tei-{tei_version}[-{arch}]

The manager version tracks our API stability. The TEI version tracks the embedded TEI binary. We test against TEI's gRPC interface and will bump MINOR if TEI changes require manager updates.

Current stability:

REST API: Stable since v0.4.0
gRPC API: Stable since v0.3.0
Config format: Stable since v0.1.0

License

Apache License 2.0 - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.bench-baseline		.bench-baseline
.github		.github
benches		benches
benchmarks		benchmarks
config		config
docs		docs
proto		proto
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.release-please-manifest.json		.release-please-manifest.json
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
DESIGN.md		DESIGN.md
Dockerfile		Dockerfile
README.md		README.md
build.rs		build.rs
clippy.toml		clippy.toml
codecov.yml		codecov.yml
deny.toml		deny.toml
justfile		justfile
release-please-config.json		release-please-config.json
release.toml		release.toml

Folders and files

Latest commit

History

Repository files navigation

TEI Manager

Who Is This For?

Architecture

Features

Docker Images

Quick Start

Using Docker

Using gRPC with grpcurl

gRPC API

Available Methods

Arrow Batch Embeddings

Rust Benchmark Client

Installation

Usage

Example Output

Using as a Rust Library Example

REST API

Endpoints

Create Instance

Model Registry

Configuration

Environment Variables

Config File

Examples

Multi-GPU Setup

Sparse Embeddings (SPLADE)

Development

Documentation

Known Limitations

Future Directions

Versioning

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages