A lightweight, self-hosted embeddings microservice that mirrors the OpenAI POST /v1/embeddings API — but runs local Hugging Face text embedding models.
Use it as a drop-in embeddings backend for:
- semantic search & nearest-neighbor retrieval
- clustering / grouping
- duplicate detection
- offline indexing pipelines
Scope: text → vector only. No reranking, no training, no “compare” endpoints — just fast, predictable embeddings behind a stable API.
- OpenAI-compatible endpoint:
POST /v1/embeddings - Local inference with Hugging Face models (allowlist via
supported_models.txt) - Input formats: string(s) or token id(s)
- Output formats: float arrays or base64 (
encoding_format) - Optional dimension truncation (
dimensions) - CPU and GPU Docker builds + docker-compose
- Deterministic pooling: mean pooling + L2 normalization
Build:
git clone https://github.com/kotylevskiy/embeddings.git
cd embeddings
docker compose up -d --buildTest:
curl -s http://localhost:11445/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model":"jinaai/jina-embeddings-v2-base-en","input":"Hi! This is a test"}'The first call can take a while because the service needs to download the model from Hugging Face.
The API is designed to mirror OpenAI’s embeddings endpoint as closely as possible:
POST /v1/embeddings
model— required, must be insupported_models.txtinput— required; one of:- a string
- a list of strings
- a list of integer token IDs
- a list of token ID arrays
encoding_format— optional:"float"(default) or"base64"dimensions— optional: if set, embeddings are truncated to this sizeuser— optional: accepted and ignored (kept for spec parity)
object:"list"data: list of{ object, index, embedding }model: echoes the request modelusage:{ prompt_tokens, total_tokens }
Token usage is computed using the Hugging Face tokenizer for the chosen model.
dimensionsperforms a simple truncation (no re-projection)- If
encoding_format="base64", embeddings are encoded as base64-encoded float32 - Token ID inputs must match the model’s tokenizer vocabulary
Embeddings only loads models listed in supported_models.txt.
Add new models by appending their Hugging Face IDs to that file.
The service uses a generic mean pooling strategy:
embedding = mean(last_hidden_state * attention_mask)
This produces a single vector per input. Normalization is always enabled.
⚠️ Security & production readiness noticeEmbeddings is provided as a development tool and is not production-ready by default:
- No authentication or authorization
- API docs (
/docs,/redoc,/openapi.json) are public- No rate limiting or abuse protection
- No security hardening or penetration testing
For public models, HF_TOKEN is optional. For gated/private models, you must set:
export HF_TOKEN=your_tokenThe token is used only to download model weights.
Embeddings runs on CPU-only systems, but larger models can be slow. For high throughput or large models, GPU is recommended.
Create a .env file with required environment variables (at minimum EMBEDDINGS_PORT if you want to change it):
HF_TOKEN=your_token_if_needed
EMBEDDINGS_PORT=11445Then run:
git clone https://github.com/your-org/embeddings.git
cd embeddings
docker compose up -d
⚠️ Docker GPU passthrough via NVIDIA Container Toolkit is supported only on Linux hosts.
Set PYTORCH_TAG in your .env file to use Dockerfile.gpu.
Choose a tag with a CUDA version supported by your NVIDIA driver.
Run the GPU stack with:
docker compose -f docker-compose.yaml -f docker-compose.gpu.yaml up -d --buildIf you want lower VRAM usage, you can enable bitsandbytes quantization for any model:
EMBEDDINGS_BITSANDBYTES=8bit # or 4bitNotes:
- bitsandbytes only works on CUDA.
bitsandbytesmust be installed (it’s included inrequirements.txt).
jinaai/jina-embeddings-v3 can use flash-attn for faster attention on CUDA.
If flash-attn is not installed, you may see:
flash_attn is not installed. Using PyTorch native attention implementation.
To enable flash-attn in the GPU image, set this in .env before building:
INSTALL_FLASH_ATTN=1Rebuild the GPU image after changing it.
Warning
Building flash-attn is time-consuming and highly resource-intensive.
curl -sS http://localhost:11445/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "sentence-transformers/all-MiniLM-L6-v2",
"input": "The quick brown fox jumps over the lazy dog"
}'curl -sS http://localhost:11445/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "sentence-transformers/all-MiniLM-L6-v2",
"encoding_format": "base64",
"input": [
"First text",
"Second text"
]
}'A few curl-based tests live in tests/. Run them against a running container:
./tests/test_simple.sh
./tests/test_base64.sh
./tests/test_batch.sh