Semantic search over the Mattermost community forum. Scrapes Discourse posts, embeds them with sentence-transformers or OpenAI, stores vectors in Qdrant, and serves a Streamlit search UI.
Streamlit — Python library that turns a Python script into an interactive web app. No HTML/CSS/JS needed. Used here as the search UI.
Alembic — Database migration tool for SQLAlchemy. Tracks schema changes over time
(like git for your database). alembic upgrade head brings any environment to the current
schema. Always run this after deploying a new version.
Qdrant — Vector database. Stores post embeddings and answers nearest-neighbour queries (semantic search). Runs as a Docker container alongside Postgres.
sentence-transformers — Python library for generating text embeddings locally using pre-trained models. Requires PyTorch, which makes the Docker image large (~1.2 GB).
Inference — running a trained model to produce outputs (embeddings). Happens twice: once in bulk when embedding all posts, and once per search query to embed the query text. The model weights never change — we use a pre-trained model as-is.
ANN (approximate nearest neighbour) — the algorithm Qdrant uses to find vectors closest to the query vector. Not model inference — pure mathematical search (cosine similarity). Fast on CPU, unaffected by the embedding model size.
RAG (Retrieval Augmented Generation) — the pattern that powers Answer mode. Retrieve relevant forum posts via vector search, pass their text as context to an LLM, and let the LLM synthesize a grounded answer. The LLM answers only from what the posts contain — no hallucination, every claim traceable to a source. See docs/how-rag-works.md for a detailed explanation.
Discourse forum
│
▼
run_scrape.py ──► PostgreSQL (topics, posts, categories)
│
run_embed.py ──► Qdrant (vector embeddings)
│
streamlit_app.py ──► https://mmforums.mattermosteng.online
├── Search mode: returns ranked forum post links
└── Answer mode: RAG — retrieves posts → LLM generates answer
| File | Purpose |
|---|---|
docker-compose.yml |
Base services (Postgres, Qdrant) — no external ports exposed |
docker-compose.dev.yml |
Adds external ports for local access (Postgres: 5433, Qdrant: 6333) |
docker-compose.prod.yml |
Prod overrides: nginx, letsencrypt, restart policies, streamlit service |
Local dev: docker compose -f docker-compose.yml -f docker-compose.dev.yml up -d
Prod: docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d --build
Prerequisites: Docker, Python 3.12+
# Start Postgres + Qdrant with local ports exposed
docker compose -f docker-compose.yml -f docker-compose.dev.yml up -d
# Install package with dev extras
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Run migrations
alembic upgrade head
# Run scraper
python scripts/run_scrape.py
# Run embedder
python scripts/run_embed.py
# Launch Streamlit
streamlit run app/streamlit_app.pyCopy .env.example to .env and adjust values before running locally.
In .env, use DATABASE_URL=postgresql://mm:changeme@localhost:5433/mm_forum (port 5433 for local dev).
- Hetzner Cloud account + API token
- SSH key registered in your Hetzner project
- A domain with DNS you control
- If your SSH key is in 1Password: enable the 1Password SSH agent (Settings → Developer →
Use the SSH agent) and add this to
~/.ssh/config:Host * IdentityAgent "~/Library/Group Containers/2BUA8C4S2C.com.1password/t/agent.sock"
HCLOUD_TOKEN=<token> \
HCLOUD_SSH_KEY=<key-name-in-hetzner> \
DOMAIN=<your-domain> \
HCLOUD_LOCATION=fsn1 \
HCLOUD_SERVER_TYPE=cx33 \
bash deploy/provision_hetzner.shNote the server IPv4 from the output.
Add an A record: <your-domain> → <server-ipv4>
If using Cloudflare, set it to DNS only (grey cloud) — the orange proxy will break certbot. You can re-enable it after SSL is provisioned.
ssh root@<server-ipv4> 'cloud-init status --wait'This installs Docker, Docker Compose, certbot, and UFW (~60–90 s).
ssh root@<server-ipv4> 'certbot certonly --standalone -d <your-domain>'Certbot sets up automatic renewal. Certs land in /etc/letsencrypt/live/<your-domain>/
and are bind-mounted read-only into the nginx container.
The repo is public so no auth is needed:
ssh root@<server-ipv4>
git clone https://github.com/pavelzeman/mm-forums-vector-db.git /home/deploy/projects/mm-forums-vector-db
chown -R deploy:deploy /home/deploy/projectsCreate the .env file:
cp /home/deploy/projects/mm-forums-vector-db/.env.example \
/home/deploy/projects/mm-forums-vector-db/.env
chmod 600 /home/deploy/projects/mm-forums-vector-db/.env
nano /home/deploy/projects/mm-forums-vector-db/.envSet these values:
DOMAIN=<your-domain>POSTGRES_PASSWORD=<strong-random-password>— generate withopenssl rand -base64 32DATABASE_URL=postgresql://mm:<password>@postgres:5432/mm_forum— use port5432(not5433, which is local dev only)OPENAI_API_KEYif using OpenAI embeddings
su - deploy
cd ~/projects/mm-forums-vector-db
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d --buildThe first build takes several minutes due to PyTorch (see Why is the first build slow?).
# Run as deploy user from ~/projects/mm-forums-vector-db
# Apply database schema
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec streamlit alembic upgrade head
# Scrape the forum (resumable if interrupted)
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec streamlit python scripts/run_scrape.py
# Embed the posts
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec streamlit python scripts/run_embed.pyThe Streamlit search UI will return results once embedding is complete.
su - deploy
cd ~/projects/mm-forums-vector-db
git pull
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d --build
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec streamlit alembic upgrade headEvery push to main runs tests (pytest with Postgres + Qdrant service containers).
A full build-and-deploy pipeline will be added when moving to AWS.
The Docker image is large (~1.2 GB) because sentence-transformers bundles:
- PyTorch (~800 MB) — the deep learning framework, includes CUDA binaries even on CPU-only servers
- Hugging Face Transformers (~200 MB)
- Model weights —
all-MiniLM-L6-v2is ~90 MB - NumPy, SciPy, tokenizers — ~100 MB
PyTorch is the main culprit. Subsequent builds are fast because Docker caches the layer.
Alternative: set EMBEDDING_MODEL=openai in .env to skip local inference entirely and
use the OpenAI API instead. The image will be much smaller and builds will be faster, but
embeddings have a per-token cost and require OPENAI_API_KEY.
When ready to move from Hetzner to AWS, use deploy/migrate_to_aws.sh. It requires
SSH access to both servers from your laptop and no intermediate storage (S3, etc.).
HETZNER_HOST=46.224.111.133 \
AWS_HOST=<ec2-ip> \
AWS_USER=ec2-user \
bash deploy/migrate_to_aws.shWhat it does:
- Postgres — streams
pg_dump | pg_restoredirectly Hetzner → laptop → AWS (no temp file) - Qdrant — takes a snapshot on Hetzner, stages it locally, uploads and restores on AWS
- Prints a verification checklist before you flip DNS
AWS prerequisite: the stack must be running (docker compose up -d) with an empty database
before you run the script — it overwrites whatever is there.
Migration order:
- Provision AWS infra and start the empty stack
- Run
migrate_to_aws.sh - Verify the app works on the AWS IP
- Update DNS A record to AWS IP
- Wait for TTL to expire, then shut down the Hetzner server
The scraper stores raw post text in Postgres. The embedder reads those posts and converts each one into a vector — a list of ~384 numbers representing the semantic meaning of the text. Similar meaning = similar numbers = close together in vector space.
When you search, your query is converted to a vector the same way, and Qdrant finds the posts whose vectors are closest to it. This is why a query like "users not receiving email notifications after SMTP setup" finds relevant posts even if no post contains those exact words — it matches on meaning, not keywords.
The vectors are stored in Qdrant. Postgres keeps the raw text and metadata. The two are
linked by embedding_id on each post row.
Run the embedder after every scrape to keep the vector index current:
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec streamlit python scripts/run_embed.pyUse these to verify the search is working after embedding completes. Open https://mmforums.mattermosteng.online and try them in order.
how to install Mattermostreset passwordLDAP configurationmobile app notificationscreate a new channel
users not receiving email notifications after SMTP setupdifference between team and channel admin permissionsmigrate from Slack to Mattermostwebhook payload format for incoming messagesplugin not showing up after install
server upgrade broke existing integrations and bots stopped respondinghigh memory usage on self-hosted instance with many concurrent usersguest accounts can see channels they shouldn't have access tohow to archive old channels without losing message history for compliancecustom emoji not syncing across cluster nodes
These are the real test. A keyword search would struggle; semantic search should surface relevant threads even when the exact words don't appear in any post.
trade-offs between database connection pooling settings and Mattermost performance under loadSAML SSO with Okta works for web but mobile app falls back to password authconfiguring rate limiting to protect the API without breaking high-volume bot integrationsrecommended approach for zero-downtime Mattermost upgrades in a Kubernetes deploymentaudit log gaps when using read replicas — some user actions not appearing in compliance exports
# Scrape new posts (resumable)
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec streamlit python scripts/run_scrape.py
# Embed any unembedded posts
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec streamlit python scripts/run_embed.py
# Query from CLI
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec streamlit python scripts/query.py "your search query"