A lightweight, extensible toolkit for preparing, indexing, and querying textual data for semantic search and downstream NLP tasks.
- Project Overview
- Architecture Diagram
- Directory Layout
- Prerequisites
- Installation
- Configuration
- Data Ingestion & Preparation
- Semantic Indexing (Milvus)
- Search API
- Conversational RAG (Chat) Service
- Management & Administrative Tools
- Testing & Evaluation
- Extending the Engine
- Contribution Guidelines
- License
The Semantic Search Engine provides a full‑stack solution for building semantic search applications:
| Layer | Purpose | Main Modules |
|---|---|---|
| Data Ingestion | Convert raw text or annotation exports into a uniform JSON/JSONL format, optionally chunk, clean, and detect language. | any_text_to_json.py, doccano_converter.py |
| Storage | Persist raw documents and metadata in PostgreSQL (relational) and Milvus (vector) databases. | data.models, engine.models, semantic_search_engine/aws_handler.py |
| Embedding & Reranking | Compute dense vector representations and optionally re‑rank using cross‑encoders. | embedders_rerankers.py, engine.controllers.embedders_rerankers |
| Search | Retrieve nearest‑neighbor chunks with optional metadata filters, template‑based constraints, and pagination. | engine.controllers.search, data.controllers.relational_db |
| RAG (Retrieval‑Augmented Generation) | Combine retrieved snippets with LLMs (local or OpenAI) to generate context‑aware answers. | chat.controllers, engine.controllers.models |
| API Layer | Expose all functionality via a clean, versioned REST API built on Django Rest Framework. | chat.api, data.api, engine.api, system.api |
| Administration | User, organisation, and group management, plus collection lifecycle utilities. | system.controllers, system.models |
The repository is deliberately modular: you can use the data preparation scripts standalone, or run the full Django‑powered service for end‑to‑end pipelines.
+----------------------+ +---------------------+ +-------------------+
| Raw Text / Doccano | ---> | Ingestion Scripts | ---> | PostgreSQL (RDB) |
+----------------------+ +---------------------+ +-------------------+
| |
v v
+-------------------+ +-------------------+
| Chunking / | | Metadata & |
| Language Detect | | Relations |
+-------------------+ +-------------------+
| |
v v
+-------------------+ +-------------------+
| JSON/JSONL Files | | Document Models |
+-------------------+ +-------------------+
| |
+---------------+---------------+
|
v
+-------------------+
| Milvus (Vector) |
+-------------------+
|
v
+-------------------+
| Search Service |
+-------------------+
|
v
+-------------------+
| RAG (LLM) |
+-------------------+
|
v
+-------------------+
| REST API (Django)|
+-------------------+
semantic-search-engine/
├── apps_sse/ # Core apps for SSE (semantic search engine)
│ ├── admin/ # Management scripts (add org, users, etc.)
│ ├── dataset/ # Generators for synthetic QA datasets
│ ├── evaluator/ # Test harnesses and evaluation utilities
│ ├── installed/ # Bundled third‑party utilities (denoiser, converters)
│ ├── add_files_from_dir.py # CLI for uploading & indexing a directory
│ ├── index_collection_to_milvus.py
│ ├── index_to_milvus.sh
│ ├── index_to_postgresql.sh
│ └── semantic_search_app.py # Interactive REPL for ad‑hoc queries
├── chat/ # Conversational (RAG) service
│ ├── models.py
│ ├── controllers.py
│ ├── api.py
│ └── urls.py
├── data/ # Document, collection, and query‑template models
│ ├── models.py
│ ├── controllers.py
│ ├── api.py
│ └── urls.py
├── engine/ # Search & generative model orchestration
│ ├── models.py
│ ├── controllers/
│ │ ├── search.py
│ │ └── models.py
│ ├── api.py
│ └── urls.py
├── system/ # Organisation & authentication
│ ├── models.py
│ ├── controllers.py
│ ├── api.py
│ └── urls.py
├── main/ # Django project entry point
│ ├── settings.py
│ ├── urls.py
│ ├── wsgi.py
│ └── asgi.py
├── semantic_search_engine/ # Shared utilities (AWS handler, constants)
│ ├── __init__.py
│ ├── aws_handler.py
│ ├── constants.py
│ └── models.json # Model metadata registry
├── doccano_converter.py # Doccano → dataset conversion
├── any_text_to_json.py # Directory → JSON/JSONL conversion
├── manage.py # Django management script
├── requirements.txt
├── README.md # <-- you are reading this file!
└── scripts/ # Optional helper scripts for CI / deployment
| Component | Minimum Version | Why it matters |
|---|---|---|
| Python | 3.11 (rc1 acceptable) | All scripts and Django run on 3.11+. |
| pip | latest | To install dependencies from requirements.txt. |
| PostgreSQL | 12+ | Relational store for documents, users, collections. |
| Milvus | 2.3+ | Vector database for semantic embeddings. |
| Django | 4.x | Core web framework for the API layer. |
| boto3 | any | Required only if you enable AWS S3 storage. |
| radlab_data package | ≥ 0.2.0 | Provides DirectoryFileReader, TextUtils, and other utilities used by the conversion scripts. |
| CUDA (optional) | 11+ | If you want to run embeddings / LLMs on GPU. |
| OpenAI API key (optional) | – | Enables the OpenAI generative backend. |
| Deepl API key (optional) | – | Enables answer translation. |
All third‑party Python dependencies are listed in requirements.txt.
git clone https://github.com/radlab-dev-group/semantic-search-engine.git
cd semantic-search-enginepython3 -m venv .venv
source .venv/bin/activatepip install -r requirements.txtTip: If you encounter a
pipversion conflict with system packages, setexport PIP_BREAK_SYSTEM_PACKAGES=1before installing (as done infull-install.sh).
Create a database and a user (example for a local dev setup):
CREATE
DATABASE semantic_search;
CREATE
USER sse_user WITH PASSWORD 'sse_pass';
GRANT ALL PRIVILEGES ON DATABASE
semantic_search TO sse_user;python manage.py migrateIf you don’t have a remote Milvus cluster, you can run a Docker container:
docker run -d --name milvus-standalone \
-p 19530:19530 -p 19121:19121 \
milvusdb/milvus:2.3.0Make sure the Milvus connection details match the values in configs/milvus_config.json.
Configuration files live under the configs/ directory. The most important ones are:
| File | Description |
|---|---|
configs/milvus_config.json |
Milvus connection, index settings, and embedding model references. |
configs/aws_config.json |
S3 credentials and bucket name (used by AwsHandler). |
configs/embedders.json |
Registry of available embedding models (paths, vector size, device). |
configs/rerankers.json |
Registry of cross‑encoder reranker models. |
configs/generative-models.json |
Mapping of generative model names to API endpoints (local or OpenAI). |
configs/query-templates.json |
Definition of query‑template grammars and per‑template metadata filters. |
main/settings.py loads configuration through the SystemSettingsHandler (see main/src/settings.py). The handler
reads values from:
- JSON config files (
django-config.jsonby default). - Environment variables (prefixed with
ENV_).
You can override any setting by exporting the corresponding environment variable. For example:
export ENV_DEBUG=true
export ENV_DB_HOST=localhost
export ENV_USE_AWS=1
export ENV_USE_KC_AUTH=0| Setting | Source | Default |
|---|---|---|
DEBUG |
ENV_DEBUG or django-config.json |
false |
DATABASES |
ENV_DB_* or django-config.json |
PostgreSQL config |
MAIN_API_URL |
Computed from api section in config |
/api/ |
ALLOWED_HOSTS |
ENV_ALLOWED_HOSTS or config |
[] |
MAIN_LOGGER |
Configured via logger section |
Python logging module |
The AwsHandler expects a JSON file (default: configs/aws_config.json). Example:
{
"aws": {
"REGION_NAME": "us-east-1",
"ENDPOINT_URL": "https://s3.amazonaws.com",
"ACCESS_KEY_ID": "<your-access-key>",
"SECRET_ACCESS_KEY": "<your-secret>",
"STORAGE_BUCKET_NAME": "my-semsearch-bucket"
}
}All required fields are validated on AwsHandler initialization; missing values raise an AssertionError.
semantic_search_engine/models.json holds model metadata used by the embedding and reranker loaders. A minimal entry
looks like:
{
"denoiser": {
"model_name": "radlab/polish-denoiser-t5-base",
"model_path": "/opt/models/denoiser/t5-base",
"device": "cuda:0"
}
}You can add new models by extending the JSON and updating embedders.json / rerankers.json accordingly.
| Variable | Meaning |
|---|---|
ENV_DEBUG |
Turn Django debug mode on/off. |
ENV_DB_NAME / ENV_DB_HOST / ENV_DB_PORT / ENV_DB_ENGINE / ENV_DB_USERNAME / ENV_DB_PASSWORD |
Override PostgreSQL connection. |
ENV_USE_AWS |
Enable the AwsHandler S3 wrapper. |
ENV_USE_KC_AUTH |
Use Keycloak authentication (requires rdl_authorization). |
ENV_USE_OAUTH_V1_AUTH / ENV_USE_OAUTH_V2_AUTH |
Use OAuth v1/v2 authentication. |
ENV_USE_INTROSPECT |
Enable token introspection for OAuth. |
ENV_PUBLIC_API_AVAILABLE |
Expose public API endpoints without authentication. |
ENV_CELERY_BROKER_URL |
URL of the Celery broker (RabbitMQ, Redis, etc.). |
ENV_DEFAULT_LANGUAGE |
Default language for API responses (pl or en). |
OPENAI_API_KEY |
API key for OpenAI model calls. |
DEEPL_AUTH_KEY |
API key for DeepL translation service. |
The engine expects documents to be stored in a normalized JSON/JSONL format. Two primary CLI utilities help you get there.
any_text_to_json.py walks a directory tree, reads plain‑text files, and emits a single JSON document (or JSONL)
containing:
filepath,relative_filepathcategory(derived from folder hierarchy)pages→ list of{page_number, page_content, metadata}- Optional language detection, chunking, overlapping tokens, and denoising.
python any_text_to_json.py \
-d /path/to/raw_texts \
-o /tmp/dataset.json| Flag | Description |
|---|---|
--proper-pages |
Merge all texts belonging to the same logical page before chunking. |
--merge-document-pages |
Collapse an entire document into a single page (useful for short docs). |
--clear-texts |
Run the radlab‑data denoiser on each chunk. |
--split-to-max-tokens-in-chunk N |
Limit each chunk to N tokens (uses the tokenizer of the chosen embedder). |
--overlap-tokens-between-chunks M |
Overlap M tokens between successive chunks (must accompany --split-to-max-tokens-in-chunk). |
--check-text-language |
Detect language with fasttext-langdetect and store it in metadata. |
--processes K |
Parallelise the pipeline with K worker processes. |
Output format is automatically chosen: if the output file ends with .json, a single JSON object containing a
"documents" list is written; otherwise a JSONL stream is produced (one document per line).
doccano_converter.py transforms Doccano annotation exports (JSONL) into datasets ready for sequence classification
or token classification training.
python doccano_converter.py \
-I /data/doccano/seq_export \
-e .jsonl \
--show-class-labels-histogram \
-O ./prepared_datasets/seq/20231208 \
--sequence-classificationpython doccano_converter.py \
-I /data/doccano/token_export \
-e .jsonl \
--show-class-labels-histogram \
-O ./prepared_datasets/token/20231208 \
--token-classificationKey flags:
| Flag | Meaning |
|---|---|
-I / --input-dir |
Directory with Doccano export files. |
-e / --dataset-extension |
File extension of source files (.jsonl default). |
-O / --output-dir |
Destination directory for the prepared dataset. |
--sequence-classification |
Produce a dataset for text‑level classification. |
--token-classification |
Produce a dataset for token‑level labeling. |
--show-class-labels-histogram |
Print a histogram of class label frequencies (useful for sanity checks). |
--save-excel-annotations |
Export label statistics to an Excel file. |
--save-iob-standard |
Save the token‑level dataset in IOB format. |
--mapping-file |
Optional CSV mapping from source label names to canonical names. |
The high‑level CLI add_files_from_dir.py bundles the previous steps: it uploads a directory to the relational DB,
runs optional preprocessing, and then triggers semantic indexing into Milvus.
python add_files_from_dir.py \
-d /path/to/texts \
-c MyCollection \
--clear-texts \
--merge-document-pages \
--split-to-max-tokens-in-chunk=200 \
--overlap-tokens-between-chunks=20 \
--processes=10Internally the command:
- Creates a temporary upload folder (or uses an existing
UploadedDocumentsrecord). - Calls
DirectoryFileReader→RelationalDBController.add_uploaded_documents_to_db. - Instantiates
DBSemanticSearchControllerfor the target collection and indexes all resultingDocumentPageTextobjects.
The system supports dynamic model loading via the JSON registries in configs/embedders.json and
configs/rerankers.json. Each entry contains:
{
"name": "sentence-transformers/all-MiniLM-L6-v2",
"path": "/opt/models/all-MiniLM-L6-v2",
"vector_size": 384,
"device": "cuda:0"
}- Embedders produce dense vectors for each text chunk.
- Rerankers (cross‑encoders) optionally re‑score the top‑K retrieved vectors using a second model, improving relevance.
| Script | Description |
|---|---|
index_collection_to_milvus.py |
Indexes all DocumentPageText objects from a relational collection into a Milvus collection. |
index_to_milvus.sh |
Convenience wrapper that sets the index type (HNSW or IVF_FLAT) and runs the Python script. |
index_to_postgresql.sh |
Uses add_files_from_dir.py to ingest raw files and then indexes them. |
All scripts accept the following arguments (common to both):
--from-collection– name of the relational collection containing the raw chunks.--to-collection– name of the Milvus collection that will be created.--chunk-type– thetext_chunk_typefield used to differentiate between plain, cleaned, or overlapped chunks.--index-name– one of the supported Milvus index types (HNSW,IVF_FLAT).
Note: The Milvus handler automatically creates the collection if it does not exist, using the embedding dimensions defined in the model registry.
The REST API follows a versioned pattern (/api/vX.Y/…). All endpoints are defined in the Django urls.py files of
each app.
| Method | URL (example) | Description |
|---|---|---|
POST |
/api/new_collection/ |
Create a new document collection (specify embedder, reranker, index type). |
GET |
/api/collections/ |
List all collections visible to the authenticated user. |
POST |
/api/upload_and_index_files/ |
Upload files (multipart) and trigger indexing. |
POST |
/api/search_with_options/ |
Perform a semantic search with rich filtering options. |
POST |
/api/generative_answer/ |
Generate a RAG answer for a previously stored query response. |
GET |
/api/question_templates/ |
List available query‑template collections. |
GET |
/api/filter_options/ |
Retrieve mock filter metadata (URLs, categories) – useful for UI drop‑downs. |
GET |
/api/categories/ |
List distinct categories present in a collection. |
GET |
/api/documents/ |
List document names within a collection (used for UI selectors). |
GET |
/api/chats/ |
Retrieve all chats for the current user. |
POST |
/api/add_user_message/ |
Append a user message to a chat, optionally run RAG and return assistant reply. |
POST |
/api/save_chat/ |
Mark a chat as saved and optionally set it to read‑only. |
GET |
/api/get_chat_by_hash/ |
Retrieve a saved chat by its hash (read‑only). |
All endpoints require an Authorization header (Token <jwt>), unless ENV_PUBLIC_API_AVAILABLE is set.
The search_with_options endpoint expects a JSON payload with:
{
"collection_name": "MyCollection",
"query_str": "Jakie są najnowsze regulacje podatkowe?",
"options": {
"categories": [
"Prawo",
"Finanse"
],
"documents": [
"regulacje_2023.pdf"
],
"relative_paths": [
"/2023/"
],
"templates": [
1,
5
],
"only_template_documents": false,
"max_results": 40,
"rerank_results": true,
"return_with_factored_fields": false,
"relative_path_contains": [
"https://gov.pl"
],
"metadata_filters": [
{
"operator": "eq",
"field": {
"metadata_json.type": "regulation"
}
}
],
"use_and_operator": true
}
}Key concepts:
- Categories – map to the
categoryfield ofDocument. - Templates – IDs of
QueryTemplateobjects; the system will filter documents that satisfy the template’sdata_filter_expressions. - Metadata filters – low‑level JSON‑logic style filters (
in,eq,gt,hse, …) applied directly on themetadata_jsoncolumn. use_and_operator– determines whether multiple filter lists are intersected (true) or unioned (false).
The response contains:
stats– aggregated per‑document hit counts, scores, and page statistics.detailed_results– list of matching chunks with left/right context.structured_results– optional structured output generated by a template that requests it.template_prompts– any system prompts extracted from the used templates (useful for LLM instruction).
Query templates are defined in configs/query-templates.json. They consist of:
- Grammar – token set and alphabet used to generate natural‑language queries.
- Data connector – static key‑value pairs that must match
Document.metadata_json. - Data filter expressions – dynamic Python expressions evaluated against document metadata (e.g., date range checks).
- Structured response flag – when
true, the engine extracts specified fields from the document metadata and returns them alongside the text snippets.
The QueryTemplateController automatically validates a document against its grammar (use_document_in_sse) and filter
expressions (use_document_in_sse), ensuring only relevant documents participate in the search.
The Chat app adds a conversational layer on top of the semantic search engine. It stores chat history, supports content supervision, and integrates with generative models.
- Create a new chat –
POST /api/new_chat/(optional collection and search options). - Send a user message –
POST /api/add_user_message/withchat_id,user_message, and generation options. - System processes the message:
- Detects whether the message is a RAG query (
user_msg_is_rag_questionflag). - Optionally extracts an explicit question (
pytanie:) and/or instruction. - Runs the semantic search using the collection linked to the chat (or the global default).
- If
use_rag_supervisoris enabled, the system creates aRAGMessageStatelinking the query, response, and generated answer. - If
use_content_supervisoris enabled, URLs in the user text are fetched and their content added to the prompt.
- Detects whether the message is a RAG query (
- Generate assistant response – either via a local LLM endpoint or OpenAI, with optional translation via DeepL.
- Persist messages –
Messageobjects store role (user,assistant,system), text, timestamps, and optionalMessageState.
All chats are tied to an OrganisationUser for permission checks. The read_only flag prevents further message
addition.
Implemented in ChatController._prepare_content_supervisor_state. It uses radlab_content_supervisor to:
- Detect URLs (
URLRegexProcessor). - Fetch remote content (HTML) and store it in
ContentSupervisorState.www_content. - Append fetched snippets to the user message before sending it to the LLM.
This feature is useful for dynamic knowledge retrieval (e.g., fetching the latest news article).
The GenerativeModelController orchestrates calls to:
- Local model APIs – defined in
configs/generative-models.jsonwith custom endpoints (/generate). - OpenAI – via
OpenAIGenerativeController.
Both paths support:
- Generation options (
top_k,temperature, etc.). - System prompt – static instruction that can be injected from a query template.
- Answer translation – using Deepl if
translate_answeris true and a key is present.
Generated answers are stored in UserQueryResponseAnswer and can be rated via the SetRateForQueryResponseAnswer
endpoint.
All standard Django commands are available (runserver, migrate, createsuperuser).
Additional project‑specific commands are provided by the apps_sse scripts:
| Command | Description |
|---|---|
python manage.py runserver 0.0.0.0:8271 |
Starts the development server (default port 8271 in run-api.sh). |
python manage.py migrate |
Applies database migrations (including the generated migration files under chat/migrations). |
python manage.py createsuperuser |
Creates a Django superuser for the admin UI (if ENV_SHOW_ADMIN_WINDOW is true). |
full-install.sh– End‑to‑end installer that sets up the virtual environment, installs required packages, creates the database, runs migrations, prepares the Milvus collection, adds a default organisation/user, and loads query templates.run-api.sh– Convenience wrapper that sets common environment variables (CUDA device, data upload limits, API keys) and starts the Django dev server.index_to_milvus.sh&index_to_postgresql.sh– Quick wrappers around the Python indexing scripts for ad‑hoc runs.
All scripts are idempotent where possible; they log progress to stdout and rely on the underlying Python modules for error handling.
The evaluator package (apps_sse/evaluator) provides utilities for:
- Loading test configurations (JSON files defining test cases).
- Running semantic search with various parameters (max results, rerank, etc.).
- Generating answers using configured generative models.
- Computing metrics (BLEU, ROUGE) via
torchmetrics.
The main entry point is tests_loader.py. It writes results to Excel (.xlsx) files for downstream analysis. Example
usage:
python apps_sse/evaluator/src/tests_loader.py \
-u alice \
-c ./tests/config.json \
-o results.xlsx \
-p results_chunks.xlsxThe output contains:
- Aggregated statistics per collection, test case, and model.
- Human‑annotated answer comparisons (precision/recall style).
- Per‑chunk detailed results for error analysis.
The codebase is deliberately plug‑and‑play:
- Add a new embedding model – Append an entry to
configs/embedders.jsonand place the model files wheremodel_pathpoints. No code changes required. - Add a new reranker – Same procedure with
configs/rerankers.json. - Create a custom query template – Edit
configs/query-templates.json(or add a new file and load it viaQueryTemplatesLoaderController). Definedata_filter_expressionsusing Python syntax; the engine evaluates them safely witheval. - Expose a new API endpoint – Create a view inheriting from
APIView, add the route to the appropriateurls.py, and use existing decorators (required_params_exists,get_organisation_user). - Swap the LLM backend – Implement a new subclass of
GenerativeModelControllerApi.LocalModelAPIwith custom request payload/response handling, then reference the new model name inconfigs/generative-models.json.
All extensions respect the existing permission model (token‑based authentication) and will automatically appear in the
OpenAPI/Swagger UI if django-rest-swagger is enabled.
- Fork the repository and create a feature branch (
git checkout -b feature/your‑idea). - Write tests for any new functionality (use the existing
apps_sse/evaluatortest harness as a reference). - Run the full test suite:
pytest(install viapip install pytest). - Format code with
blackand lint withflake8. - Update documentation – add a new section to this README, or improve docstrings.
- Submit a Pull Request targeting the
mainbranch. Ensure CI passes (if configured).
This project is licensed under the Apache 2.0 License – see the LICENSE file for full details.
Happy coding! 🚀