An AI-driven semantic search system with multi-model comparison and hybrid retrieval capabilities.
- Overview
- Features
- Project Structure
- Installation
- Quick Start
- Complete Workflow
- Usage Examples
- Model Comparison
- Configuration
This project implements an intelligent content retrieval system that:
- Embeds documents using state-of-the-art embedding models (BGE-base, MiniLM)
- Indexes vectors in Milvus vector database
- Searches semantically with hybrid retrieval (BM25 + Vector)
- Reranks results using cross-encoder models
- Compares models for empirical evaluation
Default Configuration: BGE-base-en-v1.5 model with optimized chunking strategy
- 🔍 Semantic Search: Vector similarity search with multiple embedding models
- 🔄 Hybrid Retrieval: Combines BM25 (keyword) and vector (semantic) search
- 📊 Model Comparison: Systematic comparison between baseline and improved models
- 🎯 Reranking: Cross-encoder reranking for improved relevance
- 🧹 Smart Chunking: Optimized chunking with semantic boundary awareness
- 📈 Empirical Evaluation: Built-in evaluation framework
IntelliContent/
├── data/
│ ├── hf_docs.json # Raw documents
│ ├── chunks_bge.json # BGE model chunks (default)
│ └── chunks_minilm.json # MiniLM model chunks
├── src/
│ ├── models.py # Model configurations
│ ├── embeddings.py # Embedding generation
│ ├── milvus_index.py # Milvus collection management
│ ├── search.py # Unified search module
│ ├── reranker.py # Reranking module
│ ├── bm25_retriever.py # BM25 keyword retrieval
│ └── query_processor.py # Query preprocessing
├── scripts/
│ ├── extract_hf_docs.py # Extract documents from source
│ ├── chunk.py # Document chunking
│ ├── ingest.py # Index to Milvus
│ └── setup_nltk.py # NLTK data setup
├── tests/
│ ├── test_accuracy.py # Accuracy testing
│ └── test_model_comparison.py # Model comparison tests
└── app.py # Simple CLI demo
git clone https://github.com/NicoYuan1986/IntelliContent.git
cd IntelliContentpip install -r requirements.txtpython scripts/setup_nltk.pyOr manually:
import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')Set environment variables:
export MILVUS_URI="your_milvus_uri"
export MILVUS_TOKEN="your_milvus_token"Or create a .env file:
MILVUS_URI=your_milvus_uri
MILVUS_TOKEN=your_milvus_token
# 1. Extract documents (if needed)
python scripts/extract_hf_docs.py
# 2. Generate chunks with BGE model (optimized strategy - default)
python scripts/chunk.py --model bge --strategy optimized --output data/chunks_bge.json
# 3. Index to Milvus (auto-detects data/chunks_bge.json)
python scripts/ingest.py --model bge
# 4. Search
python app.pyThis project uses HuggingFace Transformers documentation as the data source. You have two options to get the data:
Since the transformers repository is large, it is excluded from this repository. You can clone it separately:
# Clone the transformers repository to the project root
git clone https://github.com/huggingface/transformers.gitThen run the extraction script:
python scripts/extract_hf_docs.pyIf data/hf_docs.json already exists in this repository, you can skip Step 1 and proceed directly to Step 2.
Output: data/hf_docs.json (containing ~580 processed documents)
Chunk documents for embedding. Default: BGE model with optimized strategy
# BGE model with optimized chunking (default, recommended)
python scripts/chunk.py --model bge --strategy optimized --output data/chunks_bge.jsonOptions:
--model:minilmorbge(default:bge)--strategy:simpleoroptimized(default:optimized)--output: Output file path
What optimized strategy does:
- Semantic boundary-aware chunking (respects sentences/paragraphs)
- Enhanced text cleaning
- Quality filtering (removes low-quality chunks)
- Better context preservation
Output: data/chunks_bge.json
Index chunks to Milvus vector database. Auto-detects chunk file based on model
# Index BGE model (auto-detects data/chunks_bge.json)
python scripts/ingest.py --model bgeOptions:
--model:minilm,bge, orall(default:bge)--data: Manual chunk file path (optional, auto-detected if not provided)--batch-size: Batch size for embedding (default: 32)
What happens:
- Creates Milvus collection
hf_docs_bge(768 dimensions) - Loads BGE-base-en-v1.5 embedding model
- Generates embeddings for all chunks
- Inserts into Milvus with metadata
Collections created:
hf_docs_bge- BGE model (768-dim)hf_docs_minilm- MiniLM model (384-dim, if indexed)
Search using the indexed data:
from src.search import search, compare_models
# Basic search (BGE model, default)
results = search("How to use transformers?", model_key="bge", top_k=5)
# Enhanced search (with hybrid retrieval)
results = search(
"How to use transformers?",
model_key="bge",
top_k=5,
use_hybrid=True, # Enable BM25 + Vector
use_query_expansion=True, # Enable query expansion
enhanced_reranker=True # Use enhanced reranker
)
# Model comparison
comparison = compare_models("How to use transformers?", top_k=5)
# Returns: {"minilm": [...], "bge": [...]}python app.pypython -m src.searchfrom src.search import search
query = "What is attention mechanism?"
results = search(query, model_key="bge", top_k=5)
for i, result in enumerate(results, 1):
print(f"[{i}] Score: {result['score']:.4f}")
print(f" {result['content'][:200]}...")from src.search import search
query = "BERT model architecture"
results = search(
query,
model_key="bge",
top_k=5,
top_k_retrieve=50, # Retrieve more candidates
top_k_rerank=10, # Rerank top 10
use_hybrid=True, # BM25 + Vector
use_query_expansion=True, # Query expansion
enhanced_reranker=True, # Enhanced reranker
vector_weight=0.7, # Vector search weight
bm25_weight=0.3 # BM25 weight
)
for i, result in enumerate(results, 1):
print(f"[{i}] Hybrid Score: {result.get('hybrid_score', 0):.4f}")
print(f" Rerank Score: {result.get('rerank_score', 0):.4f}")
print(f" {result['content'][:200]}...")from src.search import compare_models
query = "How does fine-tuning work?"
comparison = compare_models(
query,
top_k=5,
use_hybrid=True,
use_reranker=True
)
# Compare results
for model_key, results in comparison.items():
print(f"\n--- {model_key.upper()} Model ---")
for i, result in enumerate(results, 1):
print(f"[{i}] Score: {result['score']:.4f}")# Accuracy test
python -m tests.test_accuracy
# Model comparison test
python -m tests.test_model_comparisonThe system supports comparing multiple embedding models:
| Model | Key | Dimension | Collection | Description |
|---|---|---|---|---|
| BGE-base | bge |
768 | hf_docs_bge |
Default - Better accuracy |
| MiniLM-L6 | minilm |
384 | hf_docs_minilm |
Baseline - Faster |
# Index both models for comparison
python scripts/chunk.py --model minilm --output data/chunks_minilm.json
python scripts/chunk.py --model bge --output data/chunks_bge.json
python scripts/ingest.py --model allfrom src.search import compare_models
results = compare_models("your query", top_k=5)
# Automatically searches both models and returns comparisonEdit src/models.py to change model settings:
MODEL_CONFIGS = {
"bge": {
"name": "BAAI/bge-base-en-v1.5",
"dimension": 768,
"max_length": 512,
"collection_name": "hf_docs_bge",
"description": "Improved: BGE-base-en-v1.5 (better accuracy, 768-dim)"
}
}Default (BGE + Optimized):
- Max tokens: 450 (safety margin for 512 max)
- Stride: 150 (overlap for context)
- Min chars: 100
- Max chars: 1500
Default search configuration:
top_k_retrieve: 20 (candidates to retrieve)top_k_rerank: 3 (candidates to rerank)top_k: 3 (final results)use_hybrid: False (set True for BM25 + Vector)use_reranker: Trueenhanced_reranker: False (set True for L-12 model)
python -m tests.test_accuracypython -m tests.test_model_comparison# Test embedding
from src.embeddings import load_embedding_model, embed_texts
model = load_embedding_model("BAAI/bge-base-en-v1.5")
embeddings = embed_texts(["test text"], model=model)
print(f"Embedding dimension: {len(embeddings[0])}") # Should be 768- Use BGE model for better accuracy (default)
- Enable hybrid search for keyword-heavy queries
- Use enhanced reranker for critical applications
- Adjust top_k_retrieve based on collection size
- Batch processing for large datasets
python scripts/setup_nltk.pyCheck environment variables:
echo $MILVUS_URI
echo $MILVUS_TOKENMake sure to generate chunks first:
python scripts/chunk.py --model bge --output data/chunks_bge.jsonDelete old collection and re-index:
from src.milvus_index import get_client, get_collection_name
import os
client = get_client(os.getenv("MILVUS_URI"), os.getenv("MILVUS_TOKEN"))
collection_name = get_collection_name("bge")
if client.has_collection(collection_name):
client.drop_collection(collection_name)nico
Note: This project uses BGE-base-en-v1.5 as the default model with optimized chunking strategy for best performance. For faster processing, use MiniLM model with simple chunking strategy.