Intelligent Content Retrieval and Recommendation System

An AI-driven semantic search system with multi-model comparison and hybrid retrieval capabilities.

📋 Table of Contents

Overview
Features
Project Structure
Installation
Quick Start
Complete Workflow
Usage Examples
Model Comparison
Configuration

Overview

This project implements an intelligent content retrieval system that:

Embeds documents using state-of-the-art embedding models (BGE-base, MiniLM)
Indexes vectors in Milvus vector database
Searches semantically with hybrid retrieval (BM25 + Vector)
Reranks results using cross-encoder models
Compares models for empirical evaluation

Default Configuration: BGE-base-en-v1.5 model with optimized chunking strategy

Features

🔍 Semantic Search: Vector similarity search with multiple embedding models
🔄 Hybrid Retrieval: Combines BM25 (keyword) and vector (semantic) search
📊 Model Comparison: Systematic comparison between baseline and improved models
🎯 Reranking: Cross-encoder reranking for improved relevance
🧹 Smart Chunking: Optimized chunking with semantic boundary awareness
📈 Empirical Evaluation: Built-in evaluation framework

Project Structure

IntelliContent/
├── data/
│   ├── hf_docs.json              # Raw documents
│   ├── chunks_bge.json            # BGE model chunks (default)
│   └── chunks_minilm.json         # MiniLM model chunks
├── src/
│   ├── models.py                  # Model configurations
│   ├── embeddings.py              # Embedding generation
│   ├── milvus_index.py            # Milvus collection management
│   ├── search.py                  # Unified search module
│   ├── reranker.py                # Reranking module
│   ├── bm25_retriever.py          # BM25 keyword retrieval
│   └── query_processor.py        # Query preprocessing
├── scripts/
│   ├── extract_hf_docs.py         # Extract documents from source
│   ├── chunk.py                   # Document chunking
│   ├── ingest.py                  # Index to Milvus
│   └── setup_nltk.py              # NLTK data setup
├── tests/
│   ├── test_accuracy.py           # Accuracy testing
│   └── test_model_comparison.py   # Model comparison tests
└── app.py                          # Simple CLI demo

Installation

1. Clone the Repository

git clone https://github.com/NicoYuan1986/IntelliContent.git
cd IntelliContent

2. Install Dependencies

pip install -r requirements.txt

3. Setup NLTK Data (for BM25 retrieval)

python scripts/setup_nltk.py

Or manually:

import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')

4. Configure Milvus

Set environment variables:

export MILVUS_URI="your_milvus_uri"
export MILVUS_TOKEN="your_milvus_token"

Or create a .env file:

MILVUS_URI=your_milvus_uri
MILVUS_TOKEN=your_milvus_token

Quick Start

Complete Workflow (BGE + Optimized - Default)

# 1. Extract documents (if needed)
python scripts/extract_hf_docs.py

# 2. Generate chunks with BGE model (optimized strategy - default)
python scripts/chunk.py --model bge --strategy optimized --output data/chunks_bge.json

# 3. Index to Milvus (auto-detects data/chunks_bge.json)
python scripts/ingest.py --model bge

# 4. Search
python app.py

Complete Workflow

Step 1: Extract Documents

This project uses HuggingFace Transformers documentation as the data source. You have two options to get the data:

Option A: Clone from Source (Recommended for full reproduction)

Since the transformers repository is large, it is excluded from this repository. You can clone it separately:

# Clone the transformers repository to the project root
git clone https://github.com/huggingface/transformers.git

Then run the extraction script:

python scripts/extract_hf_docs.py

Option B: Use Provided Data

If data/hf_docs.json already exists in this repository, you can skip Step 1 and proceed directly to Step 2.

Output: data/hf_docs.json (containing ~580 processed documents)

Step 2: Generate Chunks

Chunk documents for embedding. Default: BGE model with optimized strategy

# BGE model with optimized chunking (default, recommended)
python scripts/chunk.py --model bge --strategy optimized --output data/chunks_bge.json

Options:

--model: minilm or bge (default: bge)
--strategy: simple or optimized (default: optimized)
--output: Output file path

What optimized strategy does:

Semantic boundary-aware chunking (respects sentences/paragraphs)
Enhanced text cleaning
Quality filtering (removes low-quality chunks)
Better context preservation

Output: data/chunks_bge.json

Step 3: Index to Milvus

Index chunks to Milvus vector database. Auto-detects chunk file based on model

# Index BGE model (auto-detects data/chunks_bge.json)
python scripts/ingest.py --model bge

Options:

--model: minilm, bge, or all (default: bge)
--data: Manual chunk file path (optional, auto-detected if not provided)
--batch-size: Batch size for embedding (default: 32)

What happens:

Creates Milvus collection hf_docs_bge (768 dimensions)
Loads BGE-base-en-v1.5 embedding model
Generates embeddings for all chunks
Inserts into Milvus with metadata

Collections created:

hf_docs_bge - BGE model (768-dim)
hf_docs_minilm - MiniLM model (384-dim, if indexed)

Step 4: Search

Search using the indexed data:

Python API

from src.search import search, compare_models

# Basic search (BGE model, default)
results = search("How to use transformers?", model_key="bge", top_k=5)

# Enhanced search (with hybrid retrieval)
results = search(
    "How to use transformers?",
    model_key="bge",
    top_k=5,
    use_hybrid=True,           # Enable BM25 + Vector
    use_query_expansion=True,  # Enable query expansion
    enhanced_reranker=True     # Use enhanced reranker
)

# Model comparison
comparison = compare_models("How to use transformers?", top_k=5)
# Returns: {"minilm": [...], "bge": [...]}

CLI Demo

python app.py

Interactive Search

python -m src.search

Usage Examples

Example 1: Basic Search

from src.search import search

query = "What is attention mechanism?"
results = search(query, model_key="bge", top_k=5)

for i, result in enumerate(results, 1):
    print(f"[{i}] Score: {result['score']:.4f}")
    print(f"    {result['content'][:200]}...")

Example 2: Enhanced Search with Hybrid Retrieval

from src.search import search

query = "BERT model architecture"
results = search(
    query,
    model_key="bge",
    top_k=5,
    top_k_retrieve=50,        # Retrieve more candidates
    top_k_rerank=10,          # Rerank top 10
    use_hybrid=True,          # BM25 + Vector
    use_query_expansion=True, # Query expansion
    enhanced_reranker=True,   # Enhanced reranker
    vector_weight=0.7,        # Vector search weight
    bm25_weight=0.3          # BM25 weight
)

for i, result in enumerate(results, 1):
    print(f"[{i}] Hybrid Score: {result.get('hybrid_score', 0):.4f}")
    print(f"    Rerank Score: {result.get('rerank_score', 0):.4f}")
    print(f"    {result['content'][:200]}...")

Example 3: Model Comparison

from src.search import compare_models

query = "How does fine-tuning work?"
comparison = compare_models(
    query,
    top_k=5,
    use_hybrid=True,
    use_reranker=True
)

# Compare results
for model_key, results in comparison.items():
    print(f"\n--- {model_key.upper()} Model ---")
    for i, result in enumerate(results, 1):
        print(f"[{i}] Score: {result['score']:.4f}")

Example 4: Run Tests

# Accuracy test
python -m tests.test_accuracy

# Model comparison test
python -m tests.test_model_comparison

Model Comparison

The system supports comparing multiple embedding models:

Model	Key	Dimension	Collection	Description
BGE-base	`bge`	768	`hf_docs_bge`	Default - Better accuracy
MiniLM-L6	`minilm`	384	`hf_docs_minilm`	Baseline - Faster

Indexing Multiple Models

# Index both models for comparison
python scripts/chunk.py --model minilm --output data/chunks_minilm.json
python scripts/chunk.py --model bge --output data/chunks_bge.json

python scripts/ingest.py --model all

Comparing Models

from src.search import compare_models

results = compare_models("your query", top_k=5)
# Automatically searches both models and returns comparison

Configuration

Model Configuration

Edit src/models.py to change model settings:

MODEL_CONFIGS = {
    "bge": {
        "name": "BAAI/bge-base-en-v1.5",
        "dimension": 768,
        "max_length": 512,
        "collection_name": "hf_docs_bge",
        "description": "Improved: BGE-base-en-v1.5 (better accuracy, 768-dim)"
    }
}

Chunking Parameters

Default (BGE + Optimized):

Max tokens: 450 (safety margin for 512 max)
Stride: 150 (overlap for context)
Min chars: 100
Max chars: 1500

Search Parameters

Default search configuration:

top_k_retrieve: 20 (candidates to retrieve)
top_k_rerank: 3 (candidates to rerank)
top_k: 3 (final results)
use_hybrid: False (set True for BM25 + Vector)
use_reranker: True
enhanced_reranker: False (set True for L-12 model)

Testing

Run Accuracy Tests

python -m tests.test_accuracy

Run Model Comparison

python -m tests.test_model_comparison

Test Individual Components

# Test embedding
from src.embeddings import load_embedding_model, embed_texts

model = load_embedding_model("BAAI/bge-base-en-v1.5")
embeddings = embed_texts(["test text"], model=model)
print(f"Embedding dimension: {len(embeddings[0])}")  # Should be 768

Performance Tips

Use BGE model for better accuracy (default)
Enable hybrid search for keyword-heavy queries
Use enhanced reranker for critical applications
Adjust top_k_retrieve based on collection size
Batch processing for large datasets

Troubleshooting

Issue: NLTK resources not found

python scripts/setup_nltk.py

Issue: Milvus connection error

Check environment variables:

echo $MILVUS_URI
echo $MILVUS_TOKEN

Issue: Chunk file not found

Make sure to generate chunks first:

python scripts/chunk.py --model bge --output data/chunks_bge.json

Issue: Collection dimension mismatch

Delete old collection and re-index:

from src.milvus_index import get_client, get_collection_name
import os

client = get_client(os.getenv("MILVUS_URI"), os.getenv("MILVUS_TOKEN"))
collection_name = get_collection_name("bge")
if client.has_collection(collection_name):
    client.drop_collection(collection_name)

Additional Resources

Author

nico

Note: This project uses BGE-base-en-v1.5 as the default model with optimized chunking strategy for best performance. For faster processing, use MiniLM model with simple chunking strategy.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
PROJECT_REPORT.md		PROJECT_REPORT.md
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

NicoYuan1986/IntelliContent

Folders and files

Latest commit

History

Repository files navigation