Skip to content

High-precision Content Retrieval Engine with automated model comparison and evaluation. Implements semantic boundary-aware chunking and RRF-based hybrid search for optimized document retrieval.

Notifications You must be signed in to change notification settings

NicoYuan1986/IntelliContent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intelligent Content Retrieval and Recommendation System

An AI-driven semantic search system with multi-model comparison and hybrid retrieval capabilities.

📋 Table of Contents

Overview

This project implements an intelligent content retrieval system that:

  • Embeds documents using state-of-the-art embedding models (BGE-base, MiniLM)
  • Indexes vectors in Milvus vector database
  • Searches semantically with hybrid retrieval (BM25 + Vector)
  • Reranks results using cross-encoder models
  • Compares models for empirical evaluation

Default Configuration: BGE-base-en-v1.5 model with optimized chunking strategy

Features

  • 🔍 Semantic Search: Vector similarity search with multiple embedding models
  • 🔄 Hybrid Retrieval: Combines BM25 (keyword) and vector (semantic) search
  • 📊 Model Comparison: Systematic comparison between baseline and improved models
  • 🎯 Reranking: Cross-encoder reranking for improved relevance
  • 🧹 Smart Chunking: Optimized chunking with semantic boundary awareness
  • 📈 Empirical Evaluation: Built-in evaluation framework

Project Structure

IntelliContent/
├── data/
│   ├── hf_docs.json              # Raw documents
│   ├── chunks_bge.json            # BGE model chunks (default)
│   └── chunks_minilm.json         # MiniLM model chunks
├── src/
│   ├── models.py                  # Model configurations
│   ├── embeddings.py              # Embedding generation
│   ├── milvus_index.py            # Milvus collection management
│   ├── search.py                  # Unified search module
│   ├── reranker.py                # Reranking module
│   ├── bm25_retriever.py          # BM25 keyword retrieval
│   └── query_processor.py        # Query preprocessing
├── scripts/
│   ├── extract_hf_docs.py         # Extract documents from source
│   ├── chunk.py                   # Document chunking
│   ├── ingest.py                  # Index to Milvus
│   └── setup_nltk.py              # NLTK data setup
├── tests/
│   ├── test_accuracy.py           # Accuracy testing
│   └── test_model_comparison.py   # Model comparison tests
└── app.py                          # Simple CLI demo

Installation

1. Clone the Repository

git clone https://github.com/NicoYuan1986/IntelliContent.git
cd IntelliContent

2. Install Dependencies

pip install -r requirements.txt

3. Setup NLTK Data (for BM25 retrieval)

python scripts/setup_nltk.py

Or manually:

import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')

4. Configure Milvus

Set environment variables:

export MILVUS_URI="your_milvus_uri"
export MILVUS_TOKEN="your_milvus_token"

Or create a .env file:

MILVUS_URI=your_milvus_uri
MILVUS_TOKEN=your_milvus_token

Quick Start

Complete Workflow (BGE + Optimized - Default)

# 1. Extract documents (if needed)
python scripts/extract_hf_docs.py

# 2. Generate chunks with BGE model (optimized strategy - default)
python scripts/chunk.py --model bge --strategy optimized --output data/chunks_bge.json

# 3. Index to Milvus (auto-detects data/chunks_bge.json)
python scripts/ingest.py --model bge

# 4. Search
python app.py

Complete Workflow

Step 1: Extract Documents

This project uses HuggingFace Transformers documentation as the data source. You have two options to get the data:

Option A: Clone from Source (Recommended for full reproduction)

Since the transformers repository is large, it is excluded from this repository. You can clone it separately:

# Clone the transformers repository to the project root
git clone https://github.com/huggingface/transformers.git

Then run the extraction script:

python scripts/extract_hf_docs.py

Option B: Use Provided Data

If data/hf_docs.json already exists in this repository, you can skip Step 1 and proceed directly to Step 2.

Output: data/hf_docs.json (containing ~580 processed documents)

Step 2: Generate Chunks

Chunk documents for embedding. Default: BGE model with optimized strategy

# BGE model with optimized chunking (default, recommended)
python scripts/chunk.py --model bge --strategy optimized --output data/chunks_bge.json

Options:

  • --model: minilm or bge (default: bge)
  • --strategy: simple or optimized (default: optimized)
  • --output: Output file path

What optimized strategy does:

  • Semantic boundary-aware chunking (respects sentences/paragraphs)
  • Enhanced text cleaning
  • Quality filtering (removes low-quality chunks)
  • Better context preservation

Output: data/chunks_bge.json

Step 3: Index to Milvus

Index chunks to Milvus vector database. Auto-detects chunk file based on model

# Index BGE model (auto-detects data/chunks_bge.json)
python scripts/ingest.py --model bge

Options:

  • --model: minilm, bge, or all (default: bge)
  • --data: Manual chunk file path (optional, auto-detected if not provided)
  • --batch-size: Batch size for embedding (default: 32)

What happens:

  1. Creates Milvus collection hf_docs_bge (768 dimensions)
  2. Loads BGE-base-en-v1.5 embedding model
  3. Generates embeddings for all chunks
  4. Inserts into Milvus with metadata

Collections created:

  • hf_docs_bge - BGE model (768-dim)
  • hf_docs_minilm - MiniLM model (384-dim, if indexed)

Step 4: Search

Search using the indexed data:

Python API

from src.search import search, compare_models

# Basic search (BGE model, default)
results = search("How to use transformers?", model_key="bge", top_k=5)

# Enhanced search (with hybrid retrieval)
results = search(
    "How to use transformers?",
    model_key="bge",
    top_k=5,
    use_hybrid=True,           # Enable BM25 + Vector
    use_query_expansion=True,  # Enable query expansion
    enhanced_reranker=True     # Use enhanced reranker
)

# Model comparison
comparison = compare_models("How to use transformers?", top_k=5)
# Returns: {"minilm": [...], "bge": [...]}

CLI Demo

python app.py

Interactive Search

python -m src.search

Usage Examples

Example 1: Basic Search

from src.search import search

query = "What is attention mechanism?"
results = search(query, model_key="bge", top_k=5)

for i, result in enumerate(results, 1):
    print(f"[{i}] Score: {result['score']:.4f}")
    print(f"    {result['content'][:200]}...")

Example 2: Enhanced Search with Hybrid Retrieval

from src.search import search

query = "BERT model architecture"
results = search(
    query,
    model_key="bge",
    top_k=5,
    top_k_retrieve=50,        # Retrieve more candidates
    top_k_rerank=10,          # Rerank top 10
    use_hybrid=True,          # BM25 + Vector
    use_query_expansion=True, # Query expansion
    enhanced_reranker=True,   # Enhanced reranker
    vector_weight=0.7,        # Vector search weight
    bm25_weight=0.3          # BM25 weight
)

for i, result in enumerate(results, 1):
    print(f"[{i}] Hybrid Score: {result.get('hybrid_score', 0):.4f}")
    print(f"    Rerank Score: {result.get('rerank_score', 0):.4f}")
    print(f"    {result['content'][:200]}...")

Example 3: Model Comparison

from src.search import compare_models

query = "How does fine-tuning work?"
comparison = compare_models(
    query,
    top_k=5,
    use_hybrid=True,
    use_reranker=True
)

# Compare results
for model_key, results in comparison.items():
    print(f"\n--- {model_key.upper()} Model ---")
    for i, result in enumerate(results, 1):
        print(f"[{i}] Score: {result['score']:.4f}")

Example 4: Run Tests

# Accuracy test
python -m tests.test_accuracy

# Model comparison test
python -m tests.test_model_comparison

Model Comparison

The system supports comparing multiple embedding models:

Model Key Dimension Collection Description
BGE-base bge 768 hf_docs_bge Default - Better accuracy
MiniLM-L6 minilm 384 hf_docs_minilm Baseline - Faster

Indexing Multiple Models

# Index both models for comparison
python scripts/chunk.py --model minilm --output data/chunks_minilm.json
python scripts/chunk.py --model bge --output data/chunks_bge.json

python scripts/ingest.py --model all

Comparing Models

from src.search import compare_models

results = compare_models("your query", top_k=5)
# Automatically searches both models and returns comparison

Configuration

Model Configuration

Edit src/models.py to change model settings:

MODEL_CONFIGS = {
    "bge": {
        "name": "BAAI/bge-base-en-v1.5",
        "dimension": 768,
        "max_length": 512,
        "collection_name": "hf_docs_bge",
        "description": "Improved: BGE-base-en-v1.5 (better accuracy, 768-dim)"
    }
}

Chunking Parameters

Default (BGE + Optimized):

  • Max tokens: 450 (safety margin for 512 max)
  • Stride: 150 (overlap for context)
  • Min chars: 100
  • Max chars: 1500

Search Parameters

Default search configuration:

  • top_k_retrieve: 20 (candidates to retrieve)
  • top_k_rerank: 3 (candidates to rerank)
  • top_k: 3 (final results)
  • use_hybrid: False (set True for BM25 + Vector)
  • use_reranker: True
  • enhanced_reranker: False (set True for L-12 model)

Testing

Run Accuracy Tests

python -m tests.test_accuracy

Run Model Comparison

python -m tests.test_model_comparison

Test Individual Components

# Test embedding
from src.embeddings import load_embedding_model, embed_texts

model = load_embedding_model("BAAI/bge-base-en-v1.5")
embeddings = embed_texts(["test text"], model=model)
print(f"Embedding dimension: {len(embeddings[0])}")  # Should be 768

Performance Tips

  1. Use BGE model for better accuracy (default)
  2. Enable hybrid search for keyword-heavy queries
  3. Use enhanced reranker for critical applications
  4. Adjust top_k_retrieve based on collection size
  5. Batch processing for large datasets

Troubleshooting

Issue: NLTK resources not found

python scripts/setup_nltk.py

Issue: Milvus connection error

Check environment variables:

echo $MILVUS_URI
echo $MILVUS_TOKEN

Issue: Chunk file not found

Make sure to generate chunks first:

python scripts/chunk.py --model bge --output data/chunks_bge.json

Issue: Collection dimension mismatch

Delete old collection and re-index:

from src.milvus_index import get_client, get_collection_name
import os

client = get_client(os.getenv("MILVUS_URI"), os.getenv("MILVUS_TOKEN"))
collection_name = get_collection_name("bge")
if client.has_collection(collection_name):
    client.drop_collection(collection_name)

Additional Resources

Author

nico


Note: This project uses BGE-base-en-v1.5 as the default model with optimized chunking strategy for best performance. For faster processing, use MiniLM model with simple chunking strategy.

About

High-precision Content Retrieval Engine with automated model comparison and evaluation. Implements semantic boundary-aware chunking and RRF-based hybrid search for optimized document retrieval.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages