DocuQueryAI

LLM-Powered Intelligent Query–Retrieval Backend
Built for HackRx 6.0 – Bajaj Finserv's Annual Hackathon

📜 Overview

DocuQueryAI is a backend system that can process large documents (insurance policies, contracts, compliance documents) and answer natural language questions with high accuracy.

It uses semantic embeddings and LLM-based reasoning to retrieve relevant clauses and provide explainable decisions in JSON format.

Key domains:

📄 Insurance
⚖️ Legal
🏢 HR
✅ Compliance

⚙️ Features

📥 PDF ingestion from URL (DOCX & emails can be extended easily)
✂️ Smart text chunking with overlap for context preservation
🔍 Semantic search with pgvector + SentenceTransformers
🤖 LLM-powered answer generation via Groq API
🧠 Explainable answers with traceable context
🚀 Dockerized for quick deployment

🏗 Architecture

flowchart LR
    A[Client Request<br/>Questions + PDF URL] --> B[PDF Downloader]
    B --> C[Text Extractor<br/>PyPDF2]
    C --> D[Chunker<br/>Overlap + Min Length]
    D --> E[Embedding Generator<br/>SentenceTransformer]
    E --> F[pgvector Store<br/>PostgreSQL]
    G[User Query] --> H[Top-K Semantic Retrieval]
    H --> I[LLM Groq API]
    I --> J[Structured JSON Response]

🖥 Tech Stack

Language: Python 3.11
Framework: FastAPI
Vector DB: PostgreSQL + pgvector
Embeddings: SentenceTransformers (intfloat/e5-small-v2)
LLM: Groq API (configurable model)
Deployment: Docker, Uvicorn
Others: PyPDF2, psycopg2, dotenv

📂 Project Structure

.
├── api/
│   ├── main.py           # FastAPI app, routing, request handling
│── parser.py             # PDF extraction & text chunking
│── answer_generator.py   # LLM prompt building & Groq API calls
│── db_vector_store.py    # PostgreSQL (pgvector) operations
│── config.py             # Environment & config management
│── embeddings.py         # Standalone embedding generator
├── requirements.txt
├── Dockerfile
└── README.md

🚀 Quick Start

1️⃣ Clone the Repository

git clone https://github.com/yourusername/docuqueryai.git
cd docuqueryai

2️⃣ Environment Variables

Create a .env file in the api/ directory:

GROQ_API_KEY=your_groq_api_key
BEARER_TOKEN=your_auth_token
LLM_MODEL=llama3-8b-8192

# Database Config
DB_NAME=your_database_name
DB_USER=your_username
DB_PASSWORD=your_db_password
DB_HOST=localhost
DB_PORT=5432

# PDF Processing
CHUNK_SIZE=500
CHUNK_OVERLAP=100
MIN_CHUNK_LENGTH=50
TOP_K_CHUNKS=3
EMBEDDING_DIM=384

3️⃣ PostgreSQL Setup (Required)

Before running, you must:

Enable pgvector extension In pgAdmin or psql:

CREATE EXTENSION IF NOT EXISTS vector;

Create the embeddings table

CREATE TABLE IF NOT EXISTS document_chunks (
    id SERIAL PRIMARY KEY,
    document_id TEXT,
    chunk_index INT,
    chunk_text TEXT,
    embedding vector(384) -- Dimension must match your model
);

4️⃣ Install Dependencies

pip install -r requirements.txt

⚠️ Warning: These library versions are only compatible with Python 3.11. Creating a virtual environment and installing dependencies within it is highly recommended for your usage.

Recommended approach:

# Create virtual environment
python3.11 -m venv venv

# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

5️⃣ Run Locally

uvicorn main:app --reload --port 8000

6️⃣ Run with Docker

docker build -t docuqueryai .
docker run -p 10000:10000 --env-file api/.env docuqueryai

📋 API Usage

Health Check

GET /health

Response:

{
  "status": "healthy",
  "details": { 
    "pgvector": "healthy" 
  }
}

Run Query

POST /hackrx/run
Authorization: Bearer <BEARER_TOKEN>
Content-Type: application/json

Request Body:

{
  "documents": "https://example.com/policy.pdf",
  "questions": [
    "Does this policy cover knee surgery?",
    "What are the exclusions?"
  ]
}

Response:

{
  "answers": [
    "Yes, knee surgery is covered under clause 4.2 with pre-authorization.",
    "Exclusions include cosmetic surgery, pre-existing conditions..."
  ]
}

🔮 Future Enhancements

Support for DOCX, email (.eml) parsing
Multi-document query support
Fine-tuned domain-specific LLM
Web-based frontend interface

📜 License

This project is licensed under the MIT License – see the LICENSE file for details.

🤝 Contributing

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DocuQueryAI

📜 Overview

⚙️ Features

🏗 Architecture

🖥 Tech Stack

📂 Project Structure

🚀 Quick Start

1️⃣ Clone the Repository

2️⃣ Environment Variables

3️⃣ PostgreSQL Setup (Required)

4️⃣ Install Dependencies

5️⃣ Run Locally

6️⃣ Run with Docker

📋 API Usage

Health Check

Run Query

🔮 Future Enhancements

📜 License

🤝 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
api		api
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
answer_generator.py		answer_generator.py
config.py		config.py
db_vector_store.py		db_vector_store.py
embeddings.py		embeddings.py
parser.py		parser.py
requirements.txt		requirements.txt

License

Surya-Hariharan/DocuQueryAI

Folders and files

Latest commit

History

Repository files navigation

DocuQueryAI

📜 Overview

⚙️ Features

🏗 Architecture

🖥 Tech Stack

📂 Project Structure

🚀 Quick Start

1️⃣ Clone the Repository

2️⃣ Environment Variables

3️⃣ PostgreSQL Setup (Required)

4️⃣ Install Dependencies

5️⃣ Run Locally

6️⃣ Run with Docker

📋 API Usage

Health Check

Run Query

🔮 Future Enhancements

📜 License

🤝 Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages