Document Processor

A comprehensive document processing application that allows users to upload various document types, extract data, create embeddings, and interact with processed content using AI. The application features a robust backend with asynchronous job processing and a modern React frontend.

Demo

https://youtu.be/2bKFfopVvn8

Features

Multi-format Document Upload: Support for PDF, PNG, JPEG, SVG, CSV, TXT files
Intelligent Data Extraction: Extract text and structured data from uploaded documents
Asynchronous Processing: Job queue system with Redis and Bull for performance optimization
Vector Embeddings: Generate embeddings using OpenAI and store in Pinecone vector database
Background Processing: Asynchronous job processing with Bull queue
Interactive Q&A: Ask questions about your uploaded documents
Search Functionality: Search through extracted document content
Data Visualization: View extracted data in structured format
Queue Monitoring: Real-time job queue status and monitoring

Tech Stack

Backend: Node.js, Express.js with TypeScript
Frontend: React 18 with Material-UI (MUI) v5 and TypeScript
Job Queue: Redis + Bull for asynchronous processing
Job Processing: Bull queue with Redis for background task execution
AI/ML: OpenAI GPT & Embeddings
Vector Database: Pinecone
Document Processing: PDF-parse, Tesseract.js (OCR), Sharp (image processing)
Database: MongoDB with Mongoose
Configuration: Centralized configuration management with environment variables
Type Safety: Full TypeScript implementation with strict type checking

Project Structure

document-processor/
├── app/
│   ├── backend/
│   │   ├── src/
│   │   │   ├── config/           # Configuration management
│   │   │   │   ├── appConfig.ts  # Centralized configuration service
│   │   │   │   └── database.ts   # Database connection setup
│   │   │   ├── models/           # Database models
│   │   │   │   └── Document.ts   # Document schema and model
│   │   │   ├── services/         # Business logic services
│   │   │   │   ├── documentProcessor.ts  # Document processing logic
│   │   │   │   ├── vectorService.ts      # Vector embeddings and search
│   │   │   │   ├── queryService.ts       # Document querying with AI
│   │   │   │   └── jobQueue.ts           # Background job processing
│   │   │   └── server.ts         # Express server with all routes
│   │   ├── .env.example          # Environment variables template
│   │   ├── eng.traineddata       # Tesseract OCR language data
│   │   └── package.json          # Backend dependencies
│   └── frontend/
│       ├── public/               # Static assets
│       ├── src/
│       │   ├── components/       # React UI components
│       │   │   ├── DocumentUpload.tsx    # File upload component
│       │   │   ├── DocumentList.tsx      # Document listing component
│       │   │   └── DocumentInteraction.tsx # Q&A and search interface
│       │   ├── services/         # API client services
│       │   │   └── documentService.ts    # API communication layer
│       │   ├── types/            # TypeScript type definitions
│       │   │   └── index.ts      # Shared interfaces and types
│       │   ├── App.tsx           # Main application component
│       │   └── index.tsx         # React app entry point
│       └── package.json          # Frontend dependencies
├── .gitignore                    # Git ignore patterns
└── README.md                     # This file

Setup

Clone the repository
Install backend dependencies:
```
cd app/backend
npm install
```
Install frontend dependencies:
```
cd app/frontend
npm install
```
Set up environment variables:
```
cd app/backend
cp .env.example .env
```
Edit the .env file with your configuration.
Required Services:
- Redis: Required for job queue
- MongoDB: For document storage
- OpenAI API Key: For text embeddings and completions
- Pinecone API Key: For vector similarity search

Environment Variables:

# Server Configuration
PORT=3001
NODE_ENV=development

# Database Configuration
MONGODB_URI=mongodb://localhost:27017/document-processor

# Redis Configuration (for job queue)
REDIS_HOST=localhost
REDIS_PORT=6379

# OpenAI Configuration
OPENAI_API_KEY=your_openai_api_key
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
OPENAI_GPT_MODEL=gpt-4o
# IMPORTANT: This dimension must match your Pinecone index dimension
# text-embedding-3-small supports 512, 1024, 1536 dimensions
# text-embedding-3-large supports 256, 1024, 3072 dimensions
OPENAI_EMBEDDING_DIMENSIONS=1024

# Pinecone Configuration
PINECONE_API_KEY=your_pinecone_api_key
PINECONE_ENVIRONMENT=your_pinecone_environment
PINECONE_INDEX_NAME=document-processor-index

# Upload Configuration
MAX_FILE_SIZE=10485760  # 10MB
UPLOAD_DIR=./uploads

Development

Start Backend Server

cd app/backend
npm install
npm run dev

Start Frontend Development Server

cd app/frontend
npm install
npm start

Usage

Start required services:

# Start MongoDB (if not using Atlas)
mongod

# Start Redis
redis-server

Start the backend server:
```
cd app/backend
npm run dev
```
Backend will run on http://localhost:3002
Start the frontend development server:
```
cd app/frontend
npm start
```
Frontend will run on http://localhost:3000 with proxy to backend
Access the application at http://localhost:3000
Document Processing Flow:
- Upload documents through the web interface
- Documents are automatically queued for processing using Bull job queue
- Background workers process embeddings immediately when jobs are added
- Monitor job queue status through the application
- Search and interact with processed documents using AI-powered Q&A
Architecture Notes:
- Backend uses centralized configuration management with singleton pattern
- All routes are defined in the main server.ts file
- Job processing is handled asynchronously with Redis and Bull
- Vector embeddings are stored in Pinecone for semantic search
- Frontend uses Material-UI for consistent design system

API Endpoints

All API endpoints are defined in the main server.ts file and follow RESTful conventions.

Document Management

POST /api/upload - Upload and queue document for processing
- Accepts: multipart/form-data with file field
- Supported formats: PDF, PNG, JPEG, SVG, CSV, TXT
- Returns: Document object with processing status and job ID
GET /api/documents - List all documents
- Query params: ?status=processing|completed|failed
- Returns: Array of document objects with metadata
GET /api/documents/:id - Get specific document details
- Returns: Complete document object with extracted data and processing status
DELETE /api/documents/:id - Delete document and cleanup
- Removes document, associated files, embeddings, and vector data

Search & Query

POST /api/search - Semantic search across all processed documents
- Body: { "query": "search terms", "limit": 10 }
- Uses vector similarity search with Pinecone
- Returns: Array of matching document chunks with relevance scores
POST /api/ask - Ask questions about specific documents
- Body: { "question": "Your question", "documentId": "optional" }
- Uses OpenAI GPT with document context
- Returns: AI-generated answer with confidence score and sources

Supported File Types

The application supports multiple document formats with intelligent processing:

PDF: Text extraction using pdf-parse library
Images (PNG, JPEG, SVG): OCR text extraction using Tesseract.js with eng.traineddata
CSV: Structured data parsing with automatic table detection and visualization
TXT: Direct text processing with metadata extraction

Key Features & Technologies

Backend Architecture

Configuration Management: Centralized singleton configuration service with validation
Document Processing: Multi-format document processor with metadata extraction
Vector Search: OpenAI embeddings with Pinecone vector database for semantic search
Job Queue: Bull queue with Redis for asynchronous background processing with automatic worker processing
Database: MongoDB with Mongoose ODM for document storage
Type Safety: Full TypeScript implementation with strict typing

Frontend Architecture

React 18: Modern React with hooks and functional components
Material-UI v5: Comprehensive design system with emotion styling
TypeScript: Strict type checking with shared interfaces
Component Structure: Modular components for upload, listing, and interaction
API Integration: Centralized service layer for backend communication
State Management: React hooks for local state management

AI & ML Integration

OpenAI GPT-4: Advanced question answering and document analysis with two-step Q&A process
- Step 1: OpenAI analyzes questions to extract key concepts and generate refined search queries
- Step 2: Uses original question with enhanced context for final answer generation
OpenAI Embeddings: text-embedding-3-small for semantic understanding with configurable dimensions
- Configurable via OPENAI_EMBEDDING_DIMENSIONS environment variable
- Must match Pinecone index dimensions exactly (512, 1024, or 1536 for text-embedding-3-small)
Enhanced Vector Search: Score boosting algorithm with sigmoid transformation for better relevance
- Applies sigmoid transformation: 1 / (1 + exp(-10 * (score - 0.5)))
- Additional 20% boost for high-confidence matches (>0.7)
- Filters low-relevance results (<0.1) early in the process
Optimized Chunking Strategy: Improved text segmentation for better semantic matching
- Reduced chunk size to 500 tokens for more focused content
- Increased overlap to 100 tokens for better context preservation
Token Management: Automatic context truncation to prevent OpenAI rate limit errors
- Limits context to top 3 most relevant documents
- Truncates each document to 2000 characters maximum
- Maintains essential information while staying within API limits
Semantic Query Enhancement: Query refinement for improved search accuracy
- Enhances queries with contextual information for better embedding matching
Pinecone: High-performance vector similarity search with optimized performance
- Searches 3x more candidates initially, then filters and re-ranks results
- Automatic worker processing without cron job scheduling overhead

Configuration Notes

The application uses a robust configuration system that:

Validates required API keys on startup
Provides fallback defaults for development
Supports environment-specific configurations
Gracefully handles missing API keys with warnings
Uses singleton pattern for consistent configuration access
Embedding Dimensions: Configurable via OPENAI_EMBEDDING_DIMENSIONS environment variable
- Must match your Pinecone index dimension exactly
- text-embedding-3-small: supports 512, 1024, 1536 dimensions
- text-embedding-3-large: supports 256, 1024, 3072 dimensions
- Default: 1024 dimensions

Troubleshooting

Common Issues

Redis Connection: Ensure Redis server is running on the configured port
MongoDB Connection: Check MongoDB URI and ensure database is accessible
API Keys: Verify OpenAI and Pinecone API keys are valid and have sufficient credits
File Upload: Check file size limits and supported formats
Port Conflicts: Backend runs on 3002, frontend on 3000 by default

Development Tips

Use npm run type-check to validate TypeScript without compilation
Check application logs for detailed error information
Ensure Pinecone index dimensions match OPENAI_EMBEDDING_DIMENSIONS setting
Monitor token usage to avoid OpenAI rate limits (context is automatically truncated)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document Processor

Demo

Features

Tech Stack

Project Structure

Setup

Development

Start Backend Server

Start Frontend Development Server

Usage

API Endpoints

Document Management

Search & Query

Supported File Types

Key Features & Technologies

Backend Architecture

Frontend Architecture

AI & ML Integration

Configuration Notes

Troubleshooting

Common Issues

Development Tips

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Document Processor

Demo

Features

Tech Stack

Project Structure

Setup

Development

Start Backend Server

Start Frontend Development Server

Usage

API Endpoints

Document Management

Search & Query

Supported File Types

Key Features & Technologies

Backend Architecture

Frontend Architecture

AI & ML Integration

Configuration Notes

Troubleshooting

Common Issues

Development Tips