Skip to content

Latest commit

 

History

History
299 lines (248 loc) · 12.3 KB

File metadata and controls

299 lines (248 loc) · 12.3 KB

Document Processor

A comprehensive document processing application that allows users to upload various document types, extract data, create embeddings, and interact with processed content using AI. The application features a robust backend with asynchronous job processing and a modern React frontend.

Demo

https://youtu.be/2bKFfopVvn8

Features

  • Multi-format Document Upload: Support for PDF, PNG, JPEG, SVG, CSV, TXT files
  • Intelligent Data Extraction: Extract text and structured data from uploaded documents
  • Asynchronous Processing: Job queue system with Redis and Bull for performance optimization
  • Vector Embeddings: Generate embeddings using OpenAI and store in Pinecone vector database
  • Background Processing: Asynchronous job processing with Bull queue
  • Interactive Q&A: Ask questions about your uploaded documents
  • Search Functionality: Search through extracted document content
  • Data Visualization: View extracted data in structured format
  • Queue Monitoring: Real-time job queue status and monitoring

Tech Stack

  • Backend: Node.js, Express.js with TypeScript
  • Frontend: React 18 with Material-UI (MUI) v5 and TypeScript
  • Job Queue: Redis + Bull for asynchronous processing
  • Job Processing: Bull queue with Redis for background task execution
  • AI/ML: OpenAI GPT & Embeddings
  • Vector Database: Pinecone
  • Document Processing: PDF-parse, Tesseract.js (OCR), Sharp (image processing)
  • Database: MongoDB with Mongoose
  • Configuration: Centralized configuration management with environment variables
  • Type Safety: Full TypeScript implementation with strict type checking

Project Structure

document-processor/
├── app/
│   ├── backend/
│   │   ├── src/
│   │   │   ├── config/           # Configuration management
│   │   │   │   ├── appConfig.ts  # Centralized configuration service
│   │   │   │   └── database.ts   # Database connection setup
│   │   │   ├── models/           # Database models
│   │   │   │   └── Document.ts   # Document schema and model
│   │   │   ├── services/         # Business logic services
│   │   │   │   ├── documentProcessor.ts  # Document processing logic
│   │   │   │   ├── vectorService.ts      # Vector embeddings and search
│   │   │   │   ├── queryService.ts       # Document querying with AI
│   │   │   │   └── jobQueue.ts           # Background job processing
│   │   │   └── server.ts         # Express server with all routes
│   │   ├── .env.example          # Environment variables template
│   │   ├── eng.traineddata       # Tesseract OCR language data
│   │   └── package.json          # Backend dependencies
│   └── frontend/
│       ├── public/               # Static assets
│       ├── src/
│       │   ├── components/       # React UI components
│       │   │   ├── DocumentUpload.tsx    # File upload component
│       │   │   ├── DocumentList.tsx      # Document listing component
│       │   │   └── DocumentInteraction.tsx # Q&A and search interface
│       │   ├── services/         # API client services
│       │   │   └── documentService.ts    # API communication layer
│       │   ├── types/            # TypeScript type definitions
│       │   │   └── index.ts      # Shared interfaces and types
│       │   ├── App.tsx           # Main application component
│       │   └── index.tsx         # React app entry point
│       └── package.json          # Frontend dependencies
├── .gitignore                    # Git ignore patterns
└── README.md                     # This file

Setup

  1. Clone the repository

  2. Install backend dependencies:

    cd app/backend
    npm install
  3. Install frontend dependencies:

    cd app/frontend
    npm install
  4. Set up environment variables:

    cd app/backend
    cp .env.example .env

    Edit the .env file with your configuration.

  5. Required Services:

    • Redis: Required for job queue
    • MongoDB: For document storage
    • OpenAI API Key: For text embeddings and completions
    • Pinecone API Key: For vector similarity search
  6. Environment Variables:

    # Server Configuration
    PORT=3001
    NODE_ENV=development
    
    # Database Configuration
    MONGODB_URI=mongodb://localhost:27017/document-processor
    
    # Redis Configuration (for job queue)
    REDIS_HOST=localhost
    REDIS_PORT=6379
    
    # OpenAI Configuration
    OPENAI_API_KEY=your_openai_api_key
    OPENAI_EMBEDDING_MODEL=text-embedding-3-small
    OPENAI_GPT_MODEL=gpt-4o
    # IMPORTANT: This dimension must match your Pinecone index dimension
    # text-embedding-3-small supports 512, 1024, 1536 dimensions
    # text-embedding-3-large supports 256, 1024, 3072 dimensions
    OPENAI_EMBEDDING_DIMENSIONS=1024
    
    # Pinecone Configuration
    PINECONE_API_KEY=your_pinecone_api_key
    PINECONE_ENVIRONMENT=your_pinecone_environment
    PINECONE_INDEX_NAME=document-processor-index
    
    # Upload Configuration
    MAX_FILE_SIZE=10485760  # 10MB
    UPLOAD_DIR=./uploads
    

Development

Start Backend Server

cd app/backend
npm install
npm run dev

Start Frontend Development Server

cd app/frontend
npm install
npm start

Usage

  1. Start required services:

    # Start MongoDB (if not using Atlas)
    mongod
    
    # Start Redis
    redis-server
  2. Start the backend server:

    cd app/backend
    npm run dev

    Backend will run on http://localhost:3002

  3. Start the frontend development server:

    cd app/frontend
    npm start

    Frontend will run on http://localhost:3000 with proxy to backend

  4. Access the application at http://localhost:3000

  5. Document Processing Flow:

    • Upload documents through the web interface
    • Documents are automatically queued for processing using Bull job queue
    • Background workers process embeddings immediately when jobs are added
    • Monitor job queue status through the application
    • Search and interact with processed documents using AI-powered Q&A
  6. Architecture Notes:

    • Backend uses centralized configuration management with singleton pattern
    • All routes are defined in the main server.ts file
    • Job processing is handled asynchronously with Redis and Bull
    • Vector embeddings are stored in Pinecone for semantic search
    • Frontend uses Material-UI for consistent design system

API Endpoints

All API endpoints are defined in the main server.ts file and follow RESTful conventions.

Document Management

  • POST /api/upload - Upload and queue document for processing

    • Accepts: multipart/form-data with file field
    • Supported formats: PDF, PNG, JPEG, SVG, CSV, TXT
    • Returns: Document object with processing status and job ID
  • GET /api/documents - List all documents

    • Query params: ?status=processing|completed|failed
    • Returns: Array of document objects with metadata
  • GET /api/documents/:id - Get specific document details

    • Returns: Complete document object with extracted data and processing status
  • DELETE /api/documents/:id - Delete document and cleanup

    • Removes document, associated files, embeddings, and vector data

Search & Query

  • POST /api/search - Semantic search across all processed documents

    • Body: { "query": "search terms", "limit": 10 }
    • Uses vector similarity search with Pinecone
    • Returns: Array of matching document chunks with relevance scores
  • POST /api/ask - Ask questions about specific documents

    • Body: { "question": "Your question", "documentId": "optional" }
    • Uses OpenAI GPT with document context
    • Returns: AI-generated answer with confidence score and sources

Supported File Types

The application supports multiple document formats with intelligent processing:

  • PDF: Text extraction using pdf-parse library
  • Images (PNG, JPEG, SVG): OCR text extraction using Tesseract.js with eng.traineddata
  • CSV: Structured data parsing with automatic table detection and visualization
  • TXT: Direct text processing with metadata extraction

Key Features & Technologies

Backend Architecture

  • Configuration Management: Centralized singleton configuration service with validation
  • Document Processing: Multi-format document processor with metadata extraction
  • Vector Search: OpenAI embeddings with Pinecone vector database for semantic search
  • Job Queue: Bull queue with Redis for asynchronous background processing with automatic worker processing
  • Database: MongoDB with Mongoose ODM for document storage
  • Type Safety: Full TypeScript implementation with strict typing

Frontend Architecture

  • React 18: Modern React with hooks and functional components
  • Material-UI v5: Comprehensive design system with emotion styling
  • TypeScript: Strict type checking with shared interfaces
  • Component Structure: Modular components for upload, listing, and interaction
  • API Integration: Centralized service layer for backend communication
  • State Management: React hooks for local state management

AI & ML Integration

  • OpenAI GPT-4: Advanced question answering and document analysis with two-step Q&A process
    • Step 1: OpenAI analyzes questions to extract key concepts and generate refined search queries
    • Step 2: Uses original question with enhanced context for final answer generation
  • OpenAI Embeddings: text-embedding-3-small for semantic understanding with configurable dimensions
    • Configurable via OPENAI_EMBEDDING_DIMENSIONS environment variable
    • Must match Pinecone index dimensions exactly (512, 1024, or 1536 for text-embedding-3-small)
  • Enhanced Vector Search: Score boosting algorithm with sigmoid transformation for better relevance
    • Applies sigmoid transformation: 1 / (1 + exp(-10 * (score - 0.5)))
    • Additional 20% boost for high-confidence matches (>0.7)
    • Filters low-relevance results (<0.1) early in the process
  • Optimized Chunking Strategy: Improved text segmentation for better semantic matching
    • Reduced chunk size to 500 tokens for more focused content
    • Increased overlap to 100 tokens for better context preservation
  • Token Management: Automatic context truncation to prevent OpenAI rate limit errors
    • Limits context to top 3 most relevant documents
    • Truncates each document to 2000 characters maximum
    • Maintains essential information while staying within API limits
  • Semantic Query Enhancement: Query refinement for improved search accuracy
    • Enhances queries with contextual information for better embedding matching
  • Pinecone: High-performance vector similarity search with optimized performance
    • Searches 3x more candidates initially, then filters and re-ranks results
    • Automatic worker processing without cron job scheduling overhead

Configuration Notes

The application uses a robust configuration system that:

  • Validates required API keys on startup
  • Provides fallback defaults for development
  • Supports environment-specific configurations
  • Gracefully handles missing API keys with warnings
  • Uses singleton pattern for consistent configuration access
  • Embedding Dimensions: Configurable via OPENAI_EMBEDDING_DIMENSIONS environment variable
    • Must match your Pinecone index dimension exactly
    • text-embedding-3-small: supports 512, 1024, 1536 dimensions
    • text-embedding-3-large: supports 256, 1024, 3072 dimensions
    • Default: 1024 dimensions

Troubleshooting

Common Issues

  1. Redis Connection: Ensure Redis server is running on the configured port
  2. MongoDB Connection: Check MongoDB URI and ensure database is accessible
  3. API Keys: Verify OpenAI and Pinecone API keys are valid and have sufficient credits
  4. File Upload: Check file size limits and supported formats
  5. Port Conflicts: Backend runs on 3002, frontend on 3000 by default

Development Tips

  • Use npm run type-check to validate TypeScript without compilation
  • Check application logs for detailed error information
  • Ensure Pinecone index dimensions match OPENAI_EMBEDDING_DIMENSIONS setting
  • Monitor token usage to avoid OpenAI rate limits (context is automatically truncated)