Skip to content

PDF Document Processing using local Ollama Models, Postgres Vector Extension for embeddings, Nextjs 16 UI, PWA Support, Swagger Documentation

Notifications You must be signed in to change notification settings

shawnmcrowley/ai_document_processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

41 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AI Document Processing

Intelligent document processing with semantic search and vector embeddings

Next.js React PostgreSQL License

Features β€’ Getting Started β€’ Database Setup β€’ Usage β€’ Contributing


πŸ“‹ Overview

A production-ready document processing application that combines Next.js 16, PostgreSQL with pgvector, and Ollama embeddings to provide intelligent semantic search across your documents. Upload PDFs, extract text, generate embeddings, and query your documents using natural language.

✨ Features

  • πŸ“„ Multi-Format Support - PDF document processing
  • 🧠 Semantic Search - Vector-based similarity search using OpenAI embeddings
  • πŸ“Š PostgreSQL + pgvector - Scalable vector database with IVFFlat indexing
  • πŸ–₯️ Modern UI - Drag-and-drop file upload with real-time progress
  • πŸ“ Text Chunking - Intelligent document segmentation for optimal retrieval
  • πŸ” CLI Tool - Interactive command-line interface for PDF chat
  • ⚑ No External Dependencies - Pure fetch API, no OpenAI SDK required
  • πŸ’Ύ Metadata Tracking - Document metadata with timestamps and file info

πŸ—οΈ Architecture Overview

This project demonstrates a production-ready Next.js application with:

  • Next.js 16: App Router with React 19
  • PostgreSQL + pgvector: Vector database for semantic search
  • Ollama: Local LLM for embeddings (Snowflake Arctic Embed)
  • shadcn/ui: Modern UI components
  • Shared Components: Header, Footer, Landing for consistent UX

πŸ“ Project Structure

ai_document_processing/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”‚   └── v1/
β”‚   β”‚   β”‚       β”œβ”€β”€ parse/          # PDF upload and processing
β”‚   β”‚   β”‚       └── search/         # Semantic search endpoint
β”‚   β”‚   β”œβ”€β”€ api-docs/               # Swagger API documentation
β”‚   β”‚   β”œβ”€β”€ configs/                # Configuration files
β”‚   β”‚   β”œβ”€β”€ scripts/                # CLI tools
β”‚   β”‚   β”œβ”€β”€ globals.css             # Global styles
β”‚   β”‚   β”œβ”€β”€ layout.js               # Root layout
β”‚   β”‚   └── page.js                 # Main upload/search page
β”‚   β”œβ”€β”€ components/
β”‚   β”‚   β”œβ”€β”€ ui/                     # shadcn/ui components
β”‚   β”‚   β”œβ”€β”€ FileUploader.jsx        # Drag-and-drop uploader
β”‚   β”‚   β”œβ”€β”€ header.js               # Shared header with model selector
β”‚   β”‚   β”œβ”€β”€ footer.js               # Shared footer
β”‚   β”‚   └── landing.js              # Page wrapper component
β”‚   └── lib/
β”‚       β”œβ”€β”€ postgres.js             # Database connection
β”‚       └── utils.js                # Utility functions
β”œβ”€β”€ public/
β”‚   β”œβ”€β”€ images/                     # Logo and assets
β”‚   β”œβ”€β”€ manifest.json               # PWA manifest
β”‚   └── sw.js                       # Service worker
β”œβ”€β”€ .env.local                      # Environment variables
└── package.json

πŸš€ Getting Started

Prerequisites

  • Node.js 18.17 or later
  • PostgreSQL 14+ with pgvector extension
  • Ollama installed locally

Installation

# Clone the repository
git clone https://github.com/shawnmcrowley/ai_document_processing.git

# Navigate to project directory
cd ai_document_processing

# Install dependencies
npm install

# Install Ollama models
ollama pull snowflake-arctic-embed2
ollama pull llama3.2

# Run development server
npm run dev

Open http://localhost:3000 to view the application.

πŸ—„οΈ Database Setup

1. Install pgvector Extension

CREATE EXTENSION IF NOT EXISTS vector;

2. Create Document Chunks Table and Index

-- Table: public.document_chunks

-- DROP TABLE IF EXISTS public.document_chunks;

CREATE TABLE IF NOT EXISTS public.document_chunks
(
    id integer NOT NULL DEFAULT nextval('document_chunks_id_seq'::regclass),
    document_id integer,
    chunk_index integer NOT NULL,
    content text COLLATE pg_catalog."default" NOT NULL,
    embedding vector(1024) NOT NULL,
    created_at timestamp with time zone DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT document_chunks_pkey PRIMARY KEY (id),
    CONSTRAINT document_chunks_document_id_fkey FOREIGN KEY (document_id)
        REFERENCES public.documents (id) MATCH SIMPLE
        ON UPDATE NO ACTION
        ON DELETE CASCADE
)

TABLESPACE pg_default;

ALTER TABLE IF EXISTS public.document_chunks
    OWNER to postgres;
-- Index: idx_document_chunks_embedding

-- DROP INDEX IF EXISTS public.idx_document_chunks_embedding;

CREATE INDEX IF NOT EXISTS idx_document_chunks_embedding
    ON public.document_chunks USING ivfflat
    (embedding vector_cosine_ops)
    WITH (lists=100)
    TABLESPACE pg_default;

3. Create Documents Table and Index

-- Table: public.documents

-- DROP TABLE IF EXISTS public.documents;

CREATE TABLE IF NOT EXISTS public.documents
(
    id integer NOT NULL DEFAULT nextval('documents_id_seq'::regclass),
    filename text COLLATE pg_catalog."default" NOT NULL,
    content text COLLATE pg_catalog."default" NOT NULL,
    metadata jsonb NOT NULL,
    embedding vector(1024),
    created_at timestamp with time zone DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT documents_pkey PRIMARY KEY (id)
)

TABLESPACE pg_default;

ALTER TABLE IF EXISTS public.documents
    OWNER to postgres;
-- Index: idx_documents_embedding

-- DROP INDEX IF EXISTS public.idx_documents_embedding;

CREATE INDEX IF NOT EXISTS idx_documents_embedding
    ON public.documents USING ivfflat
    (embedding vector_cosine_ops)
    WITH (lists=100)
    TABLESPACE pg_default;

πŸ”§ Configuration

Environment Variables

Create .env.local in the project root:

DATABASE_URL=postgresql://user:password@localhost:5432/dbname
OLLAMA_HOST=http://localhost:11434

Import Aliases

Configured in jsconfig.json:

{
  "compilerOptions": {
    "baseUrl": "src",
    "paths": {
      "@/app/*": ["app/*"],
      "@/components/*": ["components/*"],
      "@/lib/*": ["lib/*"]
    }
  }
}

PDF Parse Fix

If you encounter issues with pdf-parse:

  1. Open node_modules/pdf-parse/index.js
  2. Change line 6: let isDebugMode = ! module.parent; to let isDebugMode = false;
  3. Clear Next.js cache: rm -rf .next/cache

🎯 Key Features

Model Selection

  • Header Dropdown: Select between Llama 3.2 and Deep Coder 2
  • Global State: Model selection persists across upload and search

Document Upload

  • Drag-and-Drop: Upload PDFs with visual feedback
  • Processing: Automatic text extraction and chunking
  • Embeddings: Generate vector embeddings using Snowflake Arctic Embed
  • Storage: Save to PostgreSQL with metadata

Semantic Search

  • Natural Language: Query documents using plain English
  • Relevance Scores: View similarity percentages
  • Metadata: Inspect document details
  • Pagination: Scroll through results

Shared Components

  • Header: Logo, model selector, navigation (Get Started, API Docs)
  • Footer: Copyright information
  • Landing: Consistent page wrapper for all routes

API Documentation

  • Swagger UI: Interactive API docs at /api-docs
  • Endpoints: /api/v1/parse (upload), /api/v1/search (query)

CLI Tool

node src/app/scripts/index.js -f document.pdf

Interactive chat session:

You: What is this document about?
AI: [Streams response based on document content]

πŸš€ Deployment

Building for Production

npm run build
npm start

Docker Deployment

FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build
EXPOSE 3000
CMD ["npm", "start"]

🀝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Next.js team for the amazing framework
  • PostgreSQL and pgvector for vector database capabilities
  • Ollama for local LLM inference
  • Snowflake for Arctic Embed model
  • shadcn/ui for beautiful components

πŸ“§ Contact

Creator - Shawn M. Crowley

Project Link: https://github.com/shawnmcrowley/ai_document_processing


Made with ❀️ using Next.js 16 and PostgreSQL

About

PDF Document Processing using local Ollama Models, Postgres Vector Extension for embeddings, Nextjs 16 UI, PWA Support, Swagger Documentation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published