Intelligent document processing with semantic search and vector embeddings
Features β’ Getting Started β’ Database Setup β’ Usage β’ Contributing
A production-ready document processing application that combines Next.js 16, PostgreSQL with pgvector, and Ollama embeddings to provide intelligent semantic search across your documents. Upload PDFs, extract text, generate embeddings, and query your documents using natural language.
- π Multi-Format Support - PDF document processing
- π§ Semantic Search - Vector-based similarity search using OpenAI embeddings
- π PostgreSQL + pgvector - Scalable vector database with IVFFlat indexing
- π₯οΈ Modern UI - Drag-and-drop file upload with real-time progress
- π Text Chunking - Intelligent document segmentation for optimal retrieval
- π CLI Tool - Interactive command-line interface for PDF chat
- β‘ No External Dependencies - Pure fetch API, no OpenAI SDK required
- πΎ Metadata Tracking - Document metadata with timestamps and file info
This project demonstrates a production-ready Next.js application with:
- Next.js 16: App Router with React 19
- PostgreSQL + pgvector: Vector database for semantic search
- Ollama: Local LLM for embeddings (Snowflake Arctic Embed)
- shadcn/ui: Modern UI components
- Shared Components: Header, Footer, Landing for consistent UX
ai_document_processing/
βββ src/
β βββ app/
β β βββ api/
β β β βββ v1/
β β β βββ parse/ # PDF upload and processing
β β β βββ search/ # Semantic search endpoint
β β βββ api-docs/ # Swagger API documentation
β β βββ configs/ # Configuration files
β β βββ scripts/ # CLI tools
β β βββ globals.css # Global styles
β β βββ layout.js # Root layout
β β βββ page.js # Main upload/search page
β βββ components/
β β βββ ui/ # shadcn/ui components
β β βββ FileUploader.jsx # Drag-and-drop uploader
β β βββ header.js # Shared header with model selector
β β βββ footer.js # Shared footer
β β βββ landing.js # Page wrapper component
β βββ lib/
β βββ postgres.js # Database connection
β βββ utils.js # Utility functions
βββ public/
β βββ images/ # Logo and assets
β βββ manifest.json # PWA manifest
β βββ sw.js # Service worker
βββ .env.local # Environment variables
βββ package.json
- Node.js 18.17 or later
- PostgreSQL 14+ with pgvector extension
- Ollama installed locally
# Clone the repository
git clone https://github.com/shawnmcrowley/ai_document_processing.git
# Navigate to project directory
cd ai_document_processing
# Install dependencies
npm install
# Install Ollama models
ollama pull snowflake-arctic-embed2
ollama pull llama3.2
# Run development server
npm run devOpen http://localhost:3000 to view the application.
CREATE EXTENSION IF NOT EXISTS vector;-- Table: public.document_chunks
-- DROP TABLE IF EXISTS public.document_chunks;
CREATE TABLE IF NOT EXISTS public.document_chunks
(
id integer NOT NULL DEFAULT nextval('document_chunks_id_seq'::regclass),
document_id integer,
chunk_index integer NOT NULL,
content text COLLATE pg_catalog."default" NOT NULL,
embedding vector(1024) NOT NULL,
created_at timestamp with time zone DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT document_chunks_pkey PRIMARY KEY (id),
CONSTRAINT document_chunks_document_id_fkey FOREIGN KEY (document_id)
REFERENCES public.documents (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE CASCADE
)
TABLESPACE pg_default;
ALTER TABLE IF EXISTS public.document_chunks
OWNER to postgres;
-- Index: idx_document_chunks_embedding
-- DROP INDEX IF EXISTS public.idx_document_chunks_embedding;
CREATE INDEX IF NOT EXISTS idx_document_chunks_embedding
ON public.document_chunks USING ivfflat
(embedding vector_cosine_ops)
WITH (lists=100)
TABLESPACE pg_default;-- Table: public.documents
-- DROP TABLE IF EXISTS public.documents;
CREATE TABLE IF NOT EXISTS public.documents
(
id integer NOT NULL DEFAULT nextval('documents_id_seq'::regclass),
filename text COLLATE pg_catalog."default" NOT NULL,
content text COLLATE pg_catalog."default" NOT NULL,
metadata jsonb NOT NULL,
embedding vector(1024),
created_at timestamp with time zone DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT documents_pkey PRIMARY KEY (id)
)
TABLESPACE pg_default;
ALTER TABLE IF EXISTS public.documents
OWNER to postgres;
-- Index: idx_documents_embedding
-- DROP INDEX IF EXISTS public.idx_documents_embedding;
CREATE INDEX IF NOT EXISTS idx_documents_embedding
ON public.documents USING ivfflat
(embedding vector_cosine_ops)
WITH (lists=100)
TABLESPACE pg_default;Create .env.local in the project root:
DATABASE_URL=postgresql://user:password@localhost:5432/dbname
OLLAMA_HOST=http://localhost:11434Configured in jsconfig.json:
{
"compilerOptions": {
"baseUrl": "src",
"paths": {
"@/app/*": ["app/*"],
"@/components/*": ["components/*"],
"@/lib/*": ["lib/*"]
}
}
}If you encounter issues with pdf-parse:
- Open
node_modules/pdf-parse/index.js - Change line 6:
let isDebugMode = ! module.parent;tolet isDebugMode = false; - Clear Next.js cache:
rm -rf .next/cache
- Header Dropdown: Select between Llama 3.2 and Deep Coder 2
- Global State: Model selection persists across upload and search
- Drag-and-Drop: Upload PDFs with visual feedback
- Processing: Automatic text extraction and chunking
- Embeddings: Generate vector embeddings using Snowflake Arctic Embed
- Storage: Save to PostgreSQL with metadata
- Natural Language: Query documents using plain English
- Relevance Scores: View similarity percentages
- Metadata: Inspect document details
- Pagination: Scroll through results
- Header: Logo, model selector, navigation (Get Started, API Docs)
- Footer: Copyright information
- Landing: Consistent page wrapper for all routes
- Swagger UI: Interactive API docs at
/api-docs - Endpoints:
/api/v1/parse(upload),/api/v1/search(query)
node src/app/scripts/index.js -f document.pdfInteractive chat session:
You: What is this document about?
AI: [Streams response based on document content]
npm run build
npm startFROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build
EXPOSE 3000
CMD ["npm", "start"]Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Next.js team for the amazing framework
- PostgreSQL and pgvector for vector database capabilities
- Ollama for local LLM inference
- Snowflake for Arctic Embed model
- shadcn/ui for beautiful components
Project Link: https://github.com/shawnmcrowley/ai_document_processing