Investment RAG

An AI-powered app that analyzes any financial document (10-K filings, annual reports, quarterly reports, etc.) from any jurisdiction and provides insights based on your selected criteria.

Quick Start (Run Locally)

Prerequisites

You'll need accounts and API keys for:

Groq - Get free API key (for LLM)
Google AI - Get free API key (for embeddings)
Pinecone - Sign up free
Clerk - Sign up free
PostgreSQL - Use Vercel Postgres or Neon
Vercel Blob - Vercel Dashboard → Storage (for document uploads)
LangSmith (optional) - smith.langchain.com (for tracing/debugging RAG and agents)

Step 1: Install Dependencies

npm install --legacy-peer-deps

Step 2: Set Up Environment Variables

cp .env.example .env.local

Open .env.local and fill in your keys:

# Required
GROQ_API_KEY=...
GOOGLE_API_KEY=...
PINECONE_API_KEY=...
PINECONE_INDEX_NAME=investment-rag
POSTGRES_URL=postgresql://...
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=pk_...
CLERK_SECRET_KEY=sk_...
BLOB_READ_WRITE_TOKEN=...

# Optional (for webhook sync)
CLERK_WEBHOOK_SECRET=whsec_...
LANGCHAIN_API_KEY=...

Vercel Blob Storage (required for document uploads)

The app stores uploaded PDFs in Vercel Blob. You need a Blob store and BLOB_READ_WRITE_TOKEN:

Go to Vercel Dashboard → select your project (or create one) → Storage tab.
Click Create Database → choose Blob.
Name the store (e.g. investment-rag-blob), set access to Public (so document URLs work), then create.
After creation, Vercel adds BLOB_READ_WRITE_TOKEN to the project. For local dev, pull env vars:
```
vercel link    # link this repo to your Vercel project if needed
vercel env pull .env.local
```
Or copy the token from Storage → your Blob store → Settings and set BLOB_READ_WRITE_TOKEN in .env.local.

Without this token, document uploads will fail.

LangChain / LangSmith (optional)

LangSmith provides tracing and debugging for the RAG pipeline and analysis agent (LangGraph). Useful for development, not required to run the app.

Sign up at smith.langchain.com.
Go to Settings → API Keys → Create API Key.
Copy the key and set in .env.local:
```
LANGCHAIN_API_KEY=lsv2_...
```
Optionally set LANGCHAIN_TRACING_V2=true to enable tracing (LangChain SDK will send traces to LangSmith when the key is present).

You can leave LANGCHAIN_API_KEY unset for a basic demo.

Step 3: Set Up Database

npm run db:push

Step 4: Initialize Pinecone Index

npm run init:pinecone

Step 5: Run the App

npm run dev

Open http://localhost:3000 in your browser.

Deploy to Production (Vercel)

Option 1: One-Click Deploy

Push your code to GitHub
Go to vercel.com/new
Import your repository
Add all environment variables from .env.local
Deploy

Option 2: CLI Deploy

npm i -g vercel
vercel login
vercel --prod

Post-Deployment

Set up Clerk Webhook: In Clerk Dashboard, add webhook endpoint https://your-domain.com/api/webhooks/clerk with events: user.created, user.updated, user.deleted
Verify: Test document upload and analysis

How It Works

Overview

User uploads PDF → Parse & Chunk → Assign Categories → Generate Embeddings → Store
                                                                                ↓
User runs analysis ← LLM generates verdict ← Filter by Categories ← Retrieve chunks

Supported Document Types

The app analyzes any financial report, including:

SEC Filings: 10-K, 10-Q, 8-K (US)
Annual Reports: From any jurisdiction (India, UK, EU, etc.)
Quarterly Reports: Any format
Other Financial Documents: Investor presentations, earnings reports

Content Categories

Each chunk is automatically classified into one or more categories using keyword pattern matching:

Category	What It Captures
`financial-performance`	Revenue, profit, margins, cash flow, balance sheet data
`risk-factors`	Business risks, uncertainties, threats, exposures
`business-operations`	Products, services, market position, operations
`management-governance`	Leadership, board, compensation, governance practices
`legal-regulatory`	Legal proceedings, compliance, regulations, patents
`strategy-outlook`	Growth plans, acquisitions, R&D, future initiatives
`general`	Content that doesn't fit specific categories

Categories enable pre-filtering during retrieval—when analyzing financial health, the system prioritizes financial-performance chunks; for risk assessment, it prioritizes risk-factors chunks.

Key Components

1. Document Processing Pipeline

When you upload a PDF:

Parse: Extract text from PDF using pdf-parse
Detect Headings: Find document structure (any format, not 10-K specific)
Chunk: Split into ~1500 token pieces with heading-aware boundaries
Classify: Assign categories to each chunk using keyword patterns
Embed: Convert chunks to 768-dimension vectors using Gemini embeddings
Store: Save vectors + categories in Pinecone, full data in PostgreSQL

2. Two Databases (Why?)

PostgreSQL	Pinecone
Stores structured data (users, documents, analysis results)	Stores vector embeddings
Good for complex queries & relationships	Optimized for fast similarity search
Source of truth for chunk text and categories	Finds semantically similar content

Example: When searching, Pinecone finds chunks that mean the same thing as your query (even without exact keyword matches), then filters by category. PostgreSQL stores the full text and metadata.

3. Retrieval (RAG)

When analyzing a document:

Hybrid Search: Combines vector similarity + keyword matching
Category Filtering: Pre-filters chunks by relevant categories
LLM Analysis: Groq Llama 3.3 70B analyzes chunks against your criteria

4. Analysis Agent (LangGraph)

The analysis runs as a 3-step workflow:

Retrieve → Analyze → Synthesize

Retrieve: Get relevant chunks (filtered by category)
Analyze: LLM extracts insights per criterion
Synthesize: Combine into final verdict with confidence score

Analysis Criteria

The system evaluates documents against these criteria:

Criterion	Categories Used
Financial Health	`financial-performance`
Risk Assessment	`risk-factors`, `legal-regulatory`
Growth Potential	`strategy-outlook`, `business-operations`
Competitive Position	`business-operations`, `strategy-outlook`
Management Quality	`management-governance`
Regulatory Compliance	`legal-regulatory`, `risk-factors`

Project Structure

app/                    # Next.js pages & API routes
├── (auth)/            # Sign in/up pages (Clerk)
├── (dashboard)/       # Protected pages (dashboard, documents, analysis)
└── api/               # Backend endpoints

lib/
├── agents/            # LangGraph analysis workflow
│   └── nodes/         # Retrieve, analyze, synthesize nodes
├── db/                # Database schema (Drizzle ORM)
├── rag/
│   ├── chunking/      # Heading-aware document splitting
│   ├── embeddings/    # Gemini embedding generation
│   ├── metadata/      # Category classifier (keyword-based)
│   └── retrieval/     # Hybrid search with category filtering
├── parsers/           # PDF parsing, heading detection
├── services/          # Document processor, retrieval service
└── vectorstore/       # Pinecone operations

components/            # React UI components
config/
├── criteria.config.ts # Analysis criteria with category mappings
└── rag.config.ts      # Chunking, embedding, retrieval settings

Cost Per Analysis

Component	Cost
Embeddings (Gemini, one-time per doc)	FREE ✨
Analysis (Groq Llama 3.3 70B)	FREE ✨
Total	$0.00 🎉

100% free within generous tier limits (1000s of requests/day)

Available Scripts

Command	Description
`npm run dev`	Start development server
`npm run build`	Build for production
`npm run db:push`	Push schema to database
`npm run db:studio`	Open Drizzle Studio (DB viewer)
`npm run init:pinecone`	Create Pinecone index

Tech Stack

Frontend: Next.js 15, React 19, TailwindCSS, shadcn/ui
Auth: Clerk
Database: PostgreSQL (Drizzle ORM)
Vector DB: Pinecone
AI: LangChain, LangGraph, Groq (Llama 3.3 70B), Google Gemini (embeddings)
Deployment: Vercel

Troubleshooting

Dependencies won't install?

npm install --legacy-peer-deps

Database connection error?

Check POSTGRES_URL is correct
Ensure your IP is whitelisted if using external DB

Document stuck processing?

Check terminal logs for errors
Verify API key has credits

Analysis fails?

Ensure document finished processing first
Check API limits

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
app		app
components		components
config		config
lib		lib
scripts		scripts
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.prettierrc		.prettierrc
AGENTS.md		AGENTS.md
EVALUATION.md		EVALUATION.md
README.md		README.md
drizzle.config.ts		drizzle.config.ts
instrumentation.ts		instrumentation.ts
middleware.ts		middleware.ts
next.config.js		next.config.js
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json
vercel-blob-client.d.ts		vercel-blob-client.d.ts
vercel.json		vercel.json
vitest.config.component.ts		vitest.config.component.ts
vitest.config.ts		vitest.config.ts

Folders and files

Latest commit

History

Repository files navigation

Investment RAG

Quick Start (Run Locally)

Prerequisites

Step 1: Install Dependencies

Step 2: Set Up Environment Variables

Vercel Blob Storage (required for document uploads)

LangChain / LangSmith (optional)

Step 3: Set Up Database

Step 4: Initialize Pinecone Index

Step 5: Run the App

Deploy to Production (Vercel)

Option 1: One-Click Deploy

Option 2: CLI Deploy

Post-Deployment

How It Works

Overview

Supported Document Types

Content Categories

Key Components

1. Document Processing Pipeline

2. Two Databases (Why?)

3. Retrieval (RAG)

4. Analysis Agent (LangGraph)

Analysis Criteria

Project Structure

Cost Per Analysis

Available Scripts

Tech Stack

Troubleshooting

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages