Skip to content

Asaad47/pdf_search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Semantic Search Tool

A command-line tool for searching through PDF documents using semantic search capabilities. This tool allows you to find relevant content across your PDF files using natural language queries, powered by HuggingFace embeddings and Chroma vector database.

Features

  • 🔍 Semantic search across PDF documents
  • 📄 Interactive slide viewer for search results
  • 🎯 Configurable search parameters
  • 📊 Rich terminal interface with markdown support
  • 🔄 Persistent vector database for fast searches
  • 📱 PDF preview integration (macOS)

Screenshot

Running the tool in interactive mode with a search query "What is a k-mer?":

Screenshot

Installation

  1. Clone the repository:
git clone https://github.com/Asaad47/pdf_search.git
cd pdf_search
  1. Install the required dependencies:
pip install -r requirements.txt

Configuration

The tool uses a config.yaml file for configuration. Here's the default structure:

# Can be specific files or glob patterns like "path/to/dir/*.pdf"
pdf_paths:
  - "/path/to/pdf1.pdf"
  - "/path/to/pdf2.pdf"
  - "/path/to/pdf3.pdf"

chroma_dir: "./chroma_db"
default_query: "What is Machine Learning?"
default_k_results: 5

You can change the pdf_paths to your own PDF files.

Usage

1. Creating the Search Database

Before searching, you need to create the vector database from your PDF files:

python create_db.py

2. Searching Documents

Basic search:

python search.py "your search query" -i

Advanced options:

python search.py "your search query" -k 10 -i

Command-line arguments:

  • -k: Number of results to return (default: 5)
  • -v: Verbose output mode (only if non-interactive)
  • -i: Interactive mode with slide viewer

Interactive Mode

When using interactive mode (-i flag), you can:

  • Press n to view next result
  • Press p to view previous result
  • Press o to open the PDF in Preview (macOS)
  • Press q to quit

Example Usage

  1. Create the database:
python create_db.py
  1. Search with default settings:
python search.py "machine learning applications" -i
  1. Search with custom settings:
python search.py "neural networks" -k 10 -i

Project Structure

  • search.py: Main search script
  • create_db.py: Database creation script
  • config.yaml: Configuration file
  • requirements.txt: Python dependencies
  • chroma_db/: Vector database directory
  • search.sh: Shell script wrapper

Notes

  • The tool uses the all-MiniLM-L6-v2 model from HuggingFace for embeddings
  • PDF preview functionality is currently limited to macOS
  • The vector database is persistent and stored in the chroma_db directory

LLM Usage

The development of this project was done with the help of ChatGPT 4o and Cursor IDE Agentic mode.

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for details.

Note: This project uses PyMuPDF which is also licensed under AGPL-3.0. Any modifications to this code must be open source and distributed under the same license.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published