PDF Semantic Search Tool

A command-line tool for searching through PDF documents using semantic search capabilities. This tool allows you to find relevant content across your PDF files using natural language queries, powered by HuggingFace embeddings and Chroma vector database.

Features

🔍 Semantic search across PDF documents
📄 Interactive slide viewer for search results
🎯 Configurable search parameters
📊 Rich terminal interface with markdown support
🔄 Persistent vector database for fast searches
📱 PDF preview integration (macOS)

Screenshot

Running the tool in interactive mode with a search query "What is a k-mer?":

Installation

Clone the repository:

git clone https://github.com/Asaad47/pdf_search.git
cd pdf_search

Install the required dependencies:

pip install -r requirements.txt

Configuration

The tool uses a config.yaml file for configuration. Here's the default structure:

# Can be specific files or glob patterns like "path/to/dir/*.pdf"
pdf_paths:
  - "/path/to/pdf1.pdf"
  - "/path/to/pdf2.pdf"
  - "/path/to/pdf3.pdf"

chroma_dir: "./chroma_db"
default_query: "What is Machine Learning?"
default_k_results: 5

You can change the pdf_paths to your own PDF files.

Usage

1. Creating the Search Database

Before searching, you need to create the vector database from your PDF files:

python create_db.py

2. Searching Documents

Basic search:

python search.py "your search query" -i

Advanced options:

python search.py "your search query" -k 10 -i

Command-line arguments:

-k: Number of results to return (default: 5)
-v: Verbose output mode (only if non-interactive)
-i: Interactive mode with slide viewer

Interactive Mode

When using interactive mode (-i flag), you can:

Press n to view next result
Press p to view previous result
Press o to open the PDF in Preview (macOS)
Press q to quit

Example Usage

Create the database:

python create_db.py

Search with default settings:

python search.py "machine learning applications" -i

Search with custom settings:

python search.py "neural networks" -k 10 -i

Project Structure

search.py: Main search script
create_db.py: Database creation script
config.yaml: Configuration file
requirements.txt: Python dependencies
chroma_db/: Vector database directory
search.sh: Shell script wrapper

Notes

The tool uses the all-MiniLM-L6-v2 model from HuggingFace for embeddings
PDF preview functionality is currently limited to macOS
The vector database is persistent and stored in the chroma_db directory

LLM Usage

The development of this project was done with the help of ChatGPT 4o and Cursor IDE Agentic mode.

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for details.

Note: This project uses PyMuPDF which is also licensed under AGPL-3.0. Any modifications to this code must be open source and distributed under the same license.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_db.py		create_db.py
example.jpg		example.jpg
requirements.txt		requirements.txt
search.py		search.py
search.sh		search.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Semantic Search Tool

Features

Screenshot

Installation

Configuration

Usage

1. Creating the Search Database

2. Searching Documents

Interactive Mode

Example Usage

Project Structure

Notes

LLM Usage

License

About

Uh oh!

Releases

Packages

Languages

License

Asaad47/pdf_search

Folders and files

Latest commit

History

Repository files navigation

PDF Semantic Search Tool

Features

Screenshot

Installation

Configuration

Usage

1. Creating the Search Database

2. Searching Documents

Interactive Mode

Example Usage

Project Structure

Notes

LLM Usage

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages