A command-line tool for searching through PDF documents using semantic search capabilities. This tool allows you to find relevant content across your PDF files using natural language queries, powered by HuggingFace embeddings and Chroma vector database.
- 🔍 Semantic search across PDF documents
- 📄 Interactive slide viewer for search results
- 🎯 Configurable search parameters
- 📊 Rich terminal interface with markdown support
- 🔄 Persistent vector database for fast searches
- 📱 PDF preview integration (macOS)
Running the tool in interactive mode with a search query "What is a k-mer?":
- Clone the repository:
git clone https://github.com/Asaad47/pdf_search.git
cd pdf_search- Install the required dependencies:
pip install -r requirements.txtThe tool uses a config.yaml file for configuration. Here's the default structure:
# Can be specific files or glob patterns like "path/to/dir/*.pdf"
pdf_paths:
- "/path/to/pdf1.pdf"
- "/path/to/pdf2.pdf"
- "/path/to/pdf3.pdf"
chroma_dir: "./chroma_db"
default_query: "What is Machine Learning?"
default_k_results: 5You can change the pdf_paths to your own PDF files.
Before searching, you need to create the vector database from your PDF files:
python create_db.pyBasic search:
python search.py "your search query" -iAdvanced options:
python search.py "your search query" -k 10 -iCommand-line arguments:
-k: Number of results to return (default: 5)-v: Verbose output mode (only if non-interactive)-i: Interactive mode with slide viewer
When using interactive mode (-i flag), you can:
- Press
nto view next result - Press
pto view previous result - Press
oto open the PDF in Preview (macOS) - Press
qto quit
- Create the database:
python create_db.py- Search with default settings:
python search.py "machine learning applications" -i- Search with custom settings:
python search.py "neural networks" -k 10 -isearch.py: Main search scriptcreate_db.py: Database creation scriptconfig.yaml: Configuration filerequirements.txt: Python dependencieschroma_db/: Vector database directorysearch.sh: Shell script wrapper
- The tool uses the
all-MiniLM-L6-v2model from HuggingFace for embeddings - PDF preview functionality is currently limited to macOS
- The vector database is persistent and stored in the
chroma_dbdirectory
The development of this project was done with the help of ChatGPT 4o and Cursor IDE Agentic mode.
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for details.
Note: This project uses PyMuPDF which is also licensed under AGPL-3.0. Any modifications to this code must be open source and distributed under the same license.
