Skip to content

spagnoloe/pdfmemrag

Repository files navigation

🧠 PDF Embedding RAG API

This project implements a Retrieval-Augmented Generation (RAG) system that creates vector embeddings for each page of uploaded PDF documents. You can interact with the API to embed custom queries, search the indexed content, and generate contextual responses based on retrieved sources.

As it is implemented today, the LLM that backs the system is an open-source transformer from HuggingFace run locally. This means the responses are slow (we are running the model on CPU because my laptop does not have a GPU 🥲) and not really elaborate because the model is very simple.

We also include the code to use either a Cohere or an OpenAI model, but we need a valid API Keys for that.


🚀 Getting Started

Follow the steps below to set up and run the application:

# Create a virtual environment
python3 -m venv venv

# Activate the virtual environment
source venv/bin/activate

# Install required dependencies
pip install -r requirements.txt

# Install the package
pip install -e .

# The repo includes a pre-commit hook so it is recommended o install it
pre-commit install

# Create a cache directory
mkdir -p cache

# Add Cohere key as environment variable
export COHERE_API_KEY=<YOU_COHERE_API_KEY>

# Run the app
python app.py

NOTE: In your first run the tool will take some time embed all the pdf files inside the documents/ directory. These embeddings will be stored in JSON inside the cache/, which in turn will mean that subsequent launches of the API will be much faster. You can also force the refresh of all the embeddings by launching the API with:

# Re-embeds all documents
python app.py --full-refresh

📡 API Endpoints

🔢 Get Corpus Size

Returns the number of embedded pages in the current corpus.

curl http://localhost:8000/corpus_size

🔍 Search Corpus

Performs a similarity search across the embedded PDF pages.

curl \
    -X POST http://localhost:8000/search \
    -H 'Content-Type: application/json' \
    -d '{"question": "What is a goon?", "max_results": 3}'

🗣️ Generate Response

curl \
    -X POST http://localhost:8000/response \
    -H 'Content-Type: application/json' \
    -d '{"question":"What is a goon?", "max_sources": 3}'

Example Response (with Cohere LLM):

{"response":"A \"goon\" is a term used to describe certain job roles that are considered to have no social value and a negative impact on society. These include telemarketers, corporate lawyers, bank lobbyists, and marketing professionals. Goons are often hired to perform manipulative and aggressive tasks, such as making deceptive advertisements or engaging in deceptive public relations practices.","sources":[{"page":54,"pdf":"Bullshit-Jobs-A-Theory-David-Graeber.pdf"},{"page":78,"pdf":"Bullshit-Jobs-A-Theory-David-Graeber.pdf"},{"page":53,"pdf":"Bullshit-Jobs-A-Theory-David-Graeber.pdf"}]}

🧪 Running Tests

At the moment, only the functions related to embeddings include unit tests, built with pytest.

You can run them as follows (assuming you have already set up the environment):

# Run all unit tests
pytest

📁 Notes

  • Each PDF page is embedded individually.
  • Embeddings are cached for efficiency (unless --full-refresh is used).

🔭 Next Steps

  1. [Must have]: Add unit tests for the other modules and the API endpoints.
  2. [Must have]: Add try-except error catching in all endpoints and services.
  3. [Must have]: Make it work in a computer with better specs, so that we can use GPU for example, which will speed up the LLM responses significantly.
  4. [Should have]: Experiment with different ways to chunk the documents. We are currently splitting by page but this is for sure very naive.
  5. [Should have]: Dockerise the application and add some sort of authentication.
  6. [Should have]: Add a UI, with a functionality to upload documents.
  7. [Nice to have]: Improve the search functionality by figuring out a way to index more efficiently the embeddings than with MiniBatchKMeans. Consider using nearest neighbours indexing. At some point, this would require migrating from numpy to Tensorflow which might overlap with (2).

About

In memory PDF RAG

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages