This project demonstrates a document retrieval and vectorization workflow based on Haystack, FlagEmbedding, HNSWLib, and BM25.
RAG_workflow/parse.py: Parses and splits PDF documents, and generates the content in JSON format.RAG_workflow/embedding.py:Vectorizes the document content and produces the embedding vector base.RAG_workflow/HNSW_retrieve.py:Performs hybrid retrieval and recall using HNSW and BM25.
Install dependencies with:
pip install -r requirements.txt-
PDF Parsing
- Modify
PATH_TO_YOUR_PDF_DIRECTORYinparse.pyto your PDF folder path. - Update the output JSON path(
PATH_TO_YOUR_JSON)。 - Run:
python PDF-RAG/parse.py
- The generated JSON file will be used in the next embedding step.
- Modify
-
JSON Vectorization
- In
embedding.py, update the input JSON path (PATH_TO_YOUR_JSON.json) and the output embedding file path (PATH_TO_YOUR_EMBEDDING.npy). - Run:
python PDF-RAG/embedding.py
- In
-
Retrieval and Recall
- In
HNSW_retrieve.py, update the embedding and JSON paths (PATH_TO_YOUR_JSON.json). - Run:
python PDF-RAG/HNSW_retrieve.py
- The script will output the construction and retrieval times for both HNSW and BM25, along with the merged retrieval results.
- Adjust HNSW and BM25 parameters according to the descriptions to get desired results.
- In
hnswlib.Index(), usespace='l2'for Squared L2,'ip'for Inner Product, and'cosine'for Cosine Similarity.
- In
- numpy
- scikit-learn
- hnswlib
- rank_bm25
- FlagEmbedding
- haystack
- haystack-integrations
Feel free to open an issue if you encounter any problems!