This project demonstrates a simple RAG pipeline using Python and SpaCy. It retrieves relevant documents from a local dataset based on a question, computes concept-based vector similarities, and generates a context prompt for answering questions.
- Tokenizes documents and questions using SpaCy.
- Filters tokens by part-of-speech (POS).
- Computes concept frequency vectors per document.
- Computes cosine similarity between a question and documents.
- Returns top relevant documents to construct a context prompt.
- Calls an LLM to generate answers based on the retrieved context.
- Python 3.9+
- SpaCy
- spaCy model
en_core_web_md - OpenAI Python client (or equivalent for LLM access)
Install dependencies with:
pip install spacy openai
python -m spacy download en_core_web_md.
├── datas/ # Folder containing your text documents
├── mini_rag.py # Main script
├── LICENSE.md # License
├── same-but-with-langchain.py # mini_rag.py but with Langchain
└── README.md- Place your documents (plain text) in the datas/ folder.
- Update question in the script or modify to take user input.
- Ensure your LLM API client (client) is properly initialized.
- Run the script:
python mini_rag.py- The script prints the prompt including top relevant documents and the question.
DATA_PATH: Path to your documents folder.POS_FILTER: List of token types to include (default: nouns, verbs, adjectives, etc.).THRESHOLD_SIMILARITY: Minimum similarity to include a document.SPACY_MODEL: SpaCy model for tokenization and embeddings.TOP_K: Maximum number of documents to include in the prompt (default: 3).MAX_TOKENS: Maximum number of tokens for LLM output (par défaut dans la fonction call_llm).
- Tokenization is case-insensitive.
- Currently uses raw frequency vectors; consider TF-IDF for better weighting.
- Only the top TOP_K most similar documents are included in the prompt.
- Large documents may need truncation for prompt limits.
same-but-with-langchain.py demonstrates the same workflow as main.py, but implemented with LangChain.
-
Python 3.12+
-
Install dependencies:
pip3 install langchain-core langchain-community langchain-openai spacy faiss-cpu python3 -m spacy download en_core_web_md # if using SpaCy embeddings
MIT License