This project explores an approach that combines RAG (Retrieval-Augmented Generation) and LLMs to generate structured summaries of research papers. The goal is to automatically extract the most relevant passages and organize the summary into the following sections:
- Introduction
- Context
- Results
- Conclusion
- Relevance
1๏ธโฃ Paper Vectorization โ The document is divided into chunks (smaller segments) with overlap to ensure context retention. 2๏ธโฃ Information Retrieval โ The system retrieves the most relevant excerpts based on predefined topics. 3๏ธโฃ Summary Generation โ LLaMA 3 processes the retrieved excerpts and generates a structured and concise summary.
The main script can be executed as follows:
python main.py --pdf_file data/sample.pdf --output_file output/summa.txt --database_dir chroma_db| Parameter | Description |
|---|---|
-pdf, --pdf_file |
Path to the input PDF file (required). |
-o, --output_file |
Path to save the extracted text (default: ./summa.txt). |
-db, --database_dir |
Directory where ChromaDB will be stored (default: ./chroma_db). |
- ๐ Enhance section extraction, vectorizing chunks by section.
- โก Optimize LLM response time.
- ๐ผ Extract images from papers to enrich summaries.