This project extracts structured information from Request for Proposal (RFP) documents in PDF and HTML formats. Using Language Models and vector search, it parses, interprets, and organizes RFP details into a predefined JSON structure.
- Supports multiple document formats: PDF and HTML
- Extracts structured data into JSON
- Uses LLMs (ChatGroq) with embeddings for context-aware information extraction
- Automatically handles missing or unspecified fields
- Saves output per RFP folder
The program extracts the following fields:
- Bid Number
- Title
- Due Date
- Bid Submission Type
- Term of Bid
- Pre Bid Meeting
- Installation
- Bid Bond Requirement
- Delivery Date
- Payment Terms
- Any Additional Documentation Required
- MFG for Registration
- Contract or Cooperative to use
- Model_no
- Part_no
- Product
- contact_info (dict: Name, Email, Phone, Address)
- company_name
- Bid Summary
- Product Specification
git clone https://github.com/j-jerusha/assignment.git
cd assignmentpython3 -m venv .venv
source .venv/bin/activatepython3 -m venv .venv
.venv\Scripts\activateIf you’re using uv:
uv syncOr using pip:
pip install -r requirements.txtCreate a .env file with the following:
GROQ_API_KEY=<your_groq_api_key>
HF_TOKEN=<HF_access_token>- Organize RFP documents into folders, e.g.:
assignment/
├─ Bid1/
│ ├─ document1.pdf
│ └─ document2.html
├─ Bid2/
│ └─ document3.pdf
- Run the script:
python main.py- Extracted JSON files will be saved in the
output/folder:
output/
├─ Bid1_extracted.json
├─ Bid2_extracted.json
- Load Documents: PDFs and HTML files are loaded using
PyPDFLoaderandUnstructuredHTMLLoader. - Vector Store Creation: Documents are embedded using
HuggingFaceEmbeddingsand stored in FAISS for retrieval. - Context Retrieval: Relevant document sections are fetched with a retriever.
- LLM Extraction:
ChatGroqgenerates structured JSON based on the predefined fields. - JSON Output: Ensures all fields are present; missing data is marked as
"Not specified".
- Python 3.10+
python-dotenvlangchain_communitylangchain_huggingfacelangchain_groqFAISS(CPU version recommended)
- The system handles missing fields gracefully.
- Ensure API keys (GROQ) are correctly set in
.env. - Can be extended to additional document formats with minimal changes.
J Jerusha