RFP Data Extraction Project

Objective

This project extracts structured information from Request for Proposal (RFP) documents in PDF and HTML formats. Using Language Models and vector search, it parses, interprets, and organizes RFP details into a predefined JSON structure.

Features

Supports multiple document formats: PDF and HTML
Extracts structured data into JSON
Uses LLMs (ChatGroq) with embeddings for context-aware information extraction
Automatically handles missing or unspecified fields
Saves output per RFP folder

Predefined Fields

The program extracts the following fields:

Bid Number
Title
Due Date
Bid Submission Type
Term of Bid
Pre Bid Meeting
Installation
Bid Bond Requirement
Delivery Date
Payment Terms
Any Additional Documentation Required
MFG for Registration
Contract or Cooperative to use
Model_no
Part_no
Product
contact_info (dict: Name, Email, Phone, Address)
company_name
Bid Summary
Product Specification

Setup Instructions

1. Clone the repository

git clone https://github.com/j-jerusha/assignment.git
cd assignment

2. Create and activate virtual environment

For macOS / Linux

python3 -m venv .venv
source .venv/bin/activate

For Windows

python3 -m venv .venv
.venv\Scripts\activate

3. Install dependencies

If you’re using uv:

uv sync

Or using pip:

pip install -r requirements.txt

4. Set environment variables

Create a .env file with the following:

GROQ_API_KEY=<your_groq_api_key>
HF_TOKEN=<HF_access_token>

Usage

Organize RFP documents into folders, e.g.:

assignment/
├─ Bid1/
│  ├─ document1.pdf
│  └─ document2.html
├─ Bid2/
│  └─ document3.pdf

Run the script:

python main.py

Extracted JSON files will be saved in the output/ folder:

output/
├─ Bid1_extracted.json
├─ Bid2_extracted.json

How It Works

Load Documents: PDFs and HTML files are loaded using PyPDFLoader and UnstructuredHTMLLoader.
Vector Store Creation: Documents are embedded using HuggingFaceEmbeddings and stored in FAISS for retrieval.
Context Retrieval: Relevant document sections are fetched with a retriever.
LLM Extraction: ChatGroq generates structured JSON based on the predefined fields.
JSON Output: Ensures all fields are present; missing data is marked as "Not specified".

Dependencies

Python 3.10+
python-dotenv
langchain_community
langchain_huggingface
langchain_groq
FAISS (CPU version recommended)

Notes

The system handles missing fields gracefully.
Ensure API keys (GROQ) are correctly set in .env.
Can be extended to additional document formats with minimal changes.

Author

J Jerusha

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Bid1		Bid1
Bid2		Bid2
output		output
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RFP Data Extraction Project

Objective

This project extracts structured information from Request for Proposal (RFP) documents in PDF and HTML formats. Using Language Models and vector search, it parses, interprets, and organizes RFP details into a predefined JSON structure.

Features

Predefined Fields

Setup Instructions

1. Clone the repository

2. Create and activate virtual environment

For macOS / Linux

For Windows

3. Install dependencies

4. Set environment variables

Usage

How It Works

Dependencies

Notes

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RFP Data Extraction Project

Objective

This project extracts structured information from Request for Proposal (RFP) documents in PDF and HTML formats. Using Language Models and vector search, it parses, interprets, and organizes RFP details into a predefined JSON structure.

Features

Predefined Fields

Setup Instructions

1. Clone the repository

2. Create and activate virtual environment

For macOS / Linux

For Windows

3. Install dependencies

4. Set environment variables

Usage

How It Works

Dependencies

Notes

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages