WordDocParser

A multimodal parsing & retrieval‑QA pipeline that ingests one or many .docx or .pdf files, converts every paragraph, table, image, and nested structure into clean Markdown plus an AST, embeds visual‑text chunks with ColQwen‑VL, and serves instant answers with the power of Gemini's text embeddings (from Vertex AI) and ColQwen2's visual embeddings (from HuggingFace).

What’s New

Upgrade	Details
Unified OCR micro‑service	A single Google Colab notebook now handles all OCR tasks—plain text, tables, and images—via Qwen2-VL + DetrForSegmentation for layout detection, with local EasyOCR as fallback.
Multi‑document support	`parse_all.py` walks an input folder, auto‑selects PDFParser or DocumentParser per file, and preserves a `(doc_id, page, section_path)` trail for citations.
Concurrency first	Async HTTP + batch image processing (`--batch_size`) keeps large corpora fast.
Section‑aware citations	Every AST node carries its parent section + page, so LangChain can pinpoint sources in answers.

Repository Layout

src/
├─ parsers/            # PDFParser, DocumentParser, OCR helpers
├─ custom_ast/         # ParagraphNode, SectionNode, VisionImageNode, …
├─ langchain_qa/       # Vector‑store builder

parse_all.py           # Logic initiates here

Setup Instructions

Clone the repository:

git clone <repo-url-here>
cd WordDocParser

Create a .env file:
- Copy from .env.example
- Make sure all the documents you want to parse are present in the input_folder/ directory

OCR Google Colab Setup

The parser uses an OCR extraction microservice hosted on Google Colab.

Open this notebook:
📎 OCR + ColQwen2 Colab Notebook
Click Runtime → Run all to start the FastAPI server.
Confirm that the /extract-tables and infer_blocks endpoints are up before starting the parser.

Run the Parser

To begin parsing your files, run the following command, replacing the ColQwen url as needed. Make sure you replace the URL in the PDFParser class too.

python parse_all.py --input_folder "./input_folder" --output_dir "./output_colpali" --colqwen_url "https://a2c9-34-10-63-157.ngrok-free.app"

Server Google Colab Setup

To interact with the parsed content:

Run the following notebook to get the server up. For now, do note that after the parsing finishes, you will have to zip and upload the persistance_dir_vlm to the specified cell in the notebook. And once it gets unzipped, rename it as specified. The rest of the steps will naturally follow!

📎 Server Notebook
Once the server is up, use Postman or curl to send questions to the server.

API Request Format

Send a POST request to the running server with the following JSON structure:

{
  "question": "this is a question?"
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
input_folder		input_folder
output		output
output_colpali/images		output_colpali/images
output_new		output_new
sample_docs		sample_docs
src		src
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
colpali_endpoint.ipynb		colpali_endpoint.ipynb
parse_all.py		parse_all.py
server.ipynb		server.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WordDocParser

What’s New

Repository Layout

Setup Instructions

OCR Google Colab Setup

Run the Parser

Server Google Colab Setup

API Request Format

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

rajrupa04/WordDocParser

Folders and files

Latest commit

History

Repository files navigation

WordDocParser

What’s New

Repository Layout

Setup Instructions

OCR Google Colab Setup

Run the Parser

Server Google Colab Setup

API Request Format

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages