Skip to content

rajrupa04/WordDocParser

Repository files navigation

WordDocParser

A multimodal parsing & retrieval‑QA pipeline that ingests one or many .docx or .pdf files, converts every paragraph, table, image, and nested structure into clean Markdown plus an AST, embeds visual‑text chunks with ColQwen‑VL, and serves instant answers with the power of Gemini's text embeddings (from Vertex AI) and ColQwen2's visual embeddings (from HuggingFace).


What’s New

Upgrade Details
Unified OCR micro‑service A single Google Colab notebook now handles all OCR tasks—plain text, tables, and images—via Qwen2-VL + DetrForSegmentation for layout detection, with local EasyOCR as fallback.
Multi‑document support parse_all.py walks an input folder, auto‑selects PDFParser or DocumentParser per file, and preserves a (doc_id, page, section_path) trail for citations.
Concurrency first Async HTTP + batch image processing (--batch_size) keeps large corpora fast.
Section‑aware citations Every AST node carries its parent section + page, so LangChain can pinpoint sources in answers.

Repository Layout

src/
├─ parsers/            # PDFParser, DocumentParser, OCR helpers
├─ custom_ast/         # ParagraphNode, SectionNode, VisionImageNode, …
├─ langchain_qa/       # Vector‑store builder

parse_all.py           # Logic initiates here


Setup Instructions

  1. Clone the repository:

    git clone <repo-url-here>
    cd WordDocParser
  2. Create a .env file:

    • Copy from .env.example
    • Make sure all the documents you want to parse are present in the input_folder/ directory

OCR Google Colab Setup

The parser uses an OCR extraction microservice hosted on Google Colab.

  1. Open this notebook:
    📎 OCR + ColQwen2 Colab Notebook

  2. Click Runtime → Run all to start the FastAPI server.

  3. Confirm that the /extract-tables and infer_blocks endpoints are up before starting the parser.


Run the Parser

To begin parsing your files, run the following command, replacing the ColQwen url as needed. Make sure you replace the URL in the PDFParser class too.

python parse_all.py --input_folder "./input_folder" --output_dir "./output_colpali" --colqwen_url "https://a2c9-34-10-63-157.ngrok-free.app"

Server Google Colab Setup

To interact with the parsed content:

  1. Run the following notebook to get the server up. For now, do note that after the parsing finishes, you will have to zip and upload the persistance_dir_vlm to the specified cell in the notebook. And once it gets unzipped, rename it as specified. The rest of the steps will naturally follow!

    📎 Server Notebook

  2. Once the server is up, use Postman or curl to send questions to the server.


API Request Format

Send a POST request to the running server with the following JSON structure:

{
  "question": "this is a question?"
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors