A multimodal parsing & retrieval‑QA pipeline that ingests one or many .docx or .pdf files, converts every paragraph, table, image, and nested structure into clean Markdown plus an AST, embeds visual‑text chunks with ColQwen‑VL, and serves instant answers with the power of Gemini's text embeddings (from Vertex AI) and ColQwen2's visual embeddings (from HuggingFace).
| Upgrade | Details |
|---|---|
| Unified OCR micro‑service | A single Google Colab notebook now handles all OCR tasks—plain text, tables, and images—via Qwen2-VL + DetrForSegmentation for layout detection, with local EasyOCR as fallback. |
| Multi‑document support | parse_all.py walks an input folder, auto‑selects PDFParser or DocumentParser per file, and preserves a (doc_id, page, section_path) trail for citations. |
| Concurrency first | Async HTTP + batch image processing (--batch_size) keeps large corpora fast. |
| Section‑aware citations | Every AST node carries its parent section + page, so LangChain can pinpoint sources in answers. |
src/
├─ parsers/ # PDFParser, DocumentParser, OCR helpers
├─ custom_ast/ # ParagraphNode, SectionNode, VisionImageNode, …
├─ langchain_qa/ # Vector‑store builder
parse_all.py # Logic initiates here
-
Clone the repository:
git clone <repo-url-here> cd WordDocParser
-
Create a
.envfile:- Copy from
.env.example - Make sure all the documents you want to parse are present in the
input_folder/directory
- Copy from
The parser uses an OCR extraction microservice hosted on Google Colab.
-
Open this notebook:
📎 OCR + ColQwen2 Colab Notebook -
Click Runtime → Run all to start the FastAPI server.
-
Confirm that the
/extract-tablesandinfer_blocksendpoints are up before starting the parser.
To begin parsing your files, run the following command, replacing the ColQwen url as needed. Make sure you replace the URL in the PDFParser class too.
python parse_all.py --input_folder "./input_folder" --output_dir "./output_colpali" --colqwen_url "https://a2c9-34-10-63-157.ngrok-free.app"To interact with the parsed content:
-
Run the following notebook to get the server up. For now, do note that after the parsing finishes, you will have to zip and upload the
persistance_dir_vlmto the specified cell in the notebook. And once it gets unzipped, rename it as specified. The rest of the steps will naturally follow! -
Once the server is up, use Postman or
curlto send questions to the server.
Send a POST request to the running server with the following JSON structure:
{
"question": "this is a question?"
}