Word Specifications Parser

A Python-based tool to parse structured Word documents containing product specifications and convert them to CSV format.

Overview

This system is designed to parse Word documents with a specific hierarchical structure of product information and export the data to CSV files. It extracts data based on formatting cues like text color, numbering patterns, and specific keywords.

The system consists of:

Word Parser Module (word_parser.py): A command-line tool for parsing Word documents
Converter Application (converter_app.py): A GUI-based application for batch processing multiple files
Main Script (main.py): A wrapper script to run either the parser or the converter

Installation

Prerequisites

Python 3.7 or higher
Required Python packages:
- python-docx - For parsing Word documents
- tkinter - For the GUI application (usually comes with Python)

Install Dependencies

pip install python-docx

Usage

Option 1: GUI Application

The easiest way to use the tool is through the GUI application:

python main.py
# OR
python main.py --gui

This will open the converter application where you can:

Add one or more Word files for processing
Select an output directory for the CSV files
Enable debug mode for additional information
Start the conversion process
View the conversion log

Option 2: Command-Line Parser

For scripting or automation, you can use the command-line parser directly:

python main.py --cli
# OR
python word_parser.py -i input.docx -o output.csv [--debug]

Command-line options:

-i, --input: Input Word document file path (default: specifications_catalog.docx)
-o, --output: Output CSV file path (default: specifications_catalog.csv)
--debug: Print debug information during parsing

Data Structure

The parser extracts the following hierarchical data from the Word document:

Group_title: Top-level category in UPPERCASE (e.g., "MECHANISCHE SLOTEN")
Subgroup_title: Second level with number format "00.00.00" (e.g., "00.00.00 Mechanische éénpuntsloten")
Item_title_NL: Specific product category with number and brand (e.g., "00.00.00 Standaard klavierslot... |FH| st Litto")
Description_NL: Detailed text description of the product category
LongDescription: Specific product description in purple text
Item_Number: Reference code (e.g., "A13E1")
Brand: Brand name extracted from Item_title_NL (e.g., "Litto")
Measuring_State: Special format text (e.g., "|FH| st")

Document Format Requirements

For proper parsing, the Word document should follow this structure:

Group_title: All capital letters, no numbering
Subgroup_title: Starts with "00.00.00" but doesn't contain "|FH|"
Item_title_NL: Starts with "00.00.00" and contains "|FH| st" followed by the brand name
Description_NL: Regular text below the Item_title_NL
LongDescription: Purple text just above an Item_Number
Item_Number: Contains "Referentie : " followed by a code and "of equivalent"

Troubleshooting

Common Issues

No Purple Text Detected:
- The exact RGB values for purple may need adjustment. Modify the is_purple_text function in word_parser.py
Parsing Structure Incorrect:
- If the document structure varies slightly, you may need to adjust the patterns in the parsing functions
Brand or Measuring State Not Extracted:
- Check if the format matches the expected pattern "|FH| st Brand"

License

This software is distributed under the MIT license.

Contributors

Your Name - Initial development

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word Specifications Parser

Overview

Installation

Prerequisites

Install Dependencies

Usage

Option 1: GUI Application

Option 2: Command-Line Parser

Data Structure

Document Format Requirements

Troubleshooting

Common Issues

License

Contributors

FilesExpand file tree

readme.md

Latest commit

History

readme.md

File metadata and controls

Word Specifications Parser

Overview

Installation

Prerequisites

Install Dependencies

Usage

Option 1: GUI Application

Option 2: Command-Line Parser

Data Structure

Document Format Requirements

Troubleshooting

Common Issues

License

Contributors