Skip to content

Latest commit

 

History

History
107 lines (73 loc) · 3.61 KB

File metadata and controls

107 lines (73 loc) · 3.61 KB

Word Specifications Parser

A Python-based tool to parse structured Word documents containing product specifications and convert them to CSV format.

Overview

This system is designed to parse Word documents with a specific hierarchical structure of product information and export the data to CSV files. It extracts data based on formatting cues like text color, numbering patterns, and specific keywords.

The system consists of:

  1. Word Parser Module (word_parser.py): A command-line tool for parsing Word documents
  2. Converter Application (converter_app.py): A GUI-based application for batch processing multiple files
  3. Main Script (main.py): A wrapper script to run either the parser or the converter

Installation

Prerequisites

  • Python 3.7 or higher
  • Required Python packages:
    • python-docx - For parsing Word documents
    • tkinter - For the GUI application (usually comes with Python)

Install Dependencies

pip install python-docx

Usage

Option 1: GUI Application

The easiest way to use the tool is through the GUI application:

python main.py
# OR
python main.py --gui

This will open the converter application where you can:

  • Add one or more Word files for processing
  • Select an output directory for the CSV files
  • Enable debug mode for additional information
  • Start the conversion process
  • View the conversion log

Option 2: Command-Line Parser

For scripting or automation, you can use the command-line parser directly:

python main.py --cli
# OR
python word_parser.py -i input.docx -o output.csv [--debug]

Command-line options:

  • -i, --input: Input Word document file path (default: specifications_catalog.docx)
  • -o, --output: Output CSV file path (default: specifications_catalog.csv)
  • --debug: Print debug information during parsing

Data Structure

The parser extracts the following hierarchical data from the Word document:

  1. Group_title: Top-level category in UPPERCASE (e.g., "MECHANISCHE SLOTEN")
  2. Subgroup_title: Second level with number format "00.00.00" (e.g., "00.00.00 Mechanische éénpuntsloten")
  3. Item_title_NL: Specific product category with number and brand (e.g., "00.00.00 Standaard klavierslot... |FH| st Litto")
  4. Description_NL: Detailed text description of the product category
  5. LongDescription: Specific product description in purple text
  6. Item_Number: Reference code (e.g., "A13E1")
  7. Brand: Brand name extracted from Item_title_NL (e.g., "Litto")
  8. Measuring_State: Special format text (e.g., "|FH| st")

Document Format Requirements

For proper parsing, the Word document should follow this structure:

  • Group_title: All capital letters, no numbering
  • Subgroup_title: Starts with "00.00.00" but doesn't contain "|FH|"
  • Item_title_NL: Starts with "00.00.00" and contains "|FH| st" followed by the brand name
  • Description_NL: Regular text below the Item_title_NL
  • LongDescription: Purple text just above an Item_Number
  • Item_Number: Contains "Referentie : " followed by a code and "of equivalent"

Troubleshooting

Common Issues

  1. No Purple Text Detected:

    • The exact RGB values for purple may need adjustment. Modify the is_purple_text function in word_parser.py
  2. Parsing Structure Incorrect:

    • If the document structure varies slightly, you may need to adjust the patterns in the parsing functions
  3. Brand or Measuring State Not Extracted:

    • Check if the format matches the expected pattern "|FH| st Brand"

License

This software is distributed under the MIT license.

Contributors

  • Your Name - Initial development