A Python-based tool that provides bidirectional conversion between Word documents containing product specifications and CSV files.
This system allows you to:
- Parse Word documents with a specific hierarchical structure of product information and export the data to CSV files
- Convert CSV files back to Word documents that maintain the original structure and formatting
The system consists of:
- Word Parser Module (
word_parser.py): Parses Word documents to CSV - CSV to Word Converter (
csv_to_word.py): Converts CSV files back to Word format - GUI Applications for both conversion directions
- Main Script (
main.py): A wrapper script to run any of the modules
- Python 3.7 or higher
- Required Python packages:
python-docx- For working with Word documentstkinter- For the GUI applications (usually comes with Python)
pip install python-docxThe main script allows you to choose the conversion direction and interface:
python main.py [--mode {word2csv,csv2word}] [--gui | --cli]Arguments:
--mode: Select conversion direction (word2csvorcsv2word). Default isword2csv.--gui: Launch the GUI application (default)--cli: Run the command-line interface
python main.py --mode word2csv
# OR
python main.py # Default is word2csvThis will open the converter application where you can:
- Add one or more Word files for processing
- Select an output directory for the CSV files
- Start the conversion process
- View the conversion log
python main.py --mode word2csv --cli
# OR
python word_parser.py -i input.docx -o output.csv [--debug]Command-line options:
-i,--input: Input Word document file path-o,--output: Output CSV file path--debug: Print debug information during parsing
python main.py --mode csv2wordThis will open the converter application where you can:
- Add one or more CSV files for processing
- Select an output directory for the Word files
- Start the conversion process
- View the conversion log
python main.py --mode csv2word --cli
# OR
python csv_to_word.py -i input.csv -o output.docxCommand-line options:
-i,--input: Input CSV file path-o,--output: Output Word document file path (optional)
The converter works with the following hierarchical data structure:
- Group_title: Top-level category in UPPERCASE (e.g., "MECHANISCHE SLOTEN")
- Subgroup_title: Second level with number format "00.00.00" (e.g., "00.00.00 Mechanische éénpuntsloten")
- Item_title_NL: Specific product category with number and brand (e.g., "00.00.00 Standaard klavierslot... |FH| st Litto")
- Description_NL: Detailed text description of the product category
- LongDescription: Specific product description in purple text with bullet points
- Item_Number: Reference code (e.g., "A13E1")
- Brand: Brand name extracted from Item_title_NL (e.g., "Litto")
- Measuring_State: Special format text (e.g., "|FH| st")
For proper parsing, the Word document should follow this structure:
- Group_title: All capital letters, blue color
- Subgroup_title: Starts with "00.00.00", blue color
- Item_title_NL: Starts with "00.00.00" and contains "|FH| st" followed by the brand name
- Description_NL: Regular text below the Item_title_NL
- LongDescription: Purple text with bullet points just above an Item_Number
- Item_Number: Contains "REFERENTIE : " followed by a code and "OF EQUIVALENT"
The CSV file uses:
- UTF-8 encoding with BOM (for Excel compatibility)
- Semicolon (
;) delimiter - Quoted fields for handling special characters
-
Color Detection Problems:
- If purple text isn't being detected, check the RGB values in
is_purple_textfunction - The default purple values are RGB(112, 48, 160)
- If purple text isn't being detected, check the RGB values in
-
Structure Recognition Issues:
- Ensure your Word document follows the expected structure
- Check that bullet points use standard characters (•, -, etc.)
-
Excel Compatibility:
- If the CSV doesn't open properly in Excel, ensure it's using UTF-8-BOM encoding and semicolon delimiter
This software is distributed under the MIT license.