A simple script to parse a dictionary file in MS Word format and produce an Excel spreadsheet with structured data
- Install Python 3
- Install the
lxmlandhtml5libmodules for Python 3- On Ubuntu Linux, run
sudo apt intall python3-lxml python3-html5lib
- On Ubuntu Linux, run
- Get the input files in
.mhtformat (in Word, save as "Single File Web Page (*.mht)") - Put the input files in their own folder (for these instructions, let's say the folder is called "Dictionary")
- Run
python3 htmlconvert.py Dictionary/*.mht- This produces a
.txtfile for each.mhtfile
- This produces a
- Run
python3 convert.py Dictionary/*.txt- This produces a
.csvfile and a.sfmfile for each.txtfile
- This produces a
- Import the
.csvfiles into Excel, or whatever else you need done with them - Import the
.sfmfiles into FieldWorks