Python Code: DocxToMarkdown.py
The Python program converts Word .docx files into markdown .md files. It can process transcripts generated by Microsoft Teams and then removes the image links inside the transcripts. It also removed backslashes. The immediate application is to use these cleaned up markdown files as input to AI applications such as Open AI Studio / APIs.
Put all your .docx files in one directory. The program will ask for below paths (example input you can provide):
- Enter the input directory path for .docx files:
C:\data\docx_files_dir - Enter the output directory path for RAW .md files:
C:\data\md_raw_files_dir - Enter the output directory path for CLEAN .md files:
C:\data\md_clean_files_dir(If CLEAN_MD = True in .env )
The final output .md files will be stored in the path defined in 3 above. You can modify def lineContainsPattern(line) to fit your own requirements.
Please review the requirements.txt for details.
Review env_sample.txt for environment variable definitions. Copy it as .env and edit to fit your own needs. You only need to set this one: CLEAN_MD = True
Python Code: MarkdownToDocx.py
Same requirements.txt.
No environment variable needed.
The Python program converts markdown .md files into Word.docx files.
Put all your .md files in one directory. The program will ask for below paths (example input you can provide):
- Enter the input file path:
C:\data\md_input_files_dir - Enter the output file path:
C:\data\output_files_dir
The final output .docx files will be stored in the path defined in 2 above.
DocxToDocxRemoveImageLinks.py removes image links and backslashes. No environment variables needed.
DocxToDocxTextReplacement.py. Replace a pair of old text phrases with new text. Review env_sample.txt for environment variable definitions. Copy it as .env and edit to fit your needs.