Turn messy, flattened, or partially broken equations inside a .docx into clean Word equations without rewriting the rest of the document.
This repository is designed for Codex / ChatGPT skill workflows:
- extract every Word equation into a JSON manifest,
- let the model fill polished
repaired_latexvalues, - write the repaired formulas back into a new
.docx.
This tool is useful when a Word document already contains equation objects, but they look like these:
g_{ad,t}=(Ad_t-Ad_{t-12})/Ad_{t-12}rendered as flat text instead of a real fraction\Sigma, roots, limits, or matrix dimensions that lost their Word structure- merged identifiers such as
grevind,Qk,Scorei, orRannual - formula-heavy reports where the layout is correct but equations need cleanup
It is especially helpful for:
- quant research reports
- ML / finance / optimization documents
- bilingual Chinese-English Word reports
- "bad source DOCX -> polished destination DOCX" repair tasks
.
├── SKILL.md # Instructions Codex / GPT should follow
├── README.md # Human-facing documentation
├── package.json # Node dependencies for LaTeX -> OMML conversion
├── requirements.txt # Python dependency list
├── docx_formula_polisher/
│ ├── __init__.py
│ ├── core.py # DOCX extraction / repair / comparison logic
│ └── latex_bridge.py # Python -> Node conversion bridge
├── scripts/
│ ├── extract_docx_formulas.py # DOCX -> manifest
│ ├── apply_formula_repairs.py # manifest -> repaired DOCX
│ ├── compare_docx_math.py # compare two DOCX files by formula order
│ └── latex_to_omml.mjs # LaTeX -> MathML -> OMML
└── references/
└── repair-rules.md # Repair heuristics and notation rules
- Python 3.10+
- Node.js 18.13+
- npm 8+
git clone <your-repo-url>
cd docx-formula-polisher
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
npm installpython scripts/extract_docx_formulas.py \
--input /path/to/bad.docx \
--manifest /path/to/formulas.jsonThis produces a JSON manifest with:
source_text: flattened text extracted from the current Word equationsuspicion_reasons: heuristic flags for formulas that likely need repaircontext: local paragraph / table context to help the model infer intentrepaired_latex: the field your model should fill
Open SKILL.md, then give the model a prompt like:
Use the docx-formula-polisher skill.
Read /path/to/formulas.json.
Fill repaired_latex for formulas that are structurally broken or obviously malformed.
Keep notation close to the source document and preserve surrounding layout.
Do not add markdown fences or $...$ delimiters.
python scripts/apply_formula_repairs.py \
--input /path/to/bad.docx \
--manifest /path/to/formulas.json \
--output /path/to/fixed.docxOnly export formulas with obvious structural issues:
python scripts/extract_docx_formulas.py \
--input /path/to/bad.docx \
--only-suspiciousCompare formulas between two documents by order:
python scripts/compare_docx_math.py \
--left /path/to/fixed.docx \
--right /path/to/reference.docxCheck the Node converter script syntax:
npm run check:nodeThe repair flow is intentionally split into two layers:
- Python reads and rewrites the
.docxpackage directly. - Node converts
LaTeX -> MathML -> OMMLso Word receives real equation objects.
That means the tool preserves:
- tables
- paragraph styles
- Chinese / English prose
- section structure
- non-formula content
while only replacing the targeted equation nodes.
- Best for formulas that already exist as Word equation objects (
m:oMath) inside the DOCX. - It does not OCR formula images.
- It does not automatically decide the mathematically perfect formula in ambiguous cases; the model still needs to fill
repaired_latexcarefully. - Some complex OMML structures may still look "suspicious" to the heuristic checker even after successful repair.
If you want to install it as a local Codex skill, copy this folder into your Codex skills directory and keep SKILL.md at the repository root.
Typical usage pattern:
- mention or select the skill,
- run
extract_docx_formulas.py, - let the model edit the manifest,
- run
apply_formula_repairs.py.
Public-repo hygiene recommendations:
- keep
node_modules/out of git - keep user documents and generated manifests out of git
- choose a license before publishing publicly
- add sample fixtures only if you are comfortable making them public
This repository uses:
temmlfor LaTeX to MathMLmathml2ommlfor MathML to OMML