DOCX Formula Polisher

Turn messy, flattened, or partially broken equations inside a .docx into clean Word equations without rewriting the rest of the document.

This repository is designed for Codex / ChatGPT skill workflows:

extract every Word equation into a JSON manifest,
let the model fill polished repaired_latex values,
write the repaired formulas back into a new .docx.

What It Solves

This tool is useful when a Word document already contains equation objects, but they look like these:

g_{ad,t}=(Ad_t-Ad_{t-12})/Ad_{t-12} rendered as flat text instead of a real fraction
\Sigma, roots, limits, or matrix dimensions that lost their Word structure
merged identifiers such as grevind, Qk, Scorei, or Rannual
formula-heavy reports where the layout is correct but equations need cleanup

It is especially helpful for:

quant research reports
ML / finance / optimization documents
bilingual Chinese-English Word reports
"bad source DOCX -> polished destination DOCX" repair tasks

Repository Layout

.
├── SKILL.md                         # Instructions Codex / GPT should follow
├── README.md                        # Human-facing documentation
├── package.json                     # Node dependencies for LaTeX -> OMML conversion
├── requirements.txt                 # Python dependency list
├── docx_formula_polisher/
│   ├── __init__.py
│   ├── core.py                      # DOCX extraction / repair / comparison logic
│   └── latex_bridge.py              # Python -> Node conversion bridge
├── scripts/
│   ├── extract_docx_formulas.py     # DOCX -> manifest
│   ├── apply_formula_repairs.py     # manifest -> repaired DOCX
│   ├── compare_docx_math.py         # compare two DOCX files by formula order
│   └── latex_to_omml.mjs            # LaTeX -> MathML -> OMML
└── references/
    └── repair-rules.md              # Repair heuristics and notation rules

Requirements

Python 3.10+
Node.js 18.13+
npm 8+

Installation

git clone <your-repo-url>
cd docx-formula-polisher
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
npm install

Quick Start

1. Extract formulas from a bad DOCX

python scripts/extract_docx_formulas.py \
  --input /path/to/bad.docx \
  --manifest /path/to/formulas.json

This produces a JSON manifest with:

source_text: flattened text extracted from the current Word equation
suspicion_reasons: heuristic flags for formulas that likely need repair
context: local paragraph / table context to help the model infer intent
repaired_latex: the field your model should fill

2. Ask Codex / GPT to repair the manifest

Open SKILL.md, then give the model a prompt like:

Use the docx-formula-polisher skill.
Read /path/to/formulas.json.
Fill repaired_latex for formulas that are structurally broken or obviously malformed.
Keep notation close to the source document and preserve surrounding layout.
Do not add markdown fences or $...$ delimiters.

3. Apply repairs back into a new DOCX

python scripts/apply_formula_repairs.py \
  --input /path/to/bad.docx \
  --manifest /path/to/formulas.json \
  --output /path/to/fixed.docx

Optional Commands

Only export formulas with obvious structural issues:

python scripts/extract_docx_formulas.py \
  --input /path/to/bad.docx \
  --only-suspicious

Compare formulas between two documents by order:

python scripts/compare_docx_math.py \
  --left /path/to/fixed.docx \
  --right /path/to/reference.docx

Check the Node converter script syntax:

npm run check:node

How the Pipeline Works

The repair flow is intentionally split into two layers:

Python reads and rewrites the .docx package directly.
Node converts LaTeX -> MathML -> OMML so Word receives real equation objects.

That means the tool preserves:

tables
paragraph styles
Chinese / English prose
section structure
non-formula content

while only replacing the targeted equation nodes.

Current Limitations

Best for formulas that already exist as Word equation objects (m:oMath) inside the DOCX.
It does not OCR formula images.
It does not automatically decide the mathematically perfect formula in ambiguous cases; the model still needs to fill repaired_latex carefully.
Some complex OMML structures may still look "suspicious" to the heuristic checker even after successful repair.

Making It a Codex Skill

If you want to install it as a local Codex skill, copy this folder into your Codex skills directory and keep SKILL.md at the repository root.

Typical usage pattern:

mention or select the skill,
run extract_docx_formulas.py,
let the model edit the manifest,
run apply_formula_repairs.py.

Development Notes

Public-repo hygiene recommendations:

keep node_modules/ out of git
keep user documents and generated manifests out of git
choose a license before publishing publicly
add sample fixtures only if you are comfortable making them public

Acknowledgements

This repository uses:

temml for LaTeX to MathML
mathml2omml for MathML to OMML

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DOCX Formula Polisher

What It Solves

Repository Layout

Requirements

Installation

Quick Start

1. Extract formulas from a bad DOCX

2. Ask Codex / GPT to repair the manifest

3. Apply repairs back into a new DOCX

Optional Commands

How the Pipeline Works

Current Limitations

Making It a Codex Skill

Development Notes

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docx_formula_polisher		docx_formula_polisher
references		references
scripts		scripts
.gitignore		.gitignore
README.md		README.md
SKILL.md		SKILL.md
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DOCX Formula Polisher

What It Solves

Repository Layout

Requirements

Installation

Quick Start

1. Extract formulas from a bad DOCX

2. Ask Codex / GPT to repair the manifest

3. Apply repairs back into a new DOCX

Optional Commands

How the Pipeline Works

Current Limitations

Making It a Codex Skill

Development Notes

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages