Skip to content

chalengr/docx-formula-polisher-skill

Repository files navigation

DOCX Formula Polisher

Turn messy, flattened, or partially broken equations inside a .docx into clean Word equations without rewriting the rest of the document.

This repository is designed for Codex / ChatGPT skill workflows:

  1. extract every Word equation into a JSON manifest,
  2. let the model fill polished repaired_latex values,
  3. write the repaired formulas back into a new .docx.

What It Solves

This tool is useful when a Word document already contains equation objects, but they look like these:

  • g_{ad,t}=(Ad_t-Ad_{t-12})/Ad_{t-12} rendered as flat text instead of a real fraction
  • \Sigma, roots, limits, or matrix dimensions that lost their Word structure
  • merged identifiers such as grevind, Qk, Scorei, or Rannual
  • formula-heavy reports where the layout is correct but equations need cleanup

It is especially helpful for:

  • quant research reports
  • ML / finance / optimization documents
  • bilingual Chinese-English Word reports
  • "bad source DOCX -> polished destination DOCX" repair tasks

Repository Layout

.
├── SKILL.md                         # Instructions Codex / GPT should follow
├── README.md                        # Human-facing documentation
├── package.json                     # Node dependencies for LaTeX -> OMML conversion
├── requirements.txt                 # Python dependency list
├── docx_formula_polisher/
│   ├── __init__.py
│   ├── core.py                      # DOCX extraction / repair / comparison logic
│   └── latex_bridge.py              # Python -> Node conversion bridge
├── scripts/
│   ├── extract_docx_formulas.py     # DOCX -> manifest
│   ├── apply_formula_repairs.py     # manifest -> repaired DOCX
│   ├── compare_docx_math.py         # compare two DOCX files by formula order
│   └── latex_to_omml.mjs            # LaTeX -> MathML -> OMML
└── references/
    └── repair-rules.md              # Repair heuristics and notation rules

Requirements

  • Python 3.10+
  • Node.js 18.13+
  • npm 8+

Installation

git clone <your-repo-url>
cd docx-formula-polisher
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
npm install

Quick Start

1. Extract formulas from a bad DOCX

python scripts/extract_docx_formulas.py \
  --input /path/to/bad.docx \
  --manifest /path/to/formulas.json

This produces a JSON manifest with:

  • source_text: flattened text extracted from the current Word equation
  • suspicion_reasons: heuristic flags for formulas that likely need repair
  • context: local paragraph / table context to help the model infer intent
  • repaired_latex: the field your model should fill

2. Ask Codex / GPT to repair the manifest

Open SKILL.md, then give the model a prompt like:

Use the docx-formula-polisher skill.
Read /path/to/formulas.json.
Fill repaired_latex for formulas that are structurally broken or obviously malformed.
Keep notation close to the source document and preserve surrounding layout.
Do not add markdown fences or $...$ delimiters.

3. Apply repairs back into a new DOCX

python scripts/apply_formula_repairs.py \
  --input /path/to/bad.docx \
  --manifest /path/to/formulas.json \
  --output /path/to/fixed.docx

Optional Commands

Only export formulas with obvious structural issues:

python scripts/extract_docx_formulas.py \
  --input /path/to/bad.docx \
  --only-suspicious

Compare formulas between two documents by order:

python scripts/compare_docx_math.py \
  --left /path/to/fixed.docx \
  --right /path/to/reference.docx

Check the Node converter script syntax:

npm run check:node

How the Pipeline Works

The repair flow is intentionally split into two layers:

  • Python reads and rewrites the .docx package directly.
  • Node converts LaTeX -> MathML -> OMML so Word receives real equation objects.

That means the tool preserves:

  • tables
  • paragraph styles
  • Chinese / English prose
  • section structure
  • non-formula content

while only replacing the targeted equation nodes.

Current Limitations

  • Best for formulas that already exist as Word equation objects (m:oMath) inside the DOCX.
  • It does not OCR formula images.
  • It does not automatically decide the mathematically perfect formula in ambiguous cases; the model still needs to fill repaired_latex carefully.
  • Some complex OMML structures may still look "suspicious" to the heuristic checker even after successful repair.

Making It a Codex Skill

If you want to install it as a local Codex skill, copy this folder into your Codex skills directory and keep SKILL.md at the repository root.

Typical usage pattern:

  1. mention or select the skill,
  2. run extract_docx_formulas.py,
  3. let the model edit the manifest,
  4. run apply_formula_repairs.py.

Development Notes

Public-repo hygiene recommendations:

  • keep node_modules/ out of git
  • keep user documents and generated manifests out of git
  • choose a license before publishing publicly
  • add sample fixtures only if you are comfortable making them public

Acknowledgements

This repository uses:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors