A .docx file is a ZIP archive containing XML files.
| Task | Approach |
|---|---|
| Read/analyze content | pandoc or unpack for raw XML |
| Create new document | Use docx-js - see Creating New Documents below |
| Edit existing document | Unpack → edit XML → repack - see Editing Existing Documents below |
Legacy .doc files must be converted before editing:
python scripts/office/soffice.py --headless --convert-to docx document.doc# Text extraction with tracked changes
pandoc --track-changes=all document.docx -o output.md
# Raw XML access
python scripts/office/unpack.py document.docx unpacked/python scripts/office/soffice.py --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf pageTo produce a clean document with all tracked changes accepted (requires LibreOffice):
python scripts/accept_changes.py input.docx output.docxGenerate .docx files with JavaScript, then validate. Install: npm install -g docx
const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun,
Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink,
InternalHyperlink, Bookmark, FootnoteReferenceRun, PositionalTab,
PositionalTabAlignment, PositionalTabRelativeTo, PositionalTabLeader,
TabStopType, TabStopPosition, Column, SectionType,
TableOfContents, HeadingLevel, BorderStyle, WidthType, ShadingType,
VerticalAlign, PageNumber, PageBreak } = require('docx');
const doc = new Document({ sections: [{ children: [/* content */] }] });
Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer));After creating the file, validate it. If validation fails, unpack, fix the XML, and repack.
python scripts/office/validate.py doc.docx// CRITICAL: docx-js defaults to A4, not US Letter
// Always set page size explicitly for consistent results
sections: [{
properties: {
page: {
size: {
width: 12240, // 8.5 inches in DXA
height: 15840 // 11 inches in DXA
},
margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } // 1 inch margins
}
},
children: [/* content */]
}]Common page sizes (DXA units, 1440 DXA = 1 inch):
| Paper | Width | Height | Content Width (1" margins) |
|---|---|---|---|
| US Letter | 12,240 | 15,840 | 9,360 |
| A4 (default) | 11,906 | 16,838 | 9,026 |
Landscape orientation: docx-js swaps width/height internally, so pass portrait dimensions and let it handle the swap:
size: {
width: 12240, // Pass SHORT edge as width
height: 15840, // Pass LONG edge as height
orientation: PageOrientation.LANDSCAPE // docx-js swaps them in the XML
},
// Content width = 15840 - left margin - right margin (uses the long edge)Use Arial as the default font (universally supported). Keep titles black for readability.
const doc = new Document({
styles: {
default: { document: { run: { font: "Arial", size: 24 } } }, // 12pt default
paragraphStyles: [
// IMPORTANT: Use exact IDs to override built-in styles
{ id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true,
run: { size: 32, bold: true, font: "Arial" },
paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } }, // outlineLevel required for TOC
{ id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true,
run: { size: 28, bold: true, font: "Arial" },
paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } },
]
},
sections: [{
children: [
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Title")] }),
]
}]
});// ❌ WRONG - never manually insert bullet characters
new Paragraph({ children: [new TextRun("• Item")] }) // BAD
new Paragraph({ children: [new TextRun("\u2022 Item")] }) // BAD
// ✅ CORRECT - use numbering config with LevelFormat.BULLET
const doc = new Document({
numbering: {
config: [
{ reference: "bullets",
levels: [{ level: 0, format: LevelFormat.BULLET, text: "•", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
{ reference: "numbers",
levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
]
},
sections: [{
children: [
new Paragraph({ numbering: { reference: "bullets", level: 0 },
children: [new TextRun("Bullet item")] }),
new Paragraph({ numbering: { reference: "numbers", level: 0 },
children: [new TextRun("Numbered item")] }),
]
}]
});
// ⚠️ Each reference creates INDEPENDENT numbering
// Same reference = continues (1,2,3 then 4,5,6)
// Different reference = restarts (1,2,3 then 1,2,3)CRITICAL: Tables need dual widths - set both columnWidths on the table AND width on each cell. Without both, tables render incorrectly on some platforms.
const border = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" };
const borders = { top: border, bottom: border, left: border, right: border };
new Table({
width: { size: 9360, type: WidthType.DXA }, // Always use DXA (percentages break in Google Docs)
columnWidths: [4680, 4680], // Must sum to table width (DXA: 1440 = 1 inch)
rows: [
new TableRow({
children: [
new TableCell({
borders,
width: { size: 4680, type: WidthType.DXA }, // Also set on each cell
shading: { fill: "D5E8F0", type: ShadingType.CLEAR }, // CLEAR not SOLID
margins: { top: 80, bottom: 80, left: 120, right: 120 },
children: [new Paragraph({ children: [new TextRun("Cell")] })]
})
]
})
]
})Width rules:
- Always use
WidthType.DXA— neverWidthType.PERCENTAGE(incompatible with Google Docs) - Table width must equal the sum of
columnWidths - Cell
widthmust match correspondingcolumnWidth - Cell
marginsare internal padding - they reduce content area, not add to cell width - US Letter with 1" margins: content width = 9360 DXA
// CRITICAL: type parameter is REQUIRED
new Paragraph({
children: [new ImageRun({
type: "png", // Required: png, jpg, jpeg, gif, bmp, svg
data: fs.readFileSync("image.png"),
transformation: { width: 200, height: 150 },
altText: { title: "Title", description: "Desc", name: "Name" } // All three required
})]
})// CRITICAL: PageBreak must be inside a Paragraph
new Paragraph({ children: [new PageBreak()] })
// Or:
new Paragraph({ pageBreakBefore: true, children: [new TextRun("New page")] })// External link
new Paragraph({
children: [new ExternalHyperlink({
children: [new TextRun({ text: "Click here", style: "Hyperlink" })],
link: "https://example.com",
})]
})
// Internal link (bookmark + reference)
// 1. Create bookmark at destination
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [
new Bookmark({ id: "chapter1", children: [new TextRun("Chapter 1")] }),
]})
// 2. Link to it
new Paragraph({ children: [new InternalHyperlink({
children: [new TextRun({ text: "See Chapter 1", style: "Hyperlink" })],
anchor: "chapter1",
})]})const doc = new Document({
footnotes: {
1: { children: [new Paragraph("Source: Annual Report 2024")] },
2: { children: [new Paragraph("See appendix for methodology")] },
},
sections: [{
children: [new Paragraph({
children: [
new TextRun("Revenue grew 15%"),
new FootnoteReferenceRun(1),
new TextRun(" using adjusted metrics"),
new FootnoteReferenceRun(2),
],
})]
}]
});// Right-align text on same line (e.g., date opposite a title)
new Paragraph({
children: [
new TextRun("Company Name"),
new TextRun("\tJanuary 2025"),
],
tabStops: [{ type: TabStopType.RIGHT, position: TabStopPosition.MAX }],
})
// Dot leader (e.g., TOC-style)
new Paragraph({
children: [
new TextRun("Introduction"),
new TextRun({ children: [
new PositionalTab({
alignment: PositionalTabAlignment.RIGHT,
relativeTo: PositionalTabRelativeTo.MARGIN,
leader: PositionalTabLeader.DOT,
}),
"3",
]}),
],
})// Equal-width columns
sections: [{
properties: {
column: {
count: 2, // number of columns
space: 720, // gap between columns in DXA (720 = 0.5 inch)
equalWidth: true,
separate: true, // vertical line between columns
},
},
children: [/* content flows naturally across columns */]
}]
// Custom-width columns (equalWidth must be false)
sections: [{
properties: {
column: {
equalWidth: false,
children: [
new Column({ width: 5400, space: 720 }),
new Column({ width: 3240 }),
],
},
},
children: [/* content */]
}]Force a column break with a new section using type: SectionType.NEXT_COLUMN.
// CRITICAL: Headings must use HeadingLevel ONLY - no custom styles
new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" })sections: [{
properties: { page: { margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } } },
headers: {
default: new Header({ children: [new Paragraph({ children: [new TextRun("Header")] })] })
},
footers: {
default: new Footer({ children: [new Paragraph({
children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] })]
})] })
},
children: [/* content */]
}]- Set page size explicitly — docx-js defaults to A4; use US Letter (12240 x 15840 DXA) for US documents
- Landscape: pass portrait dimensions — docx-js swaps width/height internally
- Never use
\n— use separate Paragraph elements - Never use unicode bullets — use
LevelFormat.BULLETwith numbering config - PageBreak must be in Paragraph — standalone creates invalid XML
- ImageRun requires
type— always specify png/jpg/etc - Always set table
widthwith DXA — never useWidthType.PERCENTAGE - Tables need dual widths —
columnWidthsarray AND cellwidth, both must match - Always add cell margins — use
margins: { top: 80, bottom: 80, left: 120, right: 120 }for readable padding - Use
ShadingType.CLEAR— never SOLID for table shading - Never use tables as dividers/rules — cells have minimum height and render as empty boxes (including in headers/footers); use
border: { bottom: { style: BorderStyle.SINGLE, size: 6, color: "2E75B6", space: 1 } }on a Paragraph instead; for two-column footers, use tab stops not tables - TOC requires HeadingLevel only — no custom styles on heading paragraphs
- Override built-in styles — use exact IDs: "Heading1", "Heading2", etc.
- Include
outlineLevel— required for TOC (0 for H1, 1 for H2, etc.)
Follow all 3 steps in order.
python scripts/office/unpack.py document.docx unpacked/Extracts XML, pretty-prints, merges adjacent runs, and converts smart quotes to XML entities. Use --merge-runs false to skip run merging.
Edit files in unpacked/word/.
Use "Claude" as the author for tracked changes and comments, unless the user explicitly requests a different name.
Use the Edit tool directly for string replacement. Do not write Python scripts.
CRITICAL: Use smart quotes for new content:
<w:t>Here’s a quote: “Hello”</w:t>| Entity | Character |
|---|---|
‘ |
' (left single) |
’ |
' (right single / apostrophe) |
“ |
" (left double) |
” |
" (right double) |
Adding comments: Use comment.py to handle boilerplate (text must be pre-escaped XML):
python scripts/comment.py unpacked/ 0 "Comment text with & and ’"
python scripts/comment.py unpacked/ 1 "Reply text" --parent 0 # reply to comment 0
python scripts/comment.py unpacked/ 0 "Text" --author "Custom Author"Then add markers to document.xml (see Comments in XML Reference).
python scripts/office/pack.py unpacked/ output.docx --original document.docxValidates with auto-repair, condenses XML, creates DOCX. Use --validate false to skip.
Auto-repair will fix: durableId >= 0x7FFFFFFF; missing xml:space="preserve" on <w:t> with whitespace.
Auto-repair won't fix: Malformed XML, invalid nesting, missing relationships, schema violations.
- Replace entire
<w:r>elements when adding tracked changes — don't inject tags inside a run - Preserve
<w:rPr>formatting — copy original run's formatting block into tracked change runs
- Element order in
<w:pPr>:<w:pStyle>,<w:numPr>,<w:spacing>,<w:ind>,<w:jc>,<w:rPr>last - Whitespace: Add
xml:space="preserve"to<w:t>with leading/trailing spaces - RSIDs: Must be 8-digit hex (e.g.,
00AB1234)
Insertion:
<w:ins w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:t>inserted text</w:t></w:r>
</w:ins>Deletion:
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>deleted text</w:delText></w:r>
</w:del>Inside <w:del>: Use <w:delText> instead of <w:t>, and <w:delInstrText> instead of <w:instrText>.
Minimal edits:
<!-- Change "30 days" to "60 days" -->
<w:r><w:t>The term is </w:t></w:r>
<w:del w:id="1" w:author="Claude" w:date="..."><w:r><w:delText>30</w:delText></w:r></w:del>
<w:ins w:id="2" w:author="Claude" w:date="..."><w:r><w:t>60</w:t></w:r></w:ins>
<w:r><w:t> days.</w:t></w:r>Deleting entire paragraphs/list items — when removing ALL content from a paragraph, also mark the paragraph mark as deleted so it merges with the next paragraph. Add <w:del/> inside <w:pPr><w:rPr>. Without it, accepting changes leaves an empty paragraph/list item:
<w:p>
<w:pPr>
<w:numPr>...</w:numPr> <!-- list numbering if present -->
<w:rPr>
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z"/>
</w:rPr>
</w:pPr>
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>Entire paragraph content being deleted...</w:delText></w:r>
</w:del>
</w:p>Rejecting another author's insertion:
<w:ins w:author="Jane" w:id="5">
<w:del w:author="Claude" w:id="10">
<w:r><w:delText>their inserted text</w:delText></w:r>
</w:del>
</w:ins>Restoring another author's deletion:
<w:del w:author="Jane" w:id="5"><w:r><w:delText>deleted text</w:delText></w:r></w:del>
<w:ins w:author="Claude" w:id="10"><w:r><w:t>deleted text</w:t></w:r></w:ins>CRITICAL: <w:commentRangeStart> and <w:commentRangeEnd> are siblings of <w:r>, never inside <w:r>.
<!-- Single comment -->
<w:commentRangeStart w:id="0"/>
<w:r><w:t> text </w:t></w:r>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
<!-- Comment 0 with reply 1 nested inside -->
<w:commentRangeStart w:id="0"/>
<w:commentRangeStart w:id="1"/>
<w:r><w:t>text</w:t></w:r>
<w:commentRangeEnd w:id="1"/>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="1"/></w:r>- Add image file to
word/media/ - Add relationship to
word/_rels/document.xml.rels - Add content type to
[Content_Types].xml - Reference with
<w:drawing>/<wp:inline>in document.xml (EMUs: 914400 = 1 inch)
- pandoc: Text extraction
- docx:
npm install -g docx(new documents) - LibreOffice: PDF/image conversion via
scripts/office/soffice.py - Poppler:
pdftoppmfor images