Commit e74c328
authored
feat(convert): improve HTML-to-Markdown conversion quality (#79)
## What
Major improvements to HTML→Markdown conversion — fix broken links, add
tables, images, ordered lists, definition lists, and expand entity
support.
## Why
Current conversion quality is too low for agents that need to understand
page structure. Broken links and missing tables lose critical
information.
## How
- **Links**: Track link text position in output buffer; on `</a>` wrap
collected text in `[text](href)`. Empty text uses autolink `<href>`
format.
- **Tables**: Collect cells into rows, render as markdown table with `|`
separators and header separator row.
- **Images**: Emit `` from `<img>` tags.
- **Ordered lists**: Use stack of `(is_ordered, counter)` tuples.
Ordered items get `1.`, `2.`, etc.
- **Definition lists**: `<dt>` → `**term**`, `<dd>` → `: definition`
- **Entities**: Expanded from ~10 to 40+ named entities (trade, bull,
hellip, smart quotes, currency, arrows, fractions)
- **Whitespace**: `clean_whitespace()` now preserves indentation after
newlines for proper nested list rendering.
No new external dependencies — all custom implementation.
## Risk
- Medium — changes core conversion behavior
- All existing tests updated and passing
- 12 new tests for links, tables, images, ordered lists, entities,
definition lists
### Checklist
- [x] Unit tests passed (all 229)
- [x] Clippy clean
- [x] Docs build clean
- [x] No new dependencies
Closes #731 parent 4162557 commit e74c328
1 file changed
Lines changed: 313 additions & 55 deletions
0 commit comments