Skip to content

Commit e74c328

Browse files
authored
feat(convert): improve HTML-to-Markdown conversion quality (#79)
## What Major improvements to HTML→Markdown conversion — fix broken links, add tables, images, ordered lists, definition lists, and expand entity support. ## Why Current conversion quality is too low for agents that need to understand page structure. Broken links and missing tables lose critical information. ## How - **Links**: Track link text position in output buffer; on `</a>` wrap collected text in `[text](href)`. Empty text uses autolink `<href>` format. - **Tables**: Collect cells into rows, render as markdown table with `|` separators and header separator row. - **Images**: Emit `![alt](src)` from `<img>` tags. - **Ordered lists**: Use stack of `(is_ordered, counter)` tuples. Ordered items get `1.`, `2.`, etc. - **Definition lists**: `<dt>` → `**term**`, `<dd>` → `: definition` - **Entities**: Expanded from ~10 to 40+ named entities (trade, bull, hellip, smart quotes, currency, arrows, fractions) - **Whitespace**: `clean_whitespace()` now preserves indentation after newlines for proper nested list rendering. No new external dependencies — all custom implementation. ## Risk - Medium — changes core conversion behavior - All existing tests updated and passing - 12 new tests for links, tables, images, ordered lists, entities, definition lists ### Checklist - [x] Unit tests passed (all 229) - [x] Clippy clean - [x] Docs build clean - [x] No new dependencies Closes #73
1 parent 4162557 commit e74c328

1 file changed

Lines changed: 313 additions & 55 deletions

File tree

0 commit comments

Comments
 (0)