Skip to content

修复“从网页提取出来的表格包含多余的空格”的问题#598

Merged
drunkpig merged 1 commit intoccprocessor:devfrom
ideaflow:dev
Nov 27, 2025
Merged

修复“从网页提取出来的表格包含多余的空格”的问题#598
drunkpig merged 1 commit intoccprocessor:devfrom
ideaflow:dev

Conversation

@ideaflow
Copy link
Collaborator

在simplify的最后一行,etree.tostring函数中的pretty_print原为True,会使得original_html中包含一些额外的换行和缩进,如果网页中包含表格等内容,提取之后表格也会包含一些额外的空格,为了跟直接从labeled_html提取的ground truth对齐,故将pretty_print改为False,这样就不会多出来一些空格。

@codecov
Copy link

codecov bot commented Nov 26, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #598      +/-   ##
==========================================
- Coverage   90.97%   90.70%   -0.28%     
==========================================
  Files         102      105       +3     
  Lines        8890     9272     +382     
==========================================
+ Hits         8088     8410     +322     
- Misses        802      862      +60     
Files with missing lines Coverage Δ
...it/main_html_parser/simplify_html/simplify_html.py 83.47% <100.00%> (+0.45%) ⬆️

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@drunkpig drunkpig merged commit a2814e3 into ccprocessor:dev Nov 27, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants