Skip to content

fix: use str.split() for accurate word count in PruningContentFilter#1838

Open
karesansui-u wants to merge 1 commit intounclecode:developfrom
karesansui-u:fix/word-count-content-filter
Open

fix: use str.split() for accurate word count in PruningContentFilter#1838
karesansui-u wants to merge 1 commit intounclecode:developfrom
karesansui-u:fix/word-count-content-filter

Conversation

@karesansui-u
Copy link

Summary

_compute_composite_score() uses text.count(" ") + 1 to count words, which overcounts when consecutive spaces are present. HTML-extracted text from get_text(strip=True) commonly contains multiple consecutive spaces between inline elements — each extra space inflates the count by one, making min_word_threshold checks too lenient and allowing short/noisy nodes to survive pruning.

Before (line 742):

word_count = text.count(" ") + 1

After:

word_count = len(text.split())

The same file already uses len(text.split()) for the identical purpose at line 268 and line 302, so this change also restores internal consistency.

Edge case

text.count(" ") + 1 returns 1 for an empty string, while len("".split()) correctly returns 0. This means empty-text nodes that should be removed by min_word_threshold >= 1 currently slip through.

Changed files

  • crawl4ai/content_filter_strategy.py — 1 line in _compute_composite_score()

Testing

Verified that str.split() handles consecutive spaces, tabs, and empty strings correctly:

>>> "hello     world".count(" ") + 1   # overcounts
6
>>> len("hello     world".split())       # correct
2
>>> "".count(" ") + 1                    # wrong for empty
1
>>> len("".split())                      # correct
0

text.count(" ") + 1 overcounts words when consecutive spaces are
present, which is common in HTML-extracted text from get_text(strip=True).
This causes min_word_threshold checks to be too lenient, allowing
short/noisy content to pass through the filter.

The same file already uses len(text.split()) for the same purpose
at lines 268 and 302.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant