fix: use str.split() for accurate word count in PruningContentFilter by karesansui-u · Pull Request #1838 · unclecode/crawl4ai

karesansui-u · 2026-03-16T16:18:57Z

Summary

_compute_composite_score() uses text.count(" ") + 1 to count words, which overcounts when consecutive spaces are present. HTML-extracted text from get_text(strip=True) commonly contains multiple consecutive spaces between inline elements — each extra space inflates the count by one, making min_word_threshold checks too lenient and allowing short/noisy nodes to survive pruning.

Before (line 742):

word_count = text.count(" ") + 1

After:

word_count = len(text.split())

The same file already uses len(text.split()) for the identical purpose at line 268 and line 302, so this change also restores internal consistency.

Edge case

text.count(" ") + 1 returns 1 for an empty string, while len("".split()) correctly returns 0. This means empty-text nodes that should be removed by min_word_threshold >= 1 currently slip through.

Changed files

crawl4ai/content_filter_strategy.py — 1 line in _compute_composite_score()

Testing

Verified that str.split() handles consecutive spaces, tabs, and empty strings correctly:

>>> "hello     world".count(" ") + 1   # overcounts
6
>>> len("hello     world".split())       # correct
2
>>> "".count(" ") + 1                    # wrong for empty
1
>>> len("".split())                      # correct
0

text.count(" ") + 1 overcounts words when consecutive spaces are present, which is common in HTML-extracted text from get_text(strip=True). This causes min_word_threshold checks to be too lenient, allowing short/noisy content to pass through the filter. The same file already uses len(text.split()) for the same purpose at lines 268 and 302.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: use str.split() for accurate word count in PruningContentFilter#1838

fix: use str.split() for accurate word count in PruningContentFilter#1838
karesansui-u wants to merge 1 commit intounclecode:developfrom
karesansui-u:fix/word-count-content-filter

karesansui-u commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

karesansui-u commented Mar 16, 2026

Summary

Edge case

Changed files

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant