-
-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Description
crawl4ai version
6.2, 6.3
Expected Behavior
Hi. Maybe I am doing something wrong. But --
I have a crawler that uses a BM25 filter with applicable query and bm25_threshold :1.5 .
It will return multiple copies of the same paragraphs in certain circumstances. I expect to get only one. I end up feeding multiple copies to the RAG db.
i believe this error may be caused when the crawler - crawls - a page that has multiple articles on the same page .. scrolling down to the next article etc ..
Current Behavior
I am getting output like this:
Success! Processing content...
Using fit_markdown, length: 8432
====================================================================================================
result.markdown.fit_markdown
The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector.
- Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector. - Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector. - Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector. - Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector. - Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector. - Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector. - Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector. - Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector. - Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector. - Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector. - Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector. - Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector. - Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector. - Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector. - Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector. - Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector. - Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
Is this reproducible?
Yes
Inputs Causing the Bug
- crawling this url https://www.tipranks.com/news/company-announcements/uk-house-prices-drop-market-implications-unveiled
Code below --
BM25 filter
DefaultMarkdownGenerator
CacheMode: ByPassSteps to Reproduce
Run the code.Code snippets
excluded_tags = [
"nav", "aside", "footer", "header", "form", "noscript", "iframe",
"script", "style", "link", "input", "button", "i",
# Additional navigation and menu related tags
"menu", "menuitem", "toolbar", "breadcrumb",
# Social media and sharing
"social", "share", "follow",
# Ads and tracking
"ad", "ads", "advert", "advertisement", "tracking", "analytics",
# Comments and user interaction (often noisy)
"comment", "comments", "reply", "replies",
]
# Function to convert the list of excluded tags to a CSS selector
def generate_css_selector(tags):
# Join each tag with :not() and chain them
return '*{}'.format(''.join(f':not({tag})' for tag in tags))
# Generate the CSS selector
css_selector = generate_css_selector(excluded_tags)
crawler_config_additions = {
'word_count_threshold': 100, # Higher threshold to focus on substantial content blocks
# Enhanced CSS selector to exclude more navigation elements
'css_selector': css_selector,
# More aggressive exclusions for cleaner content
'exclude_external_links': True,
'exclude_social_media_links': True,
'excluded_selector': "#ads,.tracker,.ad,.adsbygoogle,.adwords,.adwordsbygoogle,nav,aside,footer,header,.menu,.navigation,.breadcrumb,.social,.share",
# Content focus
'only_text': True,
'remove_forms': True,
# Target main content areas specifically
'target_elements': ['article', 'main', '.content', '.post', '.entry', '.article-content', '.post-content'],
# Media filtering
'image_score_threshold': 6,
'exclude_external_images': True,
# Block entire domains
'exclude_domains': ["adtrackers.com", "spammynews.org", "ads.com", "trackers.io"],
async def crawl_urls(urls: List[str]):
"""
Sequential crawler (no longer parallel) using the working simple approach.
Keeps all existing features: query mapping, BM25 filtering, memory tracking.
"""
global url_query_mapping
try:
if url_query_mapping:
print(f"Using URL Query Mapping: {url_query_mapping}")
except:
url_query_mapping = None
if type(urls) != list:
urls = [urls]
print("\n=== Sequential Crawling with Query Mapping + Memory Check ===")
all_results = []
# Memory tracking
peak_memory = 0
process = psutil.Process(os.getpid())
def log_memory(prefix: str = ""):
nonlocal peak_memory
current_mem = process.memory_info().rss
if current_mem > peak_memory:
peak_memory = current_mem
print(f"{prefix} Current Memory: {current_mem // (1024 * 1024)} MB, Peak: {peak_memory // (1024 * 1024)} MB")
# Browser config
browser_config = BrowserConfig(
headless=True,
verbose=False,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/237.84.2.178 Safari/537.36",
extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
)
success_count = 0
fail_count = 0
log_memory(prefix="Starting: ")
# Sequential crawling (no parallel issues)
for i, url in enumerate(urls):
print(f"\n=== Crawling {i+1}/{len(urls)}: {url} ===")
try:
# Get query for this URL
query = url_query_mapping.get(url) if url_query_mapping else "what is the current state of the housing markets in the US"
print(f"Using query: {query}")
# Choose filter based on whether we have a specific query
if url_query_mapping and url in url_query_mapping:
content_filter = BM25ContentFilter(
user_query=query,
bm25_threshold=1.5 # Working threshold from simple crawler
)
print(f"Using BM25 filter with specific query")
else:
content_filter = BM25ContentFilter(
user_query=query,
bm25_threshold=1.5
)
print(f"Using BM25 filter with default query")
# Create markdown generator
md_generator = DefaultMarkdownGenerator(content_filter=content_filter)
# Simple, clean config like working simple crawler
config = CrawlerRunConfig(
#excluded_tags=["nav", "footer", "header", "comments", "comment", "comments-section", "script", "style"],
#target_elements=['div.available-content', 'article', 'main', '.post-content', '.content', '.entry-content'],
#exclude_external_links=True,
#only_text=True,
markdown_generator=md_generator,
cache_mode=CacheMode.BYPASS,
** crawler_config_additions)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
config=config,
browser_config=browser_config
)
if result.success:
success_count += 1
print(f"✓ Success! Processing content...")
# Handle both types of markdown objects (like working simple crawler)
if hasattr(result.markdown, 'fit_markdown'):
content = result.markdown.fit_markdown
print(f"Using fit_markdown, length: {len(content)}")
print(result.markdown.fit_markdown)
else:
content = str(result.markdown)
print(f"Using string markdown, length: {len(content)}")
print(f"First 1000 chars: {content[:1000]}...")
all_results.append((url, content))
else:
fail_count += 1
error_msg = result.error_message if hasattr(result, 'error_message') else 'Unknown error'
print(f"✗ Failed: {error_msg}")
all_results.append((url, f"Failed: {error_msg}"))
except Exception as e:
fail_count += 1
print(f"✗ Exception crawling {url}: {e}")
all_results.append((url, f"Error: {e}"))
# Log memory after each URL
log_memory(prefix=f"After URL {i+1}: ")
# Small delay to be respectful
if i < len(urls) - 1: # Don't delay after last URL
await asyncio.sleep(1)
print(f"\nSummary:")
print(f" - Successfully crawled: {success_count}")
print(f" - Failed: {fail_count}")
# Final memory log
log_memory(prefix="Final: ")
print(f"\nPeak memory usage (MB): {peak_memory // (1024 * 1024)}")
all_results = [r[1] for r in all_results]
return all_results
if __name__ == "__main__":
#asyncio.run(main())
urls = []
results = asyncio.run(crawl_parallel(["https://jscottdigital.com/investment-real-estate-website-blog-ideas-that-attract/",'https://www.tipranks.com/news/company-announcements/uk-house-prices-drop-market-implications-unveiled']))
print(f"results: {results}")OS
MacOS,
Python version
python 3.12
Browser
default chrome?
Browser version
command line only
Error logs & Screenshots (if applicable)
No response