Skip to content

[Bug]: Crawler returns multiple copies of same text #1213

@steelliberty

Description

@steelliberty

crawl4ai version

6.2, 6.3

Expected Behavior

Hi. Maybe I am doing something wrong. But --
I have a crawler that uses a BM25 filter with applicable query and bm25_threshold :1.5 .
It will return multiple copies of the same paragraphs in certain circumstances. I expect to get only one. I end up feeding multiple copies to the RAG db.

i believe this error may be caused when the crawler - crawls - a page that has multiple articles on the same page .. scrolling down to the next article etc ..

Current Behavior

I am getting output like this:

Success! Processing content...
Using fit_markdown, length: 8432

====================================================================================================
result.markdown.fit_markdown

The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector.

  • Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
    The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector.
  • Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
    The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector.
  • Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
    The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector.
  • Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
    The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector.
  • Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
    The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector.
  • Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
    The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector.
  • Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
    The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector.
  • Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
    The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector.
  • Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
    The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector.
  • Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
    The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector.
  • Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
    The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector.
  • Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
    The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector.
  • Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
    The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector.
  • Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
    The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector.
  • Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
    The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector.
  • Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter
    The latest House Price Index for May has been released, revealing a surprising downturn in the UK housing market. The index showed a decline of 0.4% month-over-month, falling short of the anticipated 0.4% increase. This marks a significant drop from the previous month’s 0.3% rise, indicating a potential cooling in the housing sector.
  • Receive undervalued, market resilient stocks right to your inboxwith TipRanks' Smart Value Newsletter

Is this reproducible?

Yes

Inputs Causing the Bug

- crawling this url https://www.tipranks.com/news/company-announcements/uk-house-prices-drop-market-implications-unveiled
Code below --
BM25 filter
DefaultMarkdownGenerator
CacheMode: ByPass

Steps to Reproduce

Run the code.

Code snippets

excluded_tags = [
    "nav", "aside", "footer", "header", "form", "noscript", "iframe", 
    "script", "style", "link", "input", "button", "i",
    # Additional navigation and menu related tags
    "menu", "menuitem", "toolbar", "breadcrumb",
    # Social media and sharing
    "social", "share", "follow",
    # Ads and tracking
    "ad", "ads", "advert", "advertisement", "tracking", "analytics",
    # Comments and user interaction (often noisy)
    "comment", "comments", "reply", "replies",
]

# Function to convert the list of excluded tags to a CSS selector
def generate_css_selector(tags):
    # Join each tag with :not() and chain them
    return '*{}'.format(''.join(f':not({tag})' for tag in tags))

# Generate the CSS selector
css_selector = generate_css_selector(excluded_tags)

crawler_config_additions = {
    'word_count_threshold': 100,  # Higher threshold to focus on substantial content blocks
    
    # Enhanced CSS selector to exclude more navigation elements
    'css_selector': css_selector,
    
    # More aggressive exclusions for cleaner content
    'exclude_external_links': True,
    'exclude_social_media_links': True,
    'excluded_selector': "#ads,.tracker,.ad,.adsbygoogle,.adwords,.adwordsbygoogle,nav,aside,footer,header,.menu,.navigation,.breadcrumb,.social,.share",
    
    # Content focus
    'only_text': True,
    'remove_forms': True,
    
    # Target main content areas specifically
    'target_elements': ['article', 'main', '.content', '.post', '.entry', '.article-content', '.post-content'],
    
    # Media filtering
    'image_score_threshold': 6,
    'exclude_external_images': True,
    
    # Block entire domains
    'exclude_domains': ["adtrackers.com", "spammynews.org", "ads.com", "trackers.io"],

async def crawl_urls(urls: List[str]):
    """
    Sequential crawler (no longer parallel) using the working simple approach.
    Keeps all existing features: query mapping, BM25 filtering, memory tracking.
    """
    global url_query_mapping
    try:
        if url_query_mapping:
            print(f"Using URL Query Mapping: {url_query_mapping}")
    except:
        url_query_mapping = None
        
    if type(urls) != list:
        urls = [urls]
        
    print("\n=== Sequential Crawling with Query Mapping + Memory Check ===")
    all_results = []
    
    # Memory tracking
    peak_memory = 0
    process = psutil.Process(os.getpid())
    
    def log_memory(prefix: str = ""):
        nonlocal peak_memory
        current_mem = process.memory_info().rss
        if current_mem > peak_memory:
            peak_memory = current_mem
        print(f"{prefix} Current Memory: {current_mem // (1024 * 1024)} MB, Peak: {peak_memory // (1024 * 1024)} MB")
    
    # Browser config
    browser_config = BrowserConfig(
        headless=True,
        verbose=False,
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/237.84.2.178 Safari/537.36",
        extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
    )
    
    success_count = 0
    fail_count = 0
    
    log_memory(prefix="Starting: ")
    
    # Sequential crawling (no parallel issues)
    for i, url in enumerate(urls):
        print(f"\n=== Crawling {i+1}/{len(urls)}: {url} ===")
        
        try:
            # Get query for this URL
            query = url_query_mapping.get(url) if url_query_mapping else "what is the current state of the housing markets in the US"
            print(f"Using query: {query}")
            
            # Choose filter based on whether we have a specific query
            if url_query_mapping and url in url_query_mapping:
                content_filter = BM25ContentFilter(
                    user_query=query,
                    bm25_threshold=1.5  # Working threshold from simple crawler
                )
                print(f"Using BM25 filter with specific query")
            else:
                content_filter = BM25ContentFilter(
                    user_query=query,

                    bm25_threshold=1.5
                )
                print(f"Using BM25 filter with default query")
            
            # Create markdown generator
            md_generator = DefaultMarkdownGenerator(content_filter=content_filter)
            
            # Simple, clean config like working simple crawler
            config = CrawlerRunConfig(
                #excluded_tags=["nav", "footer", "header", "comments", "comment", "comments-section", "script", "style"],
                #target_elements=['div.available-content', 'article', 'main', '.post-content', '.content', '.entry-content'],
                #exclude_external_links=True,
                #only_text=True,
                markdown_generator=md_generator,
                cache_mode=CacheMode.BYPASS,
                ** crawler_config_additions)

            
            async with AsyncWebCrawler() as crawler:
                result = await crawler.arun(
                    url=url,
                    config=config,
                    browser_config=browser_config
                )
                
                if result.success:
                    success_count += 1
                    print(f"✓ Success! Processing content...")
                    
                    # Handle both types of markdown objects (like working simple crawler)
                    if hasattr(result.markdown, 'fit_markdown'):
                        content = result.markdown.fit_markdown
                        print(f"Using fit_markdown, length: {len(content)}")
                        print(result.markdown.fit_markdown)
                    else:
                        content = str(result.markdown)
                        print(f"Using string markdown, length: {len(content)}")
                    
                    print(f"First 1000 chars: {content[:1000]}...")
                    all_results.append((url, content))
                    
                else:
                    fail_count += 1
                    error_msg = result.error_message if hasattr(result, 'error_message') else 'Unknown error'
                    print(f"✗ Failed: {error_msg}")
                    all_results.append((url, f"Failed: {error_msg}"))
                    
        except Exception as e:
            fail_count += 1
            print(f"✗ Exception crawling {url}: {e}")
            all_results.append((url, f"Error: {e}"))
        
        # Log memory after each URL
        log_memory(prefix=f"After URL {i+1}: ")
        
        # Small delay to be respectful
        if i < len(urls) - 1:  # Don't delay after last URL
            await asyncio.sleep(1)
    
    print(f"\nSummary:")
    print(f"  - Successfully crawled: {success_count}")
    print(f"  - Failed: {fail_count}")
    
    # Final memory log
    log_memory(prefix="Final: ")
    print(f"\nPeak memory usage (MB): {peak_memory // (1024 * 1024)}")
    all_results = [r[1] for r in all_results]
    return all_results

if __name__ == "__main__":
    #asyncio.run(main())
    urls = []
    results = asyncio.run(crawl_parallel(["https://jscottdigital.com/investment-real-estate-website-blog-ideas-that-attract/",'https://www.tipranks.com/news/company-announcements/uk-house-prices-drop-market-implications-unveiled']))
    print(f"results: {results}")

OS

MacOS,

Python version

python 3.12

Browser

default chrome?

Browser version

command line only

Error logs & Screenshots (if applicable)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    ⚙ DoneBug fix, enhancement, FR that's completed pending release🐞 BugSomething isn't working📌 Root causedidentified the root cause of bug

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions