Open
Conversation
Fix warning as proposed by bs4:
```
.../bs4/element.py:2253: DeprecationWarning: The 'text' argument to find()-type methods is deprecated. Use 'string' instead.
return self.find_all(
```
- Fix dict-to-ProxyConfig conversion in BrowserConfig and CrawlerRunConfig - Fix JSON serialization of ProxyConfig objects in to_dict methods - Fix context proxy to use ProxySettings instead of plain dict - Resolves proxy authentication issues
- Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html
The remove_empty_elements_fast() method was removing whitespace-only span elements inside <pre> and <code> tags, causing import statements like "import torch" to become "importtorch". Now skips elements inside code blocks where whitespace is significant.
… and clean up mock data. ref #1621
Fix: Wrong URL variable used for extraction of raw html
Refactor Pydantic model configuration to use ConfigDict for arbitrary…
- extend LLMConfig with backoff delay/attempt/factor fields and thread them through LLMExtractionStrategy, LLMContentFilter, table extraction, and Docker API handlers - expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff and document them in the md_v2 guides
…d test for delayed redirects. ref #1268
Fix BrowserConfig proxy_config serialization
Make LLM backoff configurable end-to-end
…nce-filter [Fix]: Docker server does not decode ContentRelevanceFilter
Update docs/examples to use current API: - proxy → proxy_config in BrowserConfig - result.fit_markdown → result.markdown.fit_markdown - result.fit_html → result.markdown.fit_html - markdown_v2 deprecation notes updated - bypass_cache → cache_mode=CacheMode.BYPASS - LLMExtractionStrategy now uses llm_config=LLMConfig(...) - CrawlerConfig → CrawlerRunConfig - cache_mode string values → CacheMode enum - Fix missing CacheMode import in local-files.md - Fix indentation in app-detail.html example - Fix tautological cache mode descriptions in arun.md From PR #1770 by @maksimzayats
…1734, #1290, #1668) Bug fixes: - Verify redirect targets are alive before returning from URL seeder (#1622) - Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786) - Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796) Security/Docker: - Require api_token for /token endpoint when configured (#1795) - Deep-crawl streaming now mirrors Python library behavior via arun() (#1798) CI: - Bump GitHub Actions to latest versions - checkout v6, setup-python v6, build-push-action v6, setup-buildx v4, login v4 (#1734) Features: - Support type-list pipeline in JsonCssExtractionStrategy for chained extraction like ["attribute", "regex"] (#1290) - Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting for Unicode preservation in JSON output (#1668)
…1354, #880, #1031, #1251, #1758) - #1520: Preserve trailing slashes in URL normalization (RFC 3986 compliance) - #1489: Preserve query parameter key casing in normalize_url - #1374: Close NamedTemporaryFile handle before reopening (Windows fix) - #1424: Fix CosineStrategy returning empty results (delimiter fallback + at_least_k >= 1) - #1183: Fix extract_xml_data regex matching tag names in prose text - #1354: Make import_knowledge_base async (fix asyncio.run in running loop) - #880: Fix 404 sample_ecommerce.html gist URL in docs (6 occurrences) - #1031: Make Docker playground code editor resizable with overflow-auto - #1251: Add DEFAULT_CONFIG with deep-merge in load_config to prevent KeyError crashes - #1758: Change screenshot stitching format from BMP to PNG
Full regression suite covering all major Crawl4AI subsystems: - core crawl (arun, arun_many, raw HTML, JS, screenshots, cache, hooks) - content processing (markdown, citations, BM25/pruning filters, links, images, tables, metadata) - extraction strategies (JsonCss, JsonXPath, JsonLxml, Regex, Cosine, NoExtraction) - deep crawl (BFS, DFS, BestFirst, filters, scorers, URL normalization) - browser management (lifecycle, viewport, wait_for, stealth, sessions, iframes) - config serialization (BrowserConfig, CrawlerRunConfig, ProxyConfig roundtrips) - utilities (extract_xml_data, cache modes, content hashing) - edge cases (empty pages, malformed HTML, unicode, concurrent crawls, error recovery) Also adds /c4ai-check slash command for testing changes against the suite.
- #1487: Move virtual scroll after wait_for so dynamic containers exist - #1512: Add __aiter__ to CrawlResultContainer for async for support - #1666: Kill process group on cleanup to prevent zombie child processes, add lsof fallback for Docker environments without lsof installed - Close #1472 (redirect chain already fixed), #1480 (links already normalized), #1679 (duplicate of #1509)
… match, PDF deserialization - antibot_detector: add <pre> to content elements regex, detect browser-wrapped JSON in _looks_like_data() so httpbin-style responses are not flagged as blocked - deep_crawling/filters: use urlparse().path for path-only prefix patterns (/docs/*) instead of matching against full URL, which always failed; full-URL prefixes still match correctly - async_configs: add PDFContentScrapingStrategy to ALLOWED_DESERIALIZE_TYPES so /crawl API can deserialize it - __init__: export PDFContentScrapingStrategy for type resolution - tests: add 86-test suite covering all three fixes with adversarial and edge cases
take_screenshot() ignored the scan_full_page config flag — tall pages always got a full-page screenshot even when scan_full_page=False. Now passes scan_full_page through to take_screenshot() and uses viewport-only capture when False. Includes 16 tests (8 unit + 8 integration). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Recognized for identifying and confirming the PDFContentScrapingStrategy deserialization fix (#1815).
BM25ContentFilter.filter_content() returned duplicate text chunks when the same content appeared in multiple DOM elements. Added exact-text deduplication after threshold filtering, keeping the first occurrence in document order. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…full-page-1750 fix: screenshot respects scan_full_page=False (#1750)
…password (#1611, #1817) - #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional provider, temperature, base_url query params with fallback to server config. Consistent with /md and /llm/job endpoints. - #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT, REDIS_PASSWORD environment variable overrides. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
#1370, #1818, #1509, #1762) - #1370: Freeze element dimensions via CSS before viewport resize in take_screenshot_scroller() to prevent responsive reflow on Elementor sites; restore original viewport after capture. - #1818: Call window.stop() on session-reused pages before navigation to abort pending loads; move event listener cleanup outside session_id guard so listeners don't accumulate across reuses. - #1509: Bypass dispatcher in arun_many() when deep_crawl_strategy is set — call arun() directly per URL so the DeepCrawlDecorator can invoke the strategy (dispatcher crashes on List[CrawlResult] return). - #1762: Add encoding="utf-8" to the remaining open() call in save_global_config() (cli.py line 58). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace @app.get() with starlette.routing.Route() for the SSE handler. The MCP SDK's SseServerTransport calls raw ASGI (scope, receive, send) internally, which conflicts with Starlette's middleware wrapping. Also update CONTRIBUTORS.md for PR #1829.
Add official Redis apt repository and pin redis-server to 7.2.7 which patches the Lua use-after-free vulnerability. REDIS_VERSION build arg allows override.
css_selector was skipped in _scrap() — only target_elements was applied. Now css_selector filters the DOM first, then target_elements narrows within that selection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-1484 fix: css_selector ignored in LXML scraping for raw:// URLs (#1484)
…#1754) The monitor's update_timeline(), get_health_summary(), and get_browser_list() all acquired the crawler pool's global LOCK to read pool stats. That same lock is held during slow browser start/close operations (get_crawler, janitor, close_all), causing the monitor to block indefinitely and the pod to become unresponsive after sustained crawling. Replaced all three lock acquisitions in monitor.py with a lock-free get_pool_snapshot() in crawler_pool.py that returns shallow dict copies. Under CPython's GIL, dict.copy() and len() are atomic — safe for read-only monitoring with at most slightly stale counts.
Bump version to 0.8.5 across all references (Dockerfile, README, Docker README, blog index, __version__.py). Add release notes, blog post, demo verification script (13 real-crawl tests), and releases directory entry. Key highlights: - Anti-bot detection with 3-tier proxy escalation - Shadow DOM flattening - Deep crawl cancellation - Config defaults API - 60+ bug fixes and critical security patches
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
__version__.pydocs/blog/release-v0.8.5.md,docs/md_v2/blog/releases/v0.8.5.md)docs/releases_review/demo_v0.8.5.py)Key highlights in v0.8.5
set_defaults/get_defaults/reset_defaults)avoid_ads/avoid_cssresource filteringIssues fixed
#462, #880, #943, #1031, #1077, #1183, #1213, #1251, #1281, #1290, #1296, #1308, #1354, #1364, #1370, #1374, #1424, #1435, #1463, #1484, #1487, #1489, #1494, #1503, #1509, #1512, #1520, #1553, #1594, #1601, #1606, #1611, #1622, #1635, #1640, #1658, #1666, #1667, #1668, #1671, #1682, #1686, #1711, #1715, #1716, #1721, #1730, #1731, #1746, #1750, #1751, #1754, #1758, #1762, #1768, #1770, #1776, #1782, #1783, #1784, #1786, #1788, #1789, #1790, #1792, #1793, #1794, #1795, #1796, #1797, #1801, #1803, #1804, #1805, #1815, #1817, #1818, #1824
Test plan
python docs/releases_review/demo_v0.8.5.py— 13 end-to-end testsdocker buildx build -t crawl4ai-local:latest --load .