Skip to content

Release v0.8.5#1836

Open
ntohidi wants to merge 229 commits intomainfrom
release/v0.8.5
Open

Release v0.8.5#1836
ntohidi wants to merge 229 commits intomainfrom
release/v0.8.5

Conversation

@ntohidi
Copy link
Collaborator

@ntohidi ntohidi commented Mar 16, 2026

Summary

  • Version bump to 0.8.5 across Dockerfile, README, Docker README, blog index, __version__.py
  • Release notes and blog post added (docs/blog/release-v0.8.5.md, docs/md_v2/blog/releases/v0.8.5.md)
  • Demo verification script with 13 real-crawl tests (docs/releases_review/demo_v0.8.5.py)

Key highlights in v0.8.5

  • Anti-bot detection with 3-tier proxy escalation
  • Shadow DOM flattening
  • Deep crawl cancellation
  • Config defaults API (set_defaults / get_defaults / reset_defaults)
  • Source/sibling selector in JSON extraction
  • Consent popup removal (40+ CMP platforms)
  • avoid_ads / avoid_css resource filtering
  • Browser recycling & memory-saving mode
  • GFM table compliance
  • Critical security fixes (RCE via deserialization, Redis CVE-2025-49844)
  • 60+ bug fixes

Issues fixed

#462, #880, #943, #1031, #1077, #1183, #1213, #1251, #1281, #1290, #1296, #1308, #1354, #1364, #1370, #1374, #1424, #1435, #1463, #1484, #1487, #1489, #1494, #1503, #1509, #1512, #1520, #1553, #1594, #1601, #1606, #1611, #1622, #1635, #1640, #1658, #1666, #1667, #1668, #1671, #1682, #1686, #1711, #1715, #1716, #1721, #1730, #1731, #1746, #1750, #1751, #1754, #1758, #1762, #1768, #1770, #1776, #1782, #1783, #1784, #1786, #1788, #1789, #1790, #1792, #1793, #1794, #1795, #1796, #1797, #1801, #1803, #1804, #1805, #1815, #1817, #1818, #1824

Test plan

  • Run python docs/releases_review/demo_v0.8.5.py — 13 end-to-end tests
  • Verify Docker build: docker buildx build -t crawl4ai-local:latest --load .
  • Spot-check release notes formatting

RoyLeviLangware and others added 30 commits May 6, 2025 11:44
Fix warning as proposed by bs4:
```
  .../bs4/element.py:2253: DeprecationWarning: The 'text' argument to find()-type methods is deprecated. Use 'string' instead.
    return self.find_all(
```
- Fix dict-to-ProxyConfig conversion in BrowserConfig and CrawlerRunConfig
- Fix JSON serialization of ProxyConfig objects in to_dict methods
- Fix context proxy to use ProxySettings instead of plain dict
- Resolves proxy authentication issues
- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing

Fix: Wrong URL variable used for extraction of raw html
  The remove_empty_elements_fast() method was removing whitespace-only
  span elements inside <pre> and <code> tags, causing import statements
  like "import torch" to become "importtorch". Now skips elements inside
  code blocks where whitespace is significant.
Fix: Wrong URL variable used for extraction of raw html
Refactor Pydantic model configuration to use ConfigDict for arbitrary…
- extend LLMConfig with backoff delay/attempt/factor fields and thread them
  through LLMExtractionStrategy, LLMContentFilter, table extraction, and
  Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
  and document them in the md_v2 guides
Fix BrowserConfig proxy_config serialization
Make LLM backoff configurable end-to-end
…nce-filter

[Fix]: Docker server does not decode ContentRelevanceFilter
unclecode and others added 30 commits March 7, 2026 06:15
Update docs/examples to use current API:
- proxy → proxy_config in BrowserConfig
- result.fit_markdown → result.markdown.fit_markdown
- result.fit_html → result.markdown.fit_html
- markdown_v2 deprecation notes updated
- bypass_cache → cache_mode=CacheMode.BYPASS
- LLMExtractionStrategy now uses llm_config=LLMConfig(...)
- CrawlerConfig → CrawlerRunConfig
- cache_mode string values → CacheMode enum
- Fix missing CacheMode import in local-files.md
- Fix indentation in app-detail.html example
- Fix tautological cache mode descriptions in arun.md

From PR #1770 by @maksimzayats
…1734, #1290, #1668)

Bug fixes:
- Verify redirect targets are alive before returning from URL seeder (#1622)
- Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786)
- Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796)

Security/Docker:
- Require api_token for /token endpoint when configured (#1795)
- Deep-crawl streaming now mirrors Python library behavior via arun() (#1798)

CI:
- Bump GitHub Actions to latest versions - checkout v6, setup-python v6,
  build-push-action v6, setup-buildx v4, login v4 (#1734)

Features:
- Support type-list pipeline in JsonCssExtractionStrategy for chained
  extraction like ["attribute", "regex"] (#1290)
- Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting
  for Unicode preservation in JSON output (#1668)
…1354, #880, #1031, #1251, #1758)

- #1520: Preserve trailing slashes in URL normalization (RFC 3986 compliance)
- #1489: Preserve query parameter key casing in normalize_url
- #1374: Close NamedTemporaryFile handle before reopening (Windows fix)
- #1424: Fix CosineStrategy returning empty results (delimiter fallback + at_least_k >= 1)
- #1183: Fix extract_xml_data regex matching tag names in prose text
- #1354: Make import_knowledge_base async (fix asyncio.run in running loop)
- #880: Fix 404 sample_ecommerce.html gist URL in docs (6 occurrences)
- #1031: Make Docker playground code editor resizable with overflow-auto
- #1251: Add DEFAULT_CONFIG with deep-merge in load_config to prevent KeyError crashes
- #1758: Change screenshot stitching format from BMP to PNG
Full regression suite covering all major Crawl4AI subsystems:
- core crawl (arun, arun_many, raw HTML, JS, screenshots, cache, hooks)
- content processing (markdown, citations, BM25/pruning filters, links, images, tables, metadata)
- extraction strategies (JsonCss, JsonXPath, JsonLxml, Regex, Cosine, NoExtraction)
- deep crawl (BFS, DFS, BestFirst, filters, scorers, URL normalization)
- browser management (lifecycle, viewport, wait_for, stealth, sessions, iframes)
- config serialization (BrowserConfig, CrawlerRunConfig, ProxyConfig roundtrips)
- utilities (extract_xml_data, cache modes, content hashing)
- edge cases (empty pages, malformed HTML, unicode, concurrent crawls, error recovery)

Also adds /c4ai-check slash command for testing changes against the suite.
Bug fixes: #1520, #1489, #1374, #1424, #1183, #1354, #880, #1031, #1251, #1758
Regression tests: 291 tests covering all major subsystems
- #1487: Move virtual scroll after wait_for so dynamic containers exist
- #1512: Add __aiter__ to CrawlResultContainer for async for support
- #1666: Kill process group on cleanup to prevent zombie child processes,
  add lsof fallback for Docker environments without lsof installed
- Close #1472 (redirect chain already fixed), #1480 (links already
  normalized), #1679 (duplicate of #1509)
… match, PDF deserialization

- antibot_detector: add <pre> to content elements regex, detect
  browser-wrapped JSON in _looks_like_data() so httpbin-style
  responses are not flagged as blocked
- deep_crawling/filters: use urlparse().path for path-only prefix
  patterns (/docs/*) instead of matching against full URL, which
  always failed; full-URL prefixes still match correctly
- async_configs: add PDFContentScrapingStrategy to
  ALLOWED_DESERIALIZE_TYPES so /crawl API can deserialize it
- __init__: export PDFContentScrapingStrategy for type resolution
- tests: add 86-test suite covering all three fixes with adversarial
  and edge cases
take_screenshot() ignored the scan_full_page config flag — tall pages
always got a full-page screenshot even when scan_full_page=False.
Now passes scan_full_page through to take_screenshot() and uses
viewport-only capture when False.

Includes 16 tests (8 unit + 8 integration).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Recognized for identifying and confirming the PDFContentScrapingStrategy
deserialization fix (#1815).
BM25ContentFilter.filter_content() returned duplicate text chunks when
the same content appeared in multiple DOM elements. Added exact-text
deduplication after threshold filtering, keeping the first occurrence
in document order.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…full-page-1750

fix: screenshot respects scan_full_page=False (#1750)
…password (#1611, #1817)

- #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional
  provider, temperature, base_url query params with fallback to server config.
  Consistent with /md and /llm/job endpoints.
- #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds
  URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT,
  REDIS_PASSWORD environment variable overrides.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
#1370, #1818, #1509, #1762)

- #1370: Freeze element dimensions via CSS before viewport resize in
  take_screenshot_scroller() to prevent responsive reflow on Elementor
  sites; restore original viewport after capture.
- #1818: Call window.stop() on session-reused pages before navigation
  to abort pending loads; move event listener cleanup outside session_id
  guard so listeners don't accumulate across reuses.
- #1509: Bypass dispatcher in arun_many() when deep_crawl_strategy is
  set — call arun() directly per URL so the DeepCrawlDecorator can
  invoke the strategy (dispatcher crashes on List[CrawlResult] return).
- #1762: Add encoding="utf-8" to the remaining open() call in
  save_global_config() (cli.py line 58).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace @app.get() with starlette.routing.Route() for the SSE handler.
The MCP SDK's SseServerTransport calls raw ASGI (scope, receive, send)
internally, which conflicts with Starlette's middleware wrapping.

Also update CONTRIBUTORS.md for PR #1829.
Add official Redis apt repository and pin redis-server to 7.2.7 which
patches the Lua use-after-free vulnerability. REDIS_VERSION build arg
allows override.
css_selector was skipped in _scrap() — only target_elements was
applied. Now css_selector filters the DOM first, then target_elements
narrows within that selection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-1484

fix: css_selector ignored in LXML scraping for raw:// URLs (#1484)
…is-config-1611-1817

fix: /llm per-request provider override, Redis config from host/port/password (#1611, #1817)
…#1754)

The monitor's update_timeline(), get_health_summary(), and
get_browser_list() all acquired the crawler pool's global LOCK to read
pool stats. That same lock is held during slow browser start/close
operations (get_crawler, janitor, close_all), causing the monitor to
block indefinitely and the pod to become unresponsive after sustained
crawling.

Replaced all three lock acquisitions in monitor.py with a lock-free
get_pool_snapshot() in crawler_pool.py that returns shallow dict copies.
Under CPython's GIL, dict.copy() and len() are atomic — safe for
read-only monitoring with at most slightly stale counts.
Bump version to 0.8.5 across all references (Dockerfile, README,
Docker README, blog index, __version__.py).

Add release notes, blog post, demo verification script (13 real-crawl
tests), and releases directory entry.

Key highlights:
- Anti-bot detection with 3-tier proxy escalation
- Shadow DOM flattening
- Deep crawl cancellation
- Config defaults API
- 60+ bug fixes and critical security patches
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.