Release v0.8.5 by ntohidi · Pull Request #1836 · unclecode/crawl4ai

ntohidi · 2026-03-16T10:48:24Z

Summary

Version bump to 0.8.5 across Dockerfile, README, Docker README, blog index, __version__.py
Release notes and blog post added (docs/blog/release-v0.8.5.md, docs/md_v2/blog/releases/v0.8.5.md)
Demo verification script with 13 real-crawl tests (docs/releases_review/demo_v0.8.5.py)

Key highlights in v0.8.5

Anti-bot detection with 3-tier proxy escalation
Shadow DOM flattening
Deep crawl cancellation
Config defaults API (set_defaults / get_defaults / reset_defaults)
Source/sibling selector in JSON extraction
Consent popup removal (40+ CMP platforms)
avoid_ads / avoid_css resource filtering
Browser recycling & memory-saving mode
GFM table compliance
Critical security fixes (RCE via deserialization, Redis CVE-2025-49844)
60+ bug fixes

Issues fixed

#462, #880, #943, #1031, #1077, #1183, #1213, #1251, #1281, #1290, #1296, #1308, #1354, #1364, #1370, #1374, #1424, #1435, #1463, #1484, #1487, #1489, #1494, #1503, #1509, #1512, #1520, #1553, #1594, #1601, #1606, #1611, #1622, #1635, #1640, #1658, #1666, #1667, #1668, #1671, #1682, #1686, #1711, #1715, #1716, #1721, #1730, #1731, #1746, #1750, #1751, #1754, #1758, #1762, #1768, #1770, #1776, #1782, #1783, #1784, #1786, #1788, #1789, #1790, #1792, #1793, #1794, #1795, #1796, #1797, #1801, #1803, #1804, #1805, #1815, #1817, #1818, #1824

Test plan

Run python docs/releases_review/demo_v0.8.5.py — 13 end-to-end tests
Verify Docker build: docker buildx build -t crawl4ai-local:latest --load .
Spot-check release notes formatting

Fix warning as proposed by bs4: ``` .../bs4/element.py:2253: DeprecationWarning: The 'text' argument to find()-type methods is deprecated. Use 'string' instead. return self.find_all( ```

- Fix dict-to-ProxyConfig conversion in BrowserConfig and CrawlerRunConfig - Fix JSON serialization of ProxyConfig objects in to_dict methods - Fix context proxy to use ProxySettings instead of plain dict - Resolves proxy authentication issues

…arately.

- Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html

The remove_empty_elements_fast() method was removing whitespace-only span elements inside <pre> and <code> tags, causing import statements like "import torch" to become "importtorch". Now skips elements inside code blocks where whitespace is significant.

… types

… and clean up mock data. ref #1621

Fix: Wrong URL variable used for extraction of raw html

… dirs. ref #1638

Refactor Pydantic model configuration to use ConfigDict for arbitrary…

- extend LLMConfig with backoff delay/attempt/factor fields and thread them through LLMExtractionStrategy, LLMContentFilter, table extraction, and Docker API handlers - expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff and document them in the md_v2 guides

…d test for delayed redirects. ref #1268

…develop

Fix BrowserConfig proxy_config serialization

Make LLM backoff configurable end-to-end

…nce-filter [Fix]: Docker server does not decode ContentRelevanceFilter

@Br1an67

From PR #1792 by @Br1an67

@Br1an67

From PR #1794 by @Br1an67

@Br1an67

…rs (#1784) From PR #1784 by @Br1an67

@hoi

From PR #1730 by @hoi

…1792, #1794, #1784, #1730

@maksimzayats

Update docs/examples to use current API: - proxy → proxy_config in BrowserConfig - result.fit_markdown → result.markdown.fit_markdown - result.fit_html → result.markdown.fit_html - markdown_v2 deprecation notes updated - bypass_cache → cache_mode=CacheMode.BYPASS - LLMExtractionStrategy now uses llm_config=LLMConfig(...) - CrawlerConfig → CrawlerRunConfig - cache_mode string values → CacheMode enum - Fix missing CacheMode import in local-files.md - Fix indentation in app-detail.html example - Fix tautological cache mode descriptions in arun.md From PR #1770 by @maksimzayats

@jtanningbed

From PR #462 by @jtanningbed

…1734, #1290, #1668) Bug fixes: - Verify redirect targets are alive before returning from URL seeder (#1622) - Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786) - Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796) Security/Docker: - Require api_token for /token endpoint when configured (#1795) - Deep-crawl streaming now mirrors Python library behavior via arun() (#1798) CI: - Bump GitHub Actions to latest versions - checkout v6, setup-python v6, build-push-action v6, setup-buildx v4, login v4 (#1734) Features: - Support type-list pipeline in JsonCssExtractionStrategy for chained extraction like ["attribute", "regex"] (#1290) - Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting for Unicode preservation in JSON output (#1668)

Batch 5 merged: #1622, #1786, #1796, #1795, #1798, #1734, #1290, #1668 Closed as superseded: #1592 Closed as won't merge: #999, #1180, #1425, #1702, #1707, #1729

…1354, #880, #1031, #1251, #1758) - #1520: Preserve trailing slashes in URL normalization (RFC 3986 compliance) - #1489: Preserve query parameter key casing in normalize_url - #1374: Close NamedTemporaryFile handle before reopening (Windows fix) - #1424: Fix CosineStrategy returning empty results (delimiter fallback + at_least_k >= 1) - #1183: Fix extract_xml_data regex matching tag names in prose text - #1354: Make import_knowledge_base async (fix asyncio.run in running loop) - #880: Fix 404 sample_ecommerce.html gist URL in docs (6 occurrences) - #1031: Make Docker playground code editor resizable with overflow-auto - #1251: Add DEFAULT_CONFIG with deep-merge in load_config to prevent KeyError crashes - #1758: Change screenshot stitching format from BMP to PNG

Full regression suite covering all major Crawl4AI subsystems: - core crawl (arun, arun_many, raw HTML, JS, screenshots, cache, hooks) - content processing (markdown, citations, BM25/pruning filters, links, images, tables, metadata) - extraction strategies (JsonCss, JsonXPath, JsonLxml, Regex, Cosine, NoExtraction) - deep crawl (BFS, DFS, BestFirst, filters, scorers, URL normalization) - browser management (lifecycle, viewport, wait_for, stealth, sessions, iframes) - config serialization (BrowserConfig, CrawlerRunConfig, ProxyConfig roundtrips) - utilities (extract_xml_data, cache modes, content hashing) - edge cases (empty pages, malformed HTML, unicode, concurrent crawls, error recovery) Also adds /c4ai-check slash command for testing changes against the suite.

Bug fixes: #1520, #1489, #1374, #1424, #1183, #1354, #880, #1031, #1251, #1758 Regression tests: 291 tests covering all major subsystems

- #1487: Move virtual scroll after wait_for so dynamic containers exist - #1512: Add __aiter__ to CrawlResultContainer for async for support - #1666: Kill process group on cleanup to prevent zombie child processes, add lsof fallback for Docker environments without lsof installed - Close #1472 (redirect chain already fixed), #1480 (links already normalized), #1679 (duplicate of #1509)

… match, PDF deserialization - antibot_detector: add <pre> to content elements regex, detect browser-wrapped JSON in _looks_like_data() so httpbin-style responses are not flagged as blocked - deep_crawling/filters: use urlparse().path for path-only prefix patterns (/docs/*) instead of matching against full URL, which always failed; full-URL prefixes still match correctly - async_configs: add PDFContentScrapingStrategy to ALLOWED_DESERIALIZE_TYPES so /crawl API can deserialize it - __init__: export PDFContentScrapingStrategy for type resolution - tests: add 86-test suite covering all three fixes with adversarial and edge cases

take_screenshot() ignored the scan_full_page config flag — tall pages always got a full-page screenshot even when scan_full_page=False. Now passes scan_full_page through to take_screenshot() and uses viewport-only capture when False. Includes 16 tests (8 unit + 8 integration). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Recognized for identifying and confirming the PDFContentScrapingStrategy deserialization fix (#1815).

BM25ContentFilter.filter_content() returned duplicate text chunks when the same content appeared in multiple DOM elements. Added exact-text deduplication after threshold filtering, keeping the first occurrence in document order. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…full-page-1750 fix: screenshot respects scan_full_page=False (#1750)

…password (#1611, #1817) - #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional provider, temperature, base_url query params with fallback to server config. Consistent with /md and /llm/job endpoints. - #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT, REDIS_PASSWORD environment variable overrides. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

#1370, #1818, #1509, #1762) - #1370: Freeze element dimensions via CSS before viewport resize in take_screenshot_scroller() to prevent responsive reflow on Elementor sites; restore original viewport after capture. - #1818: Call window.stop() on session-reused pages before navigation to abort pending loads; move event listener cleanup outside session_id guard so listeners don't accumulate across reuses. - #1509: Bypass dispatcher in arun_many() when deep_crawl_strategy is set — call arun() directly per URL so the DeepCrawlDecorator can invoke the strategy (dispatcher crashes on List[CrawlResult] return). - #1762: Add encoding="utf-8" to the remaining open() call in save_global_config() (cli.py line 58). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace @app.get() with starlette.routing.Route() for the SSE handler. The MCP SDK's SseServerTransport calls raw ASGI (scope, receive, send) internally, which conflicts with Starlette's middleware wrapping. Also update CONTRIBUTORS.md for PR #1829.

Add official Redis apt repository and pin redis-server to 7.2.7 which patches the Lua use-after-free vulnerability. REDIS_VERSION build arg allows override.

css_selector was skipped in _scrap() — only target_elements was applied. Now css_selector filters the DOM first, then target_elements narrows within that selection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…-1484 fix: css_selector ignored in LXML scraping for raw:// URLs (#1484)

…is-config-1611-1817 fix: /llm per-request provider override, Redis config from host/port/password (#1611, #1817)

…#1754) The monitor's update_timeline(), get_health_summary(), and get_browser_list() all acquired the crawler pool's global LOCK to read pool stats. That same lock is held during slow browser start/close operations (get_crawler, janitor, close_all), causing the monitor to block indefinitely and the pod to become unresponsive after sustained crawling. Replaced all three lock acquisitions in monitor.py with a lock-free get_pool_snapshot() in crawler_pool.py that returns shallow dict copies. Under CPython's GIL, dict.copy() and len() are atomic — safe for read-only monitoring with at most slightly stale counts.

Bump version to 0.8.5 across all references (Dockerfile, README, Docker README, blog index, __version__.py). Add release notes, blog post, demo verification script (13 real-crawl tests), and releases directory entry. Key highlights: - Anti-bot detection with 3-tier proxy escalation - Shadow DOM flattening - Deep crawl cancellation - Config defaults API - 60+ bug fixes and critical security patches

RoyLeviLangware and others added 30 commits May 6, 2025 11:44

fix bs4 warning on text kwarg - switch to string

9a0585c

Fix warning as proposed by bs4: ``` .../bs4/element.py:2253: DeprecationWarning: The 'text' argument to find()-type methods is deprecated. Use 'string' instead. return self.find_all( ```

fix VersionManager not using CRAWL4_AI_BASE_DIRECTORY

83b323f

In obtaining cleaned_html, the tag "script" needs to be processed sep…

6d3444b

…arately.

In obtaining cleaned_html, the tag "script" needs to be processed sep…

660d701

…arately.

Merge branch 'unclecode:main' into patch-versionmanager

367190f

feat: make device_scale_factor configurable via BrowserConfig

b54c200

Fix: Use correct URL variable for raw HTML extraction (#1116)

edd0b57

- Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html

Refactor Pydantic model configuration to use ConfigDict for arbitrary…

eca04b0

… types

Merge branch 'develop' into fix/wrong_url_raw

7771ed3

Fix EmbeddingStrategy: Uncomment response handling for the variations…

84bfea8

… and clean up mock data. ref #1621

Merge pull request #1447 from rbushri/fix/wrong_url_raw

94c8a83

Fix: Wrong URL variable used for extraction of raw html

Fix: permission issues with .cache/url_seeder and other runtime cache…

b36c6da

… dirs. ref #1638

fix: ensure BrowserConfig.to_dict serializes proxy_config

a0c5f0f

Merge pull request #1623 from unclecode/fix/deprecated_pydantic

dcb77c9

Refactor Pydantic model configuration to use ConfigDict for arbitrary…

reproduced AttributeError from #1642

33a3cc3

pass timeout parameter to docker client request

6ec6bc4

added missing deep crawling objects to init

eb76df2

generalized query in ContentRelevanceFilter to be a str or list

e95e8e1

import modules from enhanceable deserialization

3a8f829

parameterized tests

6893094

Fix: capture current page URL to reflect JavaScript navigation and ad…

07ccf13

…d test for delayed redirects. ref #1268

Merge branch 'develop' of https://github.com/unclecode/crawl4ai into …

afc31e1

…develop

Merge pull request #1641 from unclecode/fix/serialize-proxy-config

d06c39e

Fix BrowserConfig proxy_config serialization

Merge pull request #1645 from unclecode/fix/configurable-backoff

f32cfc6

Make LLM backoff configurable end-to-end

refactor: replace PyPDF2 with pypdf across the codebase. ref #1412

df4d87e

Merge pull request #1648 from christopher-w-murphy/fix/content-releva…

5a8fb57

…nce-filter [Fix]: Docker server does not decode ContentRelevanceFilter

When using --deep-crawl, output all pages, not just the first one.

220a224

unclecode and others added 30 commits March 7, 2026 06:15

fix: add wait_for_images option to screenshot endpoint (#1792)

d229bee

From PR #1792 by @Br1an67

fix: add Windows support for crawler monitor keyboard input (#1794)

1029815

From PR #1794 by @Br1an67

fix: handle UnicodeEncodeError in URL seeder and strip zero-width cha…

e47e810

…rs (#1784) From PR #1784 by @Br1an67

fix: add TTL expiry for Redis task data to prevent memory growth (#1730)

761664d

From PR #1730 by @hoi

Update CONTRIBUTORS for PRs #1494, #1715, #1716, #1308, #1789, #1793, #…

db98aef

…1792, #1794, #1784, #1730

Update PR-TODOLIST for batch 4 merge (10 PRs) and refresh open PR list

31d0de2

Update CONTRIBUTORS for PR #1770

3704758

fix: add newline before opening code fence in html2text (#462)

697c2b2

From PR #462 by @jtanningbed

Update CONTRIBUTORS for PR #462

11ed854

Update CONTRIBUTORS and PR-TODOLIST for batch 5 (15 PRs resolved)

0c9e3c4

Batch 5 merged: #1622, #1786, #1796, #1795, #1798, #1734, #1290, #1668 Closed as superseded: #1592 Closed as won't merge: #999, #1180, #1425, #1702, #1707, #1729

Merge fix/batch-easy-issues-10: 10 bug fixes + regression test suite

a7e6da0

Bug fixes: #1520, #1489, #1374, #1424, #1183, #1354, #880, #1031, #1251, #1758 Regression tests: 291 tests covering all major subsystems

docs: add hafezparast to CONTRIBUTORS.md

35034f5

Recognized for identifying and confirming the PDFContentScrapingStrategy deserialization fix (#1815).

Merge pull request #1823 from hafezparast/fix/maysam-screenshot-scan-…

d907e16

…full-page-1750 fix: screenshot respects scan_full_page=False (#1750)

fix: upgrade Redis to 7.2.7 for CVE-2025-49844 (CVSS 10.0) (#1671)

bf1158a

Add official Redis apt repository and pin redis-server to 7.2.7 which patches the Lua use-after-free vulnerability. REDIS_VERSION build arg allows override.

Merge pull request #1833 from hafezparast/fix/maysam-css-selector-raw…

6e42995

…-1484 fix: css_selector ignored in LXML scraping for raw:// URLs (#1484)

Merge pull request #1827 from hafezparast/fix/maysam-llm-provider-red…

648f36b

…is-config-1611-1817 fix: /llm per-request provider override, Redis config from host/port/password (#1611, #1817)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v0.8.5#1836

Release v0.8.5#1836
ntohidi wants to merge 229 commits intomainfrom
release/v0.8.5

ntohidi commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

ntohidi commented Mar 16, 2026

Summary

Key highlights in v0.8.5

Issues fixed

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants