Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
229 commits
Select commit Hold shift + click to select a range
9a0585c
fix bs4 warning on text kwarg - switch to string
RoyLeviLangware May 6, 2025
1d6efb6
Fix proxy authentication ERR_INVALID_AUTH_CREDENTIALS
garyluky Jul 8, 2025
83b323f
fix VersionManager not using CRAWL4_AI_BASE_DIRECTORY
vladmandic Jul 12, 2025
6d3444b
In obtaining cleaned_html, the tag "script" needs to be processed sep…
nnxiong Aug 5, 2025
660d701
In obtaining cleaned_html, the tag "script" needs to be processed sep…
nnxiong Aug 5, 2025
367190f
Merge branch 'unclecode:main' into patch-versionmanager
vladmandic Aug 31, 2025
b54c200
feat: make device_scale_factor configurable via BrowserConfig
TristanDonze Sep 1, 2025
edd0b57
Fix: Use correct URL variable for raw HTML extraction (#1116)
rbushri Aug 28, 2025
c2c4d42
Fix #1181: Preserve whitespace in code blocks during HTML scraping
ntohidi Nov 17, 2025
eca04b0
Refactor Pydantic model configuration to use ConfigDict for arbitrary…
Ahmed-Tawfik94 Nov 18, 2025
7771ed3
Merge branch 'develop' into fix/wrong_url_raw
rbushri Nov 24, 2025
84bfea8
Fix EmbeddingStrategy: Uncomment response handling for the variations…
ntohidi Nov 25, 2025
94c8a83
Merge pull request #1447 from rbushri/fix/wrong_url_raw
ntohidi Nov 25, 2025
b36c6da
Fix: permission issues with .cache/url_seeder and other runtime cache…
ntohidi Nov 25, 2025
a0c5f0f
fix: ensure BrowserConfig.to_dict serializes proxy_config
SohamKukreti Nov 26, 2025
dcb77c9
Merge pull request #1623 from unclecode/fix/deprecated_pydantic
ntohidi Nov 27, 2025
7a133e2
feat: make LLM backoff configurable end-to-end
SohamKukreti Nov 28, 2025
33a3cc3
reproduced AttributeError from #1642
murphycw Dec 1, 2025
6ec6bc4
pass timeout parameter to docker client request
murphycw Dec 1, 2025
eb76df2
added missing deep crawling objects to init
murphycw Dec 1, 2025
e95e8e1
generalized query in ContentRelevanceFilter to be a str or list
murphycw Dec 1, 2025
3a8f829
import modules from enhanceable deserialization
murphycw Dec 1, 2025
6893094
parameterized tests
murphycw Dec 1, 2025
07ccf13
Fix: capture current page URL to reflect JavaScript navigation and ad…
ntohidi Dec 2, 2025
afc31e1
Merge branch 'develop' of https://github.com/unclecode/crawl4ai into …
ntohidi Dec 2, 2025
d06c39e
Merge pull request #1641 from unclecode/fix/serialize-proxy-config
ntohidi Dec 2, 2025
f32cfc6
Merge pull request #1645 from unclecode/fix/configurable-backoff
ntohidi Dec 2, 2025
df4d87e
refactor: replace PyPDF2 with pypdf across the codebase. ref #1412
ntohidi Dec 3, 2025
5a8fb57
Merge pull request #1648 from christopher-w-murphy/fix/content-releva…
ntohidi Dec 3, 2025
220a224
When using --deep-crawl, output all pages, not just the first one.
christian-oudard Dec 10, 2025
306ddcb
Merge branch 'main' into develop
ntohidi Dec 11, 2025
8ae908b
Add browser_context_id and target_id parameters to BrowserConfig
unclecode Dec 13, 2025
66941a5
Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server…
unclecode Dec 13, 2025
d22825e
Fix: add cdp_cleanup_on_close to from_kwargs
unclecode Dec 13, 2025
b2e4a1f
Fix: find context by target_id for concurrent CDP connections
unclecode Dec 13, 2025
c1e485e
Fix: use target_id to find correct page in get_page
unclecode Dec 13, 2025
8014805
Fix: use CDP to find context by browserContextId for concurrent sessions
unclecode Dec 13, 2025
6185d3c
Revert context matching attempts - Playwright cannot see CDP-created …
unclecode Dec 13, 2025
55eb968
Add create_isolated_context flag for concurrent CDP crawls
unclecode Dec 13, 2025
ecedb61
Add context caching to create_isolated_context branch
unclecode Dec 13, 2025
d10ca38
Add init_scripts support to BrowserConfig for pre-page-load JS injection
unclecode Dec 14, 2025
02acad1
Fix CDP connection handling: support WS URLs and proper cleanup
unclecode Dec 18, 2025
f6b29a8
Update gitignore
unclecode Dec 21, 2025
48426f7
Some debugging for caching
unclecode Dec 21, 2025
444cb14
Add _generate_screenshot_from_html for raw: and file:// URLs
unclecode Dec 22, 2025
67e03d6
Add PDF and MHTML support for raw: and file:// URLs
unclecode Dec 22, 2025
31ebf37
Add crash recovery for deep crawl strategies
unclecode Dec 22, 2025
624e341
Fix: HTTP strategy raw: URL parsing truncates at # character
unclecode Dec 24, 2025
3937efc
Add base_url parameter to CrawlerRunConfig for raw HTML processing
unclecode Dec 24, 2025
fde4e9f
Add prefetch mode for two-phase deep crawling
unclecode Dec 25, 2025
9e7f5aa
Updates on proxy rotation and proxy configuration
unclecode Dec 26, 2025
a43256b
Add proxy support to HTTP crawler strategy
unclecode Dec 26, 2025
2550f3d
Add browser pipeline support for raw:/file:// URLs
unclecode Dec 27, 2025
3d78001
Add smart TTL cache for sitemap URL seeder
unclecode Dec 30, 2025
db61ab8
Update URL seeder docs with smart TTL cache parameters
unclecode Dec 30, 2025
0d3f9e6
Add MEMORY.md to gitignore
unclecode Dec 30, 2025
6b2dca7
Docs: Add multi-sample schema generation section
unclecode Jan 4, 2026
cee79a8
feat: add force viewport screenshot
theredrad Jan 6, 2026
f24396c
Fix critical RCE and LFI vulnerabilities in Docker API deployment
unclecode Jan 12, 2026
acfab80
Enhance authentication flow by implementing JWT token retrieval and a…
ntohidi Jan 12, 2026
122b4fe
Add release notes for v0.7.9, detailing breaking changes, security fi…
ntohidi Jan 12, 2026
530cde3
Add release notes for v0.8.0, detailing breaking changes, security fi…
unclecode Jan 12, 2026
315eae9
Add examples for deep crawl crash recovery and prefetch mode in docum…
ntohidi Jan 14, 2026
a00da65
Add async agenerate_schema method for schema generation
unclecode Jan 16, 2026
6090629
Fix: Enable litellm.drop_params for O-series/GPT-5 model compatibility
unclecode Jan 16, 2026
624dfe7
fix: Replace tf-playwright-stealth with playwright-stealth dependency
YuriNachos Jan 17, 2026
2a04fc3
fix: Allow local embeddings by removing OpenAI fallback in EmbeddingS…
YuriNachos Jan 17, 2026
37ff85f
fix: Add docstring to MCP tool 'md' endpoint
YuriNachos Jan 17, 2026
ef8f0c6
fix: Include GoogleSearchCrawler script.js in package distribution
YuriNachos Jan 17, 2026
232f007
fix: Initialize default logger in AsyncPlaywrightCrawlerStrategy
YuriNachos Jan 17, 2026
2016d66
fix: Respect <base> tag for relative link resolution in html2text
YuriNachos Jan 17, 2026
857b1ed
Merge branch 'main' into develop
ntohidi Jan 19, 2026
418bfcf
Fix redirected_url containing raw HTML content for raw: URLs
unclecode Jan 20, 2026
fe1c1cb
Fix #1686: Use dynamic version from crawl4ai package in health endpoint
jose-blockchain Jan 20, 2026
9123f65
Fix #1686: Use dynamic version from crawl4ai package in health endpoint
jose-blockchain Jan 20, 2026
c9a271a
Merge branch 'fix/1686-docker-health-version' of https://github.com/j…
jose-blockchain Jan 20, 2026
f6897d1
Add cancellation support for deep crawl strategies
unclecode Jan 22, 2026
1e2b7fe
Add documentation and example for deep crawl cancellation
unclecode Jan 22, 2026
fbfbc69
Fix deep crawl cancellation example to use DFS for precise control
unclecode Jan 22, 2026
777d087
Update security contact emails in SECURITY.md
ntohidi Jan 22, 2026
b0b3ca1
Refactor extraction strategy internals and improve error handling
unclecode Jan 24, 2026
2d5e530
Add support for parallel URL processing in extraction utilities
unclecode Jan 24, 2026
79ebfce
Refactor HTML block delimiter to use config constant
unclecode Jan 24, 2026
94e19a4
Enhance browser profile management capabilities
unclecode Jan 24, 2026
ef226f5
Add: Cloud CLI module for profile management
unclecode Jan 25, 2026
18d2ef4
Fix: Disable cookie encryption for portable profiles
unclecode Jan 26, 2026
21e6c41
Fix: Keep storage_state.json in profile shrink
unclecode Jan 26, 2026
656b938
Merge branch 'main' into develop
unclecode Jan 27, 2026
9b52c14
Fix page reuse race condition when create_isolated_context=False
unclecode Jan 28, 2026
0a17fe8
Improve page tracking with global CDP endpoint-based tracking
unclecode Jan 28, 2026
911bbce
Fix agenerate_schema() JSON parsing for Anthropic models
unclecode Jan 29, 2026
034bddf
Merge pull request #1733 from jose-blockchain/fix/1686-docker-health-…
ntohidi Jan 29, 2026
ad5ebf1
Merge pull request #1718 from YuriNachos/fix/issue-1704-default-logger
ntohidi Jan 29, 2026
0104db6
Fix critical RCE via deserialization and eval() in /crawl endpoint
unclecode Jan 30, 2026
694ba44
Added fix for URL Seeder forcing Common Crawl index in case of a "sit…
ChiragBellara Jan 30, 2026
19b9140
Improve CDP connection handling
unclecode Jan 31, 2026
13a4148
Add set_defaults/get_defaults/reset_defaults to config classes
unclecode Jan 31, 2026
55a2cc8
Document set_defaults/get_defaults/reset_defaults in config guides
unclecode Jan 31, 2026
e19492a
Merge PR #1694: feat: add force viewport screenshot
unclecode Feb 1, 2026
5be0d2d
Add contributor and docs for force_viewport_screenshot feature
unclecode Feb 1, 2026
7c5933e
Merge PR #1746: Fix sitemap-only URL seeding avoiding Common Crawl calls
unclecode Feb 1, 2026
ee717dc
Add contributor for PR #1746 and fix test pytest marker
unclecode Feb 1, 2026
43738c9
Fix can_process_url() to receive normalized URL in deep crawl strategies
unclecode Feb 1, 2026
ccab926
Merge PR #1714: Replace tf-playwright-stealth with playwright-stealth
unclecode Feb 1, 2026
c39e796
Merge PR #1721: Fix <base> tag ignored in html2text relative link res…
unclecode Feb 1, 2026
9172581
Merge PR #1719: Include GoogleSearchCrawler script.js in package dist…
unclecode Feb 1, 2026
5cd0648
Merge PR #1717: Allow local embeddings by removing OpenAI fallback
unclecode Feb 1, 2026
dc4ae73
Merge PRs #1714, #1721, #1719, #1717 and fix base tag pipeline
unclecode Feb 1, 2026
37995d4
Merge PR #1667: Fix deep-crawl CLI outputting only the first page
unclecode Feb 1, 2026
0f83b05
Merge PR #1296: Fix VersionManager ignoring CRAWL4_AI_BASE_DIRECTORY …
unclecode Feb 1, 2026
a244e4d
Merge PR #1364: Fix script tag removal losing adjacent text in cleane…
unclecode Feb 1, 2026
312cef8
Fix PR #1296: restore .crawl4ai subfolder in VersionManager path
unclecode Feb 1, 2026
a56dd07
Merge PRs #1667, #1296, #1364 — CLI deep-crawl, env var, script tags
unclecode Feb 1, 2026
98aea2f
Merge PR #1077: Fix bs4 deprecation warning (text -> string)
unclecode Feb 1, 2026
980dc73
Merge PR #1281: Fix proxy auth ERR_INVALID_AUTH_CREDENTIALS
unclecode Feb 1, 2026
bb523b6
Merge PRs #1077, #1281 — bs4 deprecation and proxy auth fix
unclecode Feb 1, 2026
c790231
Fix browser context memory leak — signature shrink + LRU eviction (#943)
unclecode Feb 1, 2026
ffd3fac
Remove duplicate PROMPT_EXTRACT_BLOCKS definition in prompts.py
unclecode Feb 2, 2026
b962699
Add contributors from PRs #973, #1073, #931
unclecode Feb 2, 2026
0bfcf08
Add contributors from PRs #1133, #729
unclecode Feb 2, 2026
4e56f3e
Add contributing guide and update mkdocs navigation for community res…
ntohidi Feb 3, 2026
c046918
Add memory-saving mode, browser recycling, and CDP leak fixes
unclecode Feb 4, 2026
3401dd1
Fix browser recycling under high concurrency — version-based approach
unclecode Feb 5, 2026
719e83e
Update PR todolist — refresh open PRs, add 6 new, classify
unclecode Feb 6, 2026
0aacafe
Merge PR #1463: Add configurable device_scale_factor for screenshot q…
unclecode Feb 6, 2026
37a49c5
Merge PR #1435: Add redirected_status_code to CrawlResult
unclecode Feb 6, 2026
fbc5281
Add tests, docs, and contributors for PRs #1463 and #1435
unclecode Feb 6, 2026
44b8afb
Improve schema generation prompt for sibling-based layouts
unclecode Feb 10, 2026
3fc7730
Add remove_consent_popups flag and fix from_kwargs dict deserialization
unclecode Feb 11, 2026
1a24ac7
Refactor from_kwargs to respect set_defaults and use __init__ defaults
unclecode Feb 11, 2026
112f44a
Fix proxy auth for persistent browser contexts
unclecode Feb 12, 2026
fdd9897
Sync sec-ch-ua with User-Agent and keep WebGL alive in stealth mode
unclecode Feb 13, 2026
72b546c
Add anti-bot detection, retry, and fallback system
unclecode Feb 14, 2026
8752072
Unify proxy_config to accept list, add crawl_stats tracking
unclecode Feb 14, 2026
8795539
Add ProxyConfig.DIRECT sentinel for direct-then-proxy escalation
unclecode Feb 14, 2026
d028a88
Make proxy_config a property so direct assignment also normalizes
unclecode Feb 14, 2026
45d8e14
Fix proxy escalation: don't re-raise on first proxy exception when ch…
unclecode Feb 15, 2026
ccd24aa
Fix fallback fetch: run when all proxies crash, skip re-check, never …
unclecode Feb 15, 2026
cfa7308
fix: resolve AttributeError in FilterChain.add_filter by handling tup…
nitesh-77 Feb 16, 2026
4298e26
fix: run blocking chardet.detect in thread executor #1751
nitesh-77 Feb 16, 2026
094242d
Fix total_score not calculated for links that fail head extraction
AtharvaJaiswal005 Feb 16, 2026
87f57f1
Fix return in finally block silently suppressing exceptions
Otman404 Feb 17, 2026
d267c65
Add source (sibling selector) support to JSON extraction strategies
unclecode Feb 17, 2026
4fb02f8
Warn LLM against hashed/generated CSS class names in schema prompts
unclecode Feb 17, 2026
6ea0e38
Re-raise exceptions in MemoryAdaptiveDispatcher.run_urls after logging
Otman404 Feb 18, 2026
c70ab31
fix: add leading/trailing pipes to GFM tables (pad_tables=False)
PatD42 Feb 18, 2026
8576331
Add Shadow DOM flattening and reorder js_code execution pipeline
unclecode Feb 18, 2026
c9cb016
Add token usage tracking to generate_schema / agenerate_schema
unclecode Feb 18, 2026
13048a1
Add Tier 3 structural integrity check to anti-bot detector
unclecode Feb 18, 2026
2060c7e
Fix browser recycling deadlock under sustained concurrent load (#1640)
unclecode Feb 19, 2026
94a77ee
Move test_repro_1640.py to tests/browser/
unclecode Feb 19, 2026
8df3541
Skip anti-bot checks and fallback for raw: URLs
unclecode Feb 19, 2026
c854e2b
Fix simulate_user destroying page content via ArrowDown keypress
unclecode Feb 19, 2026
7226f8f
Extend try/finally to cover all post-get_page setup code (#1640)
unclecode Feb 20, 2026
254ef05
Fix anti-bot detection for large SPA block pages (403/503)
unclecode Feb 20, 2026
0e9b677
Fix MCP bridge httpx timeout: add configurable timeout parameter
claude Feb 23, 2026
7435a16
Merge pull request #1771 from hafezparast/claude/check-fork-sync-S9SSz
ntohidi Feb 23, 2026
57be8b8
Merge pull request #1759 from nitesh-77/fix/filterchain-tuple-attribu…
ntohidi Feb 24, 2026
731388c
Merge pull request #1760 from nitesh-77/fix/async-chardet-block
ntohidi Feb 24, 2026
1a9f68d
Fix cascading context crash from duplicate add_init_script (#1768)
unclecode Feb 24, 2026
5b815c2
Fix redirect URL mismatch in head data merging
AtharvaJaiswal005 Feb 24, 2026
cbd36b7
Add stats dashboard page for LP summit
unclecode Feb 24, 2026
c4cdc02
Merge pull request #1761 from AtharvaJaiswal005/fix/total-score-missi…
ntohidi Feb 25, 2026
4f9cc08
Merge pull request #1764 from PatD42/fix/table-gfm-pipes
ntohidi Feb 25, 2026
cd81e3c
Fix scroll_delay ignored in take_screenshot_scroller for full-page sc…
Ahmed-Tawfik94 Feb 25, 2026
9cfeb46
Document scroll_delay parameter for full-page screenshot crawling
Ahmed-Tawfik94 Feb 25, 2026
d419199
Merge pull request #1775 from unclecode/fix/issue-1748-screenshot-scr…
ntohidi Feb 25, 2026
8d35d17
Merge pull request #1722 from YuriNachos/fix/issue-1652-md-docstring
ntohidi Feb 25, 2026
c0912f7
feat: add avoid_ads/avoid_css resource filtering and pool release lif…
unclecode Feb 25, 2026
8f2c2e1
docs: add mzyfree to contributors for PR #1689
unclecode Feb 25, 2026
a4cc0a9
feat: add separate query_llm_config for adaptive crawler query expans…
unclecode Feb 25, 2026
0a45c10
feat: add separate query_llm_config for adaptive crawler query expans…
unclecode Feb 25, 2026
500d047
fix: preserve class and id attributes in cleaned_html
Br1an67 Mar 1, 2026
2048862
fix: strip port from URL domain in is_external_url comparison
Br1an67 Mar 1, 2026
b138c94
fix: guard against None LLM content and propagate finish_reason
Br1an67 Mar 1, 2026
669b466
fix: handle nested brackets and parentheses in LINK_PATTERN regex
Br1an67 Mar 1, 2026
0d151eb
Merge branch 'develop' of https://github.com/unclecode/crawl4ai into …
ntohidi Mar 2, 2026
0273b27
Fix MediaItem crash on non-numeric width values (e.g. "100%", "auto")
ntohidi Mar 2, 2026
71a6526
fix(docker): narrow from_serializable_dict to ignore plain data dicts…
SohamKukreti Mar 6, 2026
3795910
fix: add score_threshold support to BestFirstCrawlingStrategy
Mar 7, 2026
78434ea
fix: prevent AdaptiveCrawler from crawling external domains
Mar 7, 2026
8a677a9
Merge PR #1805: fix: prevent AdaptiveCrawler from crawling external d…
unclecode Mar 7, 2026
fdb3f8f
Merge PR #1763: fix: return in finally block silently suppressing exc…
unclecode Mar 7, 2026
b008671
Merge PR #1803: fix from_serializable_dict to ignore plain data dicts…
unclecode Mar 7, 2026
d458890
Update PR-TODOLIST and CONTRIBUTORS for merged PRs #1805, #1763, #1803
unclecode Mar 7, 2026
bd0f6e1
fix: strip markdown fences in force_json_response path (LLM extraction)
unclecode Mar 7, 2026
9ec2969
Merge PR #1790: fix: handle nested brackets and parentheses in LINK_P…
unclecode Mar 7, 2026
ff2ea34
Merge PR #1804: feat: add score_threshold support to BestFirstCrawlin…
unclecode Mar 7, 2026
4bde952
Update CONTRIBUTORS for PRs #1787, #1790, #1804
unclecode Mar 7, 2026
122be00
Merge PR #1782: fix: preserve class and id attributes in cleaned_html
unclecode Mar 7, 2026
5f65d2d
Merge PR #1788: fix: guard against None LLM content and propagate fin…
unclecode Mar 7, 2026
93f2f03
Merge PR #1783: fix: strip port from URL domain in is_external_url co…
unclecode Mar 7, 2026
814bc4d
Update CONTRIBUTORS for PRs #1782, #1788, #1783, #1179
unclecode Mar 7, 2026
72cc17c
docs: fix docstring param name crawler_config -> config (#1494)
unclecode Mar 7, 2026
5601861
docs: add missing CacheMode import in quickstart example (#1715)
unclecode Mar 7, 2026
e6c2a65
docs: fix return type annotations to use RunManyReturn (#1716)
unclecode Mar 7, 2026
d6a8f57
docs: fix css_selector type from list to string in examples (#1308)
unclecode Mar 7, 2026
91330ef
fix: add explicit utf-8 encoding to CLI file output (#1789)
unclecode Mar 7, 2026
c73aa27
fix: make link_preview_timeout configurable in AdaptiveConfig (#1793)
unclecode Mar 7, 2026
d229bee
fix: add wait_for_images option to screenshot endpoint (#1792)
unclecode Mar 7, 2026
1029815
fix: add Windows support for crawler monitor keyboard input (#1794)
unclecode Mar 7, 2026
e47e810
fix: handle UnicodeEncodeError in URL seeder and strip zero-width cha…
unclecode Mar 7, 2026
761664d
fix: add TTL expiry for Redis task data to prevent memory growth (#1730)
unclecode Mar 7, 2026
db98aef
Update CONTRIBUTORS for PRs #1494, #1715, #1716, #1308, #1789, #1793,…
unclecode Mar 7, 2026
31d0de2
Update PR-TODOLIST for batch 4 merge (10 PRs) and refresh open PR list
unclecode Mar 7, 2026
04e83aa
docs: modernize deprecated API usage across shipped docs (#1770)
unclecode Mar 7, 2026
3704758
Update CONTRIBUTORS for PR #1770
unclecode Mar 7, 2026
697c2b2
fix: add newline before opening code fence in html2text (#462)
unclecode Mar 7, 2026
11ed854
Update CONTRIBUTORS for PR #462
unclecode Mar 7, 2026
7c0cc3e
fix: batch merge of community PRs (#1622, #1786, #1796, #1795, #1798,…
unclecode Mar 7, 2026
0c9e3c4
Update CONTRIBUTORS and PR-TODOLIST for batch 5 (15 PRs resolved)
unclecode Mar 7, 2026
3a75dd3
fix: batch fix for 10 open issues (#1520, #1489, #1374, #1424, #1183,…
unclecode Mar 7, 2026
d788c28
test: add comprehensive regression test suite (291 tests)
unclecode Mar 8, 2026
a7e6da0
Merge fix/batch-easy-issues-10: 10 bug fixes + regression test suite
unclecode Mar 8, 2026
55956a8
fix: 3 bug fixes (#1487, #1512, #1666) + close 3 already-fixed issues
unclecode Mar 8, 2026
11b4576
fix: anti-bot false positive on browser JSON, URLPatternFilter prefix…
unclecode Mar 9, 2026
6efbffe
fix: screenshot respects scan_full_page=False (#1750)
hafezparast Mar 12, 2026
35034f5
docs: add hafezparast to CONTRIBUTORS.md
unclecode Mar 12, 2026
57b0d09
fix: deduplicate BM25ContentFilter output (#1213) (#1824)
hafezparast Mar 12, 2026
d907e16
Merge pull request #1823 from hafezparast/fix/maysam-screenshot-scan-…
ntohidi Mar 12, 2026
480d938
fix: /llm per-request provider override, Redis config from host/port/…
hafezparast Mar 12, 2026
3f481e9
fix: screenshot distortion, deep crawl timeout/arun_many, CLI encodin…
hafezparast Mar 12, 2026
a73bc1c
fix: MCP SSE endpoint crash — mount via raw ASGI Route (#1594)
unclecode Mar 12, 2026
bf1158a
fix: upgrade Redis to 7.2.7 for CVE-2025-49844 (CVSS 10.0) (#1671)
unclecode Mar 12, 2026
8de83a3
fix: css_selector ignored in LXML scraping for raw:// URLs (#1484)
hafezparast Mar 12, 2026
6e42995
Merge pull request #1833 from hafezparast/fix/maysam-css-selector-raw…
ntohidi Mar 13, 2026
648f36b
Merge pull request #1827 from hafezparast/fix/maysam-llm-provider-red…
ntohidi Mar 13, 2026
f6ab207
fix: remove shared LOCK contention in monitor to prevent pod deadlock…
ntohidi Mar 13, 2026
bb6406a
release: Crawl4AI v0.8.5
ntohidi Mar 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions .claude/commands/c4ai-check.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
description: "Test current changes with adversarial tests, then run full regression suite"
arguments:
- name: changes
description: "Description of what changed (e.g. 'fixed URL normalization to preserve trailing slashes')"
required: true
---

# Crawl4AI Change Verification (c4ai-check)

You are verifying that recent code changes work correctly AND haven't broken anything else. This is a two-phase process.

**Input:** $ARGUMENTS

## PHASE 1: Adversarial Testing of Current Changes

Based on the change description above:

1. **Understand the change**: Read the relevant files that were modified. Use `git diff` to see exactly what changed.

2. **Write targeted adversarial tests**: Create a temporary test file at `tests/regression/test_tmp_changes.py` that HEAVILY tests the specific changes:
- Normal cases (does it work as intended?)
- Edge cases (boundary values, empty inputs, None, huge inputs)
- Regression cases (does the OLD bug still occur? it shouldn't)
- Interaction cases (does it break anything it touches?)
- Adversarial cases (weird inputs that could expose issues)
- At least 10-15 focused tests per change area

Rules for the temp test file:
- Use `@pytest.mark.asyncio` for async tests
- Use real browser crawling where needed (`async with AsyncWebCrawler()`)
- Use the `local_server` fixture from conftest.py when needed
- NO mocking - test real behavior
- Each test must have a clear docstring explaining what it verifies

3. **Run the targeted tests**:
```bash
.venv/bin/python -m pytest tests/regression/test_tmp_changes.py -v --tb=short
```

4. **Report results**: Show pass/fail summary. If any fail, investigate and determine if it's a real bug in the changes or a test issue. Fix the tests if needed, fix the code if there's a real bug.

## PHASE 2: Full Regression Suite

After Phase 1 passes:

1. **Run the full regression suite** (skip network tests for speed):
```bash
.venv/bin/python -m pytest tests/regression/ -v -m "not network" --tb=short -q
```

2. **Analyze failures**: For any failures:
- Determine if the failure is caused by the current changes (REGRESSION) or pre-existing
- Regressions are blockers - report them clearly
- Pre-existing failures should be noted but don't block

3. **Clean up**: Delete the temporary test file:
```bash
rm tests/regression/test_tmp_changes.py
```

## PHASE 3: Report

Present a clear summary:

```
## c4ai-check Results

**Changes tested:** [brief description]

### Phase 1: Targeted Tests
- Tests written: X
- Passed: X / Failed: X
- [List any issues found]

### Phase 2: Regression Suite
- Total: X passed, X failed, X skipped
- Regressions caused by changes: [None / list]
- Pre-existing issues: [None / list]

### Verdict: PASS / FAIL
[If FAIL, explain what needs fixing]
```

IMPORTANT:
- Always delete `test_tmp_changes.py` when done, even if tests fail
- A PASS verdict means: all targeted tests pass AND no new regressions in the suite
- A FAIL verdict means: either targeted tests found bugs OR changes caused regressions
- Be honest about failures - don't hide issues
151 changes: 151 additions & 0 deletions .context/PR-TODOLIST.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# PR Review Todolist

> Last updated: 2026-03-07 | Total open PRs: 6

---

## Remaining Open PRs (6)

### Bug Fixes (2)

| PR | Author | Description | Notes |
|----|--------|-------------|-------|
| #1207 | moncapitaine | Fix streaming error handling | Old PR, likely needs rebase |
| #462 | jtanningbed | Fix: Add newline before pre codeblock start in html2text. 1-line fix | Very old, may still apply |

### Docs/Maintenance (2)

| PR | Author | Description | Notes |
|----|--------|-------------|-------|
| #1756 | VasiliyRad | Added AG2 community integration example and Quickstart pointer | Community example |
| #1533 | unclecode | Add Claude Code GitHub Workflow | Owner's PR, CI |

### Skipped (owner PRs)

| PR | Author | Description |
|----|--------|-------------|
| #1533 | unclecode | Add Claude Code GitHub Workflow |
| #1124 | unclecode | Add VNC streaming support |

---

## Previously Closed PRs (won't merge)

| PR | Author | Description | Reason |
|----|--------|-------------|--------|
| #999 | loliw | Regex-based filters for deep crawling | URLPatternFilter already supports regex |
| #1180 | kunalmanelkar | CallbackURLFilter for deep crawling | Breaks sync apply() interface |
| #1425 | denrusio | OpenRouter API support | litellm handles openrouter/ natively |
| #1702 | YxmMyth | CSS background image extraction | Too invasive for niche feature |
| #1707 | dillonledoux | Crawl-delay from robots.txt | Too complex for non-standard directive |
| #1729 | hoi | External Redis support | Docker infra - maintainer territory |
| #1592 | Ahmed-Tawfik94 | CDP page leaks and race conditions | Superseded by develop page lifecycle system |

## Previously Closed PRs (from old todolist)

| PR | Author | Original Description | What happened |
|----|--------|---------------------|---------------|
| #1572 | Ahmed-Tawfik94 | Fix CDP setting with managed browser | Closed |
| #1234 | AdarsHH30 | Fix TypeError when keep_data_attributes=False | Closed |
| #1211 | Praneeth1-O-1 | Fix: safely create new page if no page exists | Closed |
| #1200 | fischerdr | Bugfix browser manager session handling | Closed |
| #1106 | devxpain | Fix: Adapt to CrawlerMonitor constructor change | Closed |
| #1081 | Joorrit | Fix deep crawl scorer logic was inverted | Closed |
| #1065 | mccullya | Fix: Update deprecated Groq models | Closed |
| #1059 | Aaron2516 | Fix wrong proxy config type in proxy demo example | Closed |
| #1058 | Aaron2516 | Fix dict-type proxy_config not handled properly | Closed |
| #983 | umerkhan95 | Fix memory leak and empty responses in streaming mode | Closed |
| #948 | GeorgeVince | Fix summarize_page.py example | Closed |
| #1689 | mzyfree | Docker: optimize concurrency performance | Closed (contributor acknowledged) |
| #1706 | vikas-gits-good | Fix arun_many not working with DeepCrawlStrategy | Closed |
| #1683 | Vaccarini-Lorenzo | Implement double config for AdaptiveCrawler | Closed |
| #1674 | blentz | Add output pagination/control for MCP endpoints | Closed |
| #1650 | KennyStryker | Add support for Vertex AI in LLM Extraction Strategy | Closed |
| #1580 | arpagon | Add Azure OpenAI configuration support | Closed |
| #1417 | NickMandylas | Add CDP headers support for remote browser auth | Closed |
| #1255 | itsskofficial | Fix JsonCssSelector to handle adjacent sibling CSS selectors | Closed |
| #1245 | mukul-atomicwork | Feature: GitHub releases integration | Closed |
| #1238 | yerik515 | Fix ManagedBrowser constructor and Windows encoding issues | Closed |
| #1220 | dcieslak19973 | Allow OPENAI_BASE_URL for LLM base_url | Closed |
| #901 | gbe3hunna | CrawlResult model: add pydantic fields and descriptions | Closed |
| #800 | atomlong | ensure_ascii=False for json.dumps | Closed |
| #799 | atomlong | Allow setting base_url for LLM extraction strategy in CLI | Closed |
| #741 | atomlong | Add config option to control Content-Security-Policy header | Closed |
| #723 | alexandreolives | Optional close page after screenshot | Closed |
| #681 | ksallee | JS execution should happen after waiting | Closed |
| #416 | dar0xt | Add keep-aria-label-attribute option | Closed |
| #332 | nelzomal | Add remove_invisible_texts method to crawler strategy | Closed |
| #312 | AndreaFrancis | Add save to HuggingFace support | Closed |
| #1488 | AkosLukacs | Fix syntax error in README JSON example | Closed |
| #1483 | NiclasLindqvist | Update README.md with latest docker image | Closed |
| #1416 | adityaagre | Fix missing bracket in README code block | Closed |
| #1272 | zhenjunMa | Fix get title bug in amazon example | Closed |
| #1263 | vvanglro | Fix: consistent with sdk behavior | Closed |
| #1225 | albertkim | Fix docker deployment guide URL | Closed |
| #1223 | dowithless | Docs: add links to other language versions of README | Closed |
| #1159 | lbeziaud | Fix cleanup warning when no process on debug port | Closed |
| #1098 | B-X-Y | Docs: fix outdated links to Docker guide | Closed |
| #1093 | Aaron2516 | Docs: Fixed incorrect elapsed calculation | Closed |
| #967 | prajjwalnag | Update README.md | Closed |
| #671 | SteveAlphaVantage | Update README.md | Closed |
| #605 | mochamadsatria | Fix typo in docker-deployment.md filename | Closed |
| #335 | amanagarwal042 | Add Documentation for Monitoring with OpenTelemetry | Closed |
| #1722 | YuriNachos | Add missing docstring to MCP md endpoint | Merged directly |

---

## Resolved This Session (batch 5)

| PR | Author | Description | Date |
|----|--------|-------------|------|
| #1622 | Ahmed-Tawfik94 | fix: verify redirect targets in URL seeder | 2026-03-07 |
| #1786 | Br1an67 | fix: wire mean_delay/max_range into dispatcher | 2026-03-07 |
| #1796 | Br1an67 | fix: DOMParser in process_iframes | 2026-03-07 |
| #1795 | Br1an67 | fix: require api_token for /token endpoint | 2026-03-07 |
| #1798 | SohamKukreti | fix: deep-crawl streaming mirrors Python library | 2026-03-07 |
| #1734 | pgoslatara | chore: update GitHub Actions versions | 2026-03-07 |
| #1290 | 130347665 | feat: type-list pipeline in JSON extraction | 2026-03-07 |
| #1668 | microHoffman | feat: --json-ensure-ascii CLI flag | 2026-03-07 |

## Resolved (batch 4)

| PR | Author | Description | Date |
|----|--------|-------------|------|
| #1494 | AkosLukacs | docs: fix docstring param name crawler_config -> config | 2026-03-07 |
| #1715 | YuriNachos | docs: add missing CacheMode import in quickstart | 2026-03-07 |
| #1716 | YuriNachos | docs: fix return types to RunManyReturn | 2026-03-07 |
| #1308 | dominicx | docs: fix css_selector type from list to string | 2026-03-07 |
| #1789 | Br1an67 | fix: UTF-8 encoding for CLI file output | 2026-03-07 |
| #1793 | Br1an67 | fix: configurable link_preview_timeout in AdaptiveConfig | 2026-03-07 |
| #1792 | Br1an67 | fix: wait_for_images on screenshot endpoint | 2026-03-07 |
| #1794 | Br1an67 | fix: cross-platform terminal input in CrawlerMonitor | 2026-03-07 |
| #1784 | Br1an67 | fix: UnicodeEncodeError in URL seeder + zero-width chars | 2026-03-07 |
| #1730 | hoi | fix: add TTL expiry for Redis task data | 2026-03-07 |

## Previously Resolved (batches 1-3)

| PR | Author | Description | Date |
|----|--------|-------------|------|
| #1805 | nightcityblade | fix: prevent AdaptiveCrawler from crawling external domains | 2026-03-07 |
| #1763 | Otman404 | fix: return in finally block silently suppressing exceptions | 2026-03-07 |
| #1803 | SohamKukreti | fix: from_serializable_dict ignoring plain data dicts | 2026-03-07 |
| #1804 | nightcityblade | feat: add score_threshold to BestFirstCrawlingStrategy | 2026-03-07 |
| #1790 | Br1an67 | fix: handle nested brackets in LINK_PATTERN regex | 2026-03-07 |
| #1787 | Br1an67 | fix: strip markdown fences in LLM JSON responses | 2026-03-07 |
| #1782 | Br1an67 | fix: preserve class/id in cleaned_html | 2026-03-07 |
| #1788 | Br1an67 | fix: guard against None LLM content | 2026-03-07 |
| #1783 | Br1an67 | fix: strip port from domain in is_external_url | 2026-03-07 |
| #1179 | phamngocquy | fix: raw HTML URL token leak | 2026-03-07 |
| #1694 | theredrad | feat: add force viewport screenshot | 2026-02-01 |
| #1746 | ChiragBellara | fix: avoid Common Crawl calls for sitemap-only seeding | 2026-02-01 |
| #1714 | YuriNachos | fix: replace tf-playwright-stealth with playwright-stealth | 2026-02-01 |
| #1721 | YuriNachos | fix: respect base tag for relative link resolution | 2026-02-01 |
| #1719 | YuriNachos | fix: include GoogleSearchCrawler script.js in package | 2026-02-01 |
| #1717 | YuriNachos | fix: allow local embeddings by removing OpenAI fallback | 2026-02-01 |
| #1667 | christian-oudard | fix: deep-crawl CLI outputting only first page | 2026-02-01 |
| #1296 | vladmandic | fix: VersionManager ignoring CRAWL4_AI_BASE_DIRECTORY | 2026-02-01 |
| #1364 | nnxiong | fix: script tag removal losing adjacent text | 2026-02-01 |
| #1077 | RoyLeviLangware | fix: bs4 deprecation warning (text -> string) | 2026-02-01 |
| #1281 | garyluky | fix: proxy auth ERR_INVALID_AUTH_CREDENTIALS | 2026-02-01 |
| #1463 | TristanDonze | feat: device_scale_factor for screenshot quality | 2026-02-06 |
| #1435 | charlaie | feat: redirected_status_code in CrawlResult | 2026-02-06 |
8 changes: 4 additions & 4 deletions .github/workflows/docker-release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ jobs:
df -h

- name: Checkout code
uses: actions/checkout@v4
uses: actions/checkout@v6

- name: Extract version from release or tag
id: get_version
Expand All @@ -58,16 +58,16 @@ jobs:
echo "Semantic versions - Major: $MAJOR, Minor: $MINOR"

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
uses: docker/setup-buildx-action@v4

- name: Log in to Docker Hub
uses: docker/login-action@v3
uses: docker/login-action@v4
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}

- name: Build and push Docker images
uses: docker/build-push-action@v5
uses: docker/build-push-action@v6
with:
context: .
push: true
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@ jobs:

steps:
- name: Checkout code
uses: actions/checkout@v4
uses: actions/checkout@v6

- name: Set up Python
uses: actions/setup-python@v5
uses: actions/setup-python@v6
with:
python-version: '3.12'

Expand Down
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -286,6 +286,7 @@ docs/apps/linkdin/samples/insights/*

scripts/
!scripts/gen-sbom.sh
!scripts/update_stats.py


# Databse files
Expand All @@ -298,3 +299,6 @@ scripts/
*.rdb
*.ldb
MEMORY.md

# Handoff files
HANDOFF-*.md
Loading