English | 简体中文
This document summarizes the process, technical architecture, and key implementation details of developing the Xiaohongshu (RedNote) crawler system using Antigravity (Agentic AI). It aims to provide a learning reference for future developers.
This project is an automated data collection system for the Xiaohongshu platform. The core objective is to collect notes and comments under specific keywords (e.g., "Makeup", "Skincare") and provide structured data support for subsequent analysis.
- Language: Python 3.10+
- Browser Automation: DrissionPage (ChromiumPage mode) - Chosen for its advantages in Anti-Detection and low-level browser control compared to Selenium/Playwright.
- Database: SQLite + SQLModel (ORM)
- Frontend: (Optional) HTML/JS Dashboard for monitoring.
- Frontend: (Optional) HTML/JS Dashboard for monitoring.
The system consists of the following modules:
- Crawler Core (
crawler/xhs_crawler.py): Responsible for controlling browser behavior, executing search, pagination, detail scraping, Cookie management, etc. - Data Models (
database/models.py): Defines the database structure forNoteandComment. - Manager (
crawler/crawler_manager.py): (Logic Layer) Schedules crawler tasks and handles exception retries. - External Processor: Data can be exported or provided to independent analysis modules via API.
XHSCrawler implements basic request control:
- Basic Jitter:
_sleep_with_jitterimplements simple random waiting to avoid fixed-frequency request patterns. - Timeout & Retry: Explicit timeouts are set for network requests and key element lookups to prevent the crawler from hanging indefinitely.
- Strict Login Check: Mandatorily checks for avatar/username elements to prevent false positives (guest mode).
- Login Wall Handling: Automatically detects
302 Redirector modal popups and raises exceptions for manual intervention.
- Search & Pagination: Supports infinite scrolling.
- Detail Parsing: Extracts high-res images, content, publish time, etc.
- Sub-comments: Handles the "Sibling Node Trap" mentioned earlier to ensure level-2 comments are expanded and captured.
This project is a typical result of an AI-Native development workflow. The User (Human) and AI Agent (Antigravity) pair-programmed extensively:
-
Agentic Mode:
- Antigravity didn't just answer questions but took over tasks as an "independent developer". It proactively created
task.mdto plan progress and maintainedimplementation_plan.mdfor design. - For example, during the crawler architecture refactoring, the AI proactively proposed "Scheme A" (Inheritance) and automatically executed file moves, code cleanup, and dependency updates.
- Antigravity didn't just answer questions but took over tasks as an "independent developer". It proactively created
-
Debug Loop:
- When CSS selectors failed, Antigravity wrote disposable
debug_xhs.pyscripts to capture the current page HTML Dump for analysis and fixed the selectors, instead of asking the user to manually test repeatedly. - When solving the "Sub-comment" scraping challenge, the AI suggested printing
outerHTMLto observe minute DOM structure changes (like the discovery of sibling nodes).
- When CSS selectors failed, Antigravity wrote disposable
-
Docs as Code:
- All documentation (including this one) was drafted, translated (EN/CN), and maintained by the AI based on code changes. The AI ensured synchronization between documentation and code implementation (e.g., removing descriptions of obsolete features).
-
Challenge: Crawler sometimes stuck in "Not Logged In" state, but Cookie was actually valid.
- Fix: Discovered that checking Cookie expiration wasn't enough; strict DOM element checks were necessary. We introduced
_detect_security_restrictionto identify account-specific risk control pages.
- Fix: Discovered that checking Cookie expiration wasn't enough; strict DOM element checks were necessary. We introduced
-
Challenge:
DrissionPageelement location failed in some environments.- Fix: Mixed use of CSS Selector and XPath, adding Shadow DOM penetration where applicable.
For developers wishing to extend this project:
- Debug First: When encountering anti-climbing or parsing errors, write a small
debug_*.pyscript to reproduce the issue first. Don't guess blindly in the main flow. - Raw Data Storage: It is recommended to save scraped data directly into SQLite/JSON (Raw Data) without excessive cleaning or business processing. This maximizes the preservation of original information for future use.
- Logging: Keep detailed
logurulogs, especially network requests and state changes, critical for troubleshooting "Ghost Bugs" (sporadic issues). - Respect Rules: Strictly control scrape frequency (Sleep interval) to avoid putting pressure on the target site and getting IP banned.