-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Currently, crawly is designed to efficiently crawl and scrape static web pages while adhering to robots.txt rules.
However, many modern websites use JavaScript to dynamically generate content, which presents a limitation for the current crawling capabilities.
Enhancement proposal
This feature request aims to integrate support for a web driver (such as Selenium or headless browsers like Puppeteer) to enable the crawling and rendering of dynamic content created with JavaScript.
Goals
- Dynamic content rendering: use a web driver to fully render pages before scraping to capture dynamic content;
- Integration with existing architecture: Seamlessly connect web driver capabilities into the current
CrawlerandCrawlerBuildersetup; - Respect existing configurations: ensure that the rendering process adheres to existing configurations such as adherence to
robots.txt, rate limits, and depth limits.
Implementation suggestions
Option 1: Selenium Web Driver
- Utilize Selenium WebDriver to control a browser and render dynamic content;
- Integrate with Rust using crates like
fantocciniorthirtyfour.
Option 2: Headless browsers
- Use headless browsers like Puppeteer or Playwright for improved performance in rendering and scraping dynamic content;
- This might involve creating Rust bindings or using existing native integrations.
Proposed API changes
Introduce a new setting in CrawlerBuilder to enable dynamic content rendering:
let crawler = CrawlerBuilder::new()
.with_max_depth(10)
.with_max_pages(100)
.with_max_concurrent_requests(50)
.with_rate_limit_wait_seconds(2)
.with_robots(true)
.with_dynamic_rendering(true) // New configuration
.build()?;Example usage
Demonstrate how users would take advantage of the new feature in their projects:
use anyhow::Result;
use crawly::CrawlerBuilder;
#[tokio::main]
async fn main() -> Result<()> {
let crawler = CrawlerBuilder::new()
.with_max_depth(10)
.with_dynamic_rendering(true)
.build()?;
let results = crawler.start("https://example-dynamic.com").await?;
for (url, content) in &results {
println!("URL: {}\nContent: {}", url, content);
}
Ok(())
}Expected benefits
- Expanded reach: ability to scrape modern sites that heavily rely on client-side JavaScript;
- Flexibility: users can choose to enable or disable dynamic rendering based on their needs, keeping
crawlylightweight for static sites.
Additional context:
- This feature would require additional dependencies and might have implications on crawling speed and resource usage;
- It is essential to handle the added complexity in error handling and debugging when integrating third-party web drivers/browsers.
Tracking activity:
- Research and choose web driver solution (Selenium, Puppeteer, Playwright, etc.);
- Prototype the integration with a basic dynamic page;
- Implement API changes and configuration options;
- Develop tests and documentation for the new feature;
- Solicit feedback from the community and refine the implementation.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request