Skip to content

Ability to connect a web driver to also fetch and render dynamic page content (JavaScript) #1

@Emulator000

Description

@Emulator000

Summary

Currently, crawly is designed to efficiently crawl and scrape static web pages while adhering to robots.txt rules.
However, many modern websites use JavaScript to dynamically generate content, which presents a limitation for the current crawling capabilities.

Enhancement proposal

This feature request aims to integrate support for a web driver (such as Selenium or headless browsers like Puppeteer) to enable the crawling and rendering of dynamic content created with JavaScript.

Goals

  1. Dynamic content rendering: use a web driver to fully render pages before scraping to capture dynamic content;
  2. Integration with existing architecture: Seamlessly connect web driver capabilities into the current Crawler and CrawlerBuilder setup;
  3. Respect existing configurations: ensure that the rendering process adheres to existing configurations such as adherence to robots.txt, rate limits, and depth limits.

Implementation suggestions

Option 1: Selenium Web Driver

Option 2: Headless browsers

  • Use headless browsers like Puppeteer or Playwright for improved performance in rendering and scraping dynamic content;
  • This might involve creating Rust bindings or using existing native integrations.

Proposed API changes

Introduce a new setting in CrawlerBuilder to enable dynamic content rendering:

let crawler = CrawlerBuilder::new()
    .with_max_depth(10)
    .with_max_pages(100)
    .with_max_concurrent_requests(50)
    .with_rate_limit_wait_seconds(2)
    .with_robots(true)
    .with_dynamic_rendering(true) // New configuration
    .build()?;

Example usage

Demonstrate how users would take advantage of the new feature in their projects:

use anyhow::Result;
use crawly::CrawlerBuilder;

#[tokio::main]
async fn main() -> Result<()> {
    let crawler = CrawlerBuilder::new()
        .with_max_depth(10)
        .with_dynamic_rendering(true)
        .build()?;

    let results = crawler.start("https://example-dynamic.com").await?;

    for (url, content) in &results {
        println!("URL: {}\nContent: {}", url, content);
    }

    Ok(())
}

Expected benefits

  • Expanded reach: ability to scrape modern sites that heavily rely on client-side JavaScript;
  • Flexibility: users can choose to enable or disable dynamic rendering based on their needs, keeping crawly lightweight for static sites.

Additional context:

  • This feature would require additional dependencies and might have implications on crawling speed and resource usage;
  • It is essential to handle the added complexity in error handling and debugging when integrating third-party web drivers/browsers.

Tracking activity:

  • Research and choose web driver solution (Selenium, Puppeteer, Playwright, etc.);
  • Prototype the integration with a basic dynamic page;
  • Implement API changes and configuration options;
  • Develop tests and documentation for the new feature;
  • Solicit feedback from the community and refine the implementation.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions