Ability to connect a web driver to also fetch and render dynamic page content (JavaScript)

### Summary

Currently, `crawly` is designed to efficiently crawl and scrape static web pages while adhering to `robots.txt` rules.
However, many modern websites use JavaScript to dynamically generate content, which presents a limitation for the current crawling capabilities.

### Enhancement proposal

This feature request aims to integrate support for a web driver (such as Selenium or headless browsers like Puppeteer) to enable the crawling and rendering of dynamic content created with JavaScript.

### Goals

1. **Dynamic content rendering**: use a web driver to fully render pages before scraping to capture dynamic content;
2. **Integration with existing architecture**: Seamlessly connect web driver capabilities into the current `Crawler` and `CrawlerBuilder` setup;
3. **Respect existing configurations**: ensure that the rendering process adheres to existing configurations such as adherence to `robots.txt`, rate limits, and depth limits.

### Implementation suggestions

#### Option 1: Selenium Web Driver

- Utilize [Selenium WebDriver](https://www.selenium.dev/documentation/en/webdriver/) to control a browser and render dynamic content;
- Integrate with Rust using crates like [`fantoccini`](https://crates.io/crates/fantoccini) or [`thirtyfour`](https://crates.io/crates/thirtyfour).

#### Option 2: Headless browsers

- Use headless browsers like [Puppeteer](https://github.com/GoogleChrome/puppeteer) or [Playwright](https://playwright.dev/) for improved performance in rendering and scraping dynamic content;
- This might involve creating Rust bindings or using existing native integrations.

### Proposed API changes

Introduce a new setting in `CrawlerBuilder` to enable dynamic content rendering:

```rust
let crawler = CrawlerBuilder::new()
    .with_max_depth(10)
    .with_max_pages(100)
    .with_max_concurrent_requests(50)
    .with_rate_limit_wait_seconds(2)
    .with_robots(true)
    .with_dynamic_rendering(true) // New configuration
    .build()?;
```

### Example usage

Demonstrate how users would take advantage of the new feature in their projects:

```rust
use anyhow::Result;
use crawly::CrawlerBuilder;

#[tokio::main]
async fn main() -> Result<()> {
    let crawler = CrawlerBuilder::new()
        .with_max_depth(10)
        .with_dynamic_rendering(true)
        .build()?;

    let results = crawler.start("https://example-dynamic.com").await?;

    for (url, content) in &results {
        println!("URL: {}\nContent: {}", url, content);
    }

    Ok(())
}
```

### Expected benefits

- **Expanded reach**: ability to scrape modern sites that heavily rely on client-side JavaScript;
- **Flexibility**: users can choose to enable or disable dynamic rendering based on their needs, keeping `crawly` lightweight for static sites.

---

**Additional context:**

- This feature would require additional dependencies and might have implications on crawling speed and resource usage;
- It is essential to handle the added complexity in error handling and debugging when integrating third-party web drivers/browsers.

**Tracking activity:**
- [ ] Research and choose web driver solution (Selenium, Puppeteer, Playwright, etc.);
- [ ] Prototype the integration with a basic dynamic page;
- [ ] Implement API changes and configuration options;
- [ ] Develop tests and documentation for the new feature;
- [ ] Solicit feedback from the community and refine the implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to connect a web driver to also fetch and render dynamic page content (JavaScript) #1

Summary

Enhancement proposal

Goals

Implementation suggestions

Option 1: Selenium Web Driver

Option 2: Headless browsers

Proposed API changes

Example usage

Expected benefits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ability to connect a web driver to also fetch and render dynamic page content (JavaScript) #1

Description

Summary

Enhancement proposal

Goals

Implementation suggestions

Option 1: Selenium Web Driver

Option 2: Headless browsers

Proposed API changes

Example usage

Expected benefits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions