Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 77 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,26 @@
# Documentation Crawler and Converter v.0.3
# Documentation Crawler and Converter v1.0.0

This tool crawls a documentation website and converts the pages into a single Markdown document. It intelligently removes common sections that appear across multiple pages to avoid duplication, including them once at the end of the document.

**Version 1.0.0** introduces significant improvements, including support for JavaScript-rendered pages using Playwright and a fully asynchronous implementation.

## Features

- **JavaScript Rendering**: Utilizes Playwright to accurately render pages that rely on JavaScript, ensuring complete and up-to-date content capture.
- Crawls documentation websites and combines pages into a single Markdown file.
- Removes common sections that appear across many pages, including them once at the beginning.
- Customizable threshold for similarity.
- Removes common sections that appear across many pages, including them once at the end of the document.
- Customizable threshold for similarity to control deduplication sensitivity.
- Configurable selectors to remove specific elements from pages.
- Supports robots.txt compliance with an option to ignore it.
- **NEW in v0.3.3**: Ability to skip URLs based on ignore-paths both pre-fetch (before requesting content) and post-fetch (after redirects).
## **NEW in v1.0.0**:
- Javascript rendering, waiting for page to stabilize before scraping.
- Asynchronous Operation: Fully asynchronous methods enhance performance and scalability during the crawling process.

## Installation

### Prerequisites

- **Python 3.6 or higher** is required.
- **Python 3.7 or higher** is required.
- (Optional) It is recommended to use a virtual environment to avoid dependency conflicts with other projects.

### 1. Installing the Package with `pip`
Expand Down Expand Up @@ -49,11 +54,13 @@ It is recommended to use a virtual environment to isolate the package and its de
2. **Activate the virtual environment**:

- On **macOS/Linux**:

```bash
source venv/bin/activate
```

- On **Windows**:

```bash
.\venv\Scripts\activate
```
Expand All @@ -66,15 +73,25 @@ It is recommended to use a virtual environment to isolate the package and its de

This ensures that all dependencies are installed within the virtual environment.

### 4. Installing from PyPI
### 4. Installing Playwright Browsers

After installing the package, you need to install the necessary Playwright browser binaries:

```bash
playwright install
```

This command downloads the required browser binaries (Chromium, Firefox, and WebKit) used by Playwright for rendering pages.

### 5. Installing from PyPI

Once the package is published on PyPI, you can install it directly using:

```bash
pip install libcrawler
```

### 5. Upgrading the Package
### 6. Upgrading the Package

To upgrade the package to the latest version, use:

Expand All @@ -84,7 +101,7 @@ pip install --upgrade libcrawler

This will upgrade the package to the newest version available.

### 6. Verifying the Installation
### 7. Verifying the Installation

You can verify that the package has been installed correctly by running:

Expand All @@ -102,7 +119,7 @@ crawl-docs BASE_URL STARTING_POINT [OPTIONS]

### Arguments

- `BASE_URL`: The base URL of the documentation site (e.g., https://example.com).
- `BASE_URL`: The base URL of the documentation site (e.g., _https://example.com_).
- `STARTING_POINT`: The starting path of the documentation (e.g., /docs/).

### Optional Arguments
Expand All @@ -117,29 +134,33 @@ crawl-docs BASE_URL STARTING_POINT [OPTIONS]
- `--ignore-paths PATH [PATH ...]`: List of URL paths to skip during crawling, either before or after fetching content.
- `--user-agent USER_AGENT`: Specify a custom User-Agent string (which will be harmonized with any additional headers).
- `--headers-file FILE`: Path to a JSON file containing optional headers. Only one of `--headers-file` or `--headers-json` can be used.
- `--headers-json JSON` (JSON string): Optional headers as JSON
- `--headers-json JSON` (JSON string): Optional headers as JSON.

### Examples

#### Basic Usage

```bash
crawl-docs https://example.com /docs/ -o output.md
```

#### Adjusting Thresholds

```bash
crawl-docs https://example.com /docs/ -o output.md \
--similarity-threshold 0.7 \
--delay-range 0.3
```

#### Specifying Extra Selectors to Remove

```bash
crawl-docs https://example.com /docs/ -o output.md \
--remove-selectors ".sidebar" ".ad-banner"
```

#### Limiting to Specific Paths

```bash
crawl-docs https://example.com / -o output.md \
--allowed-paths "/docs/" "/api/"
Expand All @@ -148,24 +169,61 @@ crawl-docs https://example.com / -o output.md \
#### Skipping URLs with Ignore Paths

```bash
Copiar código
crawl-docs https://example.com /docs/ -o output.md \
--ignore-paths "/old/" "/legacy/"
```

### Dependencies
## Dependencies

- **Python 3.7 or higher**
- [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for HTML parsing.
- [markdownify](https://github.com/matthewwithanm/python-markdownify) for converting HTML to Markdown.
- [Playwright](https://playwright.dev/python/docs/intro) for headless browser automation and JavaScript rendering.
- [aiofiles](https://github.com/Tinche/aiofiles) for asynchronous file operations.
- Additional dependencies are listed in `requirements.txt`.

### Installing Dependencies

- Python 3.6 or higher
- BeautifulSoup4
- datasketch
- requests
- markdownify
After setting up your environment, install all required dependencies using:

Install dependencies using:
```bash
pip install -r requirements.txt
```

**Note**: Ensure you have installed the Playwright browsers by running `playwright install` as mentioned in the Installation section.

## License

This project is licensed under the LGPLv3.
This project is licensed under the LGPLv3. See the [LICENSE]\(LICENSE) file for details.

## Contributing

Contributions are welcome! Please follow these steps to contribute:

1. **Fork the repository** on GitHub.
2. **Clone your fork** to your local machine:
```bash
git clone https://github.com/your-username/libcrawler.git
```
3. **Create a new branch** for your feature or bugfix:
```bash
git checkout -b feature-name
```
4. **Make your changes** and **commit** them with clear messages:
```bash
git commit -m "Add feature X"
```
5. **Push** your changes to your fork:
```bash
git push origin feature-name
```
6. **Open a Pull Request** on the original repository describing your changes.

Please ensure your code adheres to the project's coding standards and includes appropriate tests.

## Acknowledgements

- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for HTML parsing.
- [Playwright](https://playwright.dev/) for headless browser automation.
- [Markdownify](https://github.com/matthewwithanm/python-markdownify) for converting HTML to Markdown.
- [aiofiles](https://github.com/Tinche/aiofiles) for asynchronous file operations.
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ description = "A tool to crawl documentation and convert to Markdown."
authors = [
{ name="Robert Collins", email="roberto.tomas.cuentas@gmail.com" }
]
requires-python = ">=3.6"
requires-python = ">=3.7"
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)",
Expand Down
2 changes: 2 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
aiofiles~=24.1.0
beautifulsoup4~=4.12.3
datasketch~=1.6.5
markdownify~=0.13.1
playwright~=1.49.1
Requests~=2.32.3
Empty file added src/libcrawler/__init__.py
Empty file.
8 changes: 5 additions & 3 deletions src/libcrawler/__main__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import asyncio
import argparse
import json
from urllib.parse import urljoin
Expand All @@ -18,6 +19,7 @@ def main():
help='Delay between requests in seconds.')
parser.add_argument('--delay-range', type=float, default=0.5,
help='Range for random delay variation.')
parser.add_argument('--interval', type=int, help='Time step used in wait for DOM to stablize, in milliseconds (default: 1000 ms)')
parser.add_argument('--remove-selectors', nargs='*',
help='Additional CSS selectors to remove from pages.')
parser.add_argument('--similarity-threshold', type=float, default=0.6,
Expand Down Expand Up @@ -55,7 +57,7 @@ def main():
start_url = urljoin(args.base_url, args.starting_point)

# Adjust crawl_and_convert call to handle ignore-paths and optional headers
crawl_and_convert(
asyncio.run(crawl_and_convert(
start_url=start_url,
base_url=args.base_url,
output_filename=args.output,
Expand All @@ -68,8 +70,8 @@ def main():
similarity_threshold=args.similarity_threshold,
allowed_paths=args.allowed_paths,
ignore_paths=args.ignore_paths # Pass the ignore-paths argument
)
))


if __name__ == '__main__':
main()
main()
Loading
Loading