Skip to content

kgruiz/WebScraper-Old

Repository files navigation

WebScraper-Old

Deprecated Notice: This project has been deprecated. Please check out the improved version of the scraper at WebScraper.

A Python-based web scraping tool designed to extract and convert HTML content into LaTeX format for seamless integration into documents.

Table of Contents

Installation

  1. Clone the repository:

    git clone https://github.com/kgruiz/WebScraper-Old.git
  2. Navigate to the project directory:

    cd WebScraper-Old
  3. Install the required dependencies:

    pip install requests beautifulsoup4 tqdm pypandoc weasyprint

Usage

  1. Convert a single HTML file to LaTeX:

    python HTMLtoLatex.py path/to/input.html
  2. Download web pages as PDFs:

    python Downloader.py urlList.json
  3. Flatten directory structure:

    python Scraper.py

About

A Python-based web scraping tool that extracts HTML content and converts it into LaTeX format, with additional features for downloading web pages as PDFs and flattening directory structures.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages