Welcome to the AI Web Scraper project! This project is designed to scrape web content, clean it, and parse it using AI techniques. The project leverages Streamlit for the user interface, Selenium for web scraping, and various Python libraries for content processing.
This project was created by Ansh Jain. It provides a simple and efficient way to scrape web content and process it using AI techniques. The project was inspired by a tutorial from Tech With Tim, which made the implementation process easy to understand.
- Web scraping using Selenium
- Content extraction and cleaning
- Parsing content with AI
- User-friendly interface with Streamlit
Follow these steps to set up the project on your local machine:
-
Clone the Repository:
git clone https://github.com/jansh7784/AI-Web-Scrapper.git cd AI-Web-Scraper -
Create a Virtual Environment:
python -m venv venv
-
Activate the Virtual Environment:
- On Windows:
.\venv\Scripts\activate
- On macOS/Linux:
source venv/bin/activate
- On Windows:
-
Install Dependencies:
pip install -r requirements.txt
Follow these steps to run the project:
-
Run the Streamlit Application:
streamlit run main.py
-
Open the Application in Your Browser:
- The application will automatically open in your default web browser. If it doesn't, navigate to
http://localhost:8501in your browser.
- The application will automatically open in your default web browser. If it doesn't, navigate to
-
Enter the Website URL:
- Enter the URL of the website you want to scrape in the input field and click the "Scrape Website" button.
-
View the Results:
- The scraped content, cleaned content, and parsed content will be displayed on the web interface.
For local development and testing, you can use ChromeDriver instead of Bright Data. ChromeDriver is a standalone server that implements the WebDriver protocol for Chrome. Here are the steps to set it up:
-
Download ChromeDriver:
- Download the ChromeDriver executable from the official site and place it in a directory of your choice.
-
Update scrape.py:
- Modify the
scrape_websitefunction to use ChromeDriver locally:
from selenium import webdriver from selenium.webdriver.chrome.service import Service as ChromeService from selenium.webdriver.chrome.options import Options def scrape_website(url): chrome_service = ChromeService(executable_path='path/to/chromedriver') chrome_options = Options() driver = webdriver.Chrome(service=chrome_service, options=chrome_options) driver.get(url) html = driver.page_source driver.quit() return html
Replace
'path/to/chromedriver'with the actual path to your ChromeDriver executable. - Modify the
This project was created by Ansh Jain. Special thanks to Tech With Tim for the tutorial that made this implementation easy to understand.
Connect with me on LinkedIn.
This project is licensed under the MIT License. See the LICENSE file for details.