General-Purpose Web Scraper and ML Predictor

This project is a modular web scraping and machine learning pipeline designed to extract item data from structured websites, store it in a database, and interactively predict item values using a trained ML model.

Features

Paginated web scraping using BeautifulSoup
Configurable selectors for categories and sub-items
Data storage with dynamic table creation
Filtering and cleaning of scraped data
Interactive CLI interface for:
- Adding data to database
- Fetching data for machine learning
- Predicting values using Decision Tree Regression
General-purpose schema for categorical and numeric features

Installation

Clone the repository:

git clone https://github.com/your-username/dt_predictor.git
cd your-repo

Install dependencies:

Ensure Python 3.7 or higher is installed.

pip install -r requirements.txt

How to Run the Project

To launch the scraping and machine learning pipeline:

python app.py

You will be prompted to:

Add data (scrape & insert into database)
Fetch data (load from database into ML pipeline)
Predict values interactively

Example Workflow

Choose (a) to scrape and insert data.
Choose (f) to fetch data and prepare features and labels.
Predict values based on user input using the trained model.

Machine Learning Model

Utilizes DecisionTreeRegressor from scikit-learn
Categorical features handled with LabelEncoder
Supports dynamic schemas for categorical and numeric features

Project Structure

.
├── app.py                      # Main pipeline entry point
├── database/                   # Database-related modules
│   ├── __init__.py
│   ├── create_database_tables.py
│   ├── crud.py
│   └── database_connection.py
├── scraper/                    # Web scraping components
│   ├── __init__.py
│   ├── construct_url.py
│   ├── scraper_utils.py
│   └── selector.py
├── src/                        # Core ML and data processing pipeline
│   ├── __init__.py
│   ├── data_pipeline.py
│   ├── prediction.py
│   └── utils.py
├── requirements.txt            # Python dependencies
├── README.md                   # Project documentation
├── LICENSE                     # Project license
└── .gitignore                  # Git ignored files

Requirements

Python 3.7+
requests
beautifulsoup4
scikit-learn

Disclaimer

This tool is intended for educational and research purposes only. Users are responsible for ensuring that their use of this tool complies with all applicable laws and the terms of service of any websites being scraped.

Author

Developed by Nima Daryabar

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

General-Purpose Web Scraper and ML Predictor

Features

Installation

How to Run the Project

Example Workflow

Machine Learning Model

Project Structure

Requirements

Disclaimer

Author

License

About

Uh oh!

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
database		database
scraper		scraper
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

License

nimad70/dt_predictor

Folders and files

Latest commit

History

Repository files navigation

General-Purpose Web Scraper and ML Predictor

Features

Installation

How to Run the Project

Example Workflow

Machine Learning Model

Project Structure

Requirements

Disclaimer

Author

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages