Skip to content

Modular web scraper and ML pipeline for extracting data from websites, storing it in a database, and predicting item values using Decision Tree Regression. Features interactive CLI, dynamic schemas, and configurable scraping.

License

Notifications You must be signed in to change notification settings

nimad70/dt_predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

General-Purpose Web Scraper and ML Predictor

This project is a modular web scraping and machine learning pipeline designed to extract item data from structured websites, store it in a database, and interactively predict item values using a trained ML model.

Features

  • Paginated web scraping using BeautifulSoup
  • Configurable selectors for categories and sub-items
  • Data storage with dynamic table creation
  • Filtering and cleaning of scraped data
  • Interactive CLI interface for:
    • Adding data to database
    • Fetching data for machine learning
    • Predicting values using Decision Tree Regression
  • General-purpose schema for categorical and numeric features

Installation

  1. Clone the repository:
git clone https://github.com/your-username/dt_predictor.git
cd your-repo
  1. Install dependencies:

Ensure Python 3.7 or higher is installed.

pip install -r requirements.txt

How to Run the Project

To launch the scraping and machine learning pipeline:

python app.py

You will be prompted to:

  • Add data (scrape & insert into database)
  • Fetch data (load from database into ML pipeline)
  • Predict values interactively

Example Workflow

  1. Choose (a) to scrape and insert data.
  2. Choose (f) to fetch data and prepare features and labels.
  3. Predict values based on user input using the trained model.

Machine Learning Model

  • Utilizes DecisionTreeRegressor from scikit-learn
  • Categorical features handled with LabelEncoder
  • Supports dynamic schemas for categorical and numeric features

Project Structure

.
├── app.py                      # Main pipeline entry point
├── database/                   # Database-related modules
│   ├── __init__.py
│   ├── create_database_tables.py
│   ├── crud.py
│   └── database_connection.py
├── scraper/                    # Web scraping components
│   ├── __init__.py
│   ├── construct_url.py
│   ├── scraper_utils.py
│   └── selector.py
├── src/                        # Core ML and data processing pipeline
│   ├── __init__.py
│   ├── data_pipeline.py
│   ├── prediction.py
│   └── utils.py
├── requirements.txt            # Python dependencies
├── README.md                   # Project documentation
├── LICENSE                     # Project license
└── .gitignore                  # Git ignored files

Requirements

  • Python 3.7+
  • requests
  • beautifulsoup4
  • scikit-learn

Disclaimer

This tool is intended for educational and research purposes only. Users are responsible for ensuring that their use of this tool complies with all applicable laws and the terms of service of any websites being scraped.


Author

Developed by Nima Daryabar


License

This project is licensed under the MIT License. See the LICENSE file for details.

About

Modular web scraper and ML pipeline for extracting data from websites, storing it in a database, and predicting item values using Decision Tree Regression. Features interactive CLI, dynamic schemas, and configurable scraping.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages