This project is a modular web scraping and machine learning pipeline designed to extract item data from structured websites, store it in a database, and interactively predict item values using a trained ML model.
- Paginated web scraping using BeautifulSoup
- Configurable selectors for categories and sub-items
- Data storage with dynamic table creation
- Filtering and cleaning of scraped data
- Interactive CLI interface for:
- Adding data to database
- Fetching data for machine learning
- Predicting values using Decision Tree Regression
- General-purpose schema for categorical and numeric features
- Clone the repository:
git clone https://github.com/your-username/dt_predictor.git
cd your-repo- Install dependencies:
Ensure Python 3.7 or higher is installed.
pip install -r requirements.txtTo launch the scraping and machine learning pipeline:
python app.pyYou will be prompted to:
- Add data (scrape & insert into database)
- Fetch data (load from database into ML pipeline)
- Predict values interactively
- Choose
(a)to scrape and insert data. - Choose
(f)to fetch data and prepare features and labels. - Predict values based on user input using the trained model.
- Utilizes
DecisionTreeRegressorfromscikit-learn - Categorical features handled with
LabelEncoder - Supports dynamic schemas for categorical and numeric features
.
├── app.py # Main pipeline entry point
├── database/ # Database-related modules
│ ├── __init__.py
│ ├── create_database_tables.py
│ ├── crud.py
│ └── database_connection.py
├── scraper/ # Web scraping components
│ ├── __init__.py
│ ├── construct_url.py
│ ├── scraper_utils.py
│ └── selector.py
├── src/ # Core ML and data processing pipeline
│ ├── __init__.py
│ ├── data_pipeline.py
│ ├── prediction.py
│ └── utils.py
├── requirements.txt # Python dependencies
├── README.md # Project documentation
├── LICENSE # Project license
└── .gitignore # Git ignored files
- Python 3.7+
- requests
- beautifulsoup4
- scikit-learn
This tool is intended for educational and research purposes only. Users are responsible for ensuring that their use of this tool complies with all applicable laws and the terms of service of any websites being scraped.
Developed by Nima Daryabar
This project is licensed under the MIT License. See the LICENSE file for details.