Skip to content

This is a failed project just for the team members to investigete more, it cannot fully scrape the information from the Combase

Notifications You must be signed in to change notification settings

HedgehogsGX/ComBase-Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ComBase Scraper

Simple ComBase data scraper with English interface.

Quick Start

  1. Install dependencies:
pip install -r config/requirements.txt
  1. Run the scraper:

Single Thread (Simple):

python simple_scraper.py

Parallel (10 Threads - Faster):

python parallel_scraper.py
  1. Press Ctrl+C to stop safely

Features

  • Parallel Processing: 10 threads for 10x speed improvement
  • Search Delay: 2-minute wait after search before scraping starts
  • Deduplication: Removes duplicate food parts from organism names
  • Thread-Safe: Real-time progress tracking across all threads

Output

  • Data saved to data/ directory
  • Each file contains 1,000 records
  • Complete organism names with ID, name, and food description

About

This is a failed project just for the team members to investigete more, it cannot fully scrape the information from the Combase

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages