Crawl-InstantGaming

Introduction

This repository, launched in 2024, intends to crawl data from instant-gaming.

The plain pipeline can be described as followings:

get_game_card() # Crawl the basic information about the game card, including: `name`, `description`, `image_url` and some of its `categories`.

download_image() # Download image of the game.

url_list = Queue([..]) # Queue of all game url that are neccessary for crawling.

while len(url_list) > 0: 
    target_url = url_list.pop()
    game_card = get_game_card(target_url)
    download_image(game_card["image_url"])

The more advanced pipeline includes an API that views Google Drive as cloud storage and store crawled data into that storage. To be more specific, I setup a timer that the plain pipeline would run for h-hours and rest for k-hours. During the resting time, the mentioned API would be called to do its task.

Usage

Download the repository

    git clone https://github.com/MinLee0210/Crawl-Instantgaming.git
    cd Crawl-Instantgaming
    pip install -r requirements.txt

Run the project NOTE: At the moment, the project is just able to run on local machine. Excuting the app on cloud to scrape those information in real-time is a great idea for improvement.

    python main.py

Suggesting projects

As an AI enthusiast, I believe the data can be used to:

Game Classification via Description.
Game Searching via Description (utilizing RAG systems).
Search game via image (utlizing CLIP).

Comments

Too relies on the basic, could use some kind of queue that utilizing the real-time scraping without pre-knowledge about the number of pages that the website has. Although it words, but it can be better.
Poor code.

Reference

https://medium.com/kariyertech/web-crawling-general-perspective-713971e9c659

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
data		data
engine		engine
log		log
static		static
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
DOCKERFILE		DOCKERFILE
LICENSE		LICENSE
README.md		README.md
compose.yaml		compose.yaml
git_commit.sh		git_commit.sh
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawl-InstantGaming

Introduction

Usage

Suggesting projects

Comments

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

MinLee0210/Crawl-Instantgaming

Folders and files

Latest commit

History

Repository files navigation

Crawl-InstantGaming

Introduction

Usage

Suggesting projects

Comments

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages