Skip to content

NCTU-SYNC/sync-crawler

Repository files navigation

SYNC Crawler

Description

This is a crawler for the SYNC project. It crawls news articles from different news websites and stores them in a database.

Installation

uv sync --frozen

You can also install with the following flags:

  • --no-dev: Optional. If you don't want to install dev dependencies.
  • --extra database: To install dependencies for storing news in MongoDB and Qdrant. You might need GPUs to execute embedding models.
  • --extra migration: To install dependencies for migrating data from MongoDB to Qdrant. You might need GPUs to execute embedding models.

Environment Variables

The following environment variables are only needed if you want to store crawled data in the database. The core crawler functionality doesn't require these variables. They should be defined in a .env file:

Variable Description
MONGO_URL MongoDB connection string
MONGO_DATABASE MongoDB database name
MONGO_COLLECTION MongoDB collection name for storing news articles
QDRANT_HOST Hostname or IP address of the Qdrant vector database
QDRANT_PORT Port number for the Qdrant server
QDRANT_COLLECTION Qdrant collection name for storing news embeddings

Configuration

The following configuration is only needed for database operations. The core crawler can function without this configuration.

The crawler uses a TOML configuration file located at configs/config.toml with the following options:

Qdrant Configuration

[qdrant]
embedding_model = 'moka-ai/m3e-base'  # The embedding model used for vectorizing news content

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages