SYNC Crawler

Description

This is a crawler for the SYNC project. It crawls news articles from different news websites and stores them in a database.

Installation

uv sync --frozen

You can also install with the following flags:

--no-dev: Optional. If you don't want to install dev dependencies.
--extra database: To install dependencies for storing news in MongoDB and Qdrant. You might need GPUs to execute embedding models.
--extra migration: To install dependencies for migrating data from MongoDB to Qdrant. You might need GPUs to execute embedding models.

Environment Variables

The following environment variables are only needed if you want to store crawled data in the database. The core crawler functionality doesn't require these variables. They should be defined in a .env file:

Variable	Description
MONGO_URL	MongoDB connection string
MONGO_DATABASE	MongoDB database name
MONGO_COLLECTION	MongoDB collection name for storing news articles
QDRANT_HOST	Hostname or IP address of the Qdrant vector database
QDRANT_PORT	Port number for the Qdrant server
QDRANT_COLLECTION	Qdrant collection name for storing news embeddings

Configuration

The following configuration is only needed for database operations. The core crawler can function without this configuration.

The crawler uses a TOML configuration file located at configs/config.toml with the following options:

Qdrant Configuration

[qdrant]
embedding_model = 'moka-ai/m3e-base'  # The embedding model used for vectorizing news content

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
configs		configs
scripts		scripts
sync_crawler		sync_crawler
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SYNC Crawler

Description

Installation

Environment Variables

Configuration

Qdrant Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

NCTU-SYNC/sync-crawler

Folders and files

Latest commit

History

Repository files navigation

SYNC Crawler

Description

Installation

Environment Variables

Configuration

Qdrant Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages