This project is designed to collect and manage VTuber-related dialogue data, supporting data isolation for multiple live stream rooms.
Current Architecture:
- Data Synchronization: Synchronize JSON data generated by crawlers to the local environment via file transfer (e.g., SCP).
- Data Processing: Local scripts read data files and use Pydantic models for validation and cleaning.
This project uses uv for package management. Python version >=3.12 is required.
uv syncInstall PyTorch (with CUDA support):
Please select the appropriate CUDA version based on your graphics card driver (the example below is for CUDA 12.4):
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124- Python 3.12+
- Pydantic: Used for data structure definition and validation.
- PyTorch: Deep learning framework (CUDA enablement recommended).
-
Run Tests:
uv run pytest
-
Add Dependency:
uv add <package_name>
The core of the project defines two data models (located in src/neuro_sama/models/):
- Stream: Live stream metadata (ID, title, streamer, timestamp, etc.).
- Dialogue: Dialogue data (question, answer, timestamp, confidence, etc.).
These models are used to validate the JSON data format synchronized from the crawler.
Neruo-sama
├─ config.py
├─ data
│ ├─ cleaned
│ │ └─ 7589012_pend.jsonl
│ ├─ events
│ │ ├─ alignment
│ │ └─ spam
│ └─ raw
│ ├─ audio
│ └─ danmaku
│ └─ 7589012.jsonl
├─ DEV_LOG.md
├─ Dockerfile
├─ main.py
├─ PROJECT_STRUCTURE.md
├─ pyproject.toml
├─ README.md
├─ ROADMAP.md
├─ src
│ └─ neuro_sama
│ ├─ models
│ │ ├─ dialogue.py
│ │ ├─ stream.py
│ │ └─ __init__.py
│ ├─ parser
│ │ ├─ parse_jsonl.py
│ │ ├─ screen_spam.py
│ │ └─ __init__.py
│ └─ __init__.py
├─ test
│ ├─ test_models.py
│ ├─ test_parser.py
│ └─ test_spam.py
├─ TESTING.md
└─ uv.lock