mtuciru · KORALLLL · Aug 3, 2025 · Aug 9, 2025 · Sep 18, 2025 · Oct 5, 2025
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,5 @@
 *.csv
+.torch_hub/*
 *.mp3
 *.wav
 test.py
@@ -7,7 +8,8 @@ chromadb
 cl_*.py
 *.ipynb
 *.bin
-
+temp/*
+models/*
 .env
 __pycache__/
 *.py[cod]
@@ -70,4 +72,5 @@ ipython_config.py
 .pdm-build/
 
 __pypackages__/
-venv/*
+venv/*
+cache
diff --git a/README.md b/README.md
@@ -1,278 +1,107 @@
 # Balalaika Pipeline
 
-A complete production-ready pipeline for processing podcast audio data, from download to feature extraction.
+End-to-end speech data processing: ingest, segmentation, quality filtering, multi-model ASR with ROVER, punctuation, lexical stress, G2P, and export to Parquet / WebDataset.
 
----
+Works with Yandex Music podcasts out of the box, or **your own corpus** if you follow the expected layout (see [Preparing your dataset](docs/preparing.md)).
 
-## Table of Contents
-
-1. [Prerequisites](#prerequisites)
-2. [Installation](#installation)
-3. [Data Preparation](#data-preparation)
-   - [Quick Setup (Default Parameters)](#quick-setup)
-   - [Custom Metadata Download](#custom-metadata-download)
-4. [Running the Pipeline](#running-the-pipeline)
-   - [Basic Scenario (Local Processing)](#basic-scenario-local-processing)
-5. [Configuration](#configuration)
-6. [Environment Variables](#environment-variables)
-7. [Models](#models)
-8. [Citation](#citation)
-9. [Acknowledgments](#acknowledgments)
+**Pre-built processed datasets** (segmented, filtered, annotated) are published on Hugging Face: **[Balalaika Dataset — MTUCI collection](https://huggingface.co/collections/MTUCI/balalaika-dataset)**.
 
 ---
 
-## Prerequisites
+## Quick Start
 
-Ensure you have the following tools installed on your system:
+### Prerequisites
 
 ```bash
 sudo apt update && sudo apt install -y ffmpeg
 wget -qO- https://astral.sh/uv/install.sh | sh
+```
 
-````
-
----
-
-## Installation
-
-Clone the repository and set up the environment:
+### Installation
 
 ```bash
 git clone https://github.com/mtuciru/balalaika
 cd balalaika
-# Use this if you want to annotate/modify the dataset
-bash create_dev_env.sh
-# Use this if you only want to use the pre-annotated dataset
-bash create_user_env.sh 
+bash create_dev_env.sh   # full stack for running the pipeline
+# or
+bash create_user_env.sh  # consume pre-built datasets only
 ```
 
----
-
-## Data Preparation
-
-### Quick Setup (Default Parameters)
-
-To download and prepare the dataset with default settings, choose one of the preconfigured dataset sizes:
-
-* **100-hour dataset**
-  ```bash
-  bash use_meta_100h.sh
-  ```
+### Basic setup
 
-* **500-hour dataset**
-  ```bash
-  bash use_meta_500h.sh
-  ```
+1. Create `.env`:
 
-* **1000-hour dataset**
-  ```bash
-  bash use_meta_1000h.sh
-  ```
-
-* **2000-hour dataset**
-  ```bash
-  bash use_meta_2000h.sh
-  ```
-
-All metadata can also be downloaded from [Hugging Face – MTUCI](https://huggingface.co/MTUCI).
-
-### Custom Metadata Download
-
-If you already have generated metadata files (`balalaika.parquet` and `balalaika.pkl`), place them in the project root and run:
-
-```bash
-bash use_meta.sh
+```ini
+HF_TOKEN=<your_huggingface_token>
+YANDEX_KEY=<your_yandex_music_token>
 ```
 
----
-
-## Running the Pipeline
+2. Edit `configs/config.yaml`: set absolute paths (`podcasts_path`, model files under `models/`, etc.).
 
-
-### Basic Scenario (Local Processing)
-
-
-This scenario will:
-
-1. Download datasets
-2. Split audio into semantic chunks
-3. Transcribe all segments
-4. Perform speaker segmentation
-5. Apply phonemization
-
-To execute locally, run:
+3. Run stages (see [Usage Guide](docs/guide.md)). Sequential wrapper:
 
 ```bash
 bash base.sh configs/config.yaml
 ```
 
-All output metadata will be saved in `podcasts/result.csv`.
-
----
-
-## Configuration
-
-The main configuration file is located at `configs/config.yaml`. This file is organized into several sections, each corresponding to a specific stage of the podcast processing pipeline. Below is a detailed explanation of the key parameters within each section.
-
----
-
-### Global Parameters
+Note: `base.sh` may have early stages commented out—uncomment what you need.
 
-* `podcasts_path`:  It specifies the **absolute path** to the directory where all downloaded podcast files will be stored and where subsequent processing (preprocessing, separation, transcription, etc.) will look for and save its output.
 ---
 
-### `download` Section
+## Documentation
 
-This section controls how podcast episodes are downloaded.
+- **[Preparing your dataset](docs/preparing.md)** — HF collection vs. local pipeline, folder layout, models, config.
+- **[Usage Guide](docs/guide.md)** — stages, artifacts, per-step commands.
+- **[example/README.md](example/README.md)** — loading the WebDataset with Hugging Face `datasets`.
 
-* `podcasts_path`: (As explained above) The directory where downloaded podcasts will be saved.
-* `episodes_limit`: This sets a **limit on the number of episodes** to download from a single podcast playlist.
-* `num_workers`: Specifies the **number of parallel processes** to use for downloading. A higher number can speed up downloads but will consume more system resources.
-* `podcasts_urls_file`: This parameter points to the **path of a `.pkl` file** that contains a list of podcast URLs to be downloaded.
+Per-module notes live under `src/*/README.md` (aligned with `configs/config.yaml`).
 
 ---
 
-### `preprocess` Section
-
-This section handles the initial processing of downloaded audio files, such as chopping them into smaller segments.
+## Pipeline overview
 
-* `podcasts_path`: (As explained above) The directory containing the raw downloaded podcasts that need to be preprocessed.
-* `duration`: Defines the **maximum length in seconds** for each audio sample (segment).
-* `num_workers`: Specifies the **number of parallel processes** to use during preprocessing.
-* `whisper_model`: Specifies the **name or path of the Faster-Whisper compatible model** to be used for initial audio processing.
-* `compute_type`: Determines the **computation type** for the Whisper model, affecting performance and memory usage.
-* `beam_size`: This parameter is related to the **beam search algorithm** used in the Whisper model's decoding process.
+1. **Download** — optional episode fetch.
+2. **Preprocess** — **Sortformer (ONNX)** diarization, single-speaker selection, **Smart Turn** boundary refinement, chunking + `balalaika.csv`; long source files removed after chunking; **crest-factor** filtering (`crest_factor` written to CSV, bad files deleted and their CSV rows removed); **EBU R128-style** loudness normalization (see `preprocess_yaml.sh` order).
+3. **Separation** — **music detection** (WavLM-based): `music_prob` written to CSV, clips above threshold deleted and their CSV rows removed; **DistillMOS** → `DistillMOS` column in `balalaika.csv`.
+4. **Transcription** — **[onnx-asr](https://github.com/istupakov/onnx-asr)** (ONNX Runtime / optional TensorRT), **ROVER** consensus, optional word-level `.tst`.
+5. **Punctuation** — RUPunct.
+6. **Accents** — ruAccent (e.g. `turbo3.1`).
+7. **Phonemization** — **TryIParu** `G2PModel` → `*_rover_phonemes.txt`.
+8. **Collate / export** — `balalaika.parquet` and WebDataset shards via `src/collate_yamls.sh`.
 
 ---
 
-### `separation` Section
-
-This section calculates metrics for each audio
-
-* `podcasts_path`: (As explained above) The directory where the chopped podcasts (from the `preprocess` stage) are located.
-* `num_workers`: The **number of parallel processes** to use for audio separation.
-* `nisqa_config`: Specifies the **path to the configuration file for NISQA** 
-* `one_speaker`: A **boolean flag** (`True`/`False`) that, when enabled (`True`), instructs the system to download and process only those audio recordings that should contain a single speaker.
-
----
-
-### `transcription` Section
-
-This section is responsible for converting audio into text.
-
-* `podcasts_path`: (As explained above) The directory containing the processed audio files ready for transcription.
-* `model_name`: Specifies the **type of automatic speech recognition (ASR) model** to use. Options typically include `"ctc" or "rnnt"`.
-* `num_workers`: The **number of parallel processes per GPU** to use for transcription.
-* `with_timestamps`: A **boolean flag** (`True`/`False`) that, when enabled, allows the transcription process to generate timestamps for each word or segment. **it only works with ctc**
-* `lm_path`: Specifies the **path to a language model file (`.bin`)**. A language model can improve transcription accuracy by providing contextual information. 
-
----
-
-### `punctuation` Section
-
-This section focuses on adding proper punctuation to the transcribed text.
-
-* `podcasts_path`: (As explained above) The directory where the transcribed text files are located.
-* `model_name`: Specifies the **name of the RUPunct model** to be used for punctuation restoration. 
-* `num_workers`: The **number of parallel processes per GPU** to use for punctuation.
----
-
-### `accent` Section
-
-In the transcribed text this part is restored with accents.
-
-* `podcasts_path`: (As explained above) The directory containing the relevant podcast files.
-* `num_workers`: The **number of parallel processes per GPU** to use for accent processing.
-* `model_name`: Specifies the **name of the ruAccent model** to be used.
-
----
-
-### `phonemizer` Section
-
-This section is responsible for converting text into phonetic representations (phonemes).
-
-* `podcasts_path`: (As explained above) The directory where the text files (from transcription and punctuation stages) are located.
-* `num_workers`: The **number of parallel processes per GPU** to use for phonemization.
----
-
-### `classification` Section
-
-This section relates to global speaker clustering.
-
-* `podcasts_path`: (As explained above) The directory containing the podcast files relevant for classification.
-* `num_workers`: The **number of parallel processes per GPU** to use for classification.
-* `threshold`: This is the **speaker classification confidence threshold**. Values typically range from `0.6` to `0.9`. A higher threshold means the model needs to be more confident in its classification to assign a label. 
-* `model_path`: Specifies the **path to the pretrained speaker classification model** in `.pt` format.
----
-
-### Execution Scripts
-
-Each processing script (`*_yaml.sh` and `*_args.sh`) offers flexibility in how parameters are provided:
-
-* `*_yaml.sh`: These scripts read all necessary parameters directly from the main `config.yaml` file, ensuring consistency across different stages.
-* `*_args.sh`: These scripts allow for hardcoded arguments directly within the shell script itself, which can be useful for quick tests or specific overrides without modifying the main configuration file.
-
-## Environment Variables
-
-Create a `.env` file in the project root with the following:
-
-```ini
-HF_TOKEN=<your_huggingface_token>
-YANDEX_KEY=<your_yandex_music_token>
-```
-
-* `HF_TOKEN`: Required for speaker count estimation.
-* `YANDEX_KEY`: Required for dataset downloads.
-
----
-
-## Important Notes
-
-- All scripts must be executed from the **project root directory**.
-- Paths in the config file must be **absolute**.
-- The processing scripts (punctuation, accents, yofication) should be run **sequentially**.
-- You’ll need:
-  - Yandex Music API key ([How to get one](https://yandex-music.readthedocs.io/en/main/token.html)) 
-  - Hugging Face token
-
-## Models
-
-Place all required models under the `models/` directory with the following structure:
+## Citation
 
-```
-models/
-├── vosblink_resnet/        # Speaker classification model
-│   └── ...
-└── nisqa_s.tar             # Audio quality assessment model
+```bibtex
+@article{borodin2025datacentric,
+  title={A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models},
+  author={Borodin, Kirill and Vasiliev, Nikita and Kudryavtsev, Vasiliy and Maslov, Maxim and Gorodnichev, Mikhail and Rogov, Oleg and Mkrtchian, Grach},
+  journal={arXiv preprint arXiv:2507.13563},
+  year={2025}
+}
 ```
 
-Supported models:
-
-- [NISQA](https://github.com/deepvk/NISQA-s)  – Audio quality assessment.
-- [GigaAM](https://github.com/salute-developers/GigaAM)  – ASR.
-- [ruAccent](https://github.com/Den4ikAI/ruaccent)  – Accent restoration.
-- [RUPunct](https://huggingface.co/RUPunct/RUPunct_big)  – Punctuation restoration.
-- [VoxBlink ResNet](https://github.com/wenet-e2e/wespeaker)  – Speaker classification.
-- [TryIPaG2P](https://github.com/NikiPshg/TryIPaG2P)  – Phonemization.
-- [Speaker Diarization](https://github.com/pyannote/pyannote-audio)  – Speaker diarization.
-- [Whisper](https://github.com/SYSTRAN/faster-whisper)  – ASR + segmentation
+**Paper**: [arXiv:2507.13563](https://arxiv.org/abs/2507.13563)  
+**DOI**: [10.48550/arXiv.2507.13563](https://doi.org/10.48550/arXiv.2507.13563)
 
 ---
 
-## Citation
+## Models & tooling
 
-If you use this pipeline in your research or production, please cite:
-```
-```
+| Piece | Role |
+|--------|------|
+| **Sortformer** (ONNX) | streaming diarization, single-speaker slices |
+| **[Smart Turn](https://github.com/pipecat-ai/smart-turn)** (`smart-turn-v3.0.onnx`) | end-of-speech / turn boundaries |
+| **Music detector** (`music_detection.safetensors`) | drop music-heavy chunks |
+| **DistillMOS** | predicted MOS in `balalaika.csv` |
+| **[onnx-asr](https://github.com/istupakov/onnx-asr)** | GigaAM v3 CTC/RNNT, Vosk, T-one, Parakeet, Canary, Whisper, … |
+| **[RUPunct](https://huggingface.co/RUPunct/RUPunct_big)** | punctuation |
+| **[ruAccent](https://github.com/Den4ikAI/ruaccent)** | stress marks |
+| **TryIParu** (`tryiparu`) | grapheme → IPA |
 
 ---
 
-## References and Acknowledgements
-
-Thanks to all the developers and contributors who made this project possible.
-
-<a href="https://github.com/mtuciru/balalaika/graphs/contributors">
-  <img src="https://contrib.rocks/image?repo=yeongpin/balalaikap&preview=true&max=&columns=" />
-</a>
-
+## License
 
+See [LICENSE](LICENSE).