Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
d87717f
feat: replace Whisper with Smart-Turn VAD, preserve diarization
NikiPshg Aug 3, 2025
8cbfa10
feat(transcription): add new ASR models + ROVER integration
NikiPshg Aug 9, 2025
2cc427c
feat(speech-processing): major pipeline overhaul
NikiPshg Sep 18, 2025
6272fca
feat: optimizations and updates from pyannote 3.1 to pyannote/speaker…
NikiPshg Oct 5, 2025
29311ee
feat: removed support for launching via separate bash scripts
NikiPshg Oct 5, 2025
24f5a51
feat(libs): add NISQA and Smart-VAD libraries as local dependencies
NikiPshg Oct 5, 2025
7f45964
feat: auto-download models on startup and add batching to music detector
NikiPshg Oct 6, 2025
0665296
feat(docs): minimum instructions for use
NikiPshg Oct 6, 2025
7e7a57e
docs: minimum instructions for use
NikiPshg Oct 6, 2025
549a4e9
fix: correct LM download
NikiPshg Oct 8, 2025
f3f370f
fix(diarization): switching to speaker-diarization-community-1
NikiPshg Oct 8, 2025
f6c817f
fix(accent): fixed cuda support
NikiPshg Oct 9, 2025
61eddd1
feat: addded ability to run collate without preprocess block
NikiPshg Oct 24, 2025
f8c111f
feat: replaced external links with mtuci-related ones
NikiPshg Oct 24, 2025
36f78c8
feat: update transcribation, separation for optimization
NikiPshg Nov 20, 2025
ace522e
feat: removing Nisqa from libs, update transcription(tone)
NikiPshg Dec 8, 2025
f7853e5
feat: upgrade to gigaamv3
NikiPshg Dec 8, 2025
319901f
feat: dinamic consensus_num
NikiPshg Dec 11, 2025
894dbe6
feat: update imports, fix transcription block
NikiPshg Dec 15, 2025
1dff6bb
feat: add volume normalization and deletion of files that do not go t…
NikiPshg Dec 26, 2025
ea2ed34
docs: update readme
NikiPshg Dec 26, 2025
5ce8462
docs: update readme
NikiPshg Dec 26, 2025
0e593c1
feat(silence):add silence detect, update read me
NikiPshg Jan 18, 2026
f099c44
feat: update order separation_yaml
NikiPshg Jan 31, 2026
c9e6e9b
feat: change transription backend
Feb 18, 2026
36e599c
feat: fix bugs transcription
Feb 21, 2026
2cee2d6
feat: update nisqa work
Feb 22, 2026
b7f406e
feat: update trt work and readme
Feb 23, 2026
0421371
feat: add distillMos and fix nisqa bug
NikiPshg Feb 24, 2026
9762a3e
add benchmarking
SLENSER0 Mar 9, 2026
ad32884
feat: upload sortformer and remove pyannote,nisqa,silero
NikiPshg Mar 9, 2026
db49b76
feat: update csv update in distillmos
NikiPshg Mar 10, 2026
03aaeae
feat: auto-load sortformer
NikiPshg Mar 10, 2026
46669a5
feat: update config
NikiPshg Mar 10, 2026
d32995a
fear: update transctipton trt cache path, clear repo
NikiPshg Mar 10, 2026
22ae752
feat: update SmartVAD 3.0 --> 3.2
NikiPshg Mar 10, 2026
0701a6e
Merge pull request #9 from SLENSER0/benchmarking
NikiPshg Mar 19, 2026
678dd27
docs: update sortformer readme
NikiPshg Mar 19, 2026
c71b74e
feat: add webdataset support
NikiPshg Mar 19, 2026
f77f1e5
feat: add trt for ruaccent
NikiPshg Mar 19, 2026
ad2a3ec
feat: add trt accent for each cuda
NikiPshg Mar 19, 2026
414d370
docs: update readme files
NikiPshg Mar 20, 2026
f7e6a63
fix: fix benchmarking code
NikiPshg Mar 20, 2026
a308fb9
docs: update read me
NikiPshg Mar 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
*.csv
.torch_hub/*
*.mp3
*.wav
test.py
Expand All @@ -7,7 +8,8 @@ chromadb
cl_*.py
*.ipynb
*.bin

temp/*
models/*
.env
__pycache__/
*.py[cod]
Expand Down Expand Up @@ -70,4 +72,5 @@ ipython_config.py
.pdm-build/

__pypackages__/
venv/*
venv/*
cache
281 changes: 55 additions & 226 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,278 +1,107 @@
# Balalaika Pipeline

A complete production-ready pipeline for processing podcast audio data, from download to feature extraction.
End-to-end speech data processing: ingest, segmentation, quality filtering, multi-model ASR with ROVER, punctuation, lexical stress, G2P, and export to Parquet / WebDataset.

---
Works with Yandex Music podcasts out of the box, or **your own corpus** if you follow the expected layout (see [Preparing your dataset](docs/preparing.md)).

## Table of Contents

1. [Prerequisites](#prerequisites)
2. [Installation](#installation)
3. [Data Preparation](#data-preparation)
- [Quick Setup (Default Parameters)](#quick-setup)
- [Custom Metadata Download](#custom-metadata-download)
4. [Running the Pipeline](#running-the-pipeline)
- [Basic Scenario (Local Processing)](#basic-scenario-local-processing)
5. [Configuration](#configuration)
6. [Environment Variables](#environment-variables)
7. [Models](#models)
8. [Citation](#citation)
9. [Acknowledgments](#acknowledgments)
**Pre-built processed datasets** (segmented, filtered, annotated) are published on Hugging Face: **[Balalaika Dataset — MTUCI collection](https://huggingface.co/collections/MTUCI/balalaika-dataset)**.

---

## Prerequisites
## Quick Start

Ensure you have the following tools installed on your system:
### Prerequisites

```bash
sudo apt update && sudo apt install -y ffmpeg
wget -qO- https://astral.sh/uv/install.sh | sh
```

````

---

## Installation

Clone the repository and set up the environment:
### Installation

```bash
git clone https://github.com/mtuciru/balalaika
cd balalaika
# Use this if you want to annotate/modify the dataset
bash create_dev_env.sh
# Use this if you only want to use the pre-annotated dataset
bash create_user_env.sh
bash create_dev_env.sh # full stack for running the pipeline
# or
bash create_user_env.sh # consume pre-built datasets only
```

---

## Data Preparation

### Quick Setup (Default Parameters)

To download and prepare the dataset with default settings, choose one of the preconfigured dataset sizes:

* **100-hour dataset**
```bash
bash use_meta_100h.sh
```
### Basic setup

* **500-hour dataset**
```bash
bash use_meta_500h.sh
```
1. Create `.env`:

* **1000-hour dataset**
```bash
bash use_meta_1000h.sh
```

* **2000-hour dataset**
```bash
bash use_meta_2000h.sh
```

All metadata can also be downloaded from [Hugging Face – MTUCI](https://huggingface.co/MTUCI).

### Custom Metadata Download

If you already have generated metadata files (`balalaika.parquet` and `balalaika.pkl`), place them in the project root and run:

```bash
bash use_meta.sh
```ini
HF_TOKEN=<your_huggingface_token>
YANDEX_KEY=<your_yandex_music_token>
```

---

## Running the Pipeline
2. Edit `configs/config.yaml`: set absolute paths (`podcasts_path`, model files under `models/`, etc.).


### Basic Scenario (Local Processing)


This scenario will:

1. Download datasets
2. Split audio into semantic chunks
3. Transcribe all segments
4. Perform speaker segmentation
5. Apply phonemization

To execute locally, run:
3. Run stages (see [Usage Guide](docs/guide.md)). Sequential wrapper:

```bash
bash base.sh configs/config.yaml
```

All output metadata will be saved in `podcasts/result.csv`.

---

## Configuration

The main configuration file is located at `configs/config.yaml`. This file is organized into several sections, each corresponding to a specific stage of the podcast processing pipeline. Below is a detailed explanation of the key parameters within each section.

---

### Global Parameters
Note: `base.sh` may have early stages commented out—uncomment what you need.

* `podcasts_path`: It specifies the **absolute path** to the directory where all downloaded podcast files will be stored and where subsequent processing (preprocessing, separation, transcription, etc.) will look for and save its output.
---

### `download` Section
## Documentation

This section controls how podcast episodes are downloaded.
- **[Preparing your dataset](docs/preparing.md)** — HF collection vs. local pipeline, folder layout, models, config.
- **[Usage Guide](docs/guide.md)** — stages, artifacts, per-step commands.
- **[example/README.md](example/README.md)** — loading the WebDataset with Hugging Face `datasets`.

* `podcasts_path`: (As explained above) The directory where downloaded podcasts will be saved.
* `episodes_limit`: This sets a **limit on the number of episodes** to download from a single podcast playlist.
* `num_workers`: Specifies the **number of parallel processes** to use for downloading. A higher number can speed up downloads but will consume more system resources.
* `podcasts_urls_file`: This parameter points to the **path of a `.pkl` file** that contains a list of podcast URLs to be downloaded.
Per-module notes live under `src/*/README.md` (aligned with `configs/config.yaml`).

---

### `preprocess` Section

This section handles the initial processing of downloaded audio files, such as chopping them into smaller segments.
## Pipeline overview

* `podcasts_path`: (As explained above) The directory containing the raw downloaded podcasts that need to be preprocessed.
* `duration`: Defines the **maximum length in seconds** for each audio sample (segment).
* `num_workers`: Specifies the **number of parallel processes** to use during preprocessing.
* `whisper_model`: Specifies the **name or path of the Faster-Whisper compatible model** to be used for initial audio processing.
* `compute_type`: Determines the **computation type** for the Whisper model, affecting performance and memory usage.
* `beam_size`: This parameter is related to the **beam search algorithm** used in the Whisper model's decoding process.
1. **Download** — optional episode fetch.
2. **Preprocess** — **Sortformer (ONNX)** diarization, single-speaker selection, **Smart Turn** boundary refinement, chunking + `balalaika.csv`; long source files removed after chunking; **crest-factor** filtering (`crest_factor` written to CSV, bad files deleted and their CSV rows removed); **EBU R128-style** loudness normalization (see `preprocess_yaml.sh` order).
3. **Separation** — **music detection** (WavLM-based): `music_prob` written to CSV, clips above threshold deleted and their CSV rows removed; **DistillMOS** → `DistillMOS` column in `balalaika.csv`.
4. **Transcription** — **[onnx-asr](https://github.com/istupakov/onnx-asr)** (ONNX Runtime / optional TensorRT), **ROVER** consensus, optional word-level `.tst`.
5. **Punctuation** — RUPunct.
6. **Accents** — ruAccent (e.g. `turbo3.1`).
7. **Phonemization** — **TryIParu** `G2PModel` → `*_rover_phonemes.txt`.
8. **Collate / export** — `balalaika.parquet` and WebDataset shards via `src/collate_yamls.sh`.

---

### `separation` Section

This section calculates metrics for each audio

* `podcasts_path`: (As explained above) The directory where the chopped podcasts (from the `preprocess` stage) are located.
* `num_workers`: The **number of parallel processes** to use for audio separation.
* `nisqa_config`: Specifies the **path to the configuration file for NISQA**
* `one_speaker`: A **boolean flag** (`True`/`False`) that, when enabled (`True`), instructs the system to download and process only those audio recordings that should contain a single speaker.

---

### `transcription` Section

This section is responsible for converting audio into text.

* `podcasts_path`: (As explained above) The directory containing the processed audio files ready for transcription.
* `model_name`: Specifies the **type of automatic speech recognition (ASR) model** to use. Options typically include `"ctc" or "rnnt"`.
* `num_workers`: The **number of parallel processes per GPU** to use for transcription.
* `with_timestamps`: A **boolean flag** (`True`/`False`) that, when enabled, allows the transcription process to generate timestamps for each word or segment. **it only works with ctc**
* `lm_path`: Specifies the **path to a language model file (`.bin`)**. A language model can improve transcription accuracy by providing contextual information.

---

### `punctuation` Section

This section focuses on adding proper punctuation to the transcribed text.

* `podcasts_path`: (As explained above) The directory where the transcribed text files are located.
* `model_name`: Specifies the **name of the RUPunct model** to be used for punctuation restoration.
* `num_workers`: The **number of parallel processes per GPU** to use for punctuation.
---

### `accent` Section

In the transcribed text this part is restored with accents.

* `podcasts_path`: (As explained above) The directory containing the relevant podcast files.
* `num_workers`: The **number of parallel processes per GPU** to use for accent processing.
* `model_name`: Specifies the **name of the ruAccent model** to be used.

---

### `phonemizer` Section

This section is responsible for converting text into phonetic representations (phonemes).

* `podcasts_path`: (As explained above) The directory where the text files (from transcription and punctuation stages) are located.
* `num_workers`: The **number of parallel processes per GPU** to use for phonemization.
---

### `classification` Section

This section relates to global speaker clustering.

* `podcasts_path`: (As explained above) The directory containing the podcast files relevant for classification.
* `num_workers`: The **number of parallel processes per GPU** to use for classification.
* `threshold`: This is the **speaker classification confidence threshold**. Values typically range from `0.6` to `0.9`. A higher threshold means the model needs to be more confident in its classification to assign a label.
* `model_path`: Specifies the **path to the pretrained speaker classification model** in `.pt` format.
---

### Execution Scripts

Each processing script (`*_yaml.sh` and `*_args.sh`) offers flexibility in how parameters are provided:

* `*_yaml.sh`: These scripts read all necessary parameters directly from the main `config.yaml` file, ensuring consistency across different stages.
* `*_args.sh`: These scripts allow for hardcoded arguments directly within the shell script itself, which can be useful for quick tests or specific overrides without modifying the main configuration file.

## Environment Variables

Create a `.env` file in the project root with the following:

```ini
HF_TOKEN=<your_huggingface_token>
YANDEX_KEY=<your_yandex_music_token>
```

* `HF_TOKEN`: Required for speaker count estimation.
* `YANDEX_KEY`: Required for dataset downloads.

---

## Important Notes

- All scripts must be executed from the **project root directory**.
- Paths in the config file must be **absolute**.
- The processing scripts (punctuation, accents, yofication) should be run **sequentially**.
- You’ll need:
- Yandex Music API key ([How to get one](https://yandex-music.readthedocs.io/en/main/token.html))
- Hugging Face token

## Models

Place all required models under the `models/` directory with the following structure:
## Citation

```
models/
├── vosblink_resnet/ # Speaker classification model
│ └── ...
└── nisqa_s.tar # Audio quality assessment model
```bibtex
@article{borodin2025datacentric,
title={A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models},
author={Borodin, Kirill and Vasiliev, Nikita and Kudryavtsev, Vasiliy and Maslov, Maxim and Gorodnichev, Mikhail and Rogov, Oleg and Mkrtchian, Grach},
journal={arXiv preprint arXiv:2507.13563},
year={2025}
}
```

Supported models:

- [NISQA](https://github.com/deepvk/NISQA-s) – Audio quality assessment.
- [GigaAM](https://github.com/salute-developers/GigaAM) – ASR.
- [ruAccent](https://github.com/Den4ikAI/ruaccent) – Accent restoration.
- [RUPunct](https://huggingface.co/RUPunct/RUPunct_big) – Punctuation restoration.
- [VoxBlink ResNet](https://github.com/wenet-e2e/wespeaker) – Speaker classification.
- [TryIPaG2P](https://github.com/NikiPshg/TryIPaG2P) – Phonemization.
- [Speaker Diarization](https://github.com/pyannote/pyannote-audio) – Speaker diarization.
- [Whisper](https://github.com/SYSTRAN/faster-whisper) – ASR + segmentation
**Paper**: [arXiv:2507.13563](https://arxiv.org/abs/2507.13563)
**DOI**: [10.48550/arXiv.2507.13563](https://doi.org/10.48550/arXiv.2507.13563)

---

## Citation
## Models & tooling

If you use this pipeline in your research or production, please cite:
```
```
| Piece | Role |
|--------|------|
| **Sortformer** (ONNX) | streaming diarization, single-speaker slices |
| **[Smart Turn](https://github.com/pipecat-ai/smart-turn)** (`smart-turn-v3.0.onnx`) | end-of-speech / turn boundaries |
| **Music detector** (`music_detection.safetensors`) | drop music-heavy chunks |
| **DistillMOS** | predicted MOS in `balalaika.csv` |
| **[onnx-asr](https://github.com/istupakov/onnx-asr)** | GigaAM v3 CTC/RNNT, Vosk, T-one, Parakeet, Canary, Whisper, … |
| **[RUPunct](https://huggingface.co/RUPunct/RUPunct_big)** | punctuation |
| **[ruAccent](https://github.com/Den4ikAI/ruaccent)** | stress marks |
| **TryIParu** (`tryiparu`) | grapheme → IPA |

---

## References and Acknowledgements

Thanks to all the developers and contributors who made this project possible.

<a href="https://github.com/mtuciru/balalaika/graphs/contributors">
<img src="https://contrib.rocks/image?repo=yeongpin/balalaikap&preview=true&max=&columns=" />
</a>

## License

See [LICENSE](LICENSE).
Loading