Skip to content

LucciVanSchandt/ShoutingVoiceDetection

Repository files navigation

Shouting Voice Detection

Frame-level shouting detection for music or speech audio using librosa for feature extraction and PyTorch Lightning for training. The project targets Apple Silicon (MPS) but works on CPU-only setups as well.

Features

  • 16 kHz resampling and mel-spectrogram features via ShoutingVoiceFrameDataset.
  • Lightweight CNN LightningModule (ShoutingVoiceFrameCNN) with BCE loss and accuracy logging.
  • Training CLI (app/train.py) with configurable frame settings, hyperparameters, and deterministic splits.
  • Inference CLI (app/predict.py) that loads checkpoints and exports [time, probability] arrays.
  • Visualization CLI (app/visualize.py) overlaying shouting spans on the waveform.
  • Unit tests covering dataset behavior and CNN forward/training steps.

Project Layout

.
├─ app/
│  ├─ data/                # sample WAV + labels
│  ├─ model/
│  │  ├─ dataset.py        # ShoutingVoiceFrameDataset implementation
│  │  └─ model.py          # ShoutingVoiceFrameCNN LightningModule
│  ├─ train.py             # training entry point
│  ├─ predict.py           # checkpoint-driven inference
│  └─ visualize.py         # waveform + prediction overlay
├─ tests/                  # pytest suite
├─ docker/                 # placeholder for container assets
├─ environment.yml         # conda environment (Apple Silicon-friendly)
├─ requirements.txt        # pip-based dependency lock
├─ Makefile                # setup/format/lint/test helpers
├─ IMPLEMENTATION_PLAN.md  # progress checklist
└─ AGENTS.md               # contributor guidelines

Environment Setup

Option 1: Conda (Recommended for Apple Silicon)

git clone <repo-url>
cd ShoutingVoiceDetection
conda env create -f environment.yml
conda activate svd
python -m pip install --upgrade pip

Option 2: Virtualenv via Makefile

git clone <repo-url>
cd ShoutingVoiceDetection
make setup
source .venv/bin/activate  # after setup completes

Both paths install PyTorch, PyTorch Lightning, librosa, matplotlib, pytest, black, and ruff. The Makefile targets:

  • make formatblack app tests
  • make lintruff check app tests
  • make testpytest (PYTHONPATH configured via pytest.ini)

Option 3: Docker (Reproducible Everywhere)

The repo ships with docker/Dockerfile, which creates a slim CPU-only image that already contains Python, system audio libraries, and every dependency from requirements.txt. Use it when you want a guaranteed-clean environment or to run training in CI without managing conda.

Build once from the repo root:

docker build -t svd:cpu -f docker/Dockerfile .

Kick off training inside the container (all flags pass through to app.train):

docker run --rm svd:cpu --max_epochs=5 --batch_size=8

Need an interactive shell for debugging? Override the entrypoint:

docker run -it --entrypoint /bin/bash svd:cpu

Data Requirements

Place short WAV clips under app/data/audio/ and create app/data/labels.csv with:

file,start,end,label
example.wav,0.0,2.0,shouting
example.wav,2.0,4.0,non_vocal

Times are in seconds at 16 kHz. Labels accept string values like shouting, non_vocal, or numeric 0/1.

Dataset Diagnostics

  • Inspect class balance with the same frame settings you use for training:
    python -m app.utils.report_class_balance \
      --labels_csv app/data/labels.csv \
      --audio_dir app/data/audio \
      --frame_duration 1.0 \
      --hop_duration 0.5
    This prints how many positive vs. negative frames exist overall and per file.
  • Visualize the ground-truth spans from labels.csv on top of the waveform:
    python -m app.utils.plot_labels \
      --audio app/data/audio/example.wav \
      --labels_csv app/data/labels.csv \
      --output outputs/example_labels.png
    The plot highlights shouting intervals (red) and non-vocal intervals (green).

Training

Run on CPU or Apple MPS:

python -m app.train \
  --labels_csv app/data/labels.csv \
  --audio_dir app/data/audio \
  --batch_size 4 \
  --max_epochs 5 \
  --frame_duration 1.0 \
  --hop_duration 0.5

Key flags:

  • --default_root_dir <dir> (Lightning) if you want checkpoints somewhere other than lightning_logs/svd/.
  • --sample_rate, --n_mels, --n_fft, --spec_hop_length to tweak the feature extractor.
  • --num_workers for DataLoaders (set >0 when running outside notebooks).
  • --log_dir to relocate TensorBoard events and checkpoints (default lightning_logs).

Lightning checkpoints land under lightning_logs/svd/.../checkpoints/epoch=*-step=*.ckpt. Copy or symlink a checkpoint to a stable location (e.g., checkpoints/last.ckpt) for inference.

Visualize Training with TensorBoard

TensorBoard is included in requirements.txt/environment.yml. After any training run, Lightning writes logs under lightning_logs/svd/. Launch TensorBoard from the repo root:

tensorboard --logdir lightning_logs --port 6006

Open http://localhost:6006 to inspect loss curves, metrics, and learning-rate schedules across runs.

Inference

Generate frame probabilities for any WAV:

python -m app.predict \
  checkpoints/last.ckpt \
  app/data/audio/example.wav \
  --output outputs/example_preds.npy \
  --frame_duration 1.0 \
  --hop_duration 0.5 \
  --threshold 0.6

Outputs a NumPy array with shape (num_frames, 2) containing [start_time_sec, probability].

Visualization

Overlay shouting spans on the waveform using the saved predictions:

python -m app.visualize \
  --audio app/data/audio/example.wav \
  --predictions outputs/example_preds.npy \
  --threshold 0.6 \
  --output outputs/example_plot.png

If --output is omitted, the plot displays interactively.

Example Output

Shouting voice visualization

Testing & Quality

  • pytest tests/model -q validates dataset and model components.
  • make format / make lint keep code style consistent (black + ruff).
  • For coverage-oriented runs: pytest --cov=app --cov-report=term-missing.

Implementation Progress

Track ongoing work in IMPLEMENTATION_PLAN.md. Major milestones already complete:

  1. Environment setup (conda + Makefile).
  2. Repository skeleton and sample data.
  3. Dataset/model implementations with unit tests.
  4. Training CLI and smoke test.
  5. Inference + visualization pipeline.

Remaining tasks include Dockerization, README screenshots/examples, and CI hooks.

Contributing

See AGENTS.md for contributor expectations:

  • Use 4-space indentation, snake_case, PascalCase classes.
  • Run make format lint test before opening a PR.
  • Keep commits scoped (feat:, fix:, etc.) and link issues with Closes #<id>.
  • Do not commit large audio datasets or secrets; store them outside git-tracked paths.

Troubleshooting

  • ModuleNotFoundError: app → ensure you run commands via python -m app.train or set PYTHONPATH=$(pwd).
  • MPS/Metal errors → rerun with --accelerator cpu or set PYTORCH_ENABLE_MPS_FALLBACK=1.
  • librosa import issues → confirm the active environment is the one you created via conda/Make.

For more background, refer to shouting_voice_detection_tutorial.md, which mirrors the end-to-end workflow described above. Happy experimenting!

About

Frame-level shouting detection for music/speech using librosa features and a PyTorch Lightning CNN. Includes training, inference, visualization tools, Docker support, and diagnostics for inspecting labels and class balance.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors