Pavlov

Unified multimodal embeddings with geometric modality conditioning.

A single shared embedding matrix is transformed by modality-specific learned rotation matrices (parameterized via the Lie algebra so(d)). Cross-modal reconstruction forces the shared space to capture modality-invariant structure, inspired by Pavlovian associative conditioning.

Architecture

Encode:  Input_m ──> Encoder_m(x) ──> W ──> R_m^T ──> z_m (latent)
Decode:  z ──> W_dec ──> Decoder_m ──> Reconstruction

Same-modal:    z_v ──> W_dec ──> Decoder_v ──> Recon_v
Cross-modal:   z_v ──> W_dec ──> Decoder_a ──> CrossRecon_a  (vision → audio)

W: Shared encode projection (embed_dim -> latent_dim), learned jointly across modalities
W_dec: Shared decode projection (latent_dim -> embed_dim), separate learned layer
R_v, R_a: Modality-specific rotation matrices in SO(k), parameterized via skew-symmetric matrices in the Lie algebra so(k) and mapped to SO(k) via matrix exponential
Training objective: Same-modal reconstruction + cross-modal reconstruction + contrastive alignment + optional rotation orthogonality regularization

Datasets

Pavlov supports multiple audio-visual datasets of increasing difficulty. The dataset is selected via Hydra config.

Dataset	Classes	Samples	Vision Input	Audio Input	SOTA
AV-MNIST (default)	10 digits	~70K	1x28x28 grayscale (MNIST)	1x112x112 mel spectrogram (FSDD)	~99%
CREMA-D	6 emotions	~7.4K	1x64x64 grayscale (face frame)	1x112x112 mel spectrogram	~80%

AV-MNIST pairs handwritten digit images with spoken digit spectrograms from the Free Spoken Digit Dataset. It serves as the proof-of-concept benchmark.

CREMA-D (Crowd-sourced Emotional Multimodal Actors Dataset) pairs facial video frames with spoken audio from 91 actors expressing 6 emotions (anger, disgust, fear, happiness, sadness, neutral). It is a significantly harder benchmark where both modalities carry meaningful but different emotional signals. Splits are done by actor to prevent data leakage.

Installation

git clone https://github.com/<your-org>/pavlov.git
cd pavlov
pip install -e ".[dev]"

# Optional: W&B logging support
pip install -e ".[wandb]"

Quick Start

1. Download data

# Download AV-MNIST (default)
pavlov-download

# Download CREMA-D (~7.5 GB — requires git-lfs + opencv-python)
# Install git-lfs first: brew install git-lfs && git lfs install (macOS)
pavlov-download data=cremad

2. Train

# AV-MNIST with default config (TensorBoard logging)
pavlov-train

# CREMA-D with the cremad experiment preset
pavlov-train experiment=cremad

# Or select data and model configs individually
pavlov-train data=cremad model=cremad

# Config overrides work as before
pavlov-train training.max_epochs=50 model.latent_dim=64

# With W&B logging
pavlov-train training.logger=wandb training.wandb.project=my-project

3. Evaluate

# Evaluate an AV-MNIST checkpoint
pavlov-eval +checkpoint=path/to/checkpoint.ckpt

# Evaluate a CREMA-D checkpoint
pavlov-eval data=cremad model=cremad +checkpoint=path/to/checkpoint.ckpt

# Full diagnostic suite: samples, audio, rotation matrices, t-SNE, per-digit metrics
pavlov-eval "+checkpoint=path/to/checkpoint.ckpt" +log_tensorboard=true

# Include missing-modality robustness evaluation
pavlov-eval "+checkpoint=path/to/checkpoint.ckpt" +log_tensorboard=true +eval_missing_modality=true

4. Visualize embeddings

pavlov-viz +checkpoint=path/to/checkpoint.ckpt +save_path=embeddings.png

# Also log sample reconstructions to TensorBoard
pavlov-viz "+checkpoint=path/to/checkpoint.ckpt" +save_path=embeddings.png +log_tensorboard=true

5. View TensorBoard

Training outputs (logs, checkpoints, Hydra config) are written to outputs/YYYY-MM-DD/HH-MM-SS/ per run. During validation, the following are logged: sample reconstructions (images + audio), rotation matrix heatmaps, modality alignment metrics, and t-SNE (every 5 epochs). With pavlov-eval +log_tensorboard=true, the full diagnostic suite is logged. To view TensorBoard:

tensorboard --logdir outputs/

6. Interactive demo

Launch a Gradio app to explore same-modal and cross-modal reconstructions interactively:

# Install the demo extra
pip install -e ".[demo]"

# Auto-selects the best checkpoint from outputs/
pavlov-demo

# Or specify a checkpoint explicitly
pavlov-demo --checkpoint path/to/checkpoint.ckpt

# Create a public shareable link
pavlov-demo --share

The app has two tabs -- Vision Input and Audio Input. Click "New Sample" to pick a random test example and see the original input alongside its same-modal reconstruction and cross-modal reconstruction (e.g., image to synthesized audio spectrogram, or audio to reconstructed image).

Configuration

Pavlov uses Hydra for configuration management. All configs live in configs/.

configs/
  config.yaml               # Top-level defaults (data=avmnist, model=default)
  model/
    default.yaml             # AV-MNIST model architecture
    cremad.yaml              # CREMA-D model architecture (64x64 vision input)
  data/
    avmnist.yaml             # AV-MNIST dataset settings
    cremad.yaml              # CREMA-D dataset settings
  training/default.yaml      # Training hyperparameters + loss weights
  experiment/
    baseline.yaml            # AV-MNIST baseline experiment preset
    cremad.yaml              # CREMA-D experiment preset

Override examples

# Change learning rate and batch size
pavlov-train training.lr=5e-4 data.batch_size=128

# Change embedding dimensions
pavlov-train model.embed_dim=512 model.latent_dim=256

# Adjust contrastive loss weight
pavlov-train training.loss_weights.contrastive=1.0

# Disable orthogonality regularization
pavlov-train training.loss_weights.orthogonality=0.0

Project Structure

pavlov/
  models/
    pavlov_model.py      # Core model: shared embeddings + rotation conditioning
    lightning_module.py   # PyTorch Lightning training wrapper
    encoders.py          # Modality-specific encoders (parameterized by input size)
    decoders.py          # Modality-specific decoders (parameterized by output size)
  data/
    __init__.py          # Dataset factory: build_datamodule(cfg)
    avmnist.py           # AV-MNIST data module (MNIST images + spoken digits)
    cremad.py            # CREMA-D data module (facial frames + emotion audio)
    download.py          # AV-MNIST download and preparation pipeline
  utils/
    rotation.py          # Lie algebra so(d) -> SO(d) rotation utilities
  losses.py              # Reconstruction, contrastive, and orthogonality losses
  evaluation/
    cross_modal_transfer.py  # Train on modality A, test on modality B
    retrieval.py             # Cross-modal retrieval (Recall@K, MAP)
    visualization.py         # t-SNE embedding plots
  cli/
    train.py             # Training entry point
    eval.py              # Evaluation entry point
    viz.py               # Visualization entry point
    download_data.py     # Data download script (supports all datasets)
    demo.py              # Interactive Gradio demo
configs/
  ...                    # Hydra configuration files

Evaluation Protocol

Cross-modal transfer: Train a linear classifier on vision embeddings, test on audio embeddings (and vice versa). High transfer accuracy indicates modality-invariant representations.
Cross-modal retrieval: Given a query from one modality, retrieve nearest neighbors in the other modality. Measured by Recall@K and Mean Average Precision.
t-SNE visualization: Embeddings colored by class with different markers per modality. Well-aligned modalities show overlapping clusters.
Missing-modality robustness: Evaluate performance when one modality is absent at test time.

Adding a New Dataset

Create a data module in pavlov/data/ following the pattern of avmnist.py or cremad.py. Your DataModule must return batches as dicts with "vision", "audio", and "label" keys.

Register it in pavlov/data/__init__.py:

elif dataset_name == "my_dataset":
    from pavlov.data.my_dataset import MyDataModule
    return MyDataModule(**cfg.data)

Create configs in configs/data/my_dataset.yaml and configs/model/my_dataset.yaml. Set vision_encoder.input_size and vision_decoder.output_size to match your vision input resolution.
Download: Add a branch to pavlov/cli/download_data.py for the new dataset.

Adding a New Modality

Create an encoder in pavlov/models/encoders.py:

class MyEncoder(ModalityEncoder):
    def __init__(self, input_dim, hidden_dims, embed_dim):
        super().__init__()
        # Build your encoder layers

    def forward(self, x):
        # Return (batch, embed_dim) tensor
        return self.net(x)

Create a decoder in pavlov/models/decoders.py:

class MyDecoder(ModalityDecoder):
    def __init__(self, embed_dim, hidden_dims, output_dim):
        super().__init__()
        # Build your decoder layers

    def forward(self, z):
        # Return reconstructed input
        return self.net(z)

Register the encoder/decoder in the build_encoder and build_decoder factory functions.

Add config in configs/model/default.yaml:

my_modality_encoder:
  hidden_dims: [64, 128]
my_modality_decoder:
  hidden_dims: [128, 64]

Update modalities list in configs/config.yaml:
```
modalities: [vision, audio, my_modality]
```
Update your DataModule to include the new modality in the batch dict.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
configs		configs
experiments		experiments
pavlov		pavlov
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
research_doc.md		research_doc.md
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pavlov

Architecture

Datasets

Installation

Quick Start

1. Download data

2. Train

3. Evaluate

4. Visualize embeddings

5. View TensorBoard

6. Interactive demo

Configuration

Override examples

Project Structure

Evaluation Protocol

Adding a New Dataset

Adding a New Modality

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pavlov

Architecture

Datasets

Installation

Quick Start

1. Download data

2. Train

3. Evaluate

4. Visualize embeddings

5. View TensorBoard

6. Interactive demo

Configuration

Override examples

Project Structure

Evaluation Protocol

Adding a New Dataset

Adding a New Modality

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages