Skip to content

thomasthebaud/speechLLM

Repository files navigation

SpeechLLM

Repository adapted from https://github.com/skit-ai/SpeechLLM/tree/main

Propose the training and testing of Speech LLMs for speaker characterization and summarization, adapted for the JHU-CLSP grid.

Options Available right now

Datasets

The following datasets have been adapted and can be used for training, with the following fields:

dataset train split dev split test split fields
Crema-D train dev test 6 Cat. emotions, gender, transcript
Common Voice EN v11 train dev test nationality, age (decade), gender, accent (16 nationalities), transcript
IEMOCAP ses01-03 ses04 ses05 4 Cat. emotions, gender
Librispeech train-clean-100, train-clean-360, train-other-500 dev-clean, dev-other test-clean, test-other gender, transcript
MSP podcast train validation test 8 Cat. emotions, gender
Switchboard train validation test transcript, summary
AMI train validation test summary
ICSI train validation test summary
VoxCeleb1 dev test test gender, accent (from nationality)
VoxCeleb2-AE dev test test gender, age, accent (from nationality)
WSJ0 si_tr_s si_dt_05 si_et_05 gender, transcript

Most datasets use the original splits. Voxceleb datasets are using the same validation and test, as the models are trained to optimize validation summary loss anyway.
To add a new csv, the necessary functions are in local/data_csv.
If you want access to the data, please copy the contents of the folder /home/tthebau1/EDART/SpeechLLM/data/* in your own data/ folder.

Architectures

In general, all parameters and their defaults values can be adjusted in the get_model_config() function in utils.py. The base system currently uses:

  • A WavLM base plus feature encoder, with 768 dimensional output features. It can be replaced by any hugging face encoder, by modifying the parameters:

    • --encoder 'microsoft/wavlm-base-plus' (currently accepts facebook/hubert-xlarge-ll60k, microsoft/wavlm-large, microsoft/wavlm-base-plus, MFCC, the list can be expanded in models/encoder.py)
    • --encoder-dim 768 to adjust the desired output dimension.
  • A windowed meanpooling layer, with a ratio --meanpool 5

  • A CNN connector, which uses the following parameters:

    • --connector 'cnn' for the type of connector. more types and architectures can be added in the models/connector.py file.
    • --connector-k 2 for the stride
    • --connector-layers 2 for the number of layers in case of a MLP
    • --connector-dim 1024 for the output dimension of features
  • A LLM. Currently uses a Tiny LLAMA, you can change it by adjusting the parameter:

    • --llm 'TinyLlama-1.1B-Chat-v1.0'

Parameters

For Training

there is one learning rate for the LoRa adaptors of the LLM and the connector (--lr 0.0001), and one learning rate for the feature extractor (--encoder-lr 0.000001, by default 50 times lower than the base lr).
If --no-lora is passed, the LLM is frozen.
If --ft-encoder is passed, the encoder is fine-tuned.
If --use-text is passed, transcripts will be added as inputs when available, with a probability --prob-text during training.
If --no-audio is passed, only the transcripts will be used, no encoder nor connector will be initialized nor used.

Configurations

Training configurations are defined in the folder config.

Data config

In the config/data/ folder, you can find the dataset configuration files. They can be passed as arguments --use-config summarize_switchboard.json. They define which dataset should be used for training, testing and validation, and which tasks should be used for each.

Connector config

In the config/model/ folder, you can find the connector configuration files. They can be passed as arguments --connector cnn_str1.2.1. They define which connector should be used, which part of the encoder to fine-tune, the meanpooling parameters, etc...


Trains up to ```--total-training-epoch``` maximum training epochs. top 3 models saved in ```checkpoints/```. Uses ```--epoch-to-test``` to test a specific epoch.

Naming

--group is used by Wandb to put the experiment in a given group.
--nickname is used to differentiate experiments and models with similar architecture but variations in configurations.

Installation and running an experiment

Installation

  • Conda: conda environment is available in environment.yml, use
conda env create -f environment.yml
  • Pip: pip environment is available in requirements.txt, use
pip install -r requirements.txt

Launching a training

To train a network, use

sbatch launch/$expe_series/train/$your_script.sh

To test it, use

sbatch launch/$expe_series/test/$your_script.sh

Reproducibility

ASRU 2025 article: Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM

The experiments with simple linear layer for speaker characterization are available in launch/ASRU2025, allowing partial reproduction of this article: Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM. Please cite this if you use those experiments:

@article{thebaud2025enhancing,
  title={Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM},
  author={Thebaud, Thomas and Lu, Yen-Ju and Wiesner, Matthew and Viechnicki, Peter and Dehak, Najim},
  journal={arXiv preprint arXiv:2508.04795},
  year={2025}
}

[Work in progress] TASLP article: SumSLM: a Speech-Aware Language Model for Long-Form Conversations Summarization

The experiments with CNN connector for audio summarization are available in launch/TASLP_experiments_clean.


Contact

If you have any question, please contact Thomas Thebaud on slack, or use tthebau1@jhu.edu.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •