Skip to content

nicolvisser/ZeroSyl

Repository files navigation

ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

paper quickstart explainer license

Spoken language models (pure speech language models) learn language directly from unlabeled speech. No text is used anywhere in the pipeline.

ZeroSyl is a simple method for extracting syllable-like units from a WavLM Large model, without requiring to train a complex boundary detector like in previous works.

Installation

For full functionality (including CLI):

pip install zerosyl[cli]

For base functionality (in other pipelines):

pip install zerosyl

Requires:

  • python >=3.11.0,<3.15 (last tested up to 3.14.2)
  • torch >=2.4.1,<3.0 (last tested up to 2.10.0)

Basic usage

quickstart

For continuous embeddings:

from zerosyl import ZeroSylContinuous

model = ZeroSylContinuous.from_remote()
wav = torch.randn(1, 16000)
starts, ends, embeddings = model.encode(wav)

For cluster IDs:

from zerosyl import ZeroSylDiscrete

model = ZeroSylDiscrete.from_remote()
wav = torch.randn(1, 16000)
starts, ends, ids = model.encode(wav)

For language modeling units:

from zerosyl import ZeroSylCollapsed

model = ZeroSylCollapsed.from_remote()
wav = torch.randn(1, 16000)
starts, ends, ids = model.encode(wav)

Batch encode

To encode large datasets, use the CLI tool and specify a batch size:

zerosyl encode --batch-size 16 --help

Language model

Our LanguageModel is the OPT-125M model. Refer to the OPT documentation in the transformers library for more functionality including control over generation.

from zerosyl import LanguageModel

lm = LanguageModel.from_remote()

# probe likelihoods
brick = torch.tensor([9116, 9115, 3045, 9115])
blick = torch.tensor([9116, 9115, 5041, 9115])
print(lm.loglikelihoods([brick, blick]))

# unconditional generation
print(lm.generate(max_length=10))

Evaluation

Evaluation scripts are also packaged into the CLI tool.

zerosyl evaluate --help

Method

For those interested in the method, you have several places to start:

  1. Read our paper.
  2. Working through the explainer.ipynb notebook: explainer
  3. Looking at the core module zerosyl/zerosyl.py that houses
    • ZeroSylContinuous - a wrapper around WavLM to add the boundary detection and meanpooling logic.
    • ZeroSylDiscrete - a wrapper around ZeroSylContinuous to add K-means discretization.
    • ZeroSylCollapsed - a wrapper around ZeroSylDiscrete to add silence handling.

There is also more information on reproducing the results in the paper inside notes/ and reproduce.py.

Checkpoints

About

Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors