ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

Spoken language models (pure speech language models) learn language directly from unlabeled speech. No text is used anywhere in the pipeline.

ZeroSyl is a simple method for extracting syllable-like units from a WavLM Large model, without requiring to train a complex boundary detector like in previous works.

Installation

For full functionality (including CLI):

pip install zerosyl[cli]

For base functionality (in other pipelines):

pip install zerosyl

Requires:

python >=3.11.0,<3.15 (last tested up to 3.14.2)
torch >=2.4.1,<3.0 (last tested up to 2.10.0)

Basic usage

For continuous embeddings:

from zerosyl import ZeroSylContinuous

model = ZeroSylContinuous.from_remote()
wav = torch.randn(1, 16000)
starts, ends, embeddings = model.encode(wav)

For cluster IDs:

from zerosyl import ZeroSylDiscrete

model = ZeroSylDiscrete.from_remote()
wav = torch.randn(1, 16000)
starts, ends, ids = model.encode(wav)

For language modeling units:

from zerosyl import ZeroSylCollapsed

model = ZeroSylCollapsed.from_remote()
wav = torch.randn(1, 16000)
starts, ends, ids = model.encode(wav)

Batch encode

To encode large datasets, use the CLI tool and specify a batch size:

zerosyl encode --batch-size 16 --help

Language model

Our LanguageModel is the OPT-125M model. Refer to the OPT documentation in the transformers library for more functionality including control over generation.

from zerosyl import LanguageModel

lm = LanguageModel.from_remote()

# probe likelihoods
brick = torch.tensor([9116, 9115, 3045, 9115])
blick = torch.tensor([9116, 9115, 5041, 9115])
print(lm.loglikelihoods([brick, blick]))

# unconditional generation
print(lm.generate(max_length=10))

Evaluation

Evaluation scripts are also packaged into the CLI tool.

zerosyl evaluate --help

Method

For those interested in the method, you have several places to start:

Read our paper.
Working through the explainer.ipynb notebook:
Looking at the core module zerosyl/zerosyl.py that houses
- ZeroSylContinuous - a wrapper around WavLM to add the boundary detection and meanpooling logic.
- ZeroSylDiscrete - a wrapper around ZeroSylContinuous to add K-means discretization.
- ZeroSylCollapsed - a wrapper around ZeroSylDiscrete to add silence handling.

There is also more information on reproducing the results in the paper inside notes/ and reproduce.py.

Checkpoints

WavLM Large
Spherical K-means centroids (K=10000, trained on ZeroSylContinuous embeddings, 100h of LibriSpeech train-clean-100)
- Codebook silences (A boolean tensor shape [10000,] with True if centroid represents silence)
OPT-125M langauge models trained on ZeroSylCollapsed tokens

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.vscode		.vscode
data		data
notes		notes
zerosyl		zerosyl
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
TODO.md		TODO.md
explainer.ipynb		explainer.ipynb
hubconf.py		hubconf.py
pyproject.toml		pyproject.toml
quickstart.ipynb		quickstart.ipynb
reproduce_results.sh		reproduce_results.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

Installation

Basic usage

Batch encode

Language model

Evaluation

Method

Checkpoints

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

Installation

Basic usage

Batch encode

Language model

Evaluation

Method

Checkpoints

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages