Spoken language models (pure speech language models) learn language directly from unlabeled speech. No text is used anywhere in the pipeline.
ZeroSyl is a simple method for extracting syllable-like units from a WavLM Large model, without requiring to train a complex boundary detector like in previous works.
For full functionality (including CLI):
pip install zerosyl[cli]For base functionality (in other pipelines):
pip install zerosylRequires:
- python >=3.11.0,<3.15 (last tested up to 3.14.2)
- torch >=2.4.1,<3.0 (last tested up to 2.10.0)
For continuous embeddings:
from zerosyl import ZeroSylContinuous
model = ZeroSylContinuous.from_remote()
wav = torch.randn(1, 16000)
starts, ends, embeddings = model.encode(wav)For cluster IDs:
from zerosyl import ZeroSylDiscrete
model = ZeroSylDiscrete.from_remote()
wav = torch.randn(1, 16000)
starts, ends, ids = model.encode(wav)For language modeling units:
from zerosyl import ZeroSylCollapsed
model = ZeroSylCollapsed.from_remote()
wav = torch.randn(1, 16000)
starts, ends, ids = model.encode(wav)To encode large datasets, use the CLI tool and specify a batch size:
zerosyl encode --batch-size 16 --helpOur LanguageModel is the OPT-125M model.
Refer to the OPT documentation in the transformers library for more functionality including control over generation.
from zerosyl import LanguageModel
lm = LanguageModel.from_remote()
# probe likelihoods
brick = torch.tensor([9116, 9115, 3045, 9115])
blick = torch.tensor([9116, 9115, 5041, 9115])
print(lm.loglikelihoods([brick, blick]))
# unconditional generation
print(lm.generate(max_length=10))Evaluation scripts are also packaged into the CLI tool.
zerosyl evaluate --helpFor those interested in the method, you have several places to start:
- Read our paper.
- Working through the explainer.ipynb notebook:
- Looking at the core module zerosyl/zerosyl.py that houses
ZeroSylContinuous- a wrapper aroundWavLMto add the boundary detection and meanpooling logic.ZeroSylDiscrete- a wrapper aroundZeroSylContinuousto add K-means discretization.ZeroSylCollapsed- a wrapper aroundZeroSylDiscreteto add silence handling.
There is also more information on reproducing the results in the paper inside notes/ and reproduce.py.
- WavLM Large
- Spherical K-means centroids (K=10000, trained on
ZeroSylContinuousembeddings, 100h of LibriSpeech train-clean-100)- Codebook silences (A boolean tensor shape
[10000,]withTrueif centroid represents silence)
- Codebook silences (A boolean tensor shape
- OPT-125M langauge models trained on
ZeroSylCollapsedtokens