Skip to content

TalkBank/utterance-tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

utterance-tokenizer

A NER-like BERT model trainable/trained on TalkBank data that performs the task of utterance tokenization. This is used to add punctuation and tokenization to ADR outputs in the pipeline to process raw audio data into usable transcripts for TalkBank.

Utterance tokenization is notably different from sentence tokenization; for the definition of a speech utterance, refer to the CHAT spec.

Refer to the batchalign repository for the usage of a trained model

About

model to tokenize CHAT utterances

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages