Inconsistent handling of spoken punctuation tokens in ASR output

Hi team,

I evaluated it across multiple datasets and observed that spoken punctuation (e.g., "next line", "next paragraph", "comma") is often generated inconsistently or incompletely in the ASR output.

**Observed Behavior**
The model frequently produces malformed or partial punctuation tokens, such as:
```
{next} line}
{next line
next line}
xt line}
{next} para}graph}
```
These outputs indicate that the punctuation tokens are not being generated or closed properly, leading to broken or unusable text.

**Expected Behavior**
Spoken punctuation tokens should be generated consistently and completely, for example:
```
{next line}
{next paragraph}
```

**What I Tried**

```
1. Evaluated across multiple datasets → issue persists
2. Tested with different tokenizers, transformers → no improvement
3. Current tokenizer version: tokenizers=0.22.2; transformers=5.3.0
```

**Additional Context / Hypothesis**
This may require:
- Investigation into how the model handles spoken punctuation tokens internally, or
- A post-processing layer to normalize/fix malformed tokens
- Access to the language model text or ARPA file could help debug and potentially mitigate this issue (I have also raised a separate [request](https://github.com/Google-Health/medasr/issues/15) for this).

**Request**
- Could you please confirm if this is a known issue?
- Is it possible to share the LM text / ARPA to help debug and improve handling of these cases?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent handling of spoken punctuation tokens in ASR output #16

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent handling of spoken punctuation tokens in ASR output #16

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions