MedASR: Clarification Needed on Handling of Brace Tokens and Preprocessing Rules for Fine-Tuning & Decoding


_I had raised a similar question on the [forum earlier.](https://discuss.ai.google.dev/t/medasr-clarification-needed-on-handling-of-brace-tokens-and-preprocessing-rules-for-fine-tuning-decoding/116107) Sharing it here as well in case someone from the community or team can help clarify_


Hi Team,

I am currently benchmarking and experimenting with the google/medasr model using the provided evaluation pipeline. During this process, I noticed that many transcripts in the medical ASR domain contain special tokens enclosed in braces, such as:

`{period}, {comma}, {slash}, {next line}, {question mark}, etc.`

These tokens appear to represent spoken punctuation or formatting markers.

From the current preprocessing and evaluation logic, I have a few questions that I would like to clarify:

**1. Expected List of Supported Brace Tokens**

Is there an official or recommended list of brace-based tokens (e.g., {period}, {comma}, {slash}, etc.) that the model is trained to recognize? It would be very helpful if you could provide:
- A predefined list of all such tokens supported by the tokenizer
- Guidance on whether this list is fixed or customizable

________________________________________________

**2. Preprocessing Rules for Fine-Tuning**

For fine-tuning purposes:
- Should these brace tokens be preserved as-is in training text?
- Should they be removed before tokenization?
- Are there recommended preprocessing rules specifically for medical ASR transcripts containing such tokens?

Could you please provide the recommended preprocessing pipeline for fine-tuning, particularly:

- How to handle {...} tokens in ground truth text
- Whether they should be normalized, removed, or kept intact
- Any tokenizer-specific considerations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MedASR: Clarification Needed on Handling of Brace Tokens and Preprocessing Rules for Fine-Tuning & Decoding #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MedASR: Clarification Needed on Handling of Brace Tokens and Preprocessing Rules for Fine-Tuning & Decoding #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions