Skip to content

MedASR: Clarification Needed on Handling of Brace Tokens and Preprocessing Rules for Fine-Tuning & Decoding #14

@csetanmayjain

Description

@csetanmayjain

I had raised a similar question on the forum earlier. Sharing it here as well in case someone from the community or team can help clarify

Hi Team,

I am currently benchmarking and experimenting with the google/medasr model using the provided evaluation pipeline. During this process, I noticed that many transcripts in the medical ASR domain contain special tokens enclosed in braces, such as:

{period}, {comma}, {slash}, {next line}, {question mark}, etc.

These tokens appear to represent spoken punctuation or formatting markers.

From the current preprocessing and evaluation logic, I have a few questions that I would like to clarify:

1. Expected List of Supported Brace Tokens

Is there an official or recommended list of brace-based tokens (e.g., {period}, {comma}, {slash}, etc.) that the model is trained to recognize? It would be very helpful if you could provide:

  • A predefined list of all such tokens supported by the tokenizer
  • Guidance on whether this list is fixed or customizable

2. Preprocessing Rules for Fine-Tuning

For fine-tuning purposes:

  • Should these brace tokens be preserved as-is in training text?
  • Should they be removed before tokenization?
  • Are there recommended preprocessing rules specifically for medical ASR transcripts containing such tokens?

Could you please provide the recommended preprocessing pipeline for fine-tuning, particularly:

  • How to handle {...} tokens in ground truth text
  • Whether they should be normalized, removed, or kept intact
  • Any tokenizer-specific considerations

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions