I had raised a similar question on the forum earlier. Sharing it here as well in case someone from the community or team can help clarify
Hi Team,
I am currently benchmarking and experimenting with the google/medasr model using the provided evaluation pipeline. During this process, I noticed that many transcripts in the medical ASR domain contain special tokens enclosed in braces, such as:
{period}, {comma}, {slash}, {next line}, {question mark}, etc.
These tokens appear to represent spoken punctuation or formatting markers.
From the current preprocessing and evaluation logic, I have a few questions that I would like to clarify:
1. Expected List of Supported Brace Tokens
Is there an official or recommended list of brace-based tokens (e.g., {period}, {comma}, {slash}, etc.) that the model is trained to recognize? It would be very helpful if you could provide:
- A predefined list of all such tokens supported by the tokenizer
- Guidance on whether this list is fixed or customizable
2. Preprocessing Rules for Fine-Tuning
For fine-tuning purposes:
- Should these brace tokens be preserved as-is in training text?
- Should they be removed before tokenization?
- Are there recommended preprocessing rules specifically for medical ASR transcripts containing such tokens?
Could you please provide the recommended preprocessing pipeline for fine-tuning, particularly:
- How to handle {...} tokens in ground truth text
- Whether they should be normalized, removed, or kept intact
- Any tokenizer-specific considerations
I had raised a similar question on the forum earlier. Sharing it here as well in case someone from the community or team can help clarify
Hi Team,
I am currently benchmarking and experimenting with the google/medasr model using the provided evaluation pipeline. During this process, I noticed that many transcripts in the medical ASR domain contain special tokens enclosed in braces, such as:
{period}, {comma}, {slash}, {next line}, {question mark}, etc.These tokens appear to represent spoken punctuation or formatting markers.
From the current preprocessing and evaluation logic, I have a few questions that I would like to clarify:
1. Expected List of Supported Brace Tokens
Is there an official or recommended list of brace-based tokens (e.g., {period}, {comma}, {slash}, etc.) that the model is trained to recognize? It would be very helpful if you could provide:
2. Preprocessing Rules for Fine-Tuning
For fine-tuning purposes:
Could you please provide the recommended preprocessing pipeline for fine-tuning, particularly: