-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Hi team, I have a few questions about the multilingual models and training setup:
1、Tokenizer updates for the multilingual release
Did the multilingual version introduce a newly trained/updated tokenizer compared to the Chinese-only (or earlier) releases?
If yes, could you share what changed (e.g., vocab size, training method like BPE/SentencePiece, added languages, normalization rules)?
2、“100k-hour scale” data and what stages use it
In the repository/materials, you mention data at the ~100,000-hour scale. Is my understanding correct that the encoder is not trained from scratch on this full 100k hours, and that the large-scale data is mainly used during SFT and contextual SFT stages?
3、Multilingual model performance on Chinese & data distribution
When I tested the multilingual model, the Chinese recognition quality seemed worse than expected (it often drops/misses characters). Is this potentially related to the data mixture ratio or tokenizer differences?
Do you have any guidance on the language distribution (approximate proportions) within the ~100k-hour training data, or the sampling strategy used during training?
4、Fine-tuning for a Philippine language (data requirements)
If I want to fine-tune the model for a Philippine language, roughly how much supervised speech-text data would you recommend as a starting point?
Are there any practical thresholds you’ve observed (e.g., <10h for quick adaptation, ~50–100h for solid gains, etc.), and do you recommend LoRA/adapter-style tuning vs full fine-tuning?
