a few questions about the multilingual models

Hi team, I have a few questions about the multilingual models and training setup:

1、Tokenizer updates for the multilingual release

Did the multilingual version introduce a newly trained/updated tokenizer compared to the Chinese-only (or earlier) releases?
If yes, could you share what changed (e.g., vocab size, training method like BPE/SentencePiece, added languages, normalization rules)?
2、“100k-hour scale” data and what stages use it

In the repository/materials, you mention data at the ~100,000-hour scale. Is my understanding correct that the encoder is not trained from scratch on this full 100k hours, and that the large-scale data is mainly used during SFT and contextual SFT stages?


3、Multilingual model performance on Chinese & data distribution

When I tested the multilingual model, the Chinese recognition quality seemed worse than expected (it often drops/misses characters). Is this potentially related to the data mixture ratio or tokenizer differences?
Do you have any guidance on the language distribution (approximate proportions) within the ~100k-hour training data, or the sampling strategy used during training?

4、Fine-tuning for a Philippine language (data requirements)

If I want to fine-tune the model for a Philippine language, roughly how much supervised speech-text data would you recommend as a starting point?
Are there any practical thresholds you’ve observed (e.g., <10h for quick adaptation, ~50–100h for solid gains, etc.), and do you recommend LoRA/adapter-style tuning vs full fine-tuning?

[客服_女.wav](https://github.com/user-attachments/files/24393794/_.wav)

<img width="902" height="787" alt="Image" src="https://github.com/user-attachments/assets/b66c6aa7-182c-4002-8444-b52d86a896e7" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

a few questions about the multilingual models #50

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

a few questions about the multilingual models #50

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions