Skip to content

a few questions about the multilingual models #50

@russell-shu

Description

@russell-shu

Hi team, I have a few questions about the multilingual models and training setup:

1、Tokenizer updates for the multilingual release

Did the multilingual version introduce a newly trained/updated tokenizer compared to the Chinese-only (or earlier) releases?
If yes, could you share what changed (e.g., vocab size, training method like BPE/SentencePiece, added languages, normalization rules)?
2、“100k-hour scale” data and what stages use it

In the repository/materials, you mention data at the ~100,000-hour scale. Is my understanding correct that the encoder is not trained from scratch on this full 100k hours, and that the large-scale data is mainly used during SFT and contextual SFT stages?

3、Multilingual model performance on Chinese & data distribution

When I tested the multilingual model, the Chinese recognition quality seemed worse than expected (it often drops/misses characters). Is this potentially related to the data mixture ratio or tokenizer differences?
Do you have any guidance on the language distribution (approximate proportions) within the ~100k-hour training data, or the sampling strategy used during training?

4、Fine-tuning for a Philippine language (data requirements)

If I want to fine-tune the model for a Philippine language, roughly how much supervised speech-text data would you recommend as a starting point?
Are there any practical thresholds you’ve observed (e.g., <10h for quick adaptation, ~50–100h for solid gains, etc.), and do you recommend LoRA/adapter-style tuning vs full fine-tuning?

客服_女.wav

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions