Skip to content

question about qwen3 finetune #189

@Hannibal046

Description

@Hannibal046

Hi, thanks for the quick support for the qwen3-embedding model!

I noticed a discrepancy regarding the query prefix format. In the official post, the instruction and query are concatenated without a space. However, in the Tevatron example, there's a space after the colon:

  --query_prefix "Find a relevant scientific paper abstract to support or reject the claim. Query: " \

This seemingly minor difference actually affects tokenization behavior:

# Without space after colon
>>> print(tokenizer.tokenize("Answer the query\nQuery:What is the capital of China?"))
['Answer', 'Ġthe', 'Ġquery', 'Ċ', 'Query', ':', 'What', 'Ġis', 'Ġthe', 'Ġcapital', 'Ġof', 'ĠChina', '?']

# With space after colon
>>> print(tokenizer.tokenize("Answer the query\nQuery: What is the capital of China?"))
['Answer', 'Ġthe', 'Ġquery', 'Ċ', 'Query', ':', 'ĠWhat', 'Ġis', 'Ġthe', 'Ġcapital', 'Ġof', 'ĠChina', '?']

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions