-
Notifications
You must be signed in to change notification settings - Fork 121
Open
Description
Hi, thanks for the quick support for the qwen3-embedding model!
I noticed a discrepancy regarding the query prefix format. In the official post, the instruction and query are concatenated without a space. However, in the Tevatron example, there's a space after the colon:
--query_prefix "Find a relevant scientific paper abstract to support or reject the claim. Query: " \
This seemingly minor difference actually affects tokenization behavior:
# Without space after colon
>>> print(tokenizer.tokenize("Answer the query\nQuery:What is the capital of China?"))
['Answer', 'Ġthe', 'Ġquery', 'Ċ', 'Query', ':', 'What', 'Ġis', 'Ġthe', 'Ġcapital', 'Ġof', 'ĠChina', '?']
# With space after colon
>>> print(tokenizer.tokenize("Answer the query\nQuery: What is the capital of China?"))
['Answer', 'Ġthe', 'Ġquery', 'Ċ', 'Query', ':', 'ĠWhat', 'Ġis', 'Ġthe', 'Ġcapital', 'Ġof', 'ĠChina', '?']Metadata
Metadata
Assignees
Labels
No labels