-
Notifications
You must be signed in to change notification settings - Fork 128
[data, model] feat: support Qwen3-VL textual token-based time encoding #297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[data, model] feat: support Qwen3-VL textual token-based time encoding #297
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for Qwen3-VL's textual token-based time encoding. The changes are well-structured, adding a new chat template, a data processor, and optimizing video loading. However, I've identified a critical issue with a duplicate function definition that will cause the new logic to be ignored, and a couple of high-severity issues related to potential crashes and unexpected side effects. Please review the detailed comments.
| ] | ||
| return video_inputs, audio_inputs | ||
|
|
||
| def fetch_videos_metadata(videos: List[VideoInput], **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge this function into fetch_videos? It seems return metadata will be necessary, so fetch_videos returns videos, video_metadata, audios, audio_metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the logic to support returning the full set of 4 values (video, video_metadata, audio, audio_metadata). However, I decided to keep fetch_videos_metadata separate from fetch_videos.
This design ensures isolation: fetch_videos_metadata is for the new pipeline requiring full metadata, while fetch_videos falls back to the original behavior (returning 2 values) to ensure backward compatibility for other tasks.
| final_fps_inputs.append(processed_fps) | ||
|
|
||
|
|
||
| return final_video_inputs, final_fps_inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the returns[1] is metadata, the second return should be a dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return value was actually already a List[Dict], but the previous variable name final_fps_inputs was indeed misleading (it sounded like a list of floats).
I have renamed the variables in the latest commit to explicitly reflect that they are lists of metadata dictionaries, which should clear up the confusion.
|
|
||
| return [tokenized_example] | ||
|
|
||
| def process_sample_qwen3_vl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piyifan123 Please check if there's any conflict with the origin process_sample_qwen3_vl. If no, use this new function?
This updates multimodal_chat_template, vlm_data_process, and video_utils to support Qwen3-VL's new time encoding strategy. Signed-off-by: liugengyuan <liugengyuan@bytedance.com>
This updates multimodal_chat_template, vlm_data_process, and video_utils to support Qwen3-VL's new time encoding strategy. Signed-off-by: liugengyuan <liugengyuan@bytedance.com>
2c3a978 to
2e71e70
Compare
1. Remove duplicate definition of 'process_sample_qwen3_vl'. 2. Add try-except block for video token processing safety. 3. Fix logic bug: move 'encode_messages' inside conditional blocks to prevent crashes when processing images.
1. Support returning full 4-tuple (video, video_meta, audio, audio_meta) while keeping 'fetch_videos' and 'fetch_videos_metadata' isolated for backward compatibility. 2. Rename ambiguous variables (e.g., 'final_fps_inputs') to explicitly reflect they are lists of metadata dictionaries.
Overview
This PR integrates the specific data processing and modeling logic required for Qwen3-VL, specifically focusing on its unique textual token-based time encoding strategy.
Unlike previous model versions, Qwen3-VL requires explicit time tags (e.g.,
<t 1.5 seconds>) inserted into the context before visual tokens. This PR implements the official timestamp calculation logic, adds a dedicated data processor for Qwen3-VL video samples, and optimizes the underlying video loading utilities for better memory efficiency during training.Design & Code Changes
The changes are primarily concentrated in the data processing pipeline and chat template adapter:
multimodal_chat_template.py:Qwen3VLChatTemplateinheriting fromQwen2VLTemplate._calculate_timestampsto replicate the official strategy: padding frame indices, converting to seconds, and averaging based ontemporal_patch_size=2.encode_messagesto dynamically inject<t {timestamp} seconds>tags and<|vision_start|>/<|vision_end|>tokens into the input stream.vlm_data_process.py:process_sample_qwen3_vlto handle the specific requirements of Qwen3-VL video inputs.conversationsfield fromquestionandanswerkeys if the standard conversation format is missing.frames_indices.video_utils.py:Data Format
The new data processor (
process_sample_qwen3_vl_video_r1) expects input samples to follow this schema: