[data, model] feat: support Qwen3-VL textual token-based time encoding #297

GarrickBot · 2025-12-16T12:04:53Z

Overview

This PR integrates the specific data processing and modeling logic required for Qwen3-VL, specifically focusing on its unique textual token-based time encoding strategy.

Unlike previous model versions, Qwen3-VL requires explicit time tags (e.g., <t 1.5 seconds>) inserted into the context before visual tokens. This PR implements the official timestamp calculation logic, adds a dedicated data processor for Qwen3-VL video samples, and optimizes the underlying video loading utilities for better memory efficiency during training.

Design & Code Changes

The changes are primarily concentrated in the data processing pipeline and chat template adapter:

multimodal_chat_template.py:
- New Class: Added Qwen3VLChatTemplate inheriting from Qwen2VLTemplate.
- Timestamp Logic: Implemented _calculate_timestamps to replicate the official strategy: padding frame indices, converting to seconds, and averaging based on temporal_patch_size=2.
- Message Encoding: Updated encode_messages to dynamically inject <t {timestamp} seconds> tags and <|vision_start|>/<|vision_end|> tokens into the input stream.
vlm_data_process.py:
- New Processor: Added process_sample_qwen3_vl to handle the specific requirements of Qwen3-VL video inputs.
- Data Adapter: Added logic to automatically construct the conversations field from question and answer keys if the standard conversation format is missing.
- Metadata Fix: Updated video metadata handling to correctly generate and assign frames_indices.
video_utils.py:
- Workflow Optimization: Refactored load_video_from_bytes_list to directly support pre-stored frame byte lists as input. This bypasses the need for online frame extraction (decoding), significantly reducing memory overhead.
- Metadata Adaptation: Updated the function to correctly handle input VideoMetadata and ensure the output includes the modified VideoMetadata (reflecting changes in FPS or frame counts) for downstream processing.

Data Format

The new data processor (process_sample_qwen3_vl_video_r1) expects input samples to follow this schema:

sample = {
    # 1. 'conversations' is required (or auto-constructed from question/answer)
    "conversations": [
        {
            "role": "user", 
            "content": [
                {"type": "video"}, 
                {"type": "text", "text": "Describe this video."}
            ]
        },
        {"role": "assistant", "content": "This is a video of..."}
    ],
    
    # 2. 'fps' must be a list containing the float FPS value
    "fps": [2.0],
    
    # 3. 'videos' must be a list of raw byte frames
    "videos": [
        [b'\xff\xd8...', b'\xff\xd8...', ...] # List[bytes]
    ]
}

CLAassistant · 2025-12-16T12:05:03Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request introduces support for Qwen3-VL's textual token-based time encoding. The changes are well-structured, adding a new chat template, a data processor, and optimizing video loading. However, I've identified a critical issue with a duplicate function definition that will cause the new logic to be ignored, and a couple of high-severity issues related to potential crashes and unexpected side effects. Please review the detailed comments.

tasks/data/vlm_data_process.py

veomni/data/multimodal/multimodal_chat_template.py

Coach257 · 2025-12-26T06:12:12Z

veomni/data/multimodal/video_utils.py

    ]
    return video_inputs, audio_inputs
+
+def fetch_videos_metadata(videos: List[VideoInput], **kwargs):


Can we merge this function into fetch_videos? It seems return metadata will be necessary, so fetch_videos returns videos, video_metadata, audios, audio_metadata

I have updated the logic to support returning the full set of 4 values (video, video_metadata, audio, audio_metadata). However, I decided to keep fetch_videos_metadata separate from fetch_videos.
This design ensures isolation: fetch_videos_metadata is for the new pipeline requiring full metadata, while fetch_videos falls back to the original behavior (returning 2 values) to ensure backward compatibility for other tasks.

Coach257 · 2025-12-26T06:12:48Z

veomni/data/multimodal/video_utils.py

+        final_fps_inputs.append(processed_fps)
+
+
+    return final_video_inputs, final_fps_inputs


As the returns[1] is metadata, the second return should be a dict.

The return value was actually already a List[Dict], but the previous variable name final_fps_inputs was indeed misleading (it sounded like a list of floats).
I have renamed the variables in the latest commit to explicitly reflect that they are lists of metadata dictionaries, which should clear up the confusion.

Coach257 · 2025-12-26T06:15:45Z

tasks/data/vlm_data_process.py


    return [tokenized_example]

+def process_sample_qwen3_vl(


@piyifan123 Please check if there's any conflict with the origin process_sample_qwen3_vl. If no, use this new function?

This updates multimodal_chat_template, vlm_data_process, and video_utils to support Qwen3-VL's new time encoding strategy. Signed-off-by: liugengyuan <liugengyuan@bytedance.com>

1. Remove duplicate definition of 'process_sample_qwen3_vl'. 2. Add try-except block for video token processing safety. 3. Fix logic bug: move 'encode_messages' inside conditional blocks to prevent crashes when processing images.

1. Support returning full 4-tuple (video, video_meta, audio, audio_meta) while keeping 'fetch_videos' and 'fetch_videos_metadata' isolated for backward compatibility. 2. Rename ambiguous variables (e.g., 'final_fps_inputs') to explicitly reflect they are lists of metadata dictionaries.

gemini-code-assist bot reviewed Dec 16, 2025

View reviewed changes

tasks/data/vlm_data_process.py Outdated Show resolved Hide resolved

veomni/data/multimodal/multimodal_chat_template.py Show resolved Hide resolved

veomni/data/multimodal/multimodal_chat_template.py Outdated Show resolved Hide resolved

Coach257 self-requested a review December 26, 2025 04:23

Coach257 reviewed Dec 26, 2025

View reviewed changes

Coach257 requested a review from piyifan123 December 26, 2025 06:16

GarrickBot added 2 commits December 29, 2025 02:12

feat: adapt Qwen3-VL textual token-based time encoding

913b046

This updates multimodal_chat_template, vlm_data_process, and video_utils to support Qwen3-VL's new time encoding strategy. Signed-off-by: liugengyuan <liugengyuan@bytedance.com>

feat: adapt Qwen3-VL textual token-based time encoding

2e71e70

This updates multimodal_chat_template, vlm_data_process, and video_utils to support Qwen3-VL's new time encoding strategy. Signed-off-by: liugengyuan <liugengyuan@bytedance.com>

GarrickBot force-pushed the lgy/qwen3-vl-video-processing-change branch from 2c3a978 to 2e71e70 Compare December 28, 2025 18:29

GarrickBot added 2 commits December 29, 2025 03:01

fix: address review comments and logic bugs

3870755

1. Remove duplicate definition of 'process_sample_qwen3_vl'. 2. Add try-except block for video token processing safety. 3. Fix logic bug: move 'encode_messages' inside conditional blocks to prevent crashes when processing images.

Coach257 added the enhancement New feature or request label Dec 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data, model] feat: support Qwen3-VL textual token-based time encoding #297

[data, model] feat: support Qwen3-VL textual token-based time encoding #297

Uh oh!

GarrickBot commented Dec 16, 2025

Uh oh!

CLAassistant commented Dec 16, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Coach257 Dec 26, 2025

Uh oh!

GarrickBot Dec 29, 2025 •

edited

Loading

Uh oh!

Coach257 Dec 26, 2025

Uh oh!

GarrickBot Dec 29, 2025

Uh oh!

Coach257 Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		final_fps_inputs.append(processed_fps)


		return final_video_inputs, final_fps_inputs

[data, model] feat: support Qwen3-VL textual token-based time encoding #297

Are you sure you want to change the base?

[data, model] feat: support Qwen3-VL textual token-based time encoding #297

Uh oh!

Conversation

GarrickBot commented Dec 16, 2025

Overview

Design & Code Changes

Data Format

Uh oh!

CLAassistant commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Coach257 Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

GarrickBot Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Coach257 Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

GarrickBot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Coach257 Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Dec 16, 2025 •

edited

Loading

GarrickBot Dec 29, 2025 •

edited

Loading