Skip to content

Conversation

@GarrickBot
Copy link

Overview

This PR integrates the specific data processing and modeling logic required for Qwen3-VL, specifically focusing on its unique textual token-based time encoding strategy.

Unlike previous model versions, Qwen3-VL requires explicit time tags (e.g., <t 1.5 seconds>) inserted into the context before visual tokens. This PR implements the official timestamp calculation logic, adds a dedicated data processor for Qwen3-VL video samples, and optimizes the underlying video loading utilities for better memory efficiency during training.

Design & Code Changes

The changes are primarily concentrated in the data processing pipeline and chat template adapter:

  • multimodal_chat_template.py:

    • New Class: Added Qwen3VLChatTemplate inheriting from Qwen2VLTemplate.
    • Timestamp Logic: Implemented _calculate_timestamps to replicate the official strategy: padding frame indices, converting to seconds, and averaging based on temporal_patch_size=2.
    • Message Encoding: Updated encode_messages to dynamically inject <t {timestamp} seconds> tags and <|vision_start|>/<|vision_end|> tokens into the input stream.
  • vlm_data_process.py:

    • New Processor: Added process_sample_qwen3_vl to handle the specific requirements of Qwen3-VL video inputs.
    • Data Adapter: Added logic to automatically construct the conversations field from question and answer keys if the standard conversation format is missing.
    • Metadata Fix: Updated video metadata handling to correctly generate and assign frames_indices.
  • video_utils.py:

    • Workflow Optimization: Refactored load_video_from_bytes_list to directly support pre-stored frame byte lists as input. This bypasses the need for online frame extraction (decoding), significantly reducing memory overhead.
    • Metadata Adaptation: Updated the function to correctly handle input VideoMetadata and ensure the output includes the modified VideoMetadata (reflecting changes in FPS or frame counts) for downstream processing.

Data Format

The new data processor (process_sample_qwen3_vl_video_r1) expects input samples to follow this schema:

sample = {
    # 1. 'conversations' is required (or auto-constructed from question/answer)
    "conversations": [
        {
            "role": "user", 
            "content": [
                {"type": "video"}, 
                {"type": "text", "text": "Describe this video."}
            ]
        },
        {"role": "assistant", "content": "This is a video of..."}
    ],
    
    # 2. 'fps' must be a list containing the float FPS value
    "fps": [2.0],
    
    # 3. 'videos' must be a list of raw byte frames
    "videos": [
        [b'\xff\xd8...', b'\xff\xd8...', ...] # List[bytes]
    ]
}

@CLAassistant
Copy link

CLAassistant commented Dec 16, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Qwen3-VL's textual token-based time encoding. The changes are well-structured, adding a new chat template, a data processor, and optimizing video loading. However, I've identified a critical issue with a duplicate function definition that will cause the new logic to be ignored, and a couple of high-severity issues related to potential crashes and unexpected side effects. Please review the detailed comments.

@Coach257 Coach257 self-requested a review December 26, 2025 04:23
]
return video_inputs, audio_inputs

def fetch_videos_metadata(videos: List[VideoInput], **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this function into fetch_videos? It seems return metadata will be necessary, so fetch_videos returns videos, video_metadata, audios, audio_metadata

Copy link
Author

@GarrickBot GarrickBot Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the logic to support returning the full set of 4 values (video, video_metadata, audio, audio_metadata). However, I decided to keep fetch_videos_metadata separate from fetch_videos.
This design ensures isolation: fetch_videos_metadata is for the new pipeline requiring full metadata, while fetch_videos falls back to the original behavior (returning 2 values) to ensure backward compatibility for other tasks.

final_fps_inputs.append(processed_fps)


return final_video_inputs, final_fps_inputs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the returns[1] is metadata, the second return should be a dict.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value was actually already a List[Dict], but the previous variable name final_fps_inputs was indeed misleading (it sounded like a list of floats).
I have renamed the variables in the latest commit to explicitly reflect that they are lists of metadata dictionaries, which should clear up the confusion.


return [tokenized_example]

def process_sample_qwen3_vl(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piyifan123 Please check if there's any conflict with the origin process_sample_qwen3_vl. If no, use this new function?

@Coach257 Coach257 requested a review from piyifan123 December 26, 2025 06:16
This updates multimodal_chat_template, vlm_data_process, and video_utils to support Qwen3-VL's new time encoding strategy.

Signed-off-by: liugengyuan <liugengyuan@bytedance.com>
This updates multimodal_chat_template, vlm_data_process, and video_utils to support Qwen3-VL's new time encoding strategy.

Signed-off-by: liugengyuan <liugengyuan@bytedance.com>
@GarrickBot GarrickBot force-pushed the lgy/qwen3-vl-video-processing-change branch from 2c3a978 to 2e71e70 Compare December 28, 2025 18:29
1. Remove duplicate definition of 'process_sample_qwen3_vl'.
2. Add try-except block for video token processing safety.
3. Fix logic bug: move 'encode_messages' inside conditional blocks to prevent crashes when processing images.
1. Support returning full 4-tuple (video, video_meta, audio, audio_meta) while keeping 'fetch_videos' and 'fetch_videos_metadata' isolated for backward compatibility.
2. Rename ambiguous variables (e.g., 'final_fps_inputs') to explicitly reflect they are lists of metadata dictionaries.
@Coach257 Coach257 added the enhancement New feature or request label Dec 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants