support transformers multi-modal grpo#131
Merged
tastelikefeet merged 3 commits intomodelscope:mainfrom Apr 1, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces multimodal GRPO training capabilities, featuring a new training script for the Qwen3.5 VL model on the CLEVR dataset. Core framework enhancements include a new MultiModalAccuracyReward, a CLEVRProcessor, and a refactored vLLM sampler that better supports multimodal data and standard message formats. Review feedback identifies a logic error in the training loop where metrics are reset prematurely, as well as opportunities to optimize reward function instantiation and fix a potential typo in the reward computation.
Resolved conflicts in template/base.py: - _check_max_length: adopt upstream's cleaner _truncate_feature approach - _process_mm_messages: merge our List[Dict] content support into upstream's refactored method structure Made-with: Cursor
tastelikefeet
approved these changes
Apr 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
content: List[Dict]) in template pipeline_check_max_lengthto only truncateinput_ids/labels, keeping multimodal tensors intactNonekeys from content blocks to prevent Jinja template misparsingmulti_modal_data/mm_processor_kwargspass-through in vLLM sampling pipelineencode()kwargs inLazyDatasetfor properadd_generation_promptsupportCLEVRProcessor,MultiModalAccuracyReward, multimodal GRPO demoChanged files
template/base.py—_build_mm_messagessupports bothList[Dict]and legacy str content;_apply_chat_templatestrips null keys;_check_max_lengthonly truncates sequence fieldssampler/vllm_sampler/vllm_engine.py—sample()acceptsmulti_modal_datadict directlysampler/vllm_sampler/vllm_sampler.py— extractmulti_modal_datafrom message content blocksfor vLLM; add
mm_processor_kwargsforwardingdataset/lazy_dataset.py— forwardencode()kwargs tobatch_encode