Skip to content

Conversation

@mnoukhov
Copy link
Contributor

@mnoukhov mnoukhov commented Jan 9, 2026

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @mnoukhov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on expanding and refining the training infrastructure for reinforcement learning from human feedback (RLHF) models, specifically for mathematical reasoning tasks. It updates core Ray placement strategies, streamlines debug configurations, and significantly overhauls the OLMo 7B training script. Crucially, it introduces a comprehensive set of new training and debugging scripts for various Qwen models, enabling broader experimentation and development within the RLZero Math domain.

Highlights

  • Core Training Logic Update: The Ray placement group strategy in open_instruct/grpo_fast.py has been updated from 'STRICT_SPREAD' to 'SPREAD' for potentially more flexible resource allocation.
  • Debug Script Refinement: The grpo_fast.sh debug script now utilizes uv run --active, switches the default model to deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, and streamlines VLLM configuration by removing several parameters and explicitly setting vllm_num_engines.
  • OLMo 7B Training Configuration Updates: The 7b_rlzero_math.sh script for OLMo 7B training has been significantly updated with revised local evaluation datasets, new evaluation tasks, a standardized Beaker image, an increased number of nodes, and adjusted GRPO training parameters such as async_steps, num_samples_per_prompt_rollout, response_length, pack_length, total_episodes, and vllm_num_engines. It also now enables evaluation on step 0 and disables masking of truncated completions.
  • New Qwen Model Training Scripts: Five new shell scripts have been introduced under scripts/train/qwen/ to support training and debugging for deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, Qwen/Qwen3-1.7B-Base, and Qwen/Qwen3-4B-Base models on the RLZero Math dataset, including local and Beaker-based configurations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates several training and debug scripts. The changes involve modifying hyperparameters, updating model paths, and adding new scripts for different model sizes and configurations. My review focuses on improving the correctness and maintainability of these scripts. I've identified a potentially incorrect command-line flag that could cause script failures, a recurring typo in comments that might lead to confusion, and an instance of code duplication that could impact future maintenance.

export VLLM_DISABLE_COMPILE_CACHE=1
export VLLM_USE_V1=1
uv run python open_instruct/grpo_fast.py \
uv run --active python open_instruct/grpo_fast.py \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The --active flag for uv run appears to be incorrect. According to the uv documentation, uv run does not have an --active flag. This will likely cause the script to fail. Please remove it or replace it with the correct flag if you intended something else.

Suggested change
uv run --active python open_instruct/grpo_fast.py \
uv run python open_instruct/grpo_fast.py \

LOCAL_EVALS="allenai/Dolci-RLZero-Math-7B 32"
LOCAL_EVAL_SPLITS="train"

uv run --active open_instruct/grpo_fast.py \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The --active flag for uv run appears to be incorrect. According to the uv documentation, uv run does not have an --active flag. This will likely cause the script to fail. Please remove it or replace it with the correct flag if you intended something else.

Suggested change
uv run --active open_instruct/grpo_fast.py \
uv run open_instruct/grpo_fast.py \

shift
BEAKER_IMAGE="nathanl/open_instruct_auto"

# Check if the first argument starts with the value of $BEAKER_NAME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment refers to $BEAKER_NAME, but this variable is not defined in the script. The condition below it uses $BEAKER_USER. The comment should be updated to refer to $BEAKER_USER for consistency and to avoid confusion.

Suggested change
# Check if the first argument starts with the value of $BEAKER_NAME
# Check if the first argument starts with the value of $BEAKER_USER

BEAKER_USER=$(beaker account whoami --format json | jq -r '.[0].name')
BEAKER_IMAGE="nathanl/open_instruct_auto"

# Check if the first argument starts with the value of $BEAKER_NAME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment refers to $BEAKER_NAME, but this variable is not defined in the script. The condition below it uses $BEAKER_USER. The comment should be updated to refer to $BEAKER_USER for consistency and to avoid confusion. This same issue is present in other new scripts in this PR.

Suggested change
# Check if the first argument starts with the value of $BEAKER_NAME
# Check if the first argument starts with the value of $BEAKER_USER

Comment on lines +1 to +84
#!/bin/bash

EXP_NAME="qwen1.5distill_rlzero_math"
MODEL_NAME_OR_PATH="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
DATASETS="allenai/Dolci-RLZero-Math-7B 1.0"

LOCAL_EVALS="allenai/Dolci-RLZero-Math-7B 32"
LOCAL_EVAL_SPLITS="train train"

EVALS="aime:zs_cot_r1::pass_at_32_2024_rlzero,aime:zs_cot_r1::pass_at_32_2025_rlzero"

BEAKER_USER=$(beaker account whoami --format json | jq -r '.[0].name')
BEAKER_IMAGE="nathanl/open_instruct_auto"

# Check if the first argument starts with the value of $BEAKER_NAME
if [[ "$1" == "$BEAKER_USER"* ]]; then
BEAKER_IMAGE="$1"
shift
fi

cluster=ai2/augusta
uv run mason.py \
--task_name ${EXP_NAME} \
--cluster ${cluster} \
--workspace ai2/olmo-instruct \
--priority normal \
--pure_docker_mode \
--image ${BEAKER_IMAGE} \
--preemptible \
--num_nodes 1 \
--env VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
--env VLLM_ATTENTION_BACKEND="FLASH_ATTN" \
--gpus 4 \
--budget ai2/oe-adapt \
-- source configs/beaker_configs/ray_node_setup.sh \
\&\& uv run open_instruct/grpo_fast.py \
--exp_name ${EXP_NAME} \
--beta 0.0 \
--async_steps 8 \
--inflight_updates \
--no_resampling_pass_rate 0.875 \
--truncated_importance_sampling_ratio_cap 2.0 \
--advantage_normalization_type centered \
--active_sampling \
--num_samples_per_prompt_rollout 8 \
--num_unique_prompts_rollout 32 \
--num_mini_batches 1 \
--learning_rate 1e-6 \
--per_device_train_batch_size 1 \
--kl_estimator 2 \
--dataset_mixer_list $DATASETS \
--dataset_mixer_list_splits train \
--dataset_mixer_eval_list $LOCAL_EVALS \
--dataset_mixer_eval_list_splits $LOCAL_EVAL_SPLITS \
--max_prompt_token_length 2048 \
--response_length 8192 \
--pack_length 18432 \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--chat_template_name olmo_thinker_rlzero \
--non_stop_penalty False \
--temperature 1.0 \
--total_episodes 256000 \
--deepspeed_stage 2 \
--num_learners_per_node 1 \
--vllm_num_engines 3 \
--vllm_tensor_parallel_size 1 \
--lr_scheduler_type constant \
--apply_verifiable_reward true \
--seed 1 \
--local_eval_every 25 \
--save_freq 100 \
--checkpoint_state_freq 100 \
--gradient_checkpointing \
--with_tracking \
--vllm_enable_prefix_caching \
--clip_higher 0.272 \
--mask_truncated_completions False \
--oe_eval_max_length 32768 \
--try_launch_beaker_eval_jobs_on_weka True \
--eval_priority normal \
--eval_on_step_0 True \
--oe_eval_tasks $EVALS \
--load_ref_policy False \
--oe_eval_gpu_multiplier 2 $@
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This script appears to be an exact copy of scripts/train/qwen/1.5b_rlzero_math.sh. Having identical files creates a maintenance burden, as changes will need to be applied in both places. Consider consolidating them into a single script that can be parameterized for debug vs. full runs, or using a shared configuration file.

BEAKER_USER=$(beaker account whoami --format json | jq -r '.[0].name')
BEAKER_IMAGE="nathanl/open_instruct_auto"

# Check if the first argument starts with the value of $BEAKER_NAME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment refers to $BEAKER_NAME, but this variable is not defined in the script. The condition below it uses $BEAKER_USER. The comment should be updated to refer to $BEAKER_USER for consistency and to avoid confusion.

Suggested change
# Check if the first argument starts with the value of $BEAKER_NAME
# Check if the first argument starts with the value of $BEAKER_USER

BEAKER_USER=$(beaker account whoami --format json | jq -r '.[0].name')
BEAKER_IMAGE="nathanl/open_instruct_auto"

# Check if the first argument starts with the value of $BEAKER_NAME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment refers to $BEAKER_NAME, but this variable is not defined in the script. The condition below it uses $BEAKER_USER. The comment should be updated to refer to $BEAKER_USER for consistency and to avoid confusion.

Suggested change
# Check if the first argument starts with the value of $BEAKER_NAME
# Check if the first argument starts with the value of $BEAKER_USER

BEAKER_USER=$(beaker account whoami --format json | jq -r '.[0].name')
BEAKER_IMAGE="nathanl/open_instruct_auto"

# Check if the first argument starts with the value of $BEAKER_NAME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment refers to $BEAKER_NAME, but this variable is not defined in the script. The condition below it uses $BEAKER_USER. The comment should be updated to refer to $BEAKER_USER for consistency and to avoid confusion.

Suggested change
# Check if the first argument starts with the value of $BEAKER_NAME
# Check if the first argument starts with the value of $BEAKER_USER

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants