smolzero #1330

mnoukhov · 2026-01-09T16:53:26Z

No description provided.

…/mason-gs

…/olmo3-rlzero

gemini-code-assist · 2026-01-09T16:53:59Z

Summary of Changes

Hello @mnoukhov, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on expanding and refining the training infrastructure for reinforcement learning from human feedback (RLHF) models, specifically for mathematical reasoning tasks. It updates core Ray placement strategies, streamlines debug configurations, and significantly overhauls the OLMo 7B training script. Crucially, it introduces a comprehensive set of new training and debugging scripts for various Qwen models, enabling broader experimentation and development within the RLZero Math domain.

Highlights

Core Training Logic Update: The Ray placement group strategy in open_instruct/grpo_fast.py has been updated from 'STRICT_SPREAD' to 'SPREAD' for potentially more flexible resource allocation.
Debug Script Refinement: The grpo_fast.sh debug script now utilizes uv run --active, switches the default model to deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, and streamlines VLLM configuration by removing several parameters and explicitly setting vllm_num_engines.
OLMo 7B Training Configuration Updates: The 7b_rlzero_math.sh script for OLMo 7B training has been significantly updated with revised local evaluation datasets, new evaluation tasks, a standardized Beaker image, an increased number of nodes, and adjusted GRPO training parameters such as async_steps, num_samples_per_prompt_rollout, response_length, pack_length, total_episodes, and vllm_num_engines. It also now enables evaluation on step 0 and disables masking of truncated completions.
New Qwen Model Training Scripts: Five new shell scripts have been introduced under scripts/train/qwen/ to support training and debugging for deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, Qwen/Qwen3-1.7B-Base, and Qwen/Qwen3-4B-Base models on the RLZero Math dataset, including local and Beaker-based configurations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates several training and debug scripts. The changes involve modifying hyperparameters, updating model paths, and adding new scripts for different model sizes and configurations. My review focuses on improving the correctness and maintainability of these scripts. I've identified a potentially incorrect command-line flag that could cause script failures, a recurring typo in comments that might lead to confusion, and an instance of code duplication that could impact future maintenance.

gemini-code-assist · 2026-01-09T16:55:46Z

scripts/train/debug/grpo_fast.sh

 export VLLM_DISABLE_COMPILE_CACHE=1
 export VLLM_USE_V1=1
-uv run python open_instruct/grpo_fast.py \
+uv run --active python open_instruct/grpo_fast.py \


The --active flag for uv run appears to be incorrect. According to the uv documentation, uv run does not have an --active flag. This will likely cause the script to fail. Please remove it or replace it with the correct flag if you intended something else.

Suggested change

uv run --active python open_instruct/grpo_fast.py \

uv run python open_instruct/grpo_fast.py \

gemini-code-assist · 2026-01-09T16:55:46Z

scripts/train/qwen/local_1.5b_rlzero_math.sh

+LOCAL_EVALS="allenai/Dolci-RLZero-Math-7B 32"
+LOCAL_EVAL_SPLITS="train"
+
+uv run --active open_instruct/grpo_fast.py \


The --active flag for uv run appears to be incorrect. According to the uv documentation, uv run does not have an --active flag. This will likely cause the script to fail. Please remove it or replace it with the correct flag if you intended something else.

Suggested change

uv run --active open_instruct/grpo_fast.py \

uv run open_instruct/grpo_fast.py \

gemini-code-assist · 2026-01-09T16:55:46Z

scripts/train/olmo3/7b_rlzero_math.sh

-shift
+BEAKER_IMAGE="nathanl/open_instruct_auto"
+
+# Check if the first argument starts with the value of $BEAKER_NAME


The comment refers to $BEAKER_NAME, but this variable is not defined in the script. The condition below it uses $BEAKER_USER. The comment should be updated to refer to $BEAKER_USER for consistency and to avoid confusion.

Suggested change

# Check if the first argument starts with the value of $BEAKER_NAME

# Check if the first argument starts with the value of $BEAKER_USER

gemini-code-assist · 2026-01-09T16:55:46Z

scripts/train/qwen/1.5b_rlzero_math.sh

+BEAKER_USER=$(beaker account whoami --format json | jq -r '.[0].name')
+BEAKER_IMAGE="nathanl/open_instruct_auto"
+
+# Check if the first argument starts with the value of $BEAKER_NAME


The comment refers to $BEAKER_NAME, but this variable is not defined in the script. The condition below it uses $BEAKER_USER. The comment should be updated to refer to $BEAKER_USER for consistency and to avoid confusion. This same issue is present in other new scripts in this PR.

Suggested change

# Check if the first argument starts with the value of $BEAKER_NAME

# Check if the first argument starts with the value of $BEAKER_USER

gemini-code-assist · 2026-01-09T16:55:46Z

scripts/train/qwen/debug_1.5b_rlzero_math.sh

+#!/bin/bash
+
+EXP_NAME="qwen1.5distill_rlzero_math"
+MODEL_NAME_OR_PATH="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
+DATASETS="allenai/Dolci-RLZero-Math-7B 1.0"
+
+LOCAL_EVALS="allenai/Dolci-RLZero-Math-7B 32"
+LOCAL_EVAL_SPLITS="train train"
+
+EVALS="aime:zs_cot_r1::pass_at_32_2024_rlzero,aime:zs_cot_r1::pass_at_32_2025_rlzero"
+
+BEAKER_USER=$(beaker account whoami --format json | jq -r '.[0].name')
+BEAKER_IMAGE="nathanl/open_instruct_auto"
+
+# Check if the first argument starts with the value of $BEAKER_NAME
+if [[ "$1" == "$BEAKER_USER"* ]]; then
+    BEAKER_IMAGE="$1"
+    shift
+fi
+
+cluster=ai2/augusta
+uv run mason.py \
+    --task_name ${EXP_NAME} \
+    --cluster ${cluster} \
+    --workspace ai2/olmo-instruct \
+    --priority normal \
+    --pure_docker_mode \
+    --image ${BEAKER_IMAGE} \
+    --preemptible \
+    --num_nodes 1 \
+    --env VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
+    --env VLLM_ATTENTION_BACKEND="FLASH_ATTN" \
+    --gpus 4 \
+    --budget ai2/oe-adapt \
+    -- source configs/beaker_configs/ray_node_setup.sh \
+\&\& uv run open_instruct/grpo_fast.py \
+    --exp_name ${EXP_NAME} \
+    --beta 0.0 \
+    --async_steps 8 \
+    --inflight_updates \
+    --no_resampling_pass_rate 0.875 \
+    --truncated_importance_sampling_ratio_cap 2.0 \
+    --advantage_normalization_type centered \
+    --active_sampling \
+    --num_samples_per_prompt_rollout 8 \
+    --num_unique_prompts_rollout 32 \
+    --num_mini_batches 1 \
+    --learning_rate 1e-6 \
+    --per_device_train_batch_size 1 \
+    --kl_estimator 2 \
+    --dataset_mixer_list $DATASETS \
+    --dataset_mixer_list_splits train \
+    --dataset_mixer_eval_list $LOCAL_EVALS \
+    --dataset_mixer_eval_list_splits $LOCAL_EVAL_SPLITS \
+    --max_prompt_token_length 2048 \
+    --response_length 8192 \
+    --pack_length 18432 \
+    --model_name_or_path ${MODEL_NAME_OR_PATH} \
+    --chat_template_name olmo_thinker_rlzero \
+    --non_stop_penalty False \
+    --temperature 1.0 \
+    --total_episodes 256000 \
+    --deepspeed_stage 2 \
+    --num_learners_per_node 1 \
+    --vllm_num_engines 3 \
+    --vllm_tensor_parallel_size 1 \
+    --lr_scheduler_type constant \
+    --apply_verifiable_reward true \
+    --seed 1 \
+    --local_eval_every 25 \
+    --save_freq 100 \
+    --checkpoint_state_freq 100 \
+    --gradient_checkpointing \
+    --with_tracking \
+    --vllm_enable_prefix_caching \
+    --clip_higher 0.272 \
+    --mask_truncated_completions False \
+    --oe_eval_max_length 32768 \
+    --try_launch_beaker_eval_jobs_on_weka True \
+    --eval_priority normal \
+    --eval_on_step_0 True \
+    --oe_eval_tasks $EVALS \
+    --load_ref_policy False \
+    --oe_eval_gpu_multiplier 2 $@


This script appears to be an exact copy of scripts/train/qwen/1.5b_rlzero_math.sh. Having identical files creates a maintenance burden, as changes will need to be applied in both places. Consider consolidating them into a single script that can be parameterized for debug vs. full runs, or using a shared configuration file.

gemini-code-assist · 2026-01-09T16:55:46Z

scripts/train/qwen/debug_1.5b_rlzero_math.sh

+BEAKER_USER=$(beaker account whoami --format json | jq -r '.[0].name')
+BEAKER_IMAGE="nathanl/open_instruct_auto"
+
+# Check if the first argument starts with the value of $BEAKER_NAME


The comment refers to $BEAKER_NAME, but this variable is not defined in the script. The condition below it uses $BEAKER_USER. The comment should be updated to refer to $BEAKER_USER for consistency and to avoid confusion.

Suggested change

# Check if the first argument starts with the value of $BEAKER_NAME

# Check if the first argument starts with the value of $BEAKER_USER

gemini-code-assist · 2026-01-09T16:55:47Z

scripts/train/qwen/qwen3_1.7b_rlzero_math.sh

+BEAKER_USER=$(beaker account whoami --format json | jq -r '.[0].name')
+BEAKER_IMAGE="nathanl/open_instruct_auto"
+
+# Check if the first argument starts with the value of $BEAKER_NAME


The comment refers to $BEAKER_NAME, but this variable is not defined in the script. The condition below it uses $BEAKER_USER. The comment should be updated to refer to $BEAKER_USER for consistency and to avoid confusion.

Suggested change

# Check if the first argument starts with the value of $BEAKER_NAME

# Check if the first argument starts with the value of $BEAKER_USER

gemini-code-assist · 2026-01-09T16:55:47Z

scripts/train/qwen/qwen3_4b_rlzero_math.sh

+BEAKER_USER=$(beaker account whoami --format json | jq -r '.[0].name')
+BEAKER_IMAGE="nathanl/open_instruct_auto"
+
+# Check if the first argument starts with the value of $BEAKER_NAME


The comment refers to $BEAKER_NAME, but this variable is not defined in the script. The condition below it uses $BEAKER_USER. The comment should be updated to refer to $BEAKER_USER for consistency and to avoid confusion.

Suggested change

# Check if the first argument starts with the value of $BEAKER_NAME

# Check if the first argument starts with the value of $BEAKER_USER

mnoukhov added 25 commits November 12, 2025 21:41

pass in model_name_or_path that is on augusta and it works

98e7976

Merge branch 'main' of github.com:allenai/open-instruct into michaeln…

ee3ffb4

…/mason-gs

make src path list

622d99a

Refactor gs bucket download test

e461f72

download_from_gs_bucket a separate command and removed try except

9f136a9

script and queue size fix

7810e6f

regular oe-eval image

fb31b0f

fix path name

f33b5b6

Merge branch 'main' of github.com:allenai/open-instruct into michaeln…

65e3c94

…/mason-gs

Merge remote-tracking branch 'origin/michaeln/mason-gs' into michaeln…

369633f

…/olmo3-rlzero

rerun from 2k steps

c07cffd

9 nodes

3301249

Merge branch 'main' of github.com:allenai/open-instruct into michaeln…

ea2dcf8

…/olmo3-rlzero

final script

217b25d

deepscaler comparison

d68df80

Merge branch 'main' of github.com:allenai/open-instruct into michaeln…

97a6554

…/olmo3-rlzero

rlzero final

218c999

max episodes 3k steps

966f9b8

scripts

e651b4d

4b

26cd5c2

single node mixed GPUs

f53cfc7

Merge branch 'main' of github.com:allenai/open-instruct into smolzero

e280d94

4 gpu

162127b

check in stuff

2a431b9

Merge branch 'main' of github.com:allenai/open-instruct into smolzero

eeb5dc1

gemini-code-assist bot reviewed Jan 9, 2026

View reviewed changes

mnoukhov added 2 commits January 9, 2026 12:27

User prompt transform

08b715a

pass at k for local eval

100c27c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smolzero #1330

smolzero #1330

Uh oh!

mnoukhov commented Jan 9, 2026

Uh oh!

gemini-code-assist bot commented Jan 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 9, 2026

Uh oh!

gemini-code-assist bot Jan 9, 2026

Uh oh!

gemini-code-assist bot Jan 9, 2026

Uh oh!

gemini-code-assist bot Jan 9, 2026

Uh oh!

gemini-code-assist bot Jan 9, 2026

Uh oh!

gemini-code-assist bot Jan 9, 2026

Uh oh!

gemini-code-assist bot Jan 9, 2026

Uh oh!

gemini-code-assist bot Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	uv run --active python open_instruct/grpo_fast.py \
	uv run python open_instruct/grpo_fast.py \

	uv run --active open_instruct/grpo_fast.py \
	uv run open_instruct/grpo_fast.py \

	# Check if the first argument starts with the value of $BEAKER_NAME
	# Check if the first argument starts with the value of $BEAKER_USER

smolzero #1330

Are you sure you want to change the base?

smolzero #1330

Uh oh!

Conversation

mnoukhov commented Jan 9, 2026

Uh oh!

gemini-code-assist bot commented Jan 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants