fix: resolve three issues encountered during RL training by Nyquist24 · Pull Request #47 · aiming-lab/MetaClaw

Nyquist24 · 2026-03-31T08:32:22Z

Summary

I've been running MetaClaw in RL mode (Kimi-K2.5 + GPT-5.2 as PRM). During this time I hit three independent issues that caused silent training failures — no crash, no obvious error, just the model not learning or the process hanging. Took me a while to trace each one, so I'm bundling the fixes here in case others run into the same thing.

Fix compute_advantages() returning all-zero advantages for single-sample batches (std=0 edge case)
Add timeout to PRMScorer._query_once() to prevent training loop from hanging on unresponsive API
Remove hardcoded placeholder API key in SkillEvolver so missing OPENAI_API_KEY raises immediately

Problem

I ran into three separate issues while using MetaClaw for RL training on my personal setup (Kimi-K2.5 + GPT-5.2 as PRM judge, batch_size=4):

1. Single-sample batch produces zero gradient

When I set rl.batch_size to 1 for quick iteration during debugging, I noticed the model weights weren't updating at all. After digging into it, I found that compute_advantages() normalises rewards as
(r - mean) / (std + eps) — but with a single sample, std is 0 and every advantage becomes ~0 regardless of the actual reward. The training step runs, datums are formatted, forward_backward_async is
called, but the model learns nothing. Same thing happens if a batch has identical rewards (e.g. all +1).

Fix: When std < 1e-8, fall back to raw (centred) rewards instead of dividing by near-zero std. For single-sample batches, return the reward directly.

2. PRM scorer hangs indefinitely on slow API

My PRM endpoint (proxied through a corporate gateway) occasionally hangs for 5+ minutes without responding. When this happens, _query_once blocks inside asyncio.to_thread() forever — there's no timeout.
The entire training loop freezes because evaluate() awaits all votes via asyncio.gather, and one hanging vote blocks everything. I had to manually kill the process twice before I traced it to this.

Fix: Wrap the asyncio.to_thread call with asyncio.wait_for(timeout=self.timeout). Default timeout is 120s, configurable via the new timeout parameter. On timeout, the vote returns None (same as
any other failure) and majority voting proceeds with the remaining votes.

3. SkillEvolver silently uses a hardcoded placeholder API key

When I first enabled skill evolution, I forgot to set OPENAI_API_KEY. Instead of getting a clear error, the evolver used a hardcoded fallback key ("aB7cD9eF2gH5iJ8kL1mN4oP6qR3sT0uV") and sent real
requests to the API, which returned cryptic 401 errors buried in the logs. It took me a while to realise the issue was just a missing env var. The if not api_key check on line 90 never triggers because
the default is non-empty.

Fix: Change the default to "" so the existing EnvironmentError on line 91 fires immediately with a clear message.

Changes

metaclaw/data_formatter.py — handle zero-std edge case in compute_advantages()
metaclaw/prm_scorer.py — add configurable timeout to PRMScorer._query_once()
metaclaw/skill_evolver.py — remove hardcoded placeholder API key fallback

Test

Verified single-sample batch now produces non-zero advantages locally
Simulated PRM timeout by pointing prm_url at a non-responsive endpoint — training loop no longer hangs, logs timeout warning and continues
Confirmed missing OPENAI_API_KEY now raises EnvironmentError immediately on SkillEvolver init

1. compute_advantages returns all zeros for single-sample batches (std=0 causes division to collapse), falling back to raw rewards 2. PRMScorer._query_once has no timeout — a hanging API endpoint freezes the entire training loop indefinitely, adding asyncio.wait_for with configurable timeout (default 120s) 3. SkillEvolver uses a hardcoded placeholder API key as fallback instead of raising early when OPENAI_API_KEY is not set

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve three issues encountered during RL training#47

fix: resolve three issues encountered during RL training#47
Nyquist24 wants to merge 1 commit intoaiming-lab:mainfrom
Nyquist24:fix/rl-training-edge-cases

Nyquist24 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Nyquist24 commented Mar 31, 2026

Summary

Problem

1. Single-sample batch produces zero gradient

2. PRM scorer hangs indefinitely on slow API

3. SkillEvolver silently uses a hardcoded placeholder API key

Changes

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant