Skip to content

fix: resolve three issues encountered during RL training#47

Open
Nyquist24 wants to merge 1 commit intoaiming-lab:mainfrom
Nyquist24:fix/rl-training-edge-cases
Open

fix: resolve three issues encountered during RL training#47
Nyquist24 wants to merge 1 commit intoaiming-lab:mainfrom
Nyquist24:fix/rl-training-edge-cases

Conversation

@Nyquist24
Copy link
Copy Markdown

Summary

I've been running MetaClaw in RL mode (Kimi-K2.5 + GPT-5.2 as PRM). During this time I hit three independent issues that caused silent training failures β€” no crash, no obvious error, just the model not learning or the process hanging. Took me a while to trace each one, so I'm bundling the fixes here in case others run into the same thing.

  • Fix compute_advantages() returning all-zero advantages for single-sample batches (std=0 edge case)
  • Add timeout to PRMScorer._query_once() to prevent training loop from hanging on unresponsive API
  • Remove hardcoded placeholder API key in SkillEvolver so missing OPENAI_API_KEY raises immediately

Problem

I ran into three separate issues while using MetaClaw for RL training on my personal setup (Kimi-K2.5 + GPT-5.2 as PRM judge, batch_size=4):

1. Single-sample batch produces zero gradient

When I set rl.batch_size to 1 for quick iteration during debugging, I noticed the model weights weren't updating at all. After digging into it, I found that compute_advantages() normalises rewards as
(r - mean) / (std + eps) β€” but with a single sample, std is 0 and every advantage becomes ~0 regardless of the actual reward. The training step runs, datums are formatted, forward_backward_async is
called, but the model learns nothing. Same thing happens if a batch has identical rewards (e.g. all +1).

Fix: When std < 1e-8, fall back to raw (centred) rewards instead of dividing by near-zero std. For single-sample batches, return the reward directly.

2. PRM scorer hangs indefinitely on slow API

My PRM endpoint (proxied through a corporate gateway) occasionally hangs for 5+ minutes without responding. When this happens, _query_once blocks inside asyncio.to_thread() forever β€” there's no timeout.
The entire training loop freezes because evaluate() awaits all votes via asyncio.gather, and one hanging vote blocks everything. I had to manually kill the process twice before I traced it to this.

Fix: Wrap the asyncio.to_thread call with asyncio.wait_for(timeout=self.timeout). Default timeout is 120s, configurable via the new timeout parameter. On timeout, the vote returns None (same as
any other failure) and majority voting proceeds with the remaining votes.

3. SkillEvolver silently uses a hardcoded placeholder API key

When I first enabled skill evolution, I forgot to set OPENAI_API_KEY. Instead of getting a clear error, the evolver used a hardcoded fallback key ("aB7cD9eF2gH5iJ8kL1mN4oP6qR3sT0uV") and sent real
requests to the API, which returned cryptic 401 errors buried in the logs. It took me a while to realise the issue was just a missing env var. The if not api_key check on line 90 never triggers because
the default is non-empty.

Fix: Change the default to "" so the existing EnvironmentError on line 91 fires immediately with a clear message.

Changes

  • metaclaw/data_formatter.py β€” handle zero-std edge case in compute_advantages()
  • metaclaw/prm_scorer.py β€” add configurable timeout to PRMScorer._query_once()
  • metaclaw/skill_evolver.py β€” remove hardcoded placeholder API key fallback

Test

  • Verified single-sample batch now produces non-zero advantages locally
  • Simulated PRM timeout by pointing prm_url at a non-responsive endpoint β€” training loop no longer hangs, logs timeout warning and continues
  • Confirmed missing OPENAI_API_KEY now raises EnvironmentError immediately on SkillEvolver init

1. compute_advantages returns all zeros for single-sample batches
   (std=0 causes division to collapse), falling back to raw rewards

2. PRMScorer._query_once has no timeout β€” a hanging API endpoint
   freezes the entire training loop indefinitely, adding
   asyncio.wait_for with configurable timeout (default 120s)

3. SkillEvolver uses a hardcoded placeholder API key as fallback
   instead of raising early when OPENAI_API_KEY is not set
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant