Skip to content

Conversation

@romannekrasovaillm
Copy link

PR Type

  • RL Environment PR - Complete Environment Snapshot & Zero-Training sections
  • Non-Environment PR - Complete Description, Related Issues & Type of Change sections

📝 General Information

Description

Related Issues

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Code refactor (no functional changes)
  • Build/CI/CD related changes
  • Other (please describe):

🔖 Environment Snapshot

Field Your Entry
Environment Name
Short Description
Category
Dataset Needed?
External Deps
Environmental Variables
Compute Footprint Estimate

🧪 Zero-Training Test Results

Details

W&B Link:

Examples of the Environment scoring a good example and a bad example:


✅ Developer & Reviewer Checklist

  • Code follows project style (black, isort, flake8 pass with pre-commit)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing unit tests pass locally with my changes
  • Docstrings added for all new public classes / functions
  • If .env vars required, did you add it to the .env.example in repo root?

claude and others added 30 commits January 10, 2026 05:53
- Add OllamaServer class with native /api/chat endpoint for logprobs
- Add 'ollama' server type to ServerManager and ServerBaseline
- Create code_agent_traces environment for generating agent traces
- Include test script for validating the Ollama pipeline
- Add README with usage instructions

The Ollama integration uses the native API instead of OpenAI-compatible
endpoint to properly extract logprobs for RL training.
- Configure default to use Ollama Cloud (https://ollama.com) with DeepSeek V3.2
- Add local_executor.py for sandboxed code execution without Modal
- Update agent_trace_env to fallback to local executor when Modal unavailable
- Update test script with local executor tests and correct defaults
Implement interleaved reasoning structure for agent traces:
- PLANNING: Problem analysis and approach design
- ACTION: Code generation with Python solution
- REFLECTION: Result review and iteration decision

New files:
- structured_agent_env.py: Full Atropos environment with structured reasoning
- run_structured_pipeline.py: Standalone script for testing the pipeline

The agent iterates up to max_iterations times until success or limit reached.
Each trace captures the full reasoning chain for RL training.
Re-write the file with correct Python indentation that was broken
during copy-paste operations.
- Add detailed test result output showing expected vs actual values
- Add adversarial test cases for edge cases:
  - two_sum: negative numbers, duplicates
  - is_palindrome: punctuation-only, case sensitivity
  - max_subarray: all negative arrays
- Show test pass/fail counts in execution status
- Add test summary in final output
Unlike the structured Planning-Action-Reflection pipeline, this agent
interleaves reasoning with code generation:

- [THINK] marker before each code block
- [WAIT] for catching bugs during reasoning
- [VERIFY] for tracing through solution
- Preserves indentation when parsing code blocks
- Falls back to markdown code blocks if needed

This architecture catches bugs during generation, not after.
Full Atropos environment integration featuring:
- Extends BaseEnv with InterleavedCodeEnvConfig
- Interleaved reasoning with [THINK]/[CODE]/[VERIFY] markers
- Local code execution with test verification
- Reward calculation based on test pass rate + structure bonuses
- Supports HumanEval dataset or built-in problems
- WandB metrics for think_count, verify_rate, accuracy
- CLI support: serve, process, evaluate commands

Usage:
  python interleaved_code_env.py serve --config config.yaml
  python interleaved_code_env.py process --env--total_steps 100
Two modes now available:
1. trace_generator.py - Standalone JSONL generation for fine-tuning
2. interleaved_code_env.py - Atropos RL environment

Trace generator features:
- Built-in coding problems (8 LeetCode-style)
- Configurable num traces, temperature
- --only-success filter for quality data
- --chat-format for simple fine-tuning format
- Full trace with [THINK]/[CODE]/[VERIFY] steps

Updated README with usage for both modes.
- Add --force-interleave flag for multi-turn conversation approach
- Model is prompted one step at a time, forced to stop after [/CODE]
- Re-prompts ensure granular reasoning instead of monolithic responses
- Add FORCED_INTERLEAVE_SYSTEM prompt for strict one-step output
- Track code block count to measure interleaving quality
- Update README with new mode documentation

Fixes issue where DeepSeek-v3.2 outputs Plan→Code→Verify instead of
granular interleaved Think→Code→Think→Code→...→Verify format.
- New trace_generator_tools.py: multi-turn generation with code execution
- Model receives [RESULT]/[ERROR] feedback and can iterate to fix bugs
- Creates richer training data with error-recovery patterns
- Tracks code_iterations and had_errors metrics
- Supports --training-format for single-message output
- Update README with documentation for all three modes

Key differences from marker-based approach:
- Tool-based: Think→Code→Result→Think→Fix→Result→Verify
- Marker-based: Think→Code→Verify (no execution feedback)
- Add stop sequences ["[RESULT]", "[ERROR]"] to prevent model from
  generating fake execution results
- Add _strip_hallucinated_results() to remove any that slip through
- Update system prompt to explicitly forbid [RESULT]/[ERROR] output
- Clarify that these markers come from SYSTEM only
- Update initial prompt with clear "STOP after [/CODE]" instruction

Fixes issue where DeepSeek would generate entire conversation including
fake test results instead of waiting for actual code execution.
New trace_generator_interleaved_tools.py combines BOTH dimensions:
- TRUE interleaving: Think→Code→Think→Code (1-3 lines per block)
- REAL tool execution: [RESULT]/[ERROR] feedback from actual code runs

Key features:
- Forces one step at a time via strict stop sequences
- Accumulates code blocks incrementally
- Executes when function looks complete
- Allows fix attempts on failures
- --only-ideal flag to keep only best traces
- Quality metrics: think_count, code_block_count, is_ideal

This produces the optimal training data with granular reasoning
AND real execution feedback, solving the orthogonal dimensions:

           Interleaved
               ↑
    [forced] ──┼── [THIS] ★ IDEAL
               │
    ───────────┼────────── Tool Use
               │
    [default] ─┼── [tools.py]
               ↓
Replace [THINK], [CODE], [VERIFY], [RESULT], [ERROR] with XML format:
- <think>...</think>
- <code>...</code>
- <verify>...</verify>
- <result>...</result>
- <error>...</error>

Updated in all three trace generators:
- trace_generator.py
- trace_generator_tools.py
- trace_generator_interleaved_tools.py

Also updated:
- System prompts and examples
- Parsing regex patterns
- Output formatting
- Stop sequences
- README documentation

XML tags are cleaner, more standard, and easier to parse.
New trace_generator_inline_tools.py uses RL-style inline tool calls
inside <think> blocks, matching the pattern from tool_use_interleaved_thinking.py:
- Tool calls via <tool_call>{JSON}</tool_call> inside open <think> block
- System executes tool and injects <tool_response>{JSON}</tool_response>
- Model continues reasoning after seeing execution results
- Supports --only-ideal, --only-success, --training-format flags
claude and others added 8 commits January 11, 2026 10:49
… sequences

- _extract_tests_from_prompt: no longer returns ellipsis placeholder, also extracts from docstrings
- parse_tool_call: new _fix_json_newlines() to escape literal newlines in JSON strings
- Removed stop sequences that were cutting off code mid-generation
- Better detection of tool_call tags in response
- Method 1: Direct JSON parse
- Method 2: Fix literal newlines in strings
- Method 3: Extract code value directly from malformed JSON
- Added debug logging to parse_tool_call
- Handle \r escape sequences
When tool_response contains an error or partial test failure,
inject continuation hints to encourage model to keep reasoning:
- On error: "I see there's an error: <msg>\nLet me analyze and fix..."
- On partial: "I got X/Y tests passing. Let me fix the failing cases."
Added clear guidance with correct/incorrect examples:
- Use SINGLE escape (\n) for newlines in code
- Never use double (\\n) or quadruple (\\\\n)
- Keep JSON on single line
Previous parser failed on 8.5% of tool_calls. New approach:
- Strategy 1: Direct json.loads() for valid JSON
- Strategy 2: Simple newline fix (\n → \\n) for real newlines
- Strategy 3: Regex extraction for docstrings and complex cases

Expected improvement: 26.5% → 99.5% parse success rate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants