Claude/add ollama logprobs support eu qxg #312

romannekrasovaillm · 2026-01-10T06:27:36Z

PR Type

RL Environment PR - Complete Environment Snapshot & Zero-Training sections
Non-Environment PR - Complete Description, Related Issues & Type of Change sections

📝 General Information

Description

Related Issues

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update
Code refactor (no functional changes)
Build/CI/CD related changes
Other (please describe):

🔖 Environment Snapshot

Field	Your Entry
Environment Name
Short Description
Category
Dataset Needed?
External Deps
Environmental Variables
Compute Footprint Estimate

🧪 Zero-Training Test Results

Details

W&B Link:

Examples of the Environment scoring a good example and a bad example:

✅ Developer & Reviewer Checklist

Code follows project style (black, isort, flake8 pass with pre-commit)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
New and existing unit tests pass locally with my changes
Docstrings added for all new public classes / functions
If .env vars required, did you add it to the .env.example in repo root?

- Add OllamaServer class with native /api/chat endpoint for logprobs - Add 'ollama' server type to ServerManager and ServerBaseline - Create code_agent_traces environment for generating agent traces - Include test script for validating the Ollama pipeline - Add README with usage instructions The Ollama integration uses the native API instead of OpenAI-compatible endpoint to properly extract logprobs for RL training.

- Configure default to use Ollama Cloud (https://ollama.com) with DeepSeek V3.2 - Add local_executor.py for sandboxed code execution without Modal - Update agent_trace_env to fallback to local executor when Modal unavailable - Update test script with local executor tests and correct defaults

Implement interleaved reasoning structure for agent traces: - PLANNING: Problem analysis and approach design - ACTION: Code generation with Python solution - REFLECTION: Result review and iteration decision New files: - structured_agent_env.py: Full Atropos environment with structured reasoning - run_structured_pipeline.py: Standalone script for testing the pipeline The agent iterates up to max_iterations times until success or limit reached. Each trace captures the full reasoning chain for RL training.

for more information, see https://pre-commit.ci

Re-write the file with correct Python indentation that was broken during copy-paste operations.

- Add detailed test result output showing expected vs actual values - Add adversarial test cases for edge cases: - two_sum: negative numbers, duplicates - is_palindrome: punctuation-only, case sensitivity - max_subarray: all negative arrays - Show test pass/fail counts in execution status - Add test summary in final output

for more information, see https://pre-commit.ci

Unlike the structured Planning-Action-Reflection pipeline, this agent interleaves reasoning with code generation: - [THINK] marker before each code block - [WAIT] for catching bugs during reasoning - [VERIFY] for tracing through solution - Preserves indentation when parsing code blocks - Falls back to markdown code blocks if needed This architecture catches bugs during generation, not after.

for more information, see https://pre-commit.ci

Full Atropos environment integration featuring: - Extends BaseEnv with InterleavedCodeEnvConfig - Interleaved reasoning with [THINK]/[CODE]/[VERIFY] markers - Local code execution with test verification - Reward calculation based on test pass rate + structure bonuses - Supports HumanEval dataset or built-in problems - WandB metrics for think_count, verify_rate, accuracy - CLI support: serve, process, evaluate commands Usage: python interleaved_code_env.py serve --config config.yaml python interleaved_code_env.py process --env--total_steps 100

for more information, see https://pre-commit.ci

Two modes now available: 1. trace_generator.py - Standalone JSONL generation for fine-tuning 2. interleaved_code_env.py - Atropos RL environment Trace generator features: - Built-in coding problems (8 LeetCode-style) - Configurable num traces, temperature - --only-success filter for quality data - --chat-format for simple fine-tuning format - Full trace with [THINK]/[CODE]/[VERIFY] steps Updated README with usage for both modes.

for more information, see https://pre-commit.ci

- Add --force-interleave flag for multi-turn conversation approach - Model is prompted one step at a time, forced to stop after [/CODE] - Re-prompts ensure granular reasoning instead of monolithic responses - Add FORCED_INTERLEAVE_SYSTEM prompt for strict one-step output - Track code block count to measure interleaving quality - Update README with new mode documentation Fixes issue where DeepSeek-v3.2 outputs Plan→Code→Verify instead of granular interleaved Think→Code→Think→Code→...→Verify format.

for more information, see https://pre-commit.ci

- New trace_generator_tools.py: multi-turn generation with code execution - Model receives [RESULT]/[ERROR] feedback and can iterate to fix bugs - Creates richer training data with error-recovery patterns - Tracks code_iterations and had_errors metrics - Supports --training-format for single-message output - Update README with documentation for all three modes Key differences from marker-based approach: - Tool-based: Think→Code→Result→Think→Fix→Result→Verify - Marker-based: Think→Code→Verify (no execution feedback)

for more information, see https://pre-commit.ci

- Add stop sequences ["[RESULT]", "[ERROR]"] to prevent model from generating fake execution results - Add _strip_hallucinated_results() to remove any that slip through - Update system prompt to explicitly forbid [RESULT]/[ERROR] output - Clarify that these markers come from SYSTEM only - Update initial prompt with clear "STOP after [/CODE]" instruction Fixes issue where DeepSeek would generate entire conversation including fake test results instead of waiting for actual code execution.

for more information, see https://pre-commit.ci

New trace_generator_interleaved_tools.py combines BOTH dimensions: - TRUE interleaving: Think→Code→Think→Code (1-3 lines per block) - REAL tool execution: [RESULT]/[ERROR] feedback from actual code runs Key features: - Forces one step at a time via strict stop sequences - Accumulates code blocks incrementally - Executes when function looks complete - Allows fix attempts on failures - --only-ideal flag to keep only best traces - Quality metrics: think_count, code_block_count, is_ideal This produces the optimal training data with granular reasoning AND real execution feedback, solving the orthogonal dimensions: Interleaved ↑ [forced] ──┼── [THIS] ★ IDEAL │ ───────────┼────────── Tool Use │ [default] ─┼── [tools.py] ↓

for more information, see https://pre-commit.ci

Replace [THINK], [CODE], [VERIFY], [RESULT], [ERROR] with XML format: - <think>...</think> - <code>...</code> - <verify>...</verify> - <result>...</result> - <error>...</error> Updated in all three trace generators: - trace_generator.py - trace_generator_tools.py - trace_generator_interleaved_tools.py Also updated: - System prompts and examples - Parsing regex patterns - Output formatting - Stop sequences - README documentation XML tags are cleaner, more standard, and easier to parse.

for more information, see https://pre-commit.ci

New trace_generator_inline_tools.py uses RL-style inline tool calls inside <think> blocks, matching the pattern from tool_use_interleaved_thinking.py: - Tool calls via <tool_call>{JSON}</tool_call> inside open <think> block - System executes tool and injects <tool_response>{JSON}</tool_response> - Model continues reasoning after seeing execution results - Supports --only-ideal, --only-success, --training-format flags

for more information, see https://pre-commit.ci

… sequences - _extract_tests_from_prompt: no longer returns ellipsis placeholder, also extracts from docstrings - parse_tool_call: new _fix_json_newlines() to escape literal newlines in JSON strings - Removed stop sequences that were cutting off code mid-generation - Better detection of tool_call tags in response

for more information, see https://pre-commit.ci

- Method 1: Direct JSON parse - Method 2: Fix literal newlines in strings - Method 3: Extract code value directly from malformed JSON - Added debug logging to parse_tool_call - Handle \r escape sequences

When tool_response contains an error or partial test failure, inject continuation hints to encourage model to keep reasoning: - On error: "I see there's an error: <msg>\nLet me analyze and fix..." - On partial: "I got X/Y tests passing. Let me fix the failing cases."

Added clear guidance with correct/incorrect examples: - Use SINGLE escape (\n) for newlines in code - Never use double (\\n) or quadruple (\\\\n) - Keep JSON on single line

Previous parser failed on 8.5% of tool_calls. New approach: - Strategy 1: Direct json.loads() for valid JSON - Strategy 2: Simple newline fix (\n → \\n) for real newlines - Strategy 3: Regex extraction for docstrings and complex cases Expected improvement: 26.5% → 99.5% parse success rate

for more information, see https://pre-commit.ci

claude and others added 30 commits January 10, 2026 05:53

[pre-commit.ci] auto fixes from pre-commit.com hooks

de64f67

for more information, see https://pre-commit.ci

Fix indentation in run_structured_pipeline.py

f14cc78

Re-write the file with correct Python indentation that was broken during copy-paste operations.

[pre-commit.ci] auto fixes from pre-commit.com hooks

b5b3865

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

441b482

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

2174e7d

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

4d3c2e3

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

7a4c76d

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

8dcaf5c

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

ab7a21d

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

251a9c5

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

1b1d0ce

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

351e6b2

for more information, see https://pre-commit.ci

Fix import: use execute_code_safe instead of execute_code_with_tests

0b08bd8

Change default model to deepseek-v3.2:cloud for Ollama Cloud API

3d06242

Add debug logging and pre-fill <think> tag to guide model generation

a91fc81

[pre-commit.ci] auto fixes from pre-commit.com hooks

f9ade8c

for more information, see https://pre-commit.ci

Add --model argument to trace generator

9195809

claude and others added 8 commits January 11, 2026 10:49

Improve JSON parser and add 3 few-shot examples to system prompt

43b7b40

[pre-commit.ci] auto fixes from pre-commit.com hooks

98ce2d8

for more information, see https://pre-commit.ci

Improve multi-line JSON parser with 3 fallback methods

ddaefab

- Method 1: Direct JSON parse - Method 2: Fix literal newlines in strings - Method 3: Extract code value directly from malformed JSON - Added debug logging to parse_tool_call - Handle \r escape sequences

Add explicit JSON escaping instructions to system prompt

2be0255

Added clear guidance with correct/incorrect examples: - Use SINGLE escape (\n) for newlines in code - Never use double (\\n) or quadruple (\\\\n) - Keep JSON on single line

[pre-commit.ci] auto fixes from pre-commit.com hooks

e8ea034

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Claude/add ollama logprobs support eu qxg #312

Claude/add ollama logprobs support eu qxg #312

romannekrasovaillm commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Claude/add ollama logprobs support eu qxg #312

Are you sure you want to change the base?

Claude/add ollama logprobs support eu qxg #312

Conversation

romannekrasovaillm commented Jan 10, 2026

PR Type

📝 General Information

Description

Related Issues

Type of Change

🔖 Environment Snapshot

🧪 Zero-Training Test Results

✅ Developer & Reviewer Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants