-
Notifications
You must be signed in to change notification settings - Fork 210
Claude/add ollama logprobs support eu qxg #312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
romannekrasovaillm
wants to merge
38
commits into
NousResearch:main
Choose a base branch
from
romannekrasovaillm:claude/add-ollama-logprobs-support-EuQxg
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Claude/add ollama logprobs support eu qxg #312
romannekrasovaillm
wants to merge
38
commits into
NousResearch:main
from
romannekrasovaillm:claude/add-ollama-logprobs-support-EuQxg
+8,710
−1
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Add OllamaServer class with native /api/chat endpoint for logprobs - Add 'ollama' server type to ServerManager and ServerBaseline - Create code_agent_traces environment for generating agent traces - Include test script for validating the Ollama pipeline - Add README with usage instructions The Ollama integration uses the native API instead of OpenAI-compatible endpoint to properly extract logprobs for RL training.
- Configure default to use Ollama Cloud (https://ollama.com) with DeepSeek V3.2 - Add local_executor.py for sandboxed code execution without Modal - Update agent_trace_env to fallback to local executor when Modal unavailable - Update test script with local executor tests and correct defaults
Implement interleaved reasoning structure for agent traces: - PLANNING: Problem analysis and approach design - ACTION: Code generation with Python solution - REFLECTION: Result review and iteration decision New files: - structured_agent_env.py: Full Atropos environment with structured reasoning - run_structured_pipeline.py: Standalone script for testing the pipeline The agent iterates up to max_iterations times until success or limit reached. Each trace captures the full reasoning chain for RL training.
for more information, see https://pre-commit.ci
Re-write the file with correct Python indentation that was broken during copy-paste operations.
- Add detailed test result output showing expected vs actual values - Add adversarial test cases for edge cases: - two_sum: negative numbers, duplicates - is_palindrome: punctuation-only, case sensitivity - max_subarray: all negative arrays - Show test pass/fail counts in execution status - Add test summary in final output
for more information, see https://pre-commit.ci
Unlike the structured Planning-Action-Reflection pipeline, this agent interleaves reasoning with code generation: - [THINK] marker before each code block - [WAIT] for catching bugs during reasoning - [VERIFY] for tracing through solution - Preserves indentation when parsing code blocks - Falls back to markdown code blocks if needed This architecture catches bugs during generation, not after.
for more information, see https://pre-commit.ci
Full Atropos environment integration featuring: - Extends BaseEnv with InterleavedCodeEnvConfig - Interleaved reasoning with [THINK]/[CODE]/[VERIFY] markers - Local code execution with test verification - Reward calculation based on test pass rate + structure bonuses - Supports HumanEval dataset or built-in problems - WandB metrics for think_count, verify_rate, accuracy - CLI support: serve, process, evaluate commands Usage: python interleaved_code_env.py serve --config config.yaml python interleaved_code_env.py process --env--total_steps 100
for more information, see https://pre-commit.ci
Two modes now available: 1. trace_generator.py - Standalone JSONL generation for fine-tuning 2. interleaved_code_env.py - Atropos RL environment Trace generator features: - Built-in coding problems (8 LeetCode-style) - Configurable num traces, temperature - --only-success filter for quality data - --chat-format for simple fine-tuning format - Full trace with [THINK]/[CODE]/[VERIFY] steps Updated README with usage for both modes.
for more information, see https://pre-commit.ci
- Add --force-interleave flag for multi-turn conversation approach - Model is prompted one step at a time, forced to stop after [/CODE] - Re-prompts ensure granular reasoning instead of monolithic responses - Add FORCED_INTERLEAVE_SYSTEM prompt for strict one-step output - Track code block count to measure interleaving quality - Update README with new mode documentation Fixes issue where DeepSeek-v3.2 outputs Plan→Code→Verify instead of granular interleaved Think→Code→Think→Code→...→Verify format.
for more information, see https://pre-commit.ci
- New trace_generator_tools.py: multi-turn generation with code execution - Model receives [RESULT]/[ERROR] feedback and can iterate to fix bugs - Creates richer training data with error-recovery patterns - Tracks code_iterations and had_errors metrics - Supports --training-format for single-message output - Update README with documentation for all three modes Key differences from marker-based approach: - Tool-based: Think→Code→Result→Think→Fix→Result→Verify - Marker-based: Think→Code→Verify (no execution feedback)
for more information, see https://pre-commit.ci
- Add stop sequences ["[RESULT]", "[ERROR]"] to prevent model from generating fake execution results - Add _strip_hallucinated_results() to remove any that slip through - Update system prompt to explicitly forbid [RESULT]/[ERROR] output - Clarify that these markers come from SYSTEM only - Update initial prompt with clear "STOP after [/CODE]" instruction Fixes issue where DeepSeek would generate entire conversation including fake test results instead of waiting for actual code execution.
for more information, see https://pre-commit.ci
New trace_generator_interleaved_tools.py combines BOTH dimensions:
- TRUE interleaving: Think→Code→Think→Code (1-3 lines per block)
- REAL tool execution: [RESULT]/[ERROR] feedback from actual code runs
Key features:
- Forces one step at a time via strict stop sequences
- Accumulates code blocks incrementally
- Executes when function looks complete
- Allows fix attempts on failures
- --only-ideal flag to keep only best traces
- Quality metrics: think_count, code_block_count, is_ideal
This produces the optimal training data with granular reasoning
AND real execution feedback, solving the orthogonal dimensions:
Interleaved
↑
[forced] ──┼── [THIS] ★ IDEAL
│
───────────┼────────── Tool Use
│
[default] ─┼── [tools.py]
↓
for more information, see https://pre-commit.ci
Replace [THINK], [CODE], [VERIFY], [RESULT], [ERROR] with XML format: - <think>...</think> - <code>...</code> - <verify>...</verify> - <result>...</result> - <error>...</error> Updated in all three trace generators: - trace_generator.py - trace_generator_tools.py - trace_generator_interleaved_tools.py Also updated: - System prompts and examples - Parsing regex patterns - Output formatting - Stop sequences - README documentation XML tags are cleaner, more standard, and easier to parse.
for more information, see https://pre-commit.ci
New trace_generator_inline_tools.py uses RL-style inline tool calls
inside <think> blocks, matching the pattern from tool_use_interleaved_thinking.py:
- Tool calls via <tool_call>{JSON}</tool_call> inside open <think> block
- System executes tool and injects <tool_response>{JSON}</tool_response>
- Model continues reasoning after seeing execution results
- Supports --only-ideal, --only-success, --training-format flags
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
… sequences - _extract_tests_from_prompt: no longer returns ellipsis placeholder, also extracts from docstrings - parse_tool_call: new _fix_json_newlines() to escape literal newlines in JSON strings - Removed stop sequences that were cutting off code mid-generation - Better detection of tool_call tags in response
for more information, see https://pre-commit.ci
- Method 1: Direct JSON parse - Method 2: Fix literal newlines in strings - Method 3: Extract code value directly from malformed JSON - Added debug logging to parse_tool_call - Handle \r escape sequences
When tool_response contains an error or partial test failure, inject continuation hints to encourage model to keep reasoning: - On error: "I see there's an error: <msg>\nLet me analyze and fix..." - On partial: "I got X/Y tests passing. Let me fix the failing cases."
Added clear guidance with correct/incorrect examples: - Use SINGLE escape (\n) for newlines in code - Never use double (\\n) or quadruple (\\\\n) - Keep JSON on single line
Previous parser failed on 8.5% of tool_calls. New approach: - Strategy 1: Direct json.loads() for valid JSON - Strategy 2: Simple newline fix (\n → \\n) for real newlines - Strategy 3: Regex extraction for docstrings and complex cases Expected improvement: 26.5% → 99.5% parse success rate
for more information, see https://pre-commit.ci
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Type
📝 General Information
Description
Related Issues
Type of Change
🔖 Environment Snapshot
🧪 Zero-Training Test Results
Details
W&B Link:
Examples of the Environment scoring a good example and a bad example:
✅ Developer & Reviewer Checklist