Skip to content

Conversation

@justaddcoffee
Copy link
Contributor

Summary

Adds a configurable disable_shell_tool option to CodexCoder that prevents filesystem access via bash commands, enabling fair MCP-only evaluations.

Note: This PR is based on the fix-gemini-cli-invocation branch which contains all previous Codex CLI fixes (MCP support, proper invocation, 401 fix, etc.). This should be merged after that branch is merged, or both can be merged together.

Changes

  • Added disable_shell_tool attribute to CodexCoder class
  • When enabled, Codex runs with --disable shell_tool flag instead of --dangerously-bypass-approvals-and-sandbox
  • Updated create_coder() and run_single_eval() to pass coder-specific options from YAML config

Usage

coders:
  codex:
    disable_shell_tool: true  # MCP-only mode

Motivation

During MCP literature evaluations, we discovered Codex was using shell commands (rg, sed, find) to access the filesystem and read expected answers from test case files. This made the evaluation results invalid.

With disable_shell_tool: true, Codex can only use MCP tools to retrieve information, ensuring a fair comparison with other agents.

🤖 Generated with Claude Code

When disable_shell_tool is set in the coder config, Codex runs with
--disable shell_tool flag, preventing filesystem access via bash commands.
This enables fair MCP-only evaluations where Codex can only use MCP tools.

Usage in eval config:
  coders:
    codex:
      disable_shell_tool: true

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@claude
Copy link
Contributor

claude bot commented Dec 15, 2025

Claude finished @justaddcoffee's task —— View job


Code Review Complete ✅

Reviewed PR #39: Add disable_shell_tool option to CodexCoder for MCP-only evaluation

Todo List:

  • Read and analyze modified files
  • Review code quality and best practices
  • Check for potential bugs or issues
  • Assess performance considerations
  • Evaluate security concerns
  • Review test coverage needs
  • Provide comprehensive feedback

Review Summary

This PR adds a configurable disable_shell_tool option to CodexCoder for MCP-only evaluation. The implementation is well-designed and addresses a legitimate need for fair evaluation comparisons by preventing filesystem access through shell commands.

Code Quality & Best Practices ✅

Strengths:

  • Clean, focused implementation with minimal surface area
  • Follows existing codebase patterns and conventions
  • Good documentation in class docstring (src/metacoder/coders/codex.py:45-50)
  • Proper logging with informative messages (src/metacoder/coders/codex.py:163)
  • Configuration through YAML is intuitive and consistent

Areas for Improvement:

  • Consider adding type hints for the disable_shell_tool attribute to be explicit:
    disable_shell_tool: bool = Field(default=False, description="Disable shell/bash access, MCP-only mode")

Potential Bugs & Issues ⚠️

Minor Issues:

  1. Flag validation: The code assumes the --disable shell_tool flag exists in the codex CLI but doesn't validate this. Consider adding a comment about minimum required codex CLI version.

  2. Error handling: If the --disable shell_tool flag isn't supported by the installed codex version, the error won't be very informative. The current error handling in src/metacoder/coders/codex.py:190 would catch this but might be unclear.

Security Concerns ✅

Positive Security Impact:

  • The feature actually improves security by providing a way to restrict filesystem access
  • Using --disable shell_tool instead of --dangerously-bypass-approvals-and-sandbox is more secure
  • No new security vulnerabilities introduced

Performance Considerations ✅

Performance Impact:

  • Positive impact: Disabling shell tools should improve performance by preventing filesystem operations
  • No performance regressions introduced
  • Command building logic is efficient (src/metacoder/coders/codex.py:156-167)

Implementation Quality ✅

Well-structured changes:

  1. CodexCoder class (src/metacoder/coders/codex.py:55):

    • Clean attribute addition
    • Proper documentation
    • Follows existing patterns
  2. Command building logic (src/metacoder/coders/codex.py:156-167):

    • Clear conditional logic
    • Good separation between MCP-only and default modes
    • Informative logging
  3. Configuration passing (src/metacoder/evals/runner.py:187-194):

    • Robust option handling with hasattr check
    • Proper warning for unknown options
    • Good logging

Test Coverage 🔶

Missing test coverage (recommendation):

  • Unit tests for the disable_shell_tool flag functionality
  • Integration tests to verify MCP-only mode works as expected
  • Test cases for both enabled/disabled states

Suggested test additions:

def test_codex_disable_shell_tool_flag():
    """Test that disable_shell_tool affects command building."""
    coder = CodexCoder(workdir="test", disable_shell_tool=True)
    # Mock the command execution and verify --disable shell_tool is used

def test_codex_coder_options_integration():
    """Test coder options are passed correctly from runner."""
    # Test create_coder with coder_options containing disable_shell_tool

Architecture Alignment ✅

The changes align well with the existing architecture:

  • Follows the established pattern of coder-specific configuration
  • Integrates properly with the evaluation framework
  • Maintains backward compatibility

Minor Suggestions

  1. Documentation: Consider adding an example in CLAUDE.md showing the disable_shell_tool option
  2. Validation: Add validation to ensure disable_shell_tool is boolean in CodexCoder
  3. Logging: Consider DEBUG level for the "Set coder option" message in runner.py:192

Overall Assessment

✅ This is a well-implemented feature that:

  • Addresses a legitimate evaluation fairness concern
  • Implements the solution cleanly with minimal code changes
  • Follows established patterns and conventions
  • Improves security by providing a restricted mode
  • Maintains backward compatibility

Recommendation: Approve with minor suggestions for test coverage and documentation improvements.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants