Skip to content

Conversation

@13point5
Copy link
Collaborator

Summary

This PR integrates the benchmarking code from Aditya's fork.

Key changes:

  • Fixed eval_runner.py to patch the version module, resolving import errors when benchmarks code tries to access non-existent vendor/software-agent-sdk
  • Added EVAL_INTEGRATION.md with a practical quick-start guide
  • Added example LLM config file for vLLM setup
  • Added troubleshooting section for common issues encountered during setup

How to Run and Verify

1. Start vLLM with tool calling enabled

uv run vllm serve Qwen/Qwen3-4B \
  --port 8000 \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

2. Create LLM config or use configs/eval_llm_config_example.json

mkdir -p configs
cat > configs/llm_config.json << 'EOF'
{
  "model": "openai/Qwen/Qwen3-4B",
  "api_key": "dummy",
  "base_url": "http://localhost:8000/v1",
  "temperature": 0.0
}
EOF

3. Run evaluation on 1 instance

./scripts/run_eval.sh \
    --dataset_file benchmarks/gt_location.jsonl \
    --llm-config-path configs/llm_config.json \
    --system_prompt_file benchmarks/benchmarks/agentic_code_search/prompts/system_prompt.j2 \
    --user_prompt_file benchmarks/benchmarks/agentic_code_search/prompts/file_module_short.j2 \
    --tools terminal \
    --max-iterations 10 \
    --num-workers 1 \
    --output-dir ./agentic_code_search_outputs \
    --n-limit 1 \
    --workspace_base_dir /tmp/testbed/

4. Verify results

cat ./agentic_code_search_outputs/agentic_code_search_gt_location/openai/Qwen/Qwen3-4B_sdk_*/output.jsonl | jq '.test_result.reward'

Expected output shows F1 scores for file/module/entity localization:

{
  "file_reward": 0.5,
  "module_reward": 0.5,
  "entity_reward": 0.4
}

Key Files

File Description
scripts/eval_runner.py Main entry point - patches version module and imports benchmarks code
scripts/run_eval.sh Shell wrapper for uv run
docs/EVAL_INTEGRATION.md Full documentation with quick start, implementation details, and troubleshooting
configs/eval_llm_config_example.json Example LLM config for vLLM
benchmarks/benchmarks/agentic_code_search/ Evaluation code from benchmarks submodule

- Add adityasoni9998/benchmarks as git submodule (agentic_code_search branch)
- Add eval_runner.py that uses sys.path to import benchmarks at runtime
- Add run_eval.sh wrapper script for running evaluations
- Add minimal deps (jinja2, pandas, tqdm, lmnr) needed for benchmarks

This allows running agentic_code_search evaluations using the benchmarks
repo while keeping our existing SDK and training setup intact.
- Fix version module patching in eval_runner.py to use parent repo's SDK SHA
  instead of benchmarks/vendor/ which doesn't exist in our setup
- Restructure EVAL_INTEGRATION.md with practical quick start guide
- Add example LLM config file for vLLM setup
- Add troubleshooting section for common issues (litellm provider prefix,
  vLLM tool calling flags, stale output)
@13point5 13point5 changed the title Major update evals integration Integrate benchmarks repo for evals Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants