Integrate benchmarks repo for evals #65

13point5 · 2026-01-24T05:29:13Z

Summary

This PR integrates the benchmarking code from Aditya's fork.

Key changes:

Fixed eval_runner.py to patch the version module, resolving import errors when benchmarks code tries to access non-existent vendor/software-agent-sdk
Added EVAL_INTEGRATION.md with a practical quick-start guide
Added example LLM config file for vLLM setup
Added troubleshooting section for common issues encountered during setup

How to Run and Verify

1. Start vLLM with tool calling enabled

uv run vllm serve Qwen/Qwen3-4B \
  --port 8000 \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

2. Create LLM config or use `configs/eval_llm_config_example.json`

mkdir -p configs
cat > configs/llm_config.json << 'EOF'
{
  "model": "openai/Qwen/Qwen3-4B",
  "api_key": "dummy",
  "base_url": "http://localhost:8000/v1",
  "temperature": 0.0
}
EOF

3. Run evaluation on 1 instance

./scripts/run_eval.sh \
    --dataset_file benchmarks/gt_location.jsonl \
    --llm-config-path configs/llm_config.json \
    --system_prompt_file benchmarks/benchmarks/agentic_code_search/prompts/system_prompt.j2 \
    --user_prompt_file benchmarks/benchmarks/agentic_code_search/prompts/file_module_short.j2 \
    --tools terminal \
    --max-iterations 10 \
    --num-workers 1 \
    --output-dir ./agentic_code_search_outputs \
    --n-limit 1 \
    --workspace_base_dir /tmp/testbed/

4. Verify results

cat ./agentic_code_search_outputs/agentic_code_search_gt_location/openai/Qwen/Qwen3-4B_sdk_*/output.jsonl | jq '.test_result.reward'

Expected output shows F1 scores for file/module/entity localization:

{
  "file_reward": 0.5,
  "module_reward": 0.5,
  "entity_reward": 0.4
}

Key Files

File	Description
`scripts/eval_runner.py`	Main entry point - patches version module and imports benchmarks code
`scripts/run_eval.sh`	Shell wrapper for `uv run`
`docs/EVAL_INTEGRATION.md`	Full documentation with quick start, implementation details, and troubleshooting
`configs/eval_llm_config_example.json`	Example LLM config for vLLM
`benchmarks/benchmarks/agentic_code_search/`	Evaluation code from benchmarks submodule

- Add adityasoni9998/benchmarks as git submodule (agentic_code_search branch) - Add eval_runner.py that uses sys.path to import benchmarks at runtime - Add run_eval.sh wrapper script for running evaluations - Add minimal deps (jinja2, pandas, tqdm, lmnr) needed for benchmarks This allows running agentic_code_search evaluations using the benchmarks repo while keeping our existing SDK and training setup intact.

- Fix version module patching in eval_runner.py to use parent repo's SDK SHA instead of benchmarks/vendor/ which doesn't exist in our setup - Restructure EVAL_INTEGRATION.md with practical quick start guide - Add example LLM config file for vLLM setup - Add troubleshooting section for common issues (litellm provider prefix, vLLM tool calling flags, stale output)

13point5 added 4 commits January 23, 2026 23:35

Add evaluation integration documentation

c103eae

Fix markdown formatting in eval integration docs

b5c37dd

13point5 changed the title ~~Major update evals integration~~ Integrate benchmarks repo for evals Jan 24, 2026

13point5 requested review from adityasoni9998 and lintangsutawika January 24, 2026 05:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate benchmarks repo for evals #65

Integrate benchmarks repo for evals #65

13point5 commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Integrate benchmarks repo for evals #65

Are you sure you want to change the base?

Integrate benchmarks repo for evals #65

Conversation

13point5 commented Jan 24, 2026

Summary

How to Run and Verify

1. Start vLLM with tool calling enabled

2. Create LLM config or use configs/eval_llm_config_example.json

3. Run evaluation on 1 instance

4. Verify results

Key Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

2. Create LLM config or use `configs/eval_llm_config_example.json`