Skip to content

Releases: HiThink-Research/GAGE

vLLM Optimization, Speech Understanding & LiteLLM Tooling Enhancement

11 Feb 09:03
f964a19

Choose a tag to compare

Highlights

  • vLLM Backend Optimization: Native support for data-parallel arguments and "fail-fast" error handling.
  • LiteLLM Tooling Enhancement: Full support for Tool Choice and toolchain execution logic.
  • Speech Understanding Benchmarks: Integrated MMSU and MMAU-Pro for advanced audio-visual evaluation.
  • Multi-GPU Scalability: Enabled single-machine multi-GPU execution for high-performance inference.
  • Automated Diagnostics: Implemented automatic log dumping upon exceptions for rapid debugging.

vLLM & Backend Optimization

  • Distributed Support: Added pipeline/data-parallel fields and native arg propagation (rank, addr, port) in VLLMBackendConfig.
  • Reliability Mechanism: Implemented "fail-fast" strategy; removed unstable dummy-engine fallbacks to ensure backend errors are explicitly surfaced.
  • Inference Trace: Enhanced InferenceStep to identify structured backend errors and emit precise inference_error traces.

LiteLLM & Tooling

  • Tool Choice Logic: Added support for explicit tool selection modes and handling across diverse model providers.
  • Toolchain Support: Standardized tool formatting and normalized definitions to ensure stable and reliable sequential tool execution.
  • API Compatibility: Refined LiteLLM backend to align with the latest standardized tool-calling protocols.

Benchmark Expansion

  • Speech Understanding: Integrated MMSU (Multi-modal Speech Understanding) and MMAU-Pro into the core evaluation suite.
  • Data Loading Fixes: Stabilized ARC-AGI-2 loader by optimizing metadata path resolution to avoid canonicalization conflicts.
  • General Coverage: Broadened the benchmark suite with enhanced support for multi-modal and speech-centric evaluation tasks.

Infrastructure & Scalability

  • Multi-GPU Execution: Refactored the execution pipeline to natively support single-machine multi-GPU workloads, reducing latency for large model evaluations.
  • Exception Handling: Built an automated utility to capture and dump system logs immediately upon encountering runtime exceptions.

Documentation & Tests

  • Documentation: Organized and enhanced benchmark descriptions; updated granular configuration guides for vLLM and LiteLLM.
  • Test Suite: Added validation tests for vLLM flag propagation and tool-calling consistency; updated backend unit tests to reflect new error-handling behaviors.

Notes

This release focuses on strengthening the infrastructure for large-scale multi-modal evaluation and complex agent-tool interactions while providing significantly more robust diagnostic capabilities.

Game Arena V2, Tau2 Support & Benchmark Expansion

30 Jan 09:24
9006e7d

Choose a tag to compare

Highlights

  • Game Arena V2 launched: PettingZoo MARL support and Mahjong environment integration.
  • Tau2 benchmark integrated: Full end-to-end evaluation flow with domain-specific configs.
  • Pipeline Refactoring: AppWorld and SWE-bench aligned with official workflows and sandbox behaviors.
  • Benchmark Suite Expansion: Added 8+ new benchmarks including SimpleQA Verified, Global PIQA, and LiveCodeBench.

Game Arena V2

  • PettingZoo Infrastructure: Implemented base adapters, environment wrappers, and standardized data protocols for multi-agent interaction.
  • Mahjong Implementation: Built a full-featured Mahjong engine including Tenhou-style rules, hidden information masking, and structured scoring.
  • Arena V2 Core: Refactored execution loop to handle both simultaneous and turn-based multi-agent flows.
  • PettingZoo Games: Added loading mechanisms and configurations to support standard multi-agent game libraries.

Tau2 Benchmark Integration

  • End-to-End Flow: Added loader (HF), preprocessor, runtime, judge, metrics (incl. pass_hat@k), and summary generator.
  • Configurations: Added new Tau2 configs for airline, retail, telecom, and mock domains plus smoke tests.

Pipeline Refactoring

  • AppWorld: Aligned evaluation flow, prompts, and tests with current helper-app and template behavior.
  • SWE-bench: Refactored Pro configs and tests to align with official run scripts and sandbox runtime expectations; cleaned legacy model config.

Benchmark Expansion

  • New Benchmarks: Integrated SimpleQA Verified, Global PIQA, MRCR v2, ARC-AGI-2, ScreenSpot-Pro, CharXiv Reasoning, bizfinbench.v2, and LiveCodeBench.

Documentation & Tests

  • Documentation: Updated guides for AppWorld, SWE-bench, and Tau2 evaluation usage and best practices.
  • Test Suite: Added validation tests for multi-agent synchronization and Mahjong constraints; updated tests to match current tool routing.

Notes

This release consolidates the transition to Game Arena V2 for complex agent interactions while significantly broadening the static benchmark coverage.

Agent Evaluation Infrastructure, Sandbox Runtime & Benchmark Expansion

16 Jan 09:14
0a1e6b4

Choose a tag to compare

Highlights

  • Full agent evaluation infrastructure delivered: DUTAgent, AgentLoop, toolchain routing, and extensible AgentBackends.
  • Sandbox runtime system introduced: multi-driver support (docker/local/remote), runtime profiles (aio/appworld/llm/opensandbox), plus pooling and lifecycle handling.
  • AppWorld evaluation integrated end-to-end: official JSONL config, dataset preprocessor, MCP tools, judge evaluation, metrics, and summary reporting.
  • GameArena expanded with Dou Dizhu (Landlord) and Human Showdown support.

Agent Evaluation Infrastructure

  • DUTAgent and AgentLoop provide a canonical evaluation loop with toolchain routing hooks.
  • AgentBackends make backend integration extensible across models and runtimes.

Sandbox Runtime System

  • Drivers: docker, local, and remote.
  • Runtime profiles: aio, appworld, llm, and opensandbox.
  • Pooling and lifecycle management for reusable runtime instances.

AppWorld Evaluation

  • Official JSONL configuration and dataset preprocessor for AppWorld datasets.
  • End-to-end run: MCP tools, judge evaluation, metrics, and summary reporting.
  • Build/export scripts and diagnostics for dataset extraction and tool-doc probing.

Benchmarks

  • Added support for AIME 2024, AIME 2025, Math500, MME, HLE, and MMLU-Pro.

Resource Provider

  • Bundle provider added to pre-locate dataset-level resources.

GameArena Expansion

  • Dou Dizhu (Landlord) integrated with full replay support and human-agent interactive evaluation.
  • Human Showdown via Action Server enables real-time human intervention.
  • Arena docs expanded with Dou Dizhu guides and updated showcase positioning.

Documentation & Tests

  • Bilingual agent evaluation guide and framework overview updates.
  • Extensive test coverage added for agent, sandbox, and appworld flows.

Notes

This release deepens agent evaluation and runtime foundations; APIs/configs may continue to evolve during internal validation.

Standardized Sample Protocol & GameArena MVP (Gomoku + Tic-Tac-Toe)

31 Dec 09:16

Choose a tag to compare

Highlights

  • Standardized Sample protocol landed: canonical Sample helpers, validation, and runtime write-back are now aligned around a single contract.
  • GameArena MVP delivered: first-class arena step with registry-based environments, parsers, schedulers, players, and visualizers.

Standardized Sample Protocol

  • Sample helpers: append_predict_result, update_eval_result, and output resolution live in src/gage_eval/evaluation/sample_envelope.py.
  • Schema validation: SampleValidator in src/gage_eval/assets/datasets/validation.py enforces configurable Sample checks.
  • Dataclass representation: Sample lives in src/gage_eval/assets/datasets/sample.py for consistent in-memory structure.
  • Normalization & mapping: utilities under src/gage_eval/assets/datasets/utils/ standardize messages/choices/multimodal fields.

GameArena MVP

  • First-class runtime lane: arena step implemented in src/gage_eval/role/adapters/arena.py.
  • Registry-driven components: environments/parsers/renderers/players/schedulers under src/gage_eval/role/arena/.
  • Built-in games:
    • Gomoku: turn-based board evaluation with optional human interaction and renderer support.
    • Tic-Tac-Toe: lightweight grid game for fast human/LLM matchups.

Notes

  • This release focuses on Sample standardization and GameArena foundations; APIs/configs may still evolve during internal validation.

v0.0.1-alpha: Initial Release of Gage Eval

23 Dec 10:31
775b364

Choose a tag to compare

We are excited to announce the first alpha release of Gage Eval, a scalable and extensible framework for large model evaluation. This release marks the transition of the project into internal validation, providing a robust foundation for automated benchmarking across text, multimodal, and engineering tasks.

Core Architecture

  • Step-based Orchestration: Flexible pipeline management (support -> inference -> judge -> auto_eval).
  • Role-Backend Decoupling: RoleAdapter layer allows seamless switching between local engines and remote APIs without changing logic.
  • Unified Registry: Automatic asset discovery for datasets, backends, and metrics via @registry.asset.
  • High-performance Runtime: Asynchronous execution with built-in backpressure control for maximum throughput.

Backend Support

  • vLLM (Local Primary): Unified engine for text and multimodal models (AsyncLLMEngine).
  • LiteLLM (API Primary): One-stop access to 100+ providers (OpenAI, Anthropic, Kimi, etc.).
  • High-throughput Engines: Support for SGLang and TGI.
  • Legacy & Debug: Standard openai_http and mock-ready dummy backends.

Supported Benchmarks

  • Text: MMLU, PIQA, GPQA.
  • Multimodal: MMMU, DocVQA, MathVista.
  • Engineering: Fully reproducible SWE-bench Pro integration with Docker support.

Observability

  • Structured Tracing: Detailed event streams (events.jsonl).
  • Robust Caching: Per-sample snapshots (samples.jsonl) for easy debugging.
  • Aggregated Reports: Final summary.json with comprehensive metrics and timings.

Installation & Usage

  1. Setup environment: pip install -r requirements.txt

  2. Run a demo: python run.py --config config/run_configs/demo_echo_run_1.yaml --output-dir runs

Notes
This is an alpha release. The framework is currently in the internal validation phase. APIs, Sample structures, and configuration schemas are subject to change. Feedback is highly appreciated