Releases · HiThink-Research/GAGE

11 Feb 09:03

Shuo-O

v0.0.4

f964a19

vLLM Optimization, Speech Understanding & LiteLLM Tooling Enhancement Latest

Latest

Highlights

vLLM Backend Optimization: Native support for data-parallel arguments and "fail-fast" error handling.
LiteLLM Tooling Enhancement: Full support for Tool Choice and toolchain execution logic.
Speech Understanding Benchmarks: Integrated MMSU and MMAU-Pro for advanced audio-visual evaluation.
Multi-GPU Scalability: Enabled single-machine multi-GPU execution for high-performance inference.
Automated Diagnostics: Implemented automatic log dumping upon exceptions for rapid debugging.

vLLM & Backend Optimization

Distributed Support: Added pipeline/data-parallel fields and native arg propagation (rank, addr, port) in VLLMBackendConfig.
Reliability Mechanism: Implemented "fail-fast" strategy; removed unstable dummy-engine fallbacks to ensure backend errors are explicitly surfaced.
Inference Trace: Enhanced InferenceStep to identify structured backend errors and emit precise inference_error traces.

LiteLLM & Tooling

Tool Choice Logic: Added support for explicit tool selection modes and handling across diverse model providers.
Toolchain Support: Standardized tool formatting and normalized definitions to ensure stable and reliable sequential tool execution.
API Compatibility: Refined LiteLLM backend to align with the latest standardized tool-calling protocols.

Benchmark Expansion

Speech Understanding: Integrated MMSU (Multi-modal Speech Understanding) and MMAU-Pro into the core evaluation suite.
Data Loading Fixes: Stabilized ARC-AGI-2 loader by optimizing metadata path resolution to avoid canonicalization conflicts.
General Coverage: Broadened the benchmark suite with enhanced support for multi-modal and speech-centric evaluation tasks.

Infrastructure & Scalability

Multi-GPU Execution: Refactored the execution pipeline to natively support single-machine multi-GPU workloads, reducing latency for large model evaluations.
Exception Handling: Built an automated utility to capture and dump system logs immediately upon encountering runtime exceptions.

Documentation & Tests

Documentation: Organized and enhanced benchmark descriptions; updated granular configuration guides for vLLM and LiteLLM.
Test Suite: Added validation tests for vLLM flag propagation and tool-calling consistency; updated backend unit tests to reflect new error-handling behaviors.

Notes

This release focuses on strengthening the infrastructure for large-scale multi-modal evaluation and complex agent-tool interactions while providing significantly more robust diagnostic capabilities.

Assets 2

30 Jan 09:24

superManPK

v0.0.3

9006e7d

Game Arena V2, Tau2 Support & Benchmark Expansion

Highlights

Game Arena V2 launched: PettingZoo MARL support and Mahjong environment integration.
Tau2 benchmark integrated: Full end-to-end evaluation flow with domain-specific configs.
Pipeline Refactoring: AppWorld and SWE-bench aligned with official workflows and sandbox behaviors.
Benchmark Suite Expansion: Added 8+ new benchmarks including SimpleQA Verified, Global PIQA, and LiveCodeBench.

Game Arena V2

PettingZoo Infrastructure: Implemented base adapters, environment wrappers, and standardized data protocols for multi-agent interaction.
Mahjong Implementation: Built a full-featured Mahjong engine including Tenhou-style rules, hidden information masking, and structured scoring.
Arena V2 Core: Refactored execution loop to handle both simultaneous and turn-based multi-agent flows.
PettingZoo Games: Added loading mechanisms and configurations to support standard multi-agent game libraries.

Tau2 Benchmark Integration

End-to-End Flow: Added loader (HF), preprocessor, runtime, judge, metrics (incl. pass_hat@k), and summary generator.
Configurations: Added new Tau2 configs for airline, retail, telecom, and mock domains plus smoke tests.

Pipeline Refactoring

AppWorld: Aligned evaluation flow, prompts, and tests with current helper-app and template behavior.
SWE-bench: Refactored Pro configs and tests to align with official run scripts and sandbox runtime expectations; cleaned legacy model config.

Benchmark Expansion

New Benchmarks: Integrated SimpleQA Verified, Global PIQA, MRCR v2, ARC-AGI-2, ScreenSpot-Pro, CharXiv Reasoning, bizfinbench.v2, and LiveCodeBench.

Documentation & Tests

Documentation: Updated guides for AppWorld, SWE-bench, and Tau2 evaluation usage and best practices.
Test Suite: Added validation tests for multi-agent synchronization and Mahjong constraints; updated tests to match current tool routing.

Notes

This release consolidates the transition to Game Arena V2 for complex agent interactions while significantly broadening the static benchmark coverage.

Assets 2

16 Jan 09:14

superManPK

v0.0.2

0a1e6b4

Agent Evaluation Infrastructure, Sandbox Runtime & Benchmark Expansion

Highlights

Full agent evaluation infrastructure delivered: DUTAgent, AgentLoop, toolchain routing, and extensible AgentBackends.
Sandbox runtime system introduced: multi-driver support (docker/local/remote), runtime profiles (aio/appworld/llm/opensandbox), plus pooling and lifecycle handling.
AppWorld evaluation integrated end-to-end: official JSONL config, dataset preprocessor, MCP tools, judge evaluation, metrics, and summary reporting.
GameArena expanded with Dou Dizhu (Landlord) and Human Showdown support.

Agent Evaluation Infrastructure

DUTAgent and AgentLoop provide a canonical evaluation loop with toolchain routing hooks.
AgentBackends make backend integration extensible across models and runtimes.

Sandbox Runtime System

Drivers: docker, local, and remote.
Runtime profiles: aio, appworld, llm, and opensandbox.
Pooling and lifecycle management for reusable runtime instances.

AppWorld Evaluation

Official JSONL configuration and dataset preprocessor for AppWorld datasets.
End-to-end run: MCP tools, judge evaluation, metrics, and summary reporting.
Build/export scripts and diagnostics for dataset extraction and tool-doc probing.

Benchmarks

Added support for AIME 2024, AIME 2025, Math500, MME, HLE, and MMLU-Pro.

Resource Provider

Bundle provider added to pre-locate dataset-level resources.

GameArena Expansion

Dou Dizhu (Landlord) integrated with full replay support and human-agent interactive evaluation.
Human Showdown via Action Server enables real-time human intervention.
Arena docs expanded with Dou Dizhu guides and updated showcase positioning.

Documentation & Tests

Bilingual agent evaluation guide and framework overview updates.
Extensive test coverage added for agent, sandbox, and appworld flows.

Notes

This release deepens agent evaluation and runtime foundations; APIs/configs may continue to evolve during internal validation.

Assets 2

31 Dec 09:16

superManPK

v0.0.1

1c509c4

Standardized Sample Protocol & GameArena MVP (Gomoku + Tic-Tac-Toe)

Highlights

Standardized Sample protocol landed: canonical Sample helpers, validation, and runtime write-back are now aligned around a single contract.
GameArena MVP delivered: first-class arena step with registry-based environments, parsers, schedulers, players, and visualizers.

Standardized Sample Protocol

Sample helpers: append_predict_result, update_eval_result, and output resolution live in src/gage_eval/evaluation/sample_envelope.py.
Schema validation: SampleValidator in src/gage_eval/assets/datasets/validation.py enforces configurable Sample checks.
Dataclass representation: Sample lives in src/gage_eval/assets/datasets/sample.py for consistent in-memory structure.
Normalization & mapping: utilities under src/gage_eval/assets/datasets/utils/ standardize messages/choices/multimodal fields.

GameArena MVP

First-class runtime lane: arena step implemented in src/gage_eval/role/adapters/arena.py.
Registry-driven components: environments/parsers/renderers/players/schedulers under src/gage_eval/role/arena/.
Built-in games:
- Gomoku: turn-based board evaluation with optional human interaction and renderer support.
- Tic-Tac-Toe: lightweight grid game for fast human/LLM matchups.

Notes

This release focuses on Sample standardization and GameArena foundations; APIs/configs may still evolve during internal validation.

Assets 2

23 Dec 10:31

superManPK

v0.0.1-alpha

775b364

v0.0.1-alpha: Initial Release of Gage Eval

We are excited to announce the first alpha release of Gage Eval, a scalable and extensible framework for large model evaluation. This release marks the transition of the project into internal validation, providing a robust foundation for automated benchmarking across text, multimodal, and engineering tasks.

Core Architecture

Step-based Orchestration: Flexible pipeline management (support -> inference -> judge -> auto_eval).
Role-Backend Decoupling: RoleAdapter layer allows seamless switching between local engines and remote APIs without changing logic.
Unified Registry: Automatic asset discovery for datasets, backends, and metrics via @registry.asset.
High-performance Runtime: Asynchronous execution with built-in backpressure control for maximum throughput.

Backend Support

vLLM (Local Primary): Unified engine for text and multimodal models (AsyncLLMEngine).
LiteLLM (API Primary): One-stop access to 100+ providers (OpenAI, Anthropic, Kimi, etc.).
High-throughput Engines: Support for SGLang and TGI.
Legacy & Debug: Standard openai_http and mock-ready dummy backends.

Supported Benchmarks

Text: MMLU, PIQA, GPQA.
Multimodal: MMMU, DocVQA, MathVista.
Engineering: Fully reproducible SWE-bench Pro integration with Docker support.

Observability

Structured Tracing: Detailed event streams (events.jsonl).
Robust Caching: Per-sample snapshots (samples.jsonl) for easy debugging.
Aggregated Reports: Final summary.json with comprehensive metrics and timings.

Installation & Usage

Setup environment: pip install -r requirements.txt
Run a demo: python run.py --config config/run_configs/demo_echo_run_1.yaml --output-dir runs

Notes
This is an alpha release. The framework is currently in the internal validation phase. APIs, Sample structures, and configuration schemas are subject to change. Feedback is highly appreciated

Assets 2

Releases: HiThink-Research/GAGE

vLLM Optimization, Speech Understanding & LiteLLM Tooling Enhancement

Highlights

vLLM & Backend Optimization

LiteLLM & Tooling

Benchmark Expansion

Infrastructure & Scalability

Documentation & Tests

Notes

Uh oh!

Game Arena V2, Tau2 Support & Benchmark Expansion

Highlights

Game Arena V2

Tau2 Benchmark Integration

Pipeline Refactoring

Benchmark Expansion

Notes

Uh oh!

Agent Evaluation Infrastructure, Sandbox Runtime & Benchmark Expansion

Highlights

Agent Evaluation Infrastructure

Sandbox Runtime System

AppWorld Evaluation

Benchmarks

Resource Provider

GameArena Expansion

Documentation & Tests

Notes

Uh oh!

Standardized Sample Protocol & GameArena MVP (Gomoku + Tic-Tac-Toe)

Highlights

Standardized Sample Protocol

GameArena MVP

Notes

Uh oh!

v0.0.1-alpha: Initial Release of Gage Eval

Uh oh!