Add tau 2 bench #16

cemde · 2025-12-22T20:38:12Z

Description

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code quality improvement (refactoring, formatting, etc.)

Checklist

Contribution

I have read the CONTRIBUTING.md guide.
Commits follow "How to write a good git commit message"

Documentation

Added/updated docstrings for new/modified functions as instructed CONTRIBUTING.md
Updated relevant documentation in docs/ (if applicable)
Tag github issue with this PR (if applicable)

Changelog

Added entry to CHANGELOG.md under [Unreleased] section
- Use Added section for new features
- Use Changed section for modifications to existing functionality
- Use Fixed section for bug fixes
- Use Removed section for deprecated/removed features
OR this is a documentation-only change (no changelog needed)

Example:
- Support for multi-agent tracing (PR:#123)

Architecture (if applicable)

Core/Interface separation: Changes in maseval/core/ do NOT import from maseval/interface/
Dependencies: New core dependencies added sparingly; framework integrations go to optional dependencies

Additional Notes

github-actions · 2025-12-22T20:39:25Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
maseval/benchmark/macs
data_loader.py					255, 297
macs.py
maseval/benchmark/tau2
__init__.py
data_loader.py					84-85, 100-104, 141, 150, 163, 168, 252, 296, 338, 343, 415, 470
environment.py					84, 137, 173-182, 251-253, 292, 310-315
evaluator.py					225, 239-241, 254-255, 278, 289-290, 316, 326-328, 344, 403, 412, 524
tau2.py					326, 555, 711, 731, 733, 735, 741, 815-818
utils.py					169-170, 174-179, 204-209
maseval/benchmark/tau2/domains
base.py					74, 82, 162-170, 271-275, 287, 328, 333-342
maseval/benchmark/tau2/domains/airline
db.py
models.py
tools.py					61, 69, 77, 91-94, 108, 112, 138, 150-162, 179, 182, 184, 330-339, 399, 440, 444, 475, 480, 495, 560, 662, 666
maseval/benchmark/tau2/domains/retail
db.py
models.py
tools.py					65, 86, 104, 124, 217, 243, 311, 367-368, 416, 426, 430, 442, 542, 551, 555, 566, 577-578, 584, 626, 632-665, 746, 752
maseval/benchmark/tau2/domains/telecom
db.py
models.py
tools.py					74, 83, 87, 92, 96, 101, 110, 126, 147, 160-180, 229, 255, 277, 501, 563-616, 634-640
user_models.py					255-257
user_tools.py					45, 51, 107, 142, 145, 147, 151, 208, 302-303, 386, 461, 478, 506, 538-539, 555-563, 577-578, 583-588
maseval/core
simulator.py
task.py
user.py					426, 429-434, 444, 518-523
maseval/interface/inference
anthropic.py
google_genai.py					178-188, 193
Project Total

The report is truncated to 25 files out of 29. To see the full report, please visit the workflow summary page.

_{This report was generated by python-coverage-comment-action}

…cUser

cemde added 23 commits December 26, 2025 13:06

added files about implementation strategies and plans.

23511d4

added comment to plans

758da11

updated claude plan

9c99065

consolitated plan

efae183

initial attempt

338b2b6

added agentic user with tool use

dd7a4e6

docs: Add TESTING.md with testing plan for Tau2 benchmark

cb8499a

test(tau2): implement comprehensive testing strategy per TESTING.md

6e3bb26

docs: Remove TESTING.md after implementation

05dc4c2

formatting

8f5f6a1

fix: resolve linting, duplication, and type errors in Tau2 and Agenti…

217a2d7

…cUser

style: final ruff formatting and lint fixes

cc8fb7c

updated testing

da518ac

fixing type hinting and formatting

0d1cc23

fixed dependeny issue

80ee4ca

updated agent file

c5781eb

added better coverage scripts

11567c5

movedf agentic user

1a0be47

fixed testing

df4262e

added default tau2 implementation

fa937ce

added gitignore to example

c9c9332

initial defualt agent file

ad5e67e

cleaned up docstrings for model adapters

91de571

cemde force-pushed the add-tau-2-bench branch from 068ba6b to 91de571 Compare December 26, 2025 12:16

cemde added 3 commits December 26, 2025 13:39

fixed tau2 and agenticuser tests given new ModelAdapter chat features

782c9bf

updated default agent with verbosity level

6087c8f

fixed bugs

eb24ce1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add tau 2 bench #16

Add tau 2 bench #16

Uh oh!

cemde commented Dec 22, 2025

Uh oh!

github-actions bot commented Dec 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add tau 2 bench #16

Are you sure you want to change the base?

Add tau 2 bench #16

Uh oh!

Conversation

cemde commented Dec 22, 2025

Description

Type of Change

Checklist

Contribution

Documentation

Changelog

Architecture (if applicable)

Additional Notes

Uh oh!

github-actions bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Dec 22, 2025 •

edited

Loading