Skip to content

Updating agent benchmarking to latest #886

Open
shreyasXplain wants to merge 15 commits intoagent-benchmarking-v0.1from
development
Open

Updating agent benchmarking to latest #886
shreyasXplain wants to merge 15 commits intoagent-benchmarking-v0.1from
development

Conversation

@shreyasXplain
Copy link
Copy Markdown
Collaborator

No description provided.

kadirpekel and others added 15 commits March 19, 2026 21:39
* ENG-2886 Fixed dataclass formation causing a bug

* ENG-2886 Fixed dataclass formation causing a bug review
…hitecture (#874)

* docs: refresh README positioning, diagrams, and v2 quickstart

* docs: refine README copy and update team-agent diagram

* docs: mention built-in opt-in agent memory

* Update README positioning and examples

* Refine README intro copy

* Remove OpenClaw from README

* Refine README positioning copy

* Simplify README deployment wording

* Refresh README hero and MCP marketplace section

---------

Co-authored-by: aix-ahmet <ahmet.gunduz@aixplain.com>
The Debugger agent fails with FAILED status because
run_response_generation defaults to False (set in ENG-2855).
Without response generation, the backend webhook cannot return
the agent output, causing a silent failure.

The Debugger requires response generation to synthesise its
debugging analysis, so default it to True via setdefault.

Made-with: Cursor

Co-authored-by: JP Maia <maiajp2305@gmail.com>
* Add pre-commit CI workflow and fix two failing unit tests

- Add .github/workflows/pre-commit.yaml to run pre-commit checks on all branches
- Fix v1 import in v2/core.py by using sys.modules lookup instead of direct import
- Fix test_api_key_validation to clear both TEAM_API_KEY and AIXPLAIN_API_KEY env vars

* Add pre-commit CI workflow and fix all pre-commit violations

- Add .github/workflows/pre-commit.yaml to run checks on all branches
- Scope ruff lint/format to aixplain/v2/ only
- Exclude docs/ from trailing-whitespace and end-of-file-fixer
- Fix v1 import in v2/core.py (use sys.modules instead of direct import)
- Fix test_api_key_validation to clear both API key env vars
- Fix trailing whitespace and end-of-file issues across the repo

* Remove duplicate pull_request trigger from pre-commit workflow

* Add coverage to CI workflow dependencies

* Set dummy TEAM_API_KEY in CI for v1 unit test collection

* Use real TEAM_API_KEY secret in pre-commit CI workflow

* Move 8 tests that make real API calls from unit to functional tests
* ENG-2836 Agent cloning introduced

* ENG-2836 Agent cloning functional tests

* ENG-2836 addressed feedbacks

* ENG-2836 rename clone_subagents to duplicate_subagents for consistent naming

Backend payload key remains "cloneSubagents" as required by the API.

* update gitignore

---------

Co-authored-by: aix-ahmet <ahmet.gunduz@aixplain.com>
#848)

* ENG-2847 fix the ActionInputsProxy to properly extract and coerce default values

* ENG-2847 minor fix

---------

Co-authored-by: aix-ahmet <ahmet.gunduz@aixplain.com>
* Add usage and asset fields to model response (V1 + V2)

Surface token usage (prompt_tokens, completion_tokens, total_tokens)
and asset info from model serving as first-class fields on model
responses. Fix V2 poll path which previously dropped usage and asset
from the filtered response dict for async models.

Also fix pre-existing broken tests: remove test_action_inputs_proxy.py
(imports removed ActionInputsProxy class) and fix subagents -> agents
assertion in test_v2_agent_duplicate.py.

* Remove unused mock import from test_action_inputs_proxy.py
* each actions spec retrieved + attributes -> list

* removed slug fallback

* revert to dict

* fixed deleted params
* ENG-2891 Tool saving foundation

* Fix tool reconnect: send empty name to avoid "Name already exists" error

The backend connect endpoint has a bug where it checks name uniqueness
against the tool itself during reconnect (with assetId). Without name,
it fails with a trim() error; with the tool's current name, it fails
with "Name already exists". Sending name="" satisfies the trim() call
while avoiding the uniqueness conflict. The metadata PUT handles the
actual name/description updates.

Also includes:
- Rename parent_model_id to integration_id for clarity
- Add integration_path convenience property
- Fix _extract_auth_scheme to handle attributes as dict (matches backend)
- Clear config/code after successful create/update to prevent false reconnects
- Update unit and functional tests

Related: ENG-2891, BUG-732
Made-with: Cursor

---------

Co-authored-by: aix-ahmet <ahmet.gunduz@aixplain.com>
* added available action

* use cached self.actions

---------

Co-authored-by: aix-ahmet <ahmet.gunduz@aixplain.com>
* Change default model for agents to GPT-5.4

* Fix stale model name references after GPT-5.4 default LLM update

Made-with: Cursor

---------

Co-authored-by: aix-ahmet <ahmet.gunduz@aixplain.com>
* Add usage and asset fields to model response (V1 + V2)

Surface token usage (prompt_tokens, completion_tokens, total_tokens)
and asset info from model serving as first-class fields on model
responses. Fix V2 poll path which previously dropped usage and asset
from the filtered response dict for async models.

Also fix pre-existing broken tests: remove test_action_inputs_proxy.py
(imports removed ActionInputsProxy class) and fix subagents -> agents
assertion in test_v2_agent_duplicate.py.

* Remove unused mock import from test_action_inputs_proxy.py
…890)

* Add usage and asset fields to model response (V1 + V2)

Surface token usage (prompt_tokens, completion_tokens, total_tokens)
and asset info from model serving as first-class fields on model
responses. Fix V2 poll path which previously dropped usage and asset
from the filtered response dict for async models.

Also fix pre-existing broken tests: remove test_action_inputs_proxy.py
(imports removed ActionInputsProxy class) and fix subagents -> agents
assertion in test_v2_agent_duplicate.py.

* Remove unused mock import from test_action_inputs_proxy.py

* ENG-2922 Fix model.run() hanging on NaN usage tokens from backend

Several model providers (GPT-5.4, Claude, Mistral Large) return
"NaN"/null for token counts in the usage block. This caused:
1. Usage dataclass deserialization to fail (int("NaN"))
2. sync_poll to retry the same completed response forever until timeout
3. Sync-only model.run() to return IN_PROGRESS without polling

Changes:
- Make Usage fields Optional[int] with a safe decoder that handles
  NaN, null, strings, and floats gracefully
- Add poll fallback in resource.poll() for completed responses that
  fail deserialization
- Add polling after _run_sync_v2() for sync models that return a
  poll URL

Made-with: Cursor

* ENG-2922 Show token usage in agent progress at all verbosity levels

Display input/output/total tokens inline on each step line and
aggregate totals in the completion summary. Also includes curl
commands documenting backend token reporting inconsistencies.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants