Updating agent benchmarking to latest #886
Open
shreyasXplain wants to merge 15 commits intoagent-benchmarking-v0.1from
Open
Updating agent benchmarking to latest #886shreyasXplain wants to merge 15 commits intoagent-benchmarking-v0.1from
shreyasXplain wants to merge 15 commits intoagent-benchmarking-v0.1from
Conversation
* ENG-2886 Fixed dataclass formation causing a bug * ENG-2886 Fixed dataclass formation causing a bug review
…hitecture (#874) * docs: refresh README positioning, diagrams, and v2 quickstart * docs: refine README copy and update team-agent diagram * docs: mention built-in opt-in agent memory * Update README positioning and examples * Refine README intro copy * Remove OpenClaw from README * Refine README positioning copy * Simplify README deployment wording * Refresh README hero and MCP marketplace section --------- Co-authored-by: aix-ahmet <ahmet.gunduz@aixplain.com>
The Debugger agent fails with FAILED status because run_response_generation defaults to False (set in ENG-2855). Without response generation, the backend webhook cannot return the agent output, causing a silent failure. The Debugger requires response generation to synthesise its debugging analysis, so default it to True via setdefault. Made-with: Cursor Co-authored-by: JP Maia <maiajp2305@gmail.com>
* Add pre-commit CI workflow and fix two failing unit tests - Add .github/workflows/pre-commit.yaml to run pre-commit checks on all branches - Fix v1 import in v2/core.py by using sys.modules lookup instead of direct import - Fix test_api_key_validation to clear both TEAM_API_KEY and AIXPLAIN_API_KEY env vars * Add pre-commit CI workflow and fix all pre-commit violations - Add .github/workflows/pre-commit.yaml to run checks on all branches - Scope ruff lint/format to aixplain/v2/ only - Exclude docs/ from trailing-whitespace and end-of-file-fixer - Fix v1 import in v2/core.py (use sys.modules instead of direct import) - Fix test_api_key_validation to clear both API key env vars - Fix trailing whitespace and end-of-file issues across the repo * Remove duplicate pull_request trigger from pre-commit workflow * Add coverage to CI workflow dependencies * Set dummy TEAM_API_KEY in CI for v1 unit test collection * Use real TEAM_API_KEY secret in pre-commit CI workflow * Move 8 tests that make real API calls from unit to functional tests
* ENG-2836 Agent cloning introduced * ENG-2836 Agent cloning functional tests * ENG-2836 addressed feedbacks * ENG-2836 rename clone_subagents to duplicate_subagents for consistent naming Backend payload key remains "cloneSubagents" as required by the API. * update gitignore --------- Co-authored-by: aix-ahmet <ahmet.gunduz@aixplain.com>
#848) * ENG-2847 fix the ActionInputsProxy to properly extract and coerce default values * ENG-2847 minor fix --------- Co-authored-by: aix-ahmet <ahmet.gunduz@aixplain.com>
* Add usage and asset fields to model response (V1 + V2) Surface token usage (prompt_tokens, completion_tokens, total_tokens) and asset info from model serving as first-class fields on model responses. Fix V2 poll path which previously dropped usage and asset from the filtered response dict for async models. Also fix pre-existing broken tests: remove test_action_inputs_proxy.py (imports removed ActionInputsProxy class) and fix subagents -> agents assertion in test_v2_agent_duplicate.py. * Remove unused mock import from test_action_inputs_proxy.py
* each actions spec retrieved + attributes -> list * removed slug fallback * revert to dict * fixed deleted params
* ENG-2891 Tool saving foundation * Fix tool reconnect: send empty name to avoid "Name already exists" error The backend connect endpoint has a bug where it checks name uniqueness against the tool itself during reconnect (with assetId). Without name, it fails with a trim() error; with the tool's current name, it fails with "Name already exists". Sending name="" satisfies the trim() call while avoiding the uniqueness conflict. The metadata PUT handles the actual name/description updates. Also includes: - Rename parent_model_id to integration_id for clarity - Add integration_path convenience property - Fix _extract_auth_scheme to handle attributes as dict (matches backend) - Clear config/code after successful create/update to prevent false reconnects - Update unit and functional tests Related: ENG-2891, BUG-732 Made-with: Cursor --------- Co-authored-by: aix-ahmet <ahmet.gunduz@aixplain.com>
* added available action * use cached self.actions --------- Co-authored-by: aix-ahmet <ahmet.gunduz@aixplain.com>
* Change default model for agents to GPT-5.4 * Fix stale model name references after GPT-5.4 default LLM update Made-with: Cursor --------- Co-authored-by: aix-ahmet <ahmet.gunduz@aixplain.com>
* Add usage and asset fields to model response (V1 + V2) Surface token usage (prompt_tokens, completion_tokens, total_tokens) and asset info from model serving as first-class fields on model responses. Fix V2 poll path which previously dropped usage and asset from the filtered response dict for async models. Also fix pre-existing broken tests: remove test_action_inputs_proxy.py (imports removed ActionInputsProxy class) and fix subagents -> agents assertion in test_v2_agent_duplicate.py. * Remove unused mock import from test_action_inputs_proxy.py
…890) * Add usage and asset fields to model response (V1 + V2) Surface token usage (prompt_tokens, completion_tokens, total_tokens) and asset info from model serving as first-class fields on model responses. Fix V2 poll path which previously dropped usage and asset from the filtered response dict for async models. Also fix pre-existing broken tests: remove test_action_inputs_proxy.py (imports removed ActionInputsProxy class) and fix subagents -> agents assertion in test_v2_agent_duplicate.py. * Remove unused mock import from test_action_inputs_proxy.py * ENG-2922 Fix model.run() hanging on NaN usage tokens from backend Several model providers (GPT-5.4, Claude, Mistral Large) return "NaN"/null for token counts in the usage block. This caused: 1. Usage dataclass deserialization to fail (int("NaN")) 2. sync_poll to retry the same completed response forever until timeout 3. Sync-only model.run() to return IN_PROGRESS without polling Changes: - Make Usage fields Optional[int] with a safe decoder that handles NaN, null, strings, and floats gracefully - Add poll fallback in resource.poll() for completed responses that fail deserialization - Add polling after _run_sync_v2() for sync models that return a poll URL Made-with: Cursor * ENG-2922 Show token usage in agent progress at all verbosity levels Display input/output/total tokens inline on each step line and aggregate totals in the completion summary. Also includes curl commands documenting backend token reporting inconsistencies. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.