A minimal, framework-agnostic specification for agent tooling primitives. Any agent framework can implement this. Any tool built to this spec works everywhere.
The agent tooling ecosystem is fragmented. Every framework defines its own tool schema, trace format, and evaluation contract. This makes it impossible to share debuggers, testing frameworks, and monitoring dashboards across frameworks. This spec defines the minimum common abstractions needed for interoperability.
Design principles:
- Minimal. Only what every framework needs. Nothing framework-specific.
- JSON-serializable. All types round-trip through JSON.
- Additive. Frameworks can extend with extra fields; consumers ignore unknown fields.
- No runtime dependency. Just a schema + reference implementation.
A ToolSpec describes a callable tool. All agent frameworks define tools; this is the common form.
{
"name": "search_web",
"description": "Search the web and return relevant results. Use when you need current information not in your training data.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
},
"limit": {
"type": "integer",
"description": "Max results to return",
"default": 5
}
},
"required": ["query"]
},
"returns": {
"type": "array",
"items": {
"type": "object",
"properties": {
"url": {"type": "string"},
"title": {"type": "string"},
"snippet": {"type": "string"}
}
}
},
"metadata": {
"category": "retrieval",
"cost_tier": "medium",
"side_effects": false
}
}| Field | Type | Required | Description |
|---|---|---|---|
name |
string | yes | Unique identifier, snake_case |
description |
string | yes | Human + LLM readable. What it does and when to use it. |
parameters |
JSON Schema object | yes | Input schema. Same format as OpenAI function calling. |
returns |
JSON Schema | no | Output schema. Helps agents understand what they'll receive. |
metadata.category |
string | no | retrieval | compute | io | side_effect | coordination |
metadata.cost_tier |
string | no | free | cheap | medium | expensive |
metadata.side_effects |
bool | no | True if calling this tool changes external state |
{
"tool_call": {
"id": "call_abc123",
"name": "search_web",
"arguments": {"query": "agent debugging tools 2026", "limit": 3},
"timestamp": "2026-03-01T10:00:00Z"
},
"tool_result": {
"call_id": "call_abc123",
"output": [...],
"error": null,
"duration_ms": 1240,
"cost_tokens": 0
}
}A ReasoningTrace is a structured record of one agent execution. Frameworks emit traces; debuggers, monitors, and evaluators consume them.
{
"trace_id": "tr_7f3a9b",
"agent_id": "my-research-agent",
"session_id": "sess_abc",
"started_at": "2026-03-01T10:00:00Z",
"finished_at": "2026-03-01T10:00:15Z",
"input": "What are the best agent debugging tools available today?",
"output": "Based on my research, the leading agent debugging tools are...",
"status": "success",
"steps": [
{
"step_id": "step_001",
"type": "reasoning",
"content": "I need to search for current information about agent debugging tools.",
"context_tokens": 1200,
"timestamp": "2026-03-01T10:00:01Z"
},
{
"step_id": "step_002",
"type": "tool_call",
"tool_call": {
"id": "call_abc123",
"name": "search_web",
"arguments": {"query": "agent debugging tools 2026"}
},
"timestamp": "2026-03-01T10:00:02Z"
},
{
"step_id": "step_003",
"type": "tool_result",
"tool_result": {
"call_id": "call_abc123",
"output": [...],
"duration_ms": 1240
},
"timestamp": "2026-03-01T10:00:03Z"
},
{
"step_id": "step_004",
"type": "reasoning",
"content": "The search returned 3 results. I'll synthesize them into an answer.",
"context_tokens": 2800,
"timestamp": "2026-03-01T10:00:04Z"
}
],
"metrics": {
"total_tokens": 3200,
"total_cost_usd": 0.0032,
"total_duration_ms": 15000,
"step_count": 4,
"tool_call_count": 1
},
"metadata": {}
}| Type | Fields | Description |
|---|---|---|
reasoning |
content, context_tokens |
LLM reasoning / internal monologue |
tool_call |
tool_call |
Agent is calling a tool |
tool_result |
tool_result |
Result received from tool |
handoff |
to_agent, message |
Handing off to sub-agent |
memory_read |
query, results |
Reading from memory/retrieval |
memory_write |
content, key |
Writing to memory |
success | error | loop_detected | context_overflow | timeout | refusal
A EvalCase is a test case for an agent. EvalResult is what a judge produces. The key design: assertions are properties, not exact matches, because agents are non-deterministic.
{
"eval_id": "eval_001",
"case_id": "case_web_search_001",
"description": "Agent should search for and summarize recent agent tooling news",
"input": "What new agent debugging tools came out this year?",
"context": {},
"assertions": [
{
"type": "contains_tool_call",
"tool_name": "search_web",
"description": "Must call search_web at least once"
},
{
"type": "output_property",
"property": "mentions at least 2 specific tool names",
"description": "Response should name concrete tools, not just describe categories"
},
{
"type": "output_property",
"property": "response length is between 100 and 500 words",
"description": "Appropriate length — not too brief, not too long"
},
{
"type": "no_hallucination",
"description": "Claims should be supported by retrieved content"
}
],
"tags": ["retrieval", "synthesis"],
"difficulty": "medium"
}{
"eval_id": "eval_001",
"case_id": "case_web_search_001",
"trace_id": "tr_7f3a9b",
"passed": true,
"assertion_results": [
{"assertion_type": "contains_tool_call", "passed": true, "score": 1.0, "reason": "search_web called at step_002"},
{"assertion_type": "output_property", "passed": true, "score": 0.9, "reason": "Named LangSmith, Weights & Biases, and Braintrust"},
{"assertion_type": "output_property", "passed": true, "score": 1.0, "reason": "Response is 287 words"},
{"assertion_type": "no_hallucination", "passed": true, "score": 0.8, "reason": "All claims supported by search results"}
],
"composite_score": 0.925,
"judge_model": "gpt-4o-mini",
"evaluated_at": "2026-03-01T10:01:00Z"
}| Type | Description |
|---|---|
output_property |
LLM-judged property of the output (most flexible) |
contains_tool_call |
Tool was called (optionally: with specific args) |
tool_call_count |
Number of tool calls within min/max range |
no_hallucination |
Output claims are grounded in retrieved context |
response_format |
Output matches a JSON schema |
context_efficiency |
Token usage is within acceptable bounds |
latency |
Duration within acceptable range |
A ToolRegistry is a discoverable catalog of tools. Agents can query it to find available tools.
{
"registry_id": "agora-tools-v1",
"version": "0.1.0",
"updated_at": "2026-03-01T00:00:00Z",
"tools": [
{
"spec": { ... },
"endpoint": "https://tools.example.com/search_web",
"auth": "bearer",
"provider": "example-corp",
"tags": ["retrieval", "web"]
}
]
}| Component | Status |
|---|---|
| ToolSpec (Python dataclass) | ✅ Reference implementation in agentool/ |
| ReasoningTrace (Python dataclass) | ✅ Reference implementation in agentool/ |
| EvalCase + EvalResult | ✅ Reference implementation in agentool/ |
| JSON Schema files | ✅ schemas/ directory |
| ToolRegistry | 🔜 Planned for v0.2 |
| OpenTelemetry exporter | 🔜 Planned for v0.3 |
This spec is designed to be compatible with:
- OpenAI function calling tool format (ToolSpec.parameters field is identical)
- OpenTelemetry semantic conventions for LLM (trace steps map to span events)
- Anthropic tool use format (minor field renaming)
Frameworks that implement this spec can interop with any tooling built to it: LangChain, LlamaIndex, Semantic Kernel, CrewAI, or custom frameworks.