A minimal, framework-agnostic specification for agent tooling primitives.
Inspired by The Tooling Desert — a call for shared standards so the agent ecosystem stops reinventing the wheel.
Every agent framework defines its own tool schema, trace format, and evaluation contract. This makes debuggers, testing frameworks, and monitoring tools framework-specific. An evaluator built for LangChain doesn't work with LlamaIndex. A tracer for Semantic Kernel doesn't work with CrewAI.
Three common abstractions + a reference Python implementation:
| Primitive | What it is |
|---|---|
| ToolSpec | Framework-agnostic tool definition. Exports to OpenAI and Anthropic formats. |
| ReasoningTrace | Structured execution record. Emitted by agents, consumed by debuggers and monitors. |
| EvalCase / EvalResult | Property-based evaluation. Handles non-determinism — assertions are properties, not exact matches. |
No dependencies beyond Python 3.10+. Just copy the agentool/ directory into your project, or install:
pip install agentool # when published — for now, copy agentool/from agentool import ToolSpec
spec = ToolSpec(
name="search_web",
description="Search the web. Use when you need current information not in your training data.",
parameters={
"type": "object",
"properties": {
"query": {"type": "string", "description": "The search query"},
"limit": {"type": "integer", "default": 5},
},
"required": ["query"],
},
metadata={"cost_tier": "medium", "side_effects": False},
)
# Export to any framework
spec.to_openai() # OpenAI function calling format
spec.to_anthropic() # Anthropic tool use format
spec.to_dict() # Canonical spec formatfrom agentool import ReasoningTrace, TraceStatus
from agentool.tool import ToolCall, ToolResult
trace = ReasoningTrace(agent_id="my-research-agent", input="What debugging tools exist for agents?")
trace.add_reasoning("I need to search for current agent debugging tools.")
call = ToolCall(name="search_web", arguments={"query": "agent debugging tools 2026"})
trace.add_tool_call(call)
trace.add_tool_result(ToolResult(call_id=call.id, output=[...], duration_ms=1240))
trace.add_reasoning("Found 3 results. Synthesizing an answer.")
trace.finish(output="The main tools are LangSmith, Braintrust, and...", status=TraceStatus.SUCCESS)
print(trace.to_dict()) # Full JSON-serializable trace
print(trace.has_loop()) # Loop detection
print(trace.tool_calls()) # All tool callsfrom agentool import EvalCase, Assertion
from agentool.eval import AssertionType, evaluate, evaluate_structural
case = EvalCase(
case_id="search_and_summarize_001",
description="Agent should search and cite specific tool names",
input="What new agent debugging tools came out this year?",
assertions=[
Assertion(
type=AssertionType.CONTAINS_TOOL_CALL,
tool_name="search_web",
description="Must search the web",
),
Assertion(
type=AssertionType.TOOL_CALL_COUNT,
min_calls=1, max_calls=3,
description="Shouldn't over-call tools",
),
Assertion(
type=AssertionType.OUTPUT_PROPERTY,
property="mentions at least 2 specific tool names",
description="Should cite concrete tools, not just categories",
),
Assertion(
type=AssertionType.LATENCY,
max_ms=30000,
description="Should respond within 30 seconds",
),
],
tags=["retrieval", "synthesis"],
)
# Evaluate without an LLM judge (structural assertions only):
structural_results = evaluate_structural(case, trace)
# Evaluate with an LLM judge (all assertions):
def my_judge(output, context, property_desc):
# Call your LLM here — return (passed, score, reason)
...
result = evaluate(case, trace, judge_fn=my_judge, judge_model="gpt-4o-mini")
print(result.passed, result.composite_score)
print(result.failed_assertions)Full specification in SPEC.md. Covers:
- ToolSpec JSON schema
- ReasoningTrace format and step types
- EvalCase assertion types
- ToolRegistry format (v0.2)
| Format | ToolSpec | Trace |
|---|---|---|
| OpenAI function calling | to_openai() / from_openai() |
— |
| Anthropic tool use | to_anthropic() / from_anthropic() |
— |
| OpenTelemetry LLM spans | — | planned v0.3 |
| LangSmith traces | — | compatible (step types map to span events) |
This spec is intentionally minimal. If you're building agent tooling and want to target this spec, open an issue or PR. The goal is convergence, not lock-in.