The wrong question is:
"How do I build an agent that can do everything?"
The better question is:
"What is the smallest specialist I can build that is meaningfully better than a generalist for one important workflow?"
That is the frame for this workshop.
Tools like Claude Code and Codex are useful because they are broad generalists. They can do many things reasonably well.
That breadth is also the limitation.
A generalist usually lacks:
- Deep context for your exact domain
- Strong judgment for one repeated workflow
- Tight integration with your systems
- Clear evaluation criteria for one job
- Reliability around a narrow operating boundary
That is why we are building specialists.
A specialist agent should be better than a generalist in at least one important way:
- Better domain context
- Better workflow judgment
- Better integration with tools and systems
- Better consistency
- Better evaluation and feedback loops
If your specialist is not better at a specific job, it is not really a specialist. It is just a narrower demo.
When designing an agent, work through these questions:
If you cannot describe the job in one or two sentences, it is probably too broad.
Good examples:
- Triage inbound support tickets for a single product
- Draft research briefs for account executives
- Review pull requests in one repository against a fixed checklist
- Watch a workflow, detect failures, and escalate with context
Bad examples:
- Automate engineering
- Help with company operations
- Be our AI employee
Some problems need automation. Some just need a better UI, a script, or a search tool.
An agent makes sense when the workflow requires some combination of:
- Reasoning over messy input
- Tool use
- Multi-step decisions
- Interaction with humans
- Ongoing execution across a workflow
If the task is deterministic end to end, do not pretend it needs an agent.
This is one of the most important design decisions.
Use software for:
- Validation
- Authorization
- Data access
- State management
- Business rules
- Side effects
- Logging and auditing
Use the model for:
- Interpreting ambiguity
- Classification
- Summarization
- Extracting structure from messy inputs
- Generating drafts
- Choosing among bounded actions
If the model is controlling basic system behavior, the design is usually weak.
Every agent should have explicit answers to these questions:
- What can it do automatically?
- What requires approval?
- What should make it stop?
- What should it escalate?
- How will errors be detected?
Agents become useful when autonomy is bounded.
You need something more concrete than "it feels smart."
Use measures such as:
- Accuracy
- Completion rate
- Time saved
- Reduction in manual effort
- Escalation quality
- Failure rate on risky actions
If you cannot evaluate it, you cannot improve it.
One useful way to think about agents is through the five levels of agents: a progression from simple tool-using systems to broader agentic systems.
The practical lesson is not that higher levels are always better. It is that you should start at the lowest level that solves the problem and move up only when the workflow genuinely requires it.
A single agent with a clear prompt and access to a small set of tools.
This is enough for more use cases than most teams think.
Add retrieval or persistent storage when the agent needs stable context outside the prompt window.
Add memory when continuity across interactions genuinely matters.
Only move here when the problem decomposes cleanly into distinct specialties and handoffs.
If one agent is still unclear, multiple agents usually make the system harder to reason about.
This is broader workflow automation with monitoring, recovery, escalation, and operational controls around it.
Most teams should not start here.
Taken together, the five levels are useful because they give people a way to reason about complexity without jumping straight to the most elaborate architecture in the room.
- Start narrower than you want to
- Build a specialist, not a personality
- Use the five levels of agents as a complexity ladder, not as a maturity contest
- Keep the model inside clear operational boundaries
- Let software own control, safety, and state
- Make human approval part of the system where risk is real
- Do not jump to multi-agent designs too early
- Evaluate the agent on usefulness, not novelty
The point is not to build the most ambitious system in the room.
The point is to build the most credible specialist: something narrow, useful, testable, and worth trusting for a real task.