You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Every engineer got their own code review agent. We generated `personal_agent.md` files that teammates could invoke via `/agents` in Claude Code for pre-PR reviews. These agent markdown files were generated in multiple ways, with one of colleagues, Nazeel, using his historical GitHub comments to scrape and teach the agent on how he comments, the language, and the things he focuses on in the PR\!
49
+
Every engineer got their own code review agent. We generated `personal_agent.md` files that teammates could invoke via `/agents` in Claude Code for pre-PR reviews. These agent markdown files were generated in multiple ways, with one of my colleague, Nazeel, using his GitHub PR comments to teach the agent on how he comments, the language he uses, and the things he focuses on in his PR reviews\!
50
50
51
-
**Personalization mattered more than we expected**. Generic review agents give you generic advice. Personal agents learned each engineer's patterns, preferred style, and common mistakes \- making reviews feel like feedback from a teammate who actually knew your code, not a linter with opinions.
51
+
**Personalization mattered more than we expected**. Generic review agents give you generic advice. Personal agents learned each engineer's patterns, and preferred style of comments\- making reviews feel like feedback from a teammate who actually knew your code, not a linter with opinions.
52
52
53
53
---
54
54
55
55
### **4\. Building the Esper MCP Server**
56
56
57
-
Built the MCP server *using* language-specific agents (Golang) \+ TDD. The workflow: I'd write code in specific patterns, then have the agent (Codex in this case) understand and replicate the pattern across the codebase.
57
+
I built the Esper MCP server *using* language-specific agents (Golang) \+ TDD. \n
58
+
**The workflow:** I'd write code in specific patterns, then have the agent (Codex in this case) understand and replicate the pattern across the codebase.
58
59
59
-
Teaching by example beats instruction when the pattern is complex. Writing 2-3 examples of the pattern I wanted, then letting the agent generalize, worked far better than trying to explain it in natural language. It broke down when patterns were too context-dependent or required domain knowledge the agent didn't have \- at that point, the examples alone weren't enough.
60
+
Teaching by example beats instruction when the pattern is complex. Writing 2-3 examples of the pattern I wanted, then letting the agent generalize, worked far better than trying to explain it in natural language. The agents broke down when patterns were too context-dependent or required domain knowledge it didn't have \- at which point, the examples alone weren't enough.
60
61
61
-
The way I executed Esper MCP was a bit different from the traditional MCP setup, due to the tenant-based nature of Esper, auth was a bit complicated, and I relied on the already in place token generation method we had. This way I ended by building a small CLI tool for the mcp which handled the auth and had some other handy tools\! The tools in the MCP arent a straight 1:1 for the APIs Esper provided, from experiences of multiple customer facing people there were certain workflows which users triggered which were a combination of multiple APIs. So I turned these workflows into specific tools in the MCP; one example was to detect "Device Blueprint Drift", this required multiple different API calls which could be consolidated into a simple MCP tool call!
62
+
The way I went about building the Esper MCP was a bit different from the traditional MCP setup. Due to the tenant-based nature of Esper, auth was a bit complicated so I relied on the already in place token generation method we had. This way I ended by building a small CLI tool for the mcp which handled the auth and had some other handy commands\!
63
+
64
+
The MCP tools aren't a direct 1:1 mapping of Esper's APIs. Based on feedback from customer-facing teams, we identified common user workflows that required multiple API calls and consolidated them into single MCP tools. For example, detecting 'Device Blueprint Drift' previously required several API calls. We turned that entire workflow into one simple tool.
62
65
63
66
Another aim was keep the descriptions as tight as possible (to save context\!) but give the LLM the exact info it needs to trigger the tool at the correct time. We kept iterating on this and made quite good progress.
<imgwidth="610"height="532"alt="Screenshot 2026-01-05 at 19 58 29"src="https://github.com/user-attachments/assets/8299d0c1-96eb-4b2c-8343-004789bb55f9" />
84
87
85
88
Anthropic has talked extensively about [reward hacking](https://www.anthropic.com/research/reward-hacking) where models learn to game their training objectives rather than actually solving problems. They've worked to mitigate this in Claude, but in my experience with agentic workflows, it still surfaced in subtle ways.
86
-
The context decay problem from Claude.md rules showed up elsewhere too: premature task completion. Sonnet 4 and 4.5 would claim "Task complete!" without verifying anything actually worked. No tests run, no compilation checks, no output validation.
87
-
Vague instructions made it worse. "Here's the db schema, write the gorm queries" led Opus to confidently implement patterns that didn't match our architecture. Sonnet would announce completion without checking if the code compiled.
89
+
90
+
The context decay problem from Claude.md rules showed up elsewhere too: premature task completion. Sonnet 4 and 4.5 would claim "Task complete!" without verifying anything actually worked. No tests run, no compilation checks, no output validation.Vague instructions made it worse. "Here's the db schema, write the gorm queries" led Opus to confidently implement patterns that didn't match our architecture. Sonnet would announce completion without checking if the code compiled.
88
91
The frustration here is real: I'd give the agent verification tools, and it would still skip them. It felt like the model learned that "say you're done" correlates with positive feedback, so it rushed there.
89
-
What helped: Being explicit about verification. Don't ask just for implementation, ask for implementation plus test execution. Force the verification into the task definition itself, don't leave it optional.
92
+
93
+
**What helped:** Being explicit about verification. Don't ask just for implementation, ask for implementation plus test execution. Force the verification into the task definition itself, don't leave it optional.
0 commit comments