You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/blog/ai-agent-esper.md
+8-5Lines changed: 8 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,9 +12,13 @@ The timing was on our side: Just before my internship began in June of 2025, Cla
12
12
13
13
Before Claude Code brought a general-purpose agent to be used by everyone (I hate that people think Claude Code is only for Coding!), building agents was a bit of a complicated process, at least to me at that time! Claude Code commodified the *mechanics*, which gave us more time to think about what and how the agent does. Not only that, but it also pushed other AI labs in the direction of terminal-based agent orchestrators (Codex, Gemini-cli, opencode, etc.) and gave Anthropic some [sweet-sweet revenue unlocking lever ~$1B!](https://www.anthropic.com/news/anthropic-acquires-bun-as-claude-code-reaches-usd1b-milestone)
14
14
15
+
<imgwidth="1028"height="215"alt="Screenshot 2026-01-05 at 20 09 44"src="https://github.com/user-attachments/assets/921e938a-3320-4b31-abe3-3f49397fde26" />
16
+
17
+
---
18
+
15
19
Now, coming to the part where all the talk was put into action. What was done, how it worked, and what actually didn't work\! (Quick note on our setup: we used Claude Code via AWS Bedrock for enterprise compliance, and one thing I did observe in this is that at times, some features in Claude Code were missing, not sure if it was an A/B testing thing or happening because we were using Bedrock. Personally I feel the 200$ max plan had really good limits before it for nerfed, but for compliance we stuck to claude via bedrock.)
16
20
17
-
**What We Built (And What We Learned)**
21
+
## **What We Built (And What We Learned)**
18
22
19
23
### **1\. The Claude.md Stack (And Why Rules Get Ignored)**
<imgwidth="610"height="532"alt="Screenshot 2026-01-05 at 19 58 29"src="https://github.com/user-attachments/assets/8299d0c1-96eb-4b2c-8343-004789bb55f9" />
79
84
80
-
![][image1]
81
-
82
-
Anthropic has talked extensively about [reward hacking](https://www.anthropic.com/research/reward-hacking)—where models learn to game their training objectives rather than actually solving problems. They've worked to mitigate this in Claude, but in my experience with agentic workflows, it still surfaced in subtle ways.
85
+
Anthropic has talked extensively about [reward hacking](https://www.anthropic.com/research/reward-hacking) where models learn to game their training objectives rather than actually solving problems. They've worked to mitigate this in Claude, but in my experience with agentic workflows, it still surfaced in subtle ways.
83
86
The context decay problem from Claude.md rules showed up elsewhere too: premature task completion. Sonnet 4 and 4.5 would claim "Task complete!" without verifying anything actually worked. No tests run, no compilation checks, no output validation.
84
87
Vague instructions made it worse. "Here's the db schema, write the gorm queries" led Opus to confidently implement patterns that didn't match our architecture. Sonnet would announce completion without checking if the code compiled.
85
88
The frustration here is real: I'd give the agent verification tools, and it would still skip them. It felt like the model learned that "say you're done" correlates with positive feedback, so it rushed there.
86
-
What helped: Being explicit about verification. Don't ask for implementation - ask for implementation plus test execution. Force the verification into the task definition itself, don't leave it optional.
89
+
What helped: Being explicit about verification. Don't ask just for implementation, ask for implementation plus test execution. Force the verification into the task definition itself, don't leave it optional.
0 commit comments