ML Engineer · Founder
Building agentic systems — from reasoning loops to production backends.
Incoming MLE at Robinhood (Agentic AI) · M.S. CS (AI/ML) at Duke · B.Comp. CS at NUS
Currently interested in: agent evaluation harnesses, context engineering for long-running workflows, and what it actually takes to benchmark agents that make real-world decisions over time.
VYNN AI — Agentic financial analyst platform (sole engineer, ~500 users, 50K+ LOC)
LangGraph supervisor orchestrates 5 specialized agents for end-to-end equity research: data scraping → DCF modeling (6 sector strategies) → news intelligence → report generation — all in under 7 minutes. The hard part wasn't the LLM calls; it was making the numbers trustworthy. The recommendation engine uses a 3-layer architecture: deterministic math (RecommendationCalculator) → LLM narrative → regex-based validator that blocks publication if citation coverage drops below 95%. Built a custom 1,293-line Excel formula evaluator so the DCF workbook and downstream JSON stay perfectly consistent without requiring Excel. Nightly CI runs a golden-dataset regression suite across 100 QQQ companies and blocks deployment if valuations drift beyond threshold.
→ stock-analyst (agent backend) · vynnai-web (platform frontend) · api-runner (API layer)
AutoCodeRover — Autonomous code repair agent · Core technology acquired by Sonar
Designed the Self-Fix Agent: when a patch fails to apply, an LLM-as-a-Judge diagnoses which pipeline stage (Context Retrieval or Patch Generation) caused the failure, generates corrective feedback, and replays from that stage — preserving upstream state via UUID-targeted responses. Also built a stateful replay mechanism so developers can inject feedback on any intermediate LLM response and trigger selective re-execution downstream. Result: 51.6% on SWE-bench Verified (up from 38.4%), 1.8× patch precision over next-best open-source agent.
→ auto-code-rover (agent backend) · Jetbrains-IDE-Plugin (Kotlin, end-to-end)
ACR JetBrains Plugin — IDE-integrated autonomous repair
Built end-to-end in Kotlin. Three things I'm most proud of: (1) GumTree 3-way AST merge — when you've edited code while the agent is patching the same file remotely, the plugin reconciles baseline → your edits → agent's patch at the AST level, not text level. (2) PSI-based context enrichment — before sending a task to ACR, the plugin extracts symbol references, cursor history (last 10 positions), and open files to narrow the agent's search scope. (3) Embedded SonarLint — runs static analysis locally, then lets you one-click send any issue to ACR for autonomous fixing.
LUMINA — Multi-agent citation screening for medical systematic reviews (first author)
Four-agent pipeline: classifier triage → PICOS-guided Chain-of-Thought screening → LLM-as-a-Judge reviewer → self-correction agent. Evaluated across 15 SRMAs (~150K citations from BMJ, JAMA, Lancet). 98.2% sensitivity (10 of 15 at perfect 100%) with 35× fewer missed studies vs. prior baselines, at $0.007/article.
- Agent harness design — VYNN's golden-dataset regression suite and ACR's SWE-bench eval loop taught me the same lesson: the harness that catches agent regressions matters more than the agent itself. I'm interested in building eval infrastructure that can score long-running, multi-step agents where "correct" isn't a single number.
- Context engineering — Most agent failures I've debugged trace back to what the agent didn't know, not what it reasoned poorly about. PSI-based enrichment in the ACR plugin, MCP self-retrieval in VYNN, 33 externalized prompt templates — these are all different bets on the same problem: giving agents the right context at the right time.
- The gap between demo benchmarks and production trust — An agent that scores 51.6% on SWE-bench still fails half the time. VYNN's 3-layer recommendation validator exists because "usually right" isn't good enough for financial decisions. I'm drawn to the engineering that makes agents trustworthy enough to run unsupervised.
Last updated: Mar 2026


