feat: Add Claude Agent SDK with Dynamic Skills on AgentCore Runtime#181
Conversation
Implements S3-based dynamic skill loading system for AWS AgentCore Runtime that enables zero-downtime skill updates without container rebuilds. Key Features: - S3-based centralized skill repository - Runtime skill loading via startup orchestration - Native Claude Agent SDK integration - Production-ready Docker containerization - Comprehensive documentation and setup guides This addresses the limitation in existing AgentCore examples that require container rebuilds for skill changes, providing a scalable solution for enterprise AI agent deployments. Fixes aws-samples#180
- Remove .bedrock_agentcore.yaml (contains personal account details) - Remove auto-generated AgentCore Dockerfile - Keep .bedrock_agentcore.yaml.template with proper templating - All implementation files now use environment variables only Ready for aws-samples/anthropic-on-aws community submission. Fixes aws-samples#180
…ILL.md, emphasize AgentCore persistent storage workaround
…h verified AWS documentation
schuettc
left a comment
There was a problem hiding this comment.
Solid concept with a few bugs and documentation adjustments needed
Hi @shekharprateek, thanks for putting this together. The dynamic skill loading from S3 pattern is a useful addition -- we tested it end-to-end against Bedrock and AgentCore and the core architecture works well.
We do have some findings that need to be addressed before we can merge. I've broken them into bugs (blocking) and documentation adjustments (non-blocking but strongly recommended).
Bugs (blocking)
1. S3 key casing mismatch -- skills fail to load out of the box
Both s3_skill_loader.py:96 and claude_sdk_bedrock.py:53 look for skill.md (lowercase), but the README instructions, the bundled sample skills, and the Claude SDK convention all use SKILL.md (uppercase). As-is, the skill loader reports 0 skills loaded.
Fix: change skill.md to SKILL.md in both files:
agent/s3_skill_loader.pyline 96:skill_key = f'skills/{skill_name}/SKILL.md'claude_sdk_bedrock.pyline 53:Key=f'skills/{skill_name}/SKILL.md'
We tested this fix and it takes skill loading from 0/6 to 5/6 (with the next bug explaining the remaining 1).
2. Wrong exception type in _download_skill -- skills without implementation.py are incorrectly marked as failed
s3_skill_loader.py:123 catches self.s3_client.exceptions.NoSuchKey, but download_file raises botocore.exceptions.ClientError with a 404 status. This means skills like data-analysis-workflow that have a SKILL.md but no implementation.py fall through to the outer except block and get counted as failures, even though the SKILL.md downloaded successfully.
Fix: catch botocore.exceptions.ClientError (or a bare Exception) in the inner try/except at line 123.
3. Bash script uses Python comment syntax
startup.sh lines 2-11 use Python triple-quote """ for a block comment. This is harmless (bash interprets it as empty strings) but incorrect. Should use # comments.
Documentation adjustments (strongly recommended)
We went through the README claim-by-claim against the actual implementations. The core architecture claims are accurate and well-documented, and we appreciate the honest disclaimer at line 54:
"The included skills provide conceptual frameworks and prompting strategies that guide Claude's responses."
However, the Quick Start skill descriptions (lines 96-144) make specific capability claims that the sample doesn't deliver:
| README Claim | What Actually Happens |
|---|---|
| "Security Analysis: Detects OWASP Top 10 vulnerabilities" | SKILL.md guides Claude to reason about security -- Claude does a good job, but nothing "detects" or "scans" programmatically |
| "Dependency Audit: Scans for known CVEs in third-party libraries" | No CVE scanning occurs. Claude applies its general knowledge |
| "Multi-Source Research: Gathers information from documentation, forums, GitHub, Stack Overflow" | No HTTP requests are made. web-research/implementation.py imports requests but never uses it |
| "API Integration: Connects to REST APIs with OAuth, API keys, or basic auth" | data-fetcher/implementation.py returns {"data": f"API data for {query}"} -- connects to nothing |
| "Database Queries: Retrieves data from PostgreSQL, MySQL, MongoDB, DynamoDB" | Same -- returns a hardcoded string |
| "Container Security: Multi-stage Docker build with minimal surface area" (line 459) | The Dockerfile is a single FROM python:3.12-slim -- not a multi-stage build |
Suggested approach: Reframe the skill descriptions as what they are -- prompt templates that guide Claude's responses. For example: "Guides Claude to analyze code for OWASP Top 10 vulnerabilities" rather than "Detects OWASP Top 10 vulnerabilities." This is consistent with your own disclaimer at line 54.
On the implementation.py files: In our testing, the claude_agent_sdk discovers and uses the SKILL.md files as context but the implementation.py files are never invoked in the actual flow. They're effectively dead code with hardcoded return values. Consider either removing them, or explicitly labeling them as structural placeholders showing where real implementations would go.
What works well
- S3 skill loading works end-to-end (after the casing fix)
- The simple implementation correctly injects skills into the system prompt and gets skill-aware responses from Bedrock
- The advanced implementation's native skill discovery works -- skills in
.claude/skills/are found and influence Claude's responses - The overall architecture (S3 as persistent skill store for AgentCore's ephemeral containers) is a genuinely useful pattern
The concept is sound and fills a real gap in the AgentCore samples. Looking forward to seeing the fixes.
1. Fix S3 key casing mismatch - change skill.md to SKILL.md - agent/s3_skill_loader.py line 96 - claude_sdk_bedrock.py line 53 2. Fix exception handling for missing implementation.py files - Change from NoSuchKey to Exception to catch ClientError 3. Fix bash script comments - replace Python triple-quotes with #
- Reframe skill descriptions to reflect prompt-based guidance - Update 'Detects/Scans/Connects' to 'Guides Claude to...' - Fix multi-stage Docker build claims (single-stage build) - Add clarification about implementation.py files being placeholders
Fixes #180
Summary
Adds a new sample demonstrating dynamic skill management for Claude Agent SDK on Amazon Bedrock AgentCore Runtime. Skills are loaded from S3 at container startup, enabling zero-downtime updates without container rebuilds.
Why This Matters
Current AgentCore samples (
claude-agent-sdk-on-agentcore,claude-code-on-agentcore) bundle skills statically in containers, requiring full rebuilds to update agent capabilities. This creates operational overhead in production where skills need frequent updates. This sample solves that by using S3 as a persistent skill repository, working around AgentCore's ephemeral storage limitation.What This Adds
Key Features
Testing
Related
Addresses the operational gap identified in issue #180 where existing AgentCore samples require container rebuilds for capability changes.