This document describes the enhanced harness plugin implementation that achieves 100% coverage of Anthropic's blog recommendations for long-running agents.
Reference: Effective Harnesses for Long-Running Agents
Implementation Date: December 2024
harness/
├── .claude-plugin/
│ └── plugin.json # Plugin metadata (v1.0.0)
├── core/ # Core Python modules
│ ├── __init__.py
│ ├── config.py # Configuration management
│ ├── change_detector.py # Smart change classification
│ ├── test_runner.py # Test execution utilities
│ ├── browser_automation.py # Playwright/Puppeteer wrapper
│ ├── progress.py # Progress file utilities
│ └── features.py # Feature checklist management
├── hooks/ # Claude Code hooks
│ ├── hooks.json # Hook configuration
│ ├── session_start.py # SessionStart hook
│ ├── stop.py # Stop hook
│ ├── pre_tool_use.py # PreToolUse hook
│ └── post_tool_use.py # PostToolUse hook
├── commands/ # Slash commands
│ ├── init.md # Initialize harness
│ ├── status.md # Show status
│ ├── log.md # Add progress entry
│ ├── feature.md # Manage features
│ ├── checkpoint.md # Create git checkpoint
│ ├── configure.md # Configure settings
│ └── baseline.md # Run baseline tests
├── skills/
│ └── harness-workflow.md # Workflow guidance skill
├── README.md # User documentation
└── IMPLEMENTATION.md # This file
| Recommendation | Implementation | File(s) |
|---|---|---|
| Two-agent system | Skill guidance + init command separation | skills/harness-workflow.md |
| Init script execution | Auto-runs init.sh on session start |
hooks/session_start.py |
| Progress documentation | claude-progress.txt with auto-logging |
core/progress.py, hooks/post_tool_use.py |
| Feature list (JSON) | claude-features.json with status tracking |
core/features.py |
| Session startup routine | Reads git, progress, features automatically | hooks/session_start.py |
| Baseline tests on startup | Runs tests before starting work | hooks/session_start.py, core/test_runner.py |
| Auto-progress logging | Logs significant changes automatically | hooks/post_tool_use.py |
| Auto-checkpoint suggestions | Suggests commits after major changes | hooks/post_tool_use.py |
| Feature status automation | Parses test results, suggests updates | hooks/post_tool_use.py |
| One-feature discipline | Enforces working on declared feature | hooks/pre_tool_use.py |
| Stop validation | Validates tests run, features complete | hooks/stop.py |
| Browser automation | Playwright/Puppeteer for UI verification | core/browser_automation.py |
Purpose: Centralized configuration with three strictness levels.
Key Functions:
load_config(work_dir)- Load from.claude/claude-harness.jsonsave_config(config, work_dir)- Persist configurationis_strict_mode(),is_relaxed_mode(),is_standard_mode()- Mode checksget_setting(key),set_setting(key, value)- Individual settingsis_harness_initialized(work_dir)- Check initialization status
Default Configuration:
{
"strictness": "standard",
"auto_progress_logging": True,
"auto_checkpoint_suggestions": True,
"feature_enforcement": True,
"baseline_tests_on_startup": True,
"init_script_execution": True,
"browser_automation": False,
"checkpoint_interval_minutes": 30,
"test_commands": {
"node": "npm test -- --passWithNoTests 2>&1 || true",
"python": "python -m pytest -v 2>&1 || true",
"rust": "cargo test 2>&1 || true",
"go": "go test ./... 2>&1 || true"
}
}Purpose: Intelligently classify changes to determine when to auto-log and suggest checkpoints.
Change Levels:
TRIVIAL- Comments, formatting, imports, small edits (<100 chars)SIGNIFICANT- New functions, logic changes, test runs, moderate editsMAJOR- New files >100 lines, large refactors, git commits
Key Functions:
classify_change(tool_name, tool_input, tool_result)- Returns(ChangeLevel, reason)should_auto_log(level)- True for SIGNIFICANT and MAJORshould_suggest_checkpoint(level)- True for MAJOR only
Classification Logic:
- Analyzes Write tool: counts lines, checks for code patterns
- Analyzes Edit tool: measures diff size, detects new patterns
- Analyzes Bash tool: identifies test/build/git commands
Purpose: Unified test execution across multiple project types.
Supported Project Types:
- Node.js (
package.json) -npm test - Python (
pyproject.toml,setup.py) -pytest - Rust (
Cargo.toml) -cargo test - Go (
go.mod) -go test ./... - Java Maven (
pom.xml) -mvn test - Java Gradle (
build.gradle) -./gradlew test
Key Functions:
detect_project_type(work_dir)- Detect from config filesrun_tests(work_dir, timeout)- Execute tests, returnTestSummaryparse_test_output(output, project_type)- Extract pass/fail countsdid_tests_run_in_session(transcript_path)- Check session transcriptget_test_summary_string(summary)- Human-readable summary
TestSummary Dataclass:
@dataclass
class TestSummary:
total: int = 0
passed: int = 0
failed: int = 0
skipped: int = 0
errors: int = 0
duration: float = 0.0
raw_output: str = ""
result: TestResult = TestResult.NOT_RUN
failed_tests: List[str] = field(default_factory=list)Purpose: Playwright/Puppeteer wrapper for visual verification (opt-in).
Key Functions:
detect_browser_tool(work_dir)- Find Playwright or Puppeteertake_screenshot(url, output_path, ...)- Capture page screenshotverify_element(url, selector, expected_text)- Check element existsget_installation_instructions()- Setup instructions
BrowserResult Dataclass:
@dataclass
class BrowserResult:
success: bool
screenshot_path: Optional[str] = None
error: Optional[str] = None
output: Optional[str] = None
element_found: bool = False
element_text: Optional[str] = None{
"hooks": {
"SessionStart": [{"matcher": "*", "hooks": [...], "timeout": 120}],
"PreToolUse": [{"matcher": "Edit|Write", "hooks": [...], "timeout": 10}],
"PostToolUse": [{"matcher": "Edit|Write|Bash", "hooks": [...], "timeout": 15}],
"Stop": [{"matcher": "*", "hooks": [...], "timeout": 15}]
}
}Triggers: Every session start
Actions:
- Check if harness is initialized
- Load configuration
- Execute
init.shif exists (60s timeout, 10KB limit) - Run baseline tests if configured
- Read git status and recent commits
- Read progress file (last 50 lines)
- Read feature checklist summary
- Inject context via
systemMessage
Output Format:
=== AGENT HARNESS SESSION STARTUP ===
Session started: {timestamp}
Working directory: {path}
Mode: {strictness}
--- INIT SCRIPT ---
{init.sh output}
--- BASELINE TESTS ---
{test results}
--- GIT STATUS ---
{modified files}
--- RECENT COMMITS ---
{last 10 commits}
--- PROGRESS LOG ---
{last 50 lines}
--- FEATURE CHECKLIST STATUS ---
Total: X | Passing: X | Failing: X | In Progress: X
Next priority items:
[WIP] 1. Feature name: description
[TODO] 2. Feature name: description
=== END SESSION CONTEXT ===
Triggers: Before Edit or Write tools
Purpose: Enforce one-feature-at-a-time discipline
Actions:
- Check if features are defined
- Check if any feature is
in_progress - If no active feature:
- Standard mode: Show warning, allow operation
- Strict mode: Block with
permissionDecision: deny
Output:
- Warning message listing available features to start
- In strict mode: operation blocked
Triggers: After Edit, Write, or Bash tools
Purpose: Auto-logging, checkpoint suggestions, test result parsing
Actions:
- Classify the change (trivial/significant/major)
- Auto-log significant changes to progress file
- Track cumulative changes in session state file
- Suggest checkpoint when:
- Major change occurs
- 5+ significant changes accumulate
- Time threshold exceeded (default 30 min)
- Parse test results from Bash output
- Suggest feature status updates after test pass
Session State (stored in /tmp/harness-session-{id}.json):
{
"changes_since_checkpoint": 0,
"last_checkpoint_time": null,
"significant_changes": [],
"current_feature": null
}Triggers: When agent considers stopping
Purpose: Validate work completion before allowing stop
Validation Checks:
- Tests run if code was modified
- Uncommitted changes warning
- Features still in progress
- Progress log updated
Behavior by Mode:
- Relaxed: Single FYI message
- Standard: Strong warnings, no blocking
- Strict: Block with
decision: blockif validation fails
| Command | Description | Key Actions |
|---|---|---|
/harness:init |
Initialize harness | Create progress file, features file, marker |
/harness:status |
Show current state | Display git, progress, features summary |
/harness:log [type] [msg] |
Add progress entry | Append to progress file with timestamp |
/harness:feature [action] |
Manage features | add, start, pass, fail, list, next |
/harness:checkpoint [msg] |
Create git commit | Stage all, commit with prefix, log |
/harness:configure [setting] |
Change settings | Modify .claude/claude-harness.json |
/harness:baseline |
Run tests manually | Execute and display test results |
# Claude Agent Progress Log
# Project: {name}
# Created: {timestamp}
[2024-12-14 10:30:00] INITIALIZED: Progress tracking enabled
[2024-12-14 10:35:00] STARTED: User authentication feature
[2024-12-14 10:45:00] AUTO: Created auth.py (New code file, 45 code lines)
[2024-12-14 11:00:00] CHECKPOINT [abc1234]: Implemented login form
[2024-12-14 11:30:00] COMPLETED: User authentication feature
{
"metadata": {
"project": "my-app",
"created_at": "2024-12-14T10:30:00",
"last_updated": "2024-12-14T11:30:00"
},
"features": [
{
"id": 1,
"name": "User login",
"description": "Email/password authentication",
"status": "passing",
"category": "auth",
"priority": 1,
"created_at": "2024-12-14T10:30:00",
"updated_at": "2024-12-14T11:30:00",
"notes": [
{"timestamp": "2024-12-14T11:30:00", "content": "All tests passing"}
]
}
]
}{
"strictness": "standard",
"auto_progress_logging": true,
"auto_checkpoint_suggestions": true,
"feature_enforcement": true,
"baseline_tests_on_startup": true,
"init_script_execution": true,
"browser_automation": false,
"checkpoint_interval_minutes": 30
}| Feature | Relaxed | Standard | Strict |
|---|---|---|---|
| Auto-progress logging | Off | On | On |
| Checkpoint suggestions | Off | On | On |
| Feature enforcement | Off | Warn | Block |
| Stop validation | FYI | Warn | Block |
| Baseline tests | Off | On | On |
| Init script | On | On | On |
All hooks follow these principles:
- Never block on errors - always exit 0
- Wrap all code in try/except
- Return informative error messages via
systemMessage - Fallback gracefully if imports fail
- Use timeouts for external operations
- Test SessionStart: Start Claude Code in an initialized project
- Test PreToolUse: Try to edit without starting a feature
- Test PostToolUse: Make significant edits, check for suggestions
- Test Stop: Try to stop without running tests
- Test Strictness: Configure strict mode and verify blocking
Potential improvements not yet implemented:
- SubagentStop hook for validating subagent completion
- Notification hook for external integrations
- PreCompact hook to preserve critical context
- Integration with CI/CD pipelines
- Remote session support
- Custom test command configuration per project