Harness Plugin Implementation Details

This document describes the enhanced harness plugin implementation that achieves 100% coverage of Anthropic's blog recommendations for long-running agents.

Reference: Effective Harnesses for Long-Running Agents

Implementation Date: December 2024

Architecture Overview

harness/
├── .claude-plugin/
│   └── plugin.json              # Plugin metadata (v1.0.0)
├── core/                        # Core Python modules
│   ├── __init__.py
│   ├── config.py                # Configuration management
│   ├── change_detector.py       # Smart change classification
│   ├── test_runner.py           # Test execution utilities
│   ├── browser_automation.py    # Playwright/Puppeteer wrapper
│   ├── progress.py              # Progress file utilities
│   └── features.py              # Feature checklist management
├── hooks/                       # Claude Code hooks
│   ├── hooks.json               # Hook configuration
│   ├── session_start.py         # SessionStart hook
│   ├── stop.py                  # Stop hook
│   ├── pre_tool_use.py          # PreToolUse hook
│   └── post_tool_use.py         # PostToolUse hook
├── commands/                    # Slash commands
│   ├── init.md                  # Initialize harness
│   ├── status.md                # Show status
│   ├── log.md                   # Add progress entry
│   ├── feature.md               # Manage features
│   ├── checkpoint.md            # Create git checkpoint
│   ├── configure.md             # Configure settings
│   └── baseline.md              # Run baseline tests
├── skills/
│   └── harness-workflow.md      # Workflow guidance skill
├── README.md                    # User documentation
└── IMPLEMENTATION.md            # This file

Blog Recommendations Coverage

Recommendation	Implementation	File(s)
Two-agent system	Skill guidance + init command separation	`skills/harness-workflow.md`
Init script execution	Auto-runs `init.sh` on session start	`hooks/session_start.py`
Progress documentation	`claude-progress.txt` with auto-logging	`core/progress.py`, `hooks/post_tool_use.py`
Feature list (JSON)	`claude-features.json` with status tracking	`core/features.py`
Session startup routine	Reads git, progress, features automatically	`hooks/session_start.py`
Baseline tests on startup	Runs tests before starting work	`hooks/session_start.py`, `core/test_runner.py`
Auto-progress logging	Logs significant changes automatically	`hooks/post_tool_use.py`
Auto-checkpoint suggestions	Suggests commits after major changes	`hooks/post_tool_use.py`
Feature status automation	Parses test results, suggests updates	`hooks/post_tool_use.py`
One-feature discipline	Enforces working on declared feature	`hooks/pre_tool_use.py`
Stop validation	Validates tests run, features complete	`hooks/stop.py`
Browser automation	Playwright/Puppeteer for UI verification	`core/browser_automation.py`

Core Modules

config.py - Configuration Management

Purpose: Centralized configuration with three strictness levels.

Key Functions:

load_config(work_dir) - Load from .claude/claude-harness.json
save_config(config, work_dir) - Persist configuration
is_strict_mode(), is_relaxed_mode(), is_standard_mode() - Mode checks
get_setting(key), set_setting(key, value) - Individual settings
is_harness_initialized(work_dir) - Check initialization status

Default Configuration:

{
    "strictness": "standard",
    "auto_progress_logging": True,
    "auto_checkpoint_suggestions": True,
    "feature_enforcement": True,
    "baseline_tests_on_startup": True,
    "init_script_execution": True,
    "browser_automation": False,
    "checkpoint_interval_minutes": 30,
    "test_commands": {
        "node": "npm test -- --passWithNoTests 2>&1 || true",
        "python": "python -m pytest -v 2>&1 || true",
        "rust": "cargo test 2>&1 || true",
        "go": "go test ./... 2>&1 || true"
    }
}

change_detector.py - Change Classification

Purpose: Intelligently classify changes to determine when to auto-log and suggest checkpoints.

Change Levels:

TRIVIAL - Comments, formatting, imports, small edits (<100 chars)
SIGNIFICANT - New functions, logic changes, test runs, moderate edits
MAJOR - New files >100 lines, large refactors, git commits

Key Functions:

classify_change(tool_name, tool_input, tool_result) - Returns (ChangeLevel, reason)
should_auto_log(level) - True for SIGNIFICANT and MAJOR
should_suggest_checkpoint(level) - True for MAJOR only

Classification Logic:

Analyzes Write tool: counts lines, checks for code patterns
Analyzes Edit tool: measures diff size, detects new patterns
Analyzes Bash tool: identifies test/build/git commands

test_runner.py - Test Execution

Purpose: Unified test execution across multiple project types.

Supported Project Types:

Node.js (package.json) - npm test
Python (pyproject.toml, setup.py) - pytest
Rust (Cargo.toml) - cargo test
Go (go.mod) - go test ./...
Java Maven (pom.xml) - mvn test
Java Gradle (build.gradle) - ./gradlew test

Key Functions:

detect_project_type(work_dir) - Detect from config files
run_tests(work_dir, timeout) - Execute tests, return TestSummary
parse_test_output(output, project_type) - Extract pass/fail counts
did_tests_run_in_session(transcript_path) - Check session transcript
get_test_summary_string(summary) - Human-readable summary

TestSummary Dataclass:

@dataclass
class TestSummary:
    total: int = 0
    passed: int = 0
    failed: int = 0
    skipped: int = 0
    errors: int = 0
    duration: float = 0.0
    raw_output: str = ""
    result: TestResult = TestResult.NOT_RUN
    failed_tests: List[str] = field(default_factory=list)

browser_automation.py - UI Verification

Purpose: Playwright/Puppeteer wrapper for visual verification (opt-in).

Key Functions:

detect_browser_tool(work_dir) - Find Playwright or Puppeteer
take_screenshot(url, output_path, ...) - Capture page screenshot
verify_element(url, selector, expected_text) - Check element exists
get_installation_instructions() - Setup instructions

BrowserResult Dataclass:

@dataclass
class BrowserResult:
    success: bool
    screenshot_path: Optional[str] = None
    error: Optional[str] = None
    output: Optional[str] = None
    element_found: bool = False
    element_text: Optional[str] = None

Hooks Implementation

hooks.json - Hook Registration

{
  "hooks": {
    "SessionStart": [{"matcher": "*", "hooks": [...], "timeout": 120}],
    "PreToolUse": [{"matcher": "Edit|Write", "hooks": [...], "timeout": 10}],
    "PostToolUse": [{"matcher": "Edit|Write|Bash", "hooks": [...], "timeout": 15}],
    "Stop": [{"matcher": "*", "hooks": [...], "timeout": 15}]
  }
}

session_start.py - SessionStart Hook

Triggers: Every session start

Actions:

Check if harness is initialized
Load configuration
Execute init.sh if exists (60s timeout, 10KB limit)
Run baseline tests if configured
Read git status and recent commits
Read progress file (last 50 lines)
Read feature checklist summary
Inject context via systemMessage

Output Format:

=== AGENT HARNESS SESSION STARTUP ===
Session started: {timestamp}
Working directory: {path}
Mode: {strictness}

--- INIT SCRIPT ---
{init.sh output}

--- BASELINE TESTS ---
{test results}

--- GIT STATUS ---
{modified files}

--- RECENT COMMITS ---
{last 10 commits}

--- PROGRESS LOG ---
{last 50 lines}

--- FEATURE CHECKLIST STATUS ---
Total: X | Passing: X | Failing: X | In Progress: X
Next priority items:
  [WIP] 1. Feature name: description
  [TODO] 2. Feature name: description

=== END SESSION CONTEXT ===

pre_tool_use.py - PreToolUse Hook

Triggers: Before Edit or Write tools

Purpose: Enforce one-feature-at-a-time discipline

Actions:

Check if features are defined
Check if any feature is in_progress
If no active feature:
- Standard mode: Show warning, allow operation
- Strict mode: Block with permissionDecision: deny

Output:

Warning message listing available features to start
In strict mode: operation blocked

post_tool_use.py - PostToolUse Hook

Triggers: After Edit, Write, or Bash tools

Purpose: Auto-logging, checkpoint suggestions, test result parsing

Actions:

Classify the change (trivial/significant/major)
Auto-log significant changes to progress file
Track cumulative changes in session state file
Suggest checkpoint when:
- Major change occurs
- 5+ significant changes accumulate
- Time threshold exceeded (default 30 min)
Parse test results from Bash output
Suggest feature status updates after test pass

Session State (stored in /tmp/harness-session-{id}.json):

{
    "changes_since_checkpoint": 0,
    "last_checkpoint_time": null,
    "significant_changes": [],
    "current_feature": null
}

stop.py - Stop Hook

Triggers: When agent considers stopping

Purpose: Validate work completion before allowing stop

Validation Checks:

Tests run if code was modified
Uncommitted changes warning
Features still in progress
Progress log updated

Behavior by Mode:

Relaxed: Single FYI message
Standard: Strong warnings, no blocking
Strict: Block with decision: block if validation fails

Commands

Command	Description	Key Actions
`/harness:init`	Initialize harness	Create progress file, features file, marker
`/harness:status`	Show current state	Display git, progress, features summary
`/harness:log [type] [msg]`	Add progress entry	Append to progress file with timestamp
`/harness:feature [action]`	Manage features	add, start, pass, fail, list, next
`/harness:checkpoint [msg]`	Create git commit	Stage all, commit with prefix, log
`/harness:configure [setting]`	Change settings	Modify `.claude/claude-harness.json`
`/harness:baseline`	Run tests manually	Execute and display test results

File Formats

claude-progress.txt

# Claude Agent Progress Log
# Project: {name}
# Created: {timestamp}

[2024-12-14 10:30:00] INITIALIZED: Progress tracking enabled
[2024-12-14 10:35:00] STARTED: User authentication feature
[2024-12-14 10:45:00] AUTO: Created auth.py (New code file, 45 code lines)
[2024-12-14 11:00:00] CHECKPOINT [abc1234]: Implemented login form
[2024-12-14 11:30:00] COMPLETED: User authentication feature

claude-features.json

{
  "metadata": {
    "project": "my-app",
    "created_at": "2024-12-14T10:30:00",
    "last_updated": "2024-12-14T11:30:00"
  },
  "features": [
    {
      "id": 1,
      "name": "User login",
      "description": "Email/password authentication",
      "status": "passing",
      "category": "auth",
      "priority": 1,
      "created_at": "2024-12-14T10:30:00",
      "updated_at": "2024-12-14T11:30:00",
      "notes": [
        {"timestamp": "2024-12-14T11:30:00", "content": "All tests passing"}
      ]
    }
  ]
}

.claude/claude-harness.json

{
  "strictness": "standard",
  "auto_progress_logging": true,
  "auto_checkpoint_suggestions": true,
  "feature_enforcement": true,
  "baseline_tests_on_startup": true,
  "init_script_execution": true,
  "browser_automation": false,
  "checkpoint_interval_minutes": 30
}

Strictness Levels

Feature	Relaxed	Standard	Strict
Auto-progress logging	Off	On	On
Checkpoint suggestions	Off	On	On
Feature enforcement	Off	Warn	Block
Stop validation	FYI	Warn	Block
Baseline tests	Off	On	On
Init script	On	On	On

Error Handling

All hooks follow these principles:

Never block on errors - always exit 0
Wrap all code in try/except
Return informative error messages via systemMessage
Fallback gracefully if imports fail
Use timeouts for external operations

Testing the Implementation

Test SessionStart: Start Claude Code in an initialized project
Test PreToolUse: Try to edit without starting a feature
Test PostToolUse: Make significant edits, check for suggestions
Test Stop: Try to stop without running tests
Test Strictness: Configure strict mode and verify blocking

Future Enhancements

Potential improvements not yet implemented:

SubagentStop hook for validating subagent completion
Notification hook for external integrations
PreCompact hook to preserve critical context
Integration with CI/CD pipelines
Remote session support
Custom test command configuration per project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harness Plugin Implementation Details

Architecture Overview

Blog Recommendations Coverage

Core Modules

config.py - Configuration Management

change_detector.py - Change Classification

test_runner.py - Test Execution

browser_automation.py - UI Verification

Hooks Implementation

hooks.json - Hook Registration

session_start.py - SessionStart Hook

pre_tool_use.py - PreToolUse Hook

post_tool_use.py - PostToolUse Hook

stop.py - Stop Hook

Commands

File Formats

claude-progress.txt

claude-features.json

.claude/claude-harness.json

Strictness Levels

Error Handling

Testing the Implementation

Future Enhancements

FilesExpand file tree

IMPLEMENTATION.md

Latest commit

History

IMPLEMENTATION.md

File metadata and controls

Harness Plugin Implementation Details

Architecture Overview

Blog Recommendations Coverage

Core Modules

config.py - Configuration Management

change_detector.py - Change Classification

test_runner.py - Test Execution

browser_automation.py - UI Verification

Hooks Implementation

hooks.json - Hook Registration

session_start.py - SessionStart Hook

pre_tool_use.py - PreToolUse Hook

post_tool_use.py - PostToolUse Hook

stop.py - Stop Hook

Commands

File Formats

claude-progress.txt

claude-features.json

.claude/claude-harness.json

Strictness Levels

Error Handling

Testing the Implementation

Future Enhancements