Skip to content
This repository was archived by the owner on Nov 29, 2025. It is now read-only.

Conversation

@konradsemsch
Copy link
Collaborator

@konradsemsch konradsemsch commented Oct 22, 2025

Summary

Implement Human-in-the-Loop (HITL) feedback system enabling real-time intervention and guidance during agent execution through React Terminal UI.

Architecture & Implementation

Core Components

Backend (Python)

  • FeedbackManager (feedback_manager.py): centralized state management with feedback queuing and message formatting
  • HITLHookProvider (hitl_hook_provider.py): strands SDK integration for tool execution interception
  • FeedbackInputHandler (feedback_handler.py): background stdin listener for React UI communication
  • FeedbackInjectionHook (feedback_injection_hook.py): system prompt modification before LLM calls

Frontend (React/TypeScript)

  • HITLInterventionPanel component: interactive feedback UI with input isolation
  • HITL command utilities (hitlCommands.ts): stdin protocol for Python communication
  • Application state integration: seamless pause/resume workflow
    NOTE: pause & resume are currently not working yet and need to be implemented and tested extensively
  • Event rendering: visual feedback throughout the process (both for human triggered and agent triggered, the later hasn't been tested extensively yet)

Two Working Modes

1. Agent-Triggered HITL

  • Automatic pause on destructive operations (e.g., rm -rf, system modifications).
    NOTE: needs further extension and clarification WRT events/ operations that should trigger the event. How to make it configurable
  • Low confidence detection when agent tool selection confidence < 70%
    NOTE: should be made configurable, still TBD
  • Tool review panel displays operation details with approve/modify/reject options

2. Pentester-Triggered HITL

  • Manual intervention via [i] key during any operation
  • Direct feedback input for real-time guidance and corrections
  • Flexible interaction allows strategic adjustments mid-assessment

Technical Flow

User Input (React) → Stdin Communication → Python Background Thread →
FeedbackManager State → Strands Hook System → System Prompt Injection →
LLM Receives Modified Context → Agent Adjusts Behavior

Implementation Details

  • React → Python: wrapped commands via stdin (HITL_COMMANDHITL_COMMAND_END)
  • Python → React: standard event emission through existing event system
  • Agent Integration: system prompt modification using Strands BeforeModelInvocationEvent

Current Status & Testing

Working Features

  • Complete UI integration with keyboard handlers (a/c/s/r for tool actions, y/n for confirmations)
  • Stdin communication flow validated through extensive logging system
  • Message formatting and injection pipeline functional
  • Both manual and automatic intervention modes operational
    NOTE: still needs further practical testing and the agent-triggered HITL has not been extensively reviewed

Testing/ Debugging Focus

  • Agent Response Issue: while feedback successfully reaches the agent through system prompt modification (confirmed via logging in ac77e12, can be reverted when we're certain it works), the agent is not consistently changing course based on the input
  • Message Ingestion: extensive logging shows the injection pipeline works correctly - feedback appears in system prompt before LLM calls
  • Agent Behavior: need to investigate why LLM doesn't consistently respond to injected feedback messages, not sure what the problem here might be
  • Haven't worked extensively on the agent-triggered HITL panel, needs further refinement
  • Haven't implemented pause/ resume flow yet - currently agent is working in the background while the HITL panel opens

Next Steps

Priority: Debug agent response behavior

  • Consider agent instruction tuning for HITL awareness and further optimizing the agent system prompt to look for that
  • Investigate LLM prompt processing flow for injected feedback and make sure it reaches the agent and influences the agent execution
  • Test with different feedback message formats and check agent course
  • Validate system prompt modification effectiveness
  • Further refine and test agent-driven HITL panel
  • Implement the pause/ resume system, which is currently not in place

@konradsemsch konradsemsch self-assigned this Oct 22, 2025
@konradsemsch konradsemsch added enhancement New feature or request help wanted Extra attention is needed priority: high Important, should be addressed soon status: in-progress Actively being worked on labels Oct 22, 2025
@konradsemsch konradsemsch added this to the release 0.1.4 milestone Oct 22, 2025
@github-actions
Copy link

PR Quality Check

Pylint Score: 9.64/10 [PASS]
Tests: Check workflow status above

[PASS] All checks passed!

@konradsemsch konradsemsch changed the title Implements hook-based HITL in 2 working modes - 1) triggered by agent (e.g. when operation is destructive) 2) triggered by the pentester [DRAFT] Implements hook-based HITL in 2 working modes - 1) triggered by agent (e.g. when operation is destructive) 2) triggered by the pentester Oct 22, 2025
@westonbrown
Copy link
Owner

@claude Please review this HITL implementation PR. Focus on:

  1. Code quality - PEP 8, logical flow, no over-engineering
  2. Security - proper error handling, no vulnerabilities
  3. Testing - coverage >80%, edge cases, mocking strategy
  4. Architecture - flag unnecessary abstractions or excessive nesting
  5. Documentation - clarity of implementation approach

Be direct and constructive.

@caa-swe-agent
Copy link

caa-swe-agent bot commented Oct 23, 2025

Claude finished @westonbrown's task —— View job


Code Review: HITL Implementation

Conducted comprehensive review of the Human-in-the-Loop implementation across 27 files (3889 additions, 316 deletions).


1. Code Quality & Architecture

Over-Engineering Concerns ⚠️

Excessive Logging Infrastructure: The hitl_logger.py module creates a parallel logging system alongside Python's standard logging. This is redundant.

  • Issue: log_hitl() wrapper + dedicated file handler + global logger state
  • Impact: Dual logging paths (standard logger + HITL logger) create confusion and maintenance burden
  • Recommendation: Use standard Python logging with logger namespaces (logging.getLogger(__name__)) and configure handlers centrally. Delete hitl_logger.py.

Direct stdout bypass in production code (feedback_injection_hook.py:18-24):

def direct_log(msg: str):
    """Write directly to stdout bypassing all logging infrastructure."""
    sys.stdout.write(f"[HITL-HOOK-DIRECT] {msg}\n")
  • This is debugging code that made it to production
  • Remove direct_log() entirely - it bypasses normal logging for no valid reason

Dataclass Overkill: types.py defines dataclasses for simple structures that could be TypedDict or plain dicts:

@dataclass
class ToolInvocation:
    tool_name: str
    tool_id: str
    parameters: Dict[str, Any]
    confidence: Optional[float] = None
    reason: Optional[str] = None
  • These are never used for their dataclass features (no methods, no complex initialization)
  • Plain dicts or TypedDict would be simpler and more flexible

Logical Flow Issues

State Machine Complexity: FeedbackManager has 5 states (ACTIVE, PAUSE_REQUESTED, AWAITING_FEEDBACK, AWAITING_CONFIRMATION, REJECTED) but unclear state transitions:

  • Line 74: self.state = HITLState.PAUSE_REQUESTED
  • Line 96: self.state = HITLState.AWAITING_FEEDBACK (immediately after)
  • Why two states in succession? This suggests PAUSE_REQUESTED is redundant.

Unclear ownership: pending_feedback is stored in both:

  1. self.pending_feedback (single slot)
  2. self.feedback_queue (dict by tool_id)

This is redundant. Choose one.


2. Security Vulnerabilities 🔴

Critical: No Input Validation on stdin Commands

Location: feedback_handler.py:87-117

The _process_input_line() method parses JSON from stdin with zero validation:

command_json = line[start:end]
command = json.loads(command_json)
self.handle_feedback_command(command)

Vulnerabilities:

  1. No size limits: Malicious input could send megabytes of JSON causing memory exhaustion
  2. No schema validation: Any JSON structure is accepted
  3. Command injection risk: content field is not sanitized before being injected into system prompt
  4. No authentication: Anyone with stdin access can inject commands

Exploitation scenario:

echo '__HITL_COMMAND__{"type":"submit_feedback","feedback_type":"correction","content":"' + 'A'*10000000 + '","tool_id":"x"}__HITL_COMMAND_END__'

Required fixes:

  1. Add max size check (e.g., 10KB limit):

    command_json = line[start:end]
    if len(command_json) > 10000:
        logger.warning("HITL command too large, ignoring")
        return
  2. Add schema validation using pydantic or explicit checks:

    REQUIRED_FIELDS = {"type": str, "tool_id": str}
    if not all(k in command and isinstance(command[k], v) for k, v in REQUIRED_FIELDS.items()):
        logger.warning("Invalid HITL command structure")
        return
  3. Sanitize content field to prevent prompt injection:

    content = command.get("content", "")[:5000]  # Hard limit
    # Strip control characters
    content = "".join(c for c in content if c.isprintable() or c.isspace())

Medium: Shell Command Detection is Bypassable

Location: hitl_hook_provider.py:124-154

Destructive operation detection uses simple substring matching:

destructive_patterns = ["rm ", "delete ", "drop ", "truncate ", "format ", "erase "]
if pattern in command_lower:
    return True

Bypasses:

  • rm\t-rf (tab instead of space)
  • rm\n-rf (newline)
  • r``m -rf (bash quoting)
  • /bin/rm -rf
  • find . -delete

Fix: Use regex with word boundaries:

import re
destructive_patterns = [
    r'\brm\b.*-[rf]',
    r'\b(delete|drop|truncate)\b',
    r'\b(format|erase)\b',
]
for pattern in destructive_patterns:
    if re.search(pattern, command_lower):
        return True

Low: Race Condition in Threading

Location: feedback_handler.py:56-61

def stop_listening(self) -> None:
    self._running = False
    if self._listener_thread:
        self._listener_thread.join(timeout=1.0)

No check if thread actually stopped. If join() times out, thread keeps running and accessing self.feedback_manager.

Fix:

def stop_listening(self) -> None:
    self._running = False
    if self._listener_thread and self._listener_thread.is_alive():
        self._listener_thread.join(timeout=1.0)
        if self._listener_thread.is_alive():
            logger.error("Feedback listener thread did not stop cleanly")

3. Error Handling 🟡

Too many broad exception catches:

Count: 11 try/except blocks across HITL modules (against reviewer guidelines: "minimal try/except blocks").

Problematic examples:

  1. Silent failures (feedback_handler.py:83-85):
except Exception as e:
    logger.error("Error in feedback listener: %s", e, exc_info=True)
    log_hitl("InputHandler", f"ERROR in listen loop: {e}", "ERROR")
  • Logs error but continues loop
  • User has no idea their feedback was ignored
  • Fix: Emit error event to UI
  1. Overly broad (feedback_manager.py:408-409):
except Exception as e:
    logger.warning("Failed to store intervention in memory: %s", e)
  • Catches everything including SystemExit, KeyboardInterrupt
  • Fix: Catch specific exceptions only:
    except (AttributeError, TypeError, ValueError) as e:
        logger.warning("Failed to store intervention: %s", e)
  1. Missing edge case handling (feedback_injection_hook.py:113):
event.agent.system_prompt += f"\n\n{feedback_message}"
  • What if event.agent.system_prompt is None?
  • Fix: Add null check:
    if event.agent.system_prompt is None:
        event.agent.system_prompt = feedback_message
    else:
        event.agent.system_prompt += f"\n\n{feedback_message}"

4. Testing Coverage 📊

Current Coverage: ~60% (Estimated)

Test files:

  • test_hitl_components.py: 262 lines, 16 test methods ✓
  • test_hitl_feedback_injection.py: 547 lines, 32 test methods ✓
  • test_hitl_hook_manual.py: 189 lines, 1 manual test ✗ (not a unit test)

Production code: ~1,100 lines (excluding tests)

Coverage gaps:

  1. No integration tests: All tests use mocks. No real Strands SDK integration.
  2. No stdin testing: FeedbackInputHandler._listen_loop() is never tested with actual stdin data.
  3. No concurrency tests: Threading behavior untested.
  4. Missing edge cases:
    • Empty feedback content
    • Malformed JSON in stdin
    • System prompt = None
    • Multiple pending feedbacks for same tool_id
    • Feedback during swarm operations

Mocking Strategy Issues

Over-mocking reduces test value (test_hitl_feedback_injection.py:53-66):

@pytest.fixture
def mock_agent():
    """Create mock agent with system_prompt attribute."""
    agent = Mock()
    agent.system_prompt = "You are a helpful assistant..."
    return agent

This mocks the entire agent, so tests never verify if:

  • The hook actually registers with Strands
  • BeforeModelInvocationEvent fires correctly
  • Strands SDK sees the modified prompt

Recommendation: Use real Strands Agent with a mock Model that captures prompts.


5. Type Hints ✅ (Mostly Good)

Well-typed files:

  • types.py: Full coverage with dataclasses ✓
  • feedback_manager.py: Return types on all methods ✓
  • hitl_hook_provider.py: Complete type hints ✓

Missing type hints:

  1. feedback_handler.py:119 - Missing return type:
def handle_feedback_command(self, command: dict) -> None:  # Add -> None
  1. hitl_logger.py:15 - Return type not specified:
def get_hitl_logger() -> logging.Logger:  # Already has it ✓
  1. React/TypeScript side: Good type coverage in HITLInterventionPanel.tsx

6. PEP 8 Compliance ✅

Pylint score: 9.64/10 - Good!

Minor issues found:

  1. Line length: Some lines > 100 chars (e.g., feedback_manager.py:331)
  2. Import order: from .hitl_logger import log_hitl should come before from .types import ... (PEP 8 groups)

7. Documentation Issues 📝

Unclear Implementation Approach

From PR description: "the agent is not consistently changing course based on the input"

Root cause hypothesis (from code review):

  1. Feedback timing: HITLFeedbackInjectionHook injects before EVERY model invocation, but feedback might be injected AFTER the agent already decided on a tool.

  2. System prompt vs conversation: LLMs pay more attention to recent conversation messages than system prompt additions. Injecting into system prompt may be ignored.

  3. Verification logging shows modification worked (line 124-132), so the issue is likely LLM behavior, not hook mechanics.

Recommended debugging steps (add to PR):

  1. Log the FULL modified system prompt to verify placement
  2. Try injecting as a conversation message instead of system prompt
  3. Test with different prompt formats (e.g., "CRITICAL:", "STOP:", "HUMAN OVERRIDE:")
  4. Add confidence scoring to detect when agent ignored feedback

Missing Documentation

  • No docstring for direct_log() explaining why it exists (should be removed anyway)
  • hitl_hook_provider.py:38-45: destructive_patterns list should document why each pattern is included
  • TypeScript files lack JSDoc for complex functions

Summary of Critical Issues

Must Fix Before Merge 🔴

  1. Security: Add input validation to stdin parsing (size limits, schema validation, content sanitization)
  2. Security: Improve destructive operation detection (regex with word boundaries)
  3. Remove debugging code: Delete direct_log() and all calls to it
  4. Simplify logging: Remove hitl_logger.py, use standard Python logging

Should Fix 🟡

  1. Reduce exception handling: Make catch blocks specific, emit errors to UI
  2. Fix state machine: Remove redundant PAUSE_REQUESTED state or clarify purpose
  3. Add integration tests: Test with real Strands SDK components
  4. Test stdin parsing: Add tests with malformed input, oversized payloads
  5. Document why agent ignores feedback: Add investigation results to code comments

Nice to Have 🟢

  1. Replace dataclasses with TypedDict for simple structures
  2. Consolidate pending_feedback storage (remove redundant queue)
  3. Add thread safety checks in stop_listening()

Test Coverage Recommendation

Target: >80% - Current estimate: ~60%

Add these tests:

def test_stdin_oversized_command():
    """Test rejection of oversized HITL commands."""
    
def test_stdin_malformed_json():
    """Test handling of malformed JSON in stdin."""
    
def test_concurrent_feedback_submissions():
    """Test thread safety of feedback queue."""
    
def test_feedback_injection_with_none_prompt():
    """Test injection when system_prompt is None."""
    
def test_destructive_pattern_bypass_attempts():
    """Test detection of obfuscated destructive commands."""

Architecture Assessment

Unnecessary abstractions: The 6-file module structure is reasonable for a feature of this size. However:

  • hitl_logger.py is unnecessary abstraction (delete it)
  • types.py dataclasses could be simpler (but not critical)

Excessive nesting: No significant issues. Most functions are 1-2 levels deep. ✓

Missing edge cases (flagged above):

  • Null system prompt
  • Empty feedback content
  • Thread lifecycle edge cases
  • Input validation bypasses

Overall Assessment: Solid foundation with good test structure, but has critical security gaps in input validation and overly complex logging. The core HITL mechanism is well-designed, but needs hardening before production use.

Recommendation: Address security issues and simplify logging before merging. Consider this PR a strong alpha, not production-ready.


Job RunBranch: feat/hitl-reworked

@konradsemsch
Copy link
Collaborator Author

konradsemsch commented Oct 27, 2025

@westonbrown @aggr0cr4g

A bit of an update and explanation. Mostly focused on the pentester-triggered mode in this batch.

Addressed - see underneath for full explanations

  • Investigate LLM prompt processing flow for injected feedback and make sure it reaches the agent
  1. User submits feedback → React UI → stdin → FeedbackHandler._process_input_line()
  2. Feedback stored → FeedbackManager.submit_feedback() → sets pending_feedback
  3. Agent continues execution → calls model
  4. Before model invocation → HITLFeedbackInjectionHook.inject_feedback() fires
  5. Check for pending feedback → feedback_manager.get_pending_feedback_message()
  6. If feedback exists:
  • Get current system prompt
  • Append feedback with \n\n separator
  • Set modified prompt: event.agent.system_prompt = current + "\n\n" + feedback
  • Clear pending feedback to prevent duplicate injection
  1. Agent receives modified prompt → processes feedback as part of core context

User suggestions are injected at the end of the system prompt which is then used for the next loop.

# Line 116-117 in feedback_injection_hook.py
current_prompt = event.agent.system_prompt or ""
event.agent.system_prompt = f"{current_prompt}\n\n{feedback_message}"
  • Implement the pause/ resume system

Analyzing the logs and looking at the HITL panel confirm the pause should be working now

13:47:24.793 - pause requested
13:48:19.315 - agent detects pause at step 2: "Manual pause detected before model invocation - blocking"
13:48:19.316 - Agent blocks, waiting for feedback (120s timeout)
13:49:24.796 - 120 seconds expired, auto-resume triggered
13:49:24.810 - Agent resumes: "Manual pause cleared - continuing execution"
13:49:24.812 - feedback arrives (just after timeout)
13:49:38.070 - Feedback injected at step 3: "✓ Pending feedback found: 214 chars"

  • Further refine and test agent-driven HITL panel

Still outstanding

  • Do further testing and synchronisaztion of behavours for the agent-triggered HITL panel. I can confirm that the agent indeed triggers it, but I wasn't able to fully test responsiveness depending on the user answer. This is what it looks like ATM:
Screenshot 2025-10-27 at 15 02 00
  • Consider agent instruction tuning for HITL awareness and further optimizing the agent system prompt to look for that so it influences the agent execution
  • Test with different feedback message formats and check agent course
  • Validate system prompt modification effectiveness
  • Cleaning up and modularizing all code. Taking Claude's suggestions into consideration
  • Implementing extensive testing

Potential issues

During one of my tests when I triggered the HITL panel, the agent triggered the HITL panel itself in the same loop. This cause the user-triggered timeout to reset and the agent-triggered timeout kicked in, effectively shortening a bit the overall timeout. So we have a potential race condition here but I would say we accept it for now.

@konradsemsch
Copy link
Collaborator Author

@westonbrown @aggr0cr4g did some further changes and test for the agent-triggered HITL panel. On a high-level I can confirm that the it seems to work (to the extent I had the time to test), but I'm still not 100% sure. For instance: once it asked me for feedback for an operation and I rejected it, and then the agent crashed. On another instance I approved and it seemed to work. Haven't really gotten too much into testing with a correction. But bottomline is that it's triggered either a) when a potentially descructive operation is detected (this needs to be further hardened) or b) when agent has low confidence (ATM set it to 90% to get the trigger to kick in ASAP).

I think intuitively it works as expected, but needs much more testing to really confirm everything

How to test

  • when running in docker -> build the image and then run with the following env. CYBER_AGENT_HITL_ENABLED set to true
docker run --user root -it --rm \ 
    -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
    -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
    -e AWS_REGION=us-east-1 \
    -e CYBER_AGENT_HITL_ENABLED=true \
    -v $(pwd)/outputs:/app/outputs \
    cyber-autoagent:latest
  • running without docker - I haven't tried that at all so would be great to get somebody to play with this

Still outstanding

  • Consider agent instruction tuning for HITL awareness and further optimizing the agent system prompt to look for that so it influences the agent execution
  • Validate system prompt modification effectiveness
  • Cleaning up (getting rid of extra logs) and modularizing all code. Taking Claude's suggestions into consideration
  • Implementing extensive testing

@westonbrown westonbrown changed the base branch from main to release-0.1.4 November 5, 2025 02:19
@westonbrown
Copy link
Owner

Base branch changed: mainrelease-0.1.4 for v0.1.4 release coordination. Will have time to test this weekend - cool progress!

@westonbrown
Copy link
Owner

westonbrown commented Nov 13, 2025

Thanks for pushing this — looks promising. Could you:

  • Rebase onto the latest release-0.1.4 to resolve conflicts.
  • Update to the Strands 1.16.0 release from today and switch from experimental hooks to strands.hooks (BeforeModelCallEvent, AfterModelCallEvent, BeforeToolCallEvent, AfterToolCallEvent) to clear deprecations and align with current payloads.
  • Keep the stream/event protocol unchanged when HITL is off.

I'd like to land this as a single PR and goal is to merge by this weekend. I can help test and review once rebased; community help welcome — especially around pause/resume and the agent‑triggered path.

@konradsemsch
Copy link
Collaborator Author

hey @westonbrown - thanks! Yeah, will do an update in the next 1-2 days :)

@konradsemsch
Copy link
Collaborator Author

@westonbrown working on it now but won't have enough time to complete entirely. I should be however able to finish it by Tuesday EOD if that's fine with you.

Konrad and others added 4 commits November 16, 2025 19:48
…avior

- Create test_hitl_feedback_injection.py with 27 test cases
- Test feedback injection hook without requiring live LLM
- Cover all feedback types: CORRECTION, SUGGESTION, APPROVAL, REJECTION
- Verify system prompt modification and feedback clearing
- Test sequential feedback cycles and edge cases
- Tests are CI-friendly and run in milliseconds

Addresses feedback injection testing requirements:
- Debug agent response behavior for HITL feedback
- Test different feedback message formats
- Validate system prompt modification effectiveness
- Use mocked agent instead of live Ollama

Co-authored-by: Aaron Brown <westonbrown@users.noreply.github.com>
Enhance the agent-triggered HITL intervention panel based on user
feedback and testing:

- Fix yellow box alignment by using paddingX instead of padding
- Update panel title to sentence case for less aggressive tone
- Reorder action options: approve, reject, correction
- Fix correction feedback input mode activation
- Add rejection handling in hook provider to prevent tool execution
- Document HITL triggering conditions in system prompt

The rejection handler now raises RuntimeError to prevent destructive
operations when user clicks reject, while still allowing the agent to
receive the rejection feedback at next model invocation.
@github-actions
Copy link

PR Quality Check

Pylint Score: N/A/10 [FAIL]
Tests: Check workflow status above

[FAIL] Pylint score must be 9.5 or higher

@github-actions
Copy link

PR Quality Check

Pylint Score: 9.60/10 [PASS]
Tests: Check workflow status above

[PASS] All checks passed!

@konradsemsch konradsemsch marked this pull request as ready for review November 16, 2025 20:44
@github-actions
Copy link

PR Quality Check

Pylint Score: 9.60/10 [PASS]
Tests: Check workflow status above

[PASS] All checks passed!

Removed duplicate class definitions that were causing isinstance()
validation failures. Classes were being imported from modules.config.types
but then redefined in manager.py, creating two different class objects
with the same name.

Changes:
- Removed duplicate definitions: ModelProvider, ModelConfig, LLMConfig,
  EmbeddingConfig, VectorStoreConfig, MemoryLLMConfig, MemoryEmbeddingConfig,
  MemoryVectorStoreConfig, MemoryConfig, EvaluationConfig, SwarmConfig,
  get_default_base_dir, SDKConfig, OutputConfig, ServerConfig
- Moved HITLConfig to types.py and added to ServerConfig
- Added HITLConfig to imports in manager.py
- Removed stale MEM0_PROVIDER_MAP duplicate (more complete version in types.py)
- Cleaned up merge conflict markers
@github-actions
Copy link

PR Quality Check

Pylint Score: 9.71/10 [PASS]
Tests: Check workflow status above

[PASS] All checks passed!

Konrad added 6 commits November 18, 2025 15:05
Remove unconditional hooks list recreation that caused UnboundLocalError
when prompt optimization is disabled. The hook is only created when
CYBER_ENABLE_PROMPT_OPTIMIZATION is enabled, so we shouldn't reference
it unconditionally.
Test mocks were missing required hitl configuration attribute causing
AttributeError. Added hitl SimpleNamespace with default disabled config
to all affected test fixtures.
create_agent now returns three values (agent, callback_handler,
feedback_manager) instead of two. Updated all test assertions to
unpack three values correctly.
Adjust test expectations to work with actual snapshot data instead of
requiring specific mock data counts. Tests now use real client or have
flexible assertions that work with any valid snapshot.
Remove undefined feedback_injected_this_turn variable reference and
duplicate get_config_manager import. Add missing Dict type import for
type annotations.
Apply automatic code formatting to maintain consistent style across
the codebase.
@github-actions
Copy link

PR Quality Check

Pylint Score: 9.72/10 [PASS]
Tests: Check workflow status above

[PASS] All checks passed!

Add missing mocks (get_config_manager, ReactBridgeHandler) that are
required for the test to reach the model creation code. Without these
mocks, the function fails earlier and never reaches the error handler
being tested.
@github-actions
Copy link

PR Quality Check

Pylint Score: 9.72/10 [PASS]
Tests: Check workflow status above

[PASS] All checks passed!

Manual test test_hitl_hook_manual requires Ollama server running locally,
which is not available in CI environment. This was causing test failures
with connection errors.

Changes:
- Add 'manual' pytest marker in pyproject.toml
- Mark test_feedback_injection_hook with @pytest.mark.manual
- Update CI pytest command to skip manual tests with -m 'not manual'
- Add documentation on how to run manual tests locally

Manual tests can still be run with:
  pytest -m manual
  pytest tests/test_hitl_hook_manual.py
@github-actions
Copy link

PR Quality Check

Pylint Score: 9.72/10 [PASS]
Tests: Check workflow status above

[PASS] All checks passed!

The pr-checks.yml workflow was missing the -m 'not manual' flag,
causing manual integration tests to run on PRs. This resulted in
test failures when Ollama server is not available in CI.

This completes the fix started in commit 5bed705 by ensuring both
ci.yml and pr-checks.yml workflows skip manual tests.
@github-actions
Copy link

PR Quality Check

Pylint Score: 9.72/10 [PASS]
Tests: Check workflow status above

[PASS] All checks passed!

@westonbrown westonbrown deleted the branch release-0.1.4 November 29, 2025 18:55
@westonbrown westonbrown deleted the feat/hitl-reworked branch November 29, 2025 18:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request help wanted Extra attention is needed priority: high Important, should be addressed soon status: in-progress Actively being worked on

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants